This repo contains code for MIMIC-Extract. It has been divided into the following folders:
mimic_direct_extract.py
: extraction script.If you use this code in your research, please cite the following publication:
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann,
and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation
Pipeline for MIMIC-III. arXiv:1907.08322.
If you simply wish to use the output of this pipeline in your own research, a preprocessed version with default parameters is available via gcp, here.
To access this, you will need to be credentialed for MIMIC-III GCP access through physionet. Instructions for that are available on physionet.
This output is released on an as-is basis, with no guarantees, but if you find any issues with it please let us know via Github issues.
The first several steps are the same here as above. These instructions are tested with mimic-code at version 762943eab64deb30bdb2abcf7db43602ccb25908
Your local system should have the following executables on the PATH:
All instructions below should be executed from a terminal, with current directory set to utils/
Next, make a new conda environment from mimic_extract_env_py36.yml and activate that environment.
conda env create --force -f ../mimic_extract_env_py36.yml
This step will report failure on the pip installation stage. This is not the end of the world. Instead, simply activate the environment (which should work despite the former "failure"):
conda activate mimic_data_extraction
And then install any failed packages with pip (e.g., pip install [package]
). This may include, in
particular, packages: datapackage
, spacy
, and scispacy
.
You will also then need to install the english language model for spacy, via:
python -m spacy download en_core_web_sm
The desired enviroment will be created and activated.
Will typically take less than 5 minutes. Requires a good internet connection.
Materialized views in the MIMIC PostgreSQL database will be generated. This includes all concept tables in MIT-LCP Repo and tables for extracting non-mechanical ventilation, and injections of crystalloid bolus and colloid bolus.
Note that you need to have schema edit permission on your postgres user to make concepts in this way. First,
you must clone this github repository to a directory, which here we assume is stored in the environment
variable $MIMIC_CODE_DIR
. After cloning, follow these instructions:
cd $MIMIC_CODE_DIR/concepts
psql -d mimic -f postgres-functions.sql
bash postgres_make_concepts.sh
Next, you'll need to build 3 additional materialized views necessary for this pipeline. To do this (again with
schema edit permission), navigate to utils
and run bash postgres_make_extended_concepts.sh
followed by
psql -d mimic -f niv-durations.sql
.
Next, navigate to the root directory of this repository, activate your conda environment and run
python mimic_direct_extract.py ...
with your args as desired.
The default setting will create an hdf5 file inside MIMIC_EXTRACT_OUTPUT_DIR with four tables:
patients: static demographics, static outcomes
vitals_labs: time-varying vitals and labs (hourly mean, count and standard deviation)
vitals_labs_mean: time-varying vitals and labs (hourly mean only)
interventions: hourly binary indicators for administered interventions
Will probably take 5-10 hours. Will require a good machine with at least 50GB RAM.
By default, this step builds a dataset with all eligible patients. Sometimes, we wish to run with only a small subset of patients (debugging, etc.).
To do this, just set the POP_SIZE environmental variable. For example, to build a curated dataset with only the first 1000 patients, we could do:
mimic_direct_extract.py
, I encounter an error of the form:
psycopg2.OperationalError: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
or
psycopg2.OperationalError: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/..."?
For this issue, see this stackoverflow
post and use our
--psql_host
argument, which you can pass either directly when calling mimic_direct_extract.py
or use
via the Makefile instructions by setting the HOST
environment variable.relation "code_status" does not exist
In this error, the table code_status
hasn't been built successfully, and you'll need to rebuild your
MIMIC-III concepts. Instructions for this can be found in Step 3 of either instruction set. Also see
below for our issues specific to building concepts.ALTER TABLE code_status SET SCHEMA mimiciii;
* GRANT SELECT ON mimiciii.code_status TO [USER];
Note that you'll need to run these on every concepts table accessed by the script.