- Introduction
- Modules
- Code Description
- GSoC Experience
- Conclusion
- Team
- License
This project aims to provide improved solution to the medical world, where millions of people die due to Sepsis, a fatal disease where the patient has dyregulated response to infection. Since sepsis is time-sensitive, it quickly escalates to multiorgan failures, that greatly increases the risk of death. Here we try to accurately predict the occurence of sepsis, hours before it actaully occurs. This will provide doctors to take contingency actions early, and will decrease mortality rates significantly.
This project is based off the eICU database, managed by physionet. Critically ill patients are admitted to the ICU where they receive complex and time-sensitive care from a wide array of clinical staff. Electronic measuring devices are attached to them that produce data at regular intervals. This data, from multiple hospitals was assimilated into the eICU database.
The vitals from the patients were measured every 5 minutes. Such a frequency is ideal because reduced frequency does not allow us to get a deep insight into the patient's condition, and consequentially, the models are not accurate enough.
In this project, we apply multiple machine learning methods to generate descriptive features that are clinically meaningful and predict the onset of sepsis.
NOTE: For the database features, please go through the documentation of the eICU database here: https://eicu-crd.mit.edu/about/eicu/
- Since there are multiple tables to work with and the Sequential Organ Failure Assesment (SOFA) needs to be calculated from multiple sources, we converged all the relevant things to a single table. For reference, the following is the break-up (for debugging purposes):
- lab.csv was used to extract the lab values.
- nurseCharting.csv was used to extract the GCS scores as well as the MAP and ventilator details.
- infusionDrug.csv was used to extract all relevant vasopressors like Norepinephrine, Dopamine etc.
- vitalPeriodic.csv was were all the vitals for the patients were recorded in a frequency of 5 minutes.
- The IV antibiotics data has been collected from the medication.csv table for each registered patient, while the fluid samples data was taken from the microlab.csv
- Apart from the essential parameters needed for SOFA score calculation, we have also included a number of different variables to the final training data to check how they influence the model as will be shown in the feature importance curve. Some of them are:
- calcium
- glucose
- lactate
- magnesium
- Phosphate
- potassium
- For the SOFA calculation, we first merged all the aforementioned extracted tables. Then we followed the given rubrics to calculated the SOFA-3 scores.
- For the feature extraction process, we need to introduce the concept of time windows and time before true onset. Preprocessing is done is such a way that the time window, i.e the amount of data in a time period required to train the model is kept constant at 6 hours. So, we always train the model using 6hrs worth of data. Time before true onset means how early do we want to predict sepsis. This parameter has been varied in steps of 2 hours to get a better understanding of how your accuracy drops off as the time difference increases. For this experiment, we have used time priors of 2, 4, 6 and 8 hours. Then we have preprocessed the entire dataframe according to each of these time differences. So we have processed data for 2 hours before sepsis with 6 hours of training data, 4 hours before with 6 hours of training data and so on so forth. After the SOFA calculations are done and our final training table is made with multiple clinical and vital variables, the total number of features are 27. We then extracted 7 statistical features from each of the original 27 features.
- Standard Deviation
- Kurtosis
- Skewness
- Mean
- Minimum
- Maximum
- RMS_Difference
- XGBoost machine learning method is an efficient and optimized distributed gradient boosting library and provides a parallel tree boosting that solve many data science problems in a fast and accurate way.The data is first partitioned into the train (80%) and test (20%) datasets. The train set is used for cross-validated models, while the test set was used to perform the model validation.
Here is a small code snippet of one of the parts of SOFA calculation:
labs_withO2.loc[(labs_withO2['total_bilirubin'] <1.2), 'SOFA_Liver'] = 0 labs_withO2.loc[(labs_withO2['total_bilirubin'] >=1.2) & (labs_withO2['total_bilirubin'] <=1.9), 'SOFA_Liver'] = 1 labs_withO2.loc[(labs_withO2['total_bilirubin'] >=2) & (labs_withO2['total_bilirubin'] <=5.9), 'SOFA_Liver'] = 2 labs_withO2.loc[(labs_withO2['total_bilirubin'] >=6) & (labs_withO2['total_bilirubin'] <=11.9), 'SOFA_Liver'] = 3 labs_withO2.loc[(labs_withO2['total_bilirubin'] >12), 'SOFA_Liver'] = 4 labs_withO2.loc[(labs_withO2['paO2_FiO2'] >=400), 'SOFA_Respiration'] = 0 labs_withO2.loc[(labs_withO2['paO2_FiO2'] <400), 'SOFA_Respiration'] = 1 labs_withO2.loc[(labs_withO2['paO2_FiO2'] <300), 'SOFA_Respiration'] = 2 labs_withO2.loc[((labs_withO2['paO2_FiO2'] <200) & (labs_withO2['nursingchartvalue'] =='ventilator')), 'SOFA_Respiration'] = 3 labs_withO2.loc[((labs_withO2['paO2_FiO2'] <100) & (labs_withO2['nursingchartvalue'] =='ventilator')), 'SOFA_Respiration'] = 4 labs_withO2.loc[((labs_withO2['creatinine'] >=0) & (labs_withO2['creatinine'] <=1.1)), 'SOFA_Renal'] = 0 labs_withO2.loc[((labs_withO2['creatinine'] >=1.2) & (labs_withO2['creatinine'] <=1.9)), 'SOFA_Renal'] = 1 labs_withO2.loc[((labs_withO2['creatinine'] >=2) & (labs_withO2['creatinine'] <=3.4)), 'SOFA_Renal'] = 2 labs_withO2.loc[((labs_withO2['creatinine'] >=3.5) & (labs_withO2['creatinine'] <=4.9)) | (labs_withO2['urinary_creatinine'] <200), 'SOFA_Renal'] = 3 labs_withO2.loc[(labs_withO2['creatinine'] >5) | (labs_withO2['urinary_creatinine'] <200), 'SOFA_Renal'] = 4
They are:
Five-fold cross-validation model was developed using XGBClassifier. The area under the ROC curve (AUROC) is a function of prediction window. The AUROC for the training set was higher than the testing set. The average testing AUROC at 2 hours prior to the sepsis onset was 0.86. However, the AUROC decreases as we move away from the time of sepsis onset.
The average testing cross-validated recall and precision for predicting sepsis class are 73%, and 84%, respectively, 2 hours before the sepsis onset. Whereas, the overall F1-Score was 79.5%. The following provides the precision, recall and F1 score for each of the time intervals before the sepsis onset.
Here we compare the XGBoost F1-Score with the other machine learning methods (RF: Random Forest; LR: Logistic Regression; GNB: Gaussian Naïve Bayes).
NOTE: All of the model statistics are exclusive to the eICU database. A new database might produce different results, better or worse. Hyper-parameterization will be required.
- antibiotics.py
- get_antibiotics() Parameters - medication_table, treatment_table, microlab_table in the format of the eICU dataset.
- gcs_extract.py
- extract_GCS_withSOFA() Parameters - nurseCharting_table in the format of the eICU dataset.
- extract_GCS() Parameters - nurseCharting_table in the format of the eICU dataset.
- extract_MAP() Parameters - nurseCharting_table in the format of the eICU dataset.
- extract_VENT() Parameters - nurseCharting_table in the format of the eICU dataset.
- labs_extract.py
- extract_lab_format() Parameters - lab_table, respiratoryCharting_table in the format of the eICU dataset and the ventilator details in the format of the extract_VENT() fn.
- calc_lab_sofa() Parameters - input format should match the output of the extract_lab_format return value.
- vasopressor_extract.py
- extract_drugrates() Parameters - infusionDrug_table in the format of the eICU dataset.
- incorporate_weights() Paramters - a filtered table of SOFA related vasopressors (in the format of the output of the extract_drugrates() function), and patient_table in the format of the eICU dataset.
- add_separate_cols() Paramters - a noramlized table of vasopressors (in the format of the output of the incorporate_weights() function).
- calc_SOFA() Paramters - a table in the format of the output of the add_separate_cols() function.
- sepsis_calc.py
- calc_tsepsis() Parameters - lab table with SOFA (return value of calc_lab_sofa() of labs_extract), vasopressors table with SOFA (return value of calc_SOFA() from vasopressor_extract), GCS table with SOFA (return value of extract_GCS_withSOFA() from gcs_extract), table with tsuspicion (return value of get_antibiotics() from antibiotics)
- merge_final_table.py
- merge_final() Parameters - GCS_scores table (return value of extract_GCS() from gcs_extract) , labs_morevars (return value of extract_lab_format() from labs_extract), drugrate_norm_updated (return value of calc_SOFA from vasopressor_extract), tsus_max (return value of get_antibiotics() from antibiotics), tsepsis_table (return value of calc_tsepsis() from sepsis_calc), vitals_table (vitals table in the format of eICU dataset).
- sepsisprediction.py
- feature_fun() Parameters - column name, dataframe
- process() Paramters - merged_table (return value of merge_final() of merge_final_table), index, time_prior (how much time before true onset, in hours), time_duration (duration of time used for training data, in hours)
- case_preprocess() Paramters - training_data after process(), concatenated into a single dataframe. (Check main.py for reference)
- control_preprocess() Parameters - training_data after process(), concatenated into a single dataframe. (Check main.py for reference)
- get_controls() Parameters - controls_table (return value of control_preprocess)
- run_xgboost() Parameters - num_runs, sepsis_training, sepsis_, sepsis_y_cv, control_train, x_crossval, y_crossval
Return - a table with patients fulfilling the suspicion criteria and their max time of suspicion.
Return - a table with patients with the SOFA scores of the patients based on the GCS score.
Return - a table with GCS scores of each admitted patient over the period of admit duration.
Return - a table with Mean Arterial Pressure values of each admitted patient over the period of admit duration.
Return - a table with ventilator details of each admitted patient over the period of admit duration.
Return - a table with all the lab values in columns for every patient along with the ventilator details to check for O2.
Return - a table with the SOFA scores related to lab values.
Return - a table with all the SOFA related vasopressors, like Dopamine, Norepinephrine etc. Also the units are separated into a different column to normalize it later.
Return - a table containing normalized and weighted results.
Return - a table containing normalized and weighted results, and the drugnames all segragated into different columns, for SOFA calculations.
Return - a table with the SOFA scores of the cardiovascular paramters.
Return - a table with patients with the time of onset of sepsis.
Return - a table ready for training, with all features in columns.
Return - values of 7 descriptive features for the particular column name in the dataframe. (features listed above in Feature Extraction section)
Return - NULL, csv is pushed to the working directory.
Return - table containing all septic patients.
Return - table containing all control patients.
Return - downsampled dataframe split by train_test_split()
Return - xgboost model run for num_run iterations using partial fit method, along with AUCROC values
Ronet Swaminathan [email protected] Author |
Aditya Singh [email protected] Author |
Dr. Akram Mohammed [email protected] Mentor, Maintainer |
Dr. Rishikesan Kamaleswaran [email protected] Mentor |