research-article

Open access

MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III

Authors:

Matthew B. A. McDermott,

Geeticka Chauhan,

Marzyeh Ghassemi,

Michael C. Hughes,

Tristan NaumannAuthors Info & Claims

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

Pages 222 - 235

https://doi.org/10.1145/3368555.3384469

Published: 02 April 2020 Publication History

Abstract

Machine learning for healthcare researchers face challenges to progress and reproducibility due to a lack of standardized processing frameworks for public datasets. We present MIMIC-Extract, an open source pipeline for transforming the raw electronic health record (EHR) data of critical care patients from the publicly-available MIMIC-III database into data structures that are directly usable in common time-series prediction pipelines. MIMIC-Extract addresses three challenges in making complex EHR data accessible to the broader machine learning community. First, MIMIC-Extract transforms raw vital sign and laboratory measurements into usable hourly time series, performing essential steps such as unit conversion, outlier handling, and aggregation of semantically similar features to reduce missingness and improve robustness. Second, MIMIC-Extract extracts and makes prediction of clinically-relevant targets possible, including outcomes such as mortality and length-of-stay as well as comprehensive hourly intervention signals for ventilators, vasopressors, and fluid therapies. Finally, the pipeline emphasizes reproducibility and extensibility to future research questions. We demonstrate the pipeline's effectiveness by developing several benchmark tasks for outcome and intervention forecasting and assessing the performance of competitive models.

References

[1]

Denis Agniel, Isaac S Kohane, and Griffin M Weber. 2018. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. bmj 361 (2018), k1479.

[2]

James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281--305.

Digital Library

[3]

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. 2018. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports 8, 1 (2018).

[4]

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems. 3504--3512.

Digital Library

[5]

Edward Choi, Cao Xiao, Walter Stewart, and Jimeng Sun. 2018. MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. In Advances in Neural Information Processing Systems. 4552--4562.

[6]

Frederick D'Aragon, Emilie P Belley-Cote, Maureen O Meade, François Lauzier, Neill KJ Adhikari, Matthias Briel, Manoj Lalu, Salmaan Kanji, Pierre Asfar, Alexis F Turgeon, et al. 2015. Blood Pressure Targets For Vasopressor Therapy: A Systematic Review. Shock 43, 6 (2015), 530--539.

[7]

Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, N. Brimmer, Rohit Joshi, Anna Rumshisky, and Peter Szolovits. 2014. Unfolding physiological state: Mortality modelling in intensive care units. In International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 75--84.

Digital Library

[8]

Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. 2018. Opportunities in Machine Learning for Healthcare. arXiv preprint arXiv:1806.00388 (2018).

[9]

Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. 2015. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Digital Library

[10]

Marzyeh Ghassemi, M. Wu, M. Feng, L.A. Celi, P. Szolovits, and F. Doshi-Velez. 2016. Understanding vasopressor intervention and weaning: Risk prediction in a public heterogeneous clinical time series database. Journal of the American Medical Informatics Association (2016), ocw138.

[11]

Marzyeh Ghassemi, Mike Wu, Michael Hughes, and Finale Doshi-Velez. 2017. Predicting Intervention Onset in the ICU with Switching State Space Models. In Proceedings of the AMIA Summit on Clinical Research Informatics (CRI), Vol. 2017. American Medical Informatics Association.

[12]

Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. 2019. Multitask learning and benchmarking with clinical time series data. Scientific Data (2019).

[13]

Alistair EW Johnson, Tom J Pollard, and Roger G Mark. 2017. Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference. 361--376.

[14]

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.

[15]

Alistair EW Johnson, David J Stone, Leo A Celi, and Tom J Pollard. 2017. The MIMIC Code Repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association 25, 1 (2017), 32--39.

[16]

Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. 2016. Learning to diagnose with LSTM recurrent neural networks. In International Conference on Learning Representations.

[17]

ML Malbrain, Paul E Marik, Ine Witters, Colin Cordemans, Andrew W Kirkpatrick, Derek J Roberts, and Niels Van Regenmortel. 2014. Fluid overload, de-resuscitation, and outcomes in critically ill or injured patients: a systematic review with suggestions for clinical practice. Anaesthesiol Intensive Ther 46, 5 (2014), 361--80. [18] M.B.A. McDermott, T. Yan, T. Naumann, N. Hunt, H. Suresh, P. Szolovits, and M. Ghassemi. 2018. Semi-supervised Biomedical Translation with Cycle Wasserstein Regression GANs. In Association for the Advancement of Artificial Intelligence. New Orleans, LA.

[18]

M.B.A. McDermott, T. Yan, T. Naumann, N. Hunt, H. Suresh, P. Szolovits, and M. Ghassemi. 2018. Semi-supervised Biomedical Translation with CycleWasserstein Regression GANs. In Association for the Advancement of Artificial Intelligence. New Orleans, LA.

[19]

Matthew BA McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Marzyeh Ghassemi, and Luca Foschini. 2019. Reproducibility in Machine Learning for Health. In Submission. (2019).

[20]

Wes McKinney et al. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. Austin, TX, 51--56.

[21]

Marcus Müllner, Bernhard Urbanek, Christof Havel, Heidrun Losert, Gunnar Gamper, and Harald Herkner. 2004. Vasopressors for shock. The Cochrane Library (2004).

[22]

Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, and Marzyeh Ghassemi. 2019. Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks. In In Press: Machine Learning for Healthcare Conference (MLHC).

[23]

Fernando Pérez and Brian E Granger. 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering 9, 3 (2007), 21--29.

Digital Library

[24]

Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. 2018. Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics 83 (2018).

[25]

A. Raghu, M. Komorowski, L.A. Celi, P. Szolovits, and M. Ghassemi. 2017. Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference (MLHC). 147-- 163.

[26]

Michael Sjoding, Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, and Jenna Wiens. 2019. Democratizing EHR Analyses a Comprehensive Pipeline for Learning from Clinical Data. In Machine Learning For Healthcare (Clinical Abstracts Track).

[27]

Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. 2017. Clinical Intervention Prediction and Understanding with Deep Neural Networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research), Vol. 68. PMLR, Boston, Massachusetts, 322--337.

[28]

Martin J Tobin. 2006. Principles and practice of mechanical ventilation. McGrawHill Medical Pub. Division.

[29]

Karl L Yang and Martin J Tobin. 1991. A prospective study of indexes predicting the outcome of trials of weaning from mechanical ventilation. New England Journal of Medicine 324 (1991).

Cited By

Ma MWang MGao BLi YHuang JChen H(2024)Research on Multimodal Fusion of Temporal Electronic Medical RecordsBioengineering10.3390/bioengineering1101009411:1(94)Online publication date: 18-Jan-2024
https://doi.org/10.3390/bioengineering11010094
Miriyala GSinha A(2024)PSO-XnB: a proposed model for predicting hospital stay of CAD patientsFrontiers in Artificial Intelligence10.3389/frai.2024.13814307Online publication date: 3-May-2024
https://doi.org/10.3389/frai.2024.1381430
Yang RZeng QYou KQiao YHuang LHsieh CRosand BGoldwasser JDave AKeenan TKe YHong CLiu NChew ERadev DLu ZXu HChen QLi I(2024)Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation StudyJournal of Medical Internet Research10.2196/6060126(e60601)Online publication date: 3-Oct-2024
https://doi.org/10.2196/60601
Show More Cited By

Index Terms

MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III
1. Applied computing
  1. Life and medical sciences
    1. Health care information systems
    2. Health informatics

Recommendations

Predictive modeling of bacterial infections and antibiotic therapy needs in critically ill adults
Graphical abstract

Display Omitted
Highlights
- Unnecessary antibiotic regimens can harm patients without bacterial infections.
Abstract
Unnecessary antibiotic regimens in the intensive care unit (ICU) are associated with adverse patient outcomes and antimicrobial resistance. Bacterial infections (BI) are both common and deadly in ICUs, and as a result, patients with a ...
An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes
Highlights
- Developed deep learning-based algorithms to map clinical notes to ICD-9 medical codes automatically.
Abstract Background and Objective
Code assignment is of paramount importance in many levels in modern hospitals, from ensuring accurate billing process to creating a valid record of patient care history. However, the coding process ...
Managing healthcare costs by peer-group modeling

We describe statistical methods for managing healthcare costs using peer-group models and outlier detection. A peer group is a collection of similar entities such as patients, physicians, clinics, hospitals or pharmacies. In an empirical study of drug ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

April 2020

265 pages

ISBN:9781450370462

DOI:10.1145/3368555

General Chair:
Marzyeh Ghassemi
University of Toronto and the Vector Institute

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Microsoft Research
NSERC Discovery Grant
a CIFAR AI chair at Vector Institute
National Institutes of Health (NIH): National Institute of Mental Health (NIMH)
Wistron Corporation
NSF Projects
Mitacs Globalink Research Fellowship

Conference

ACM CHIL '20

Sponsor:

ACM

ACM CHIL '20: ACM Conference on Health, Inference, and Learning

April 2 - 4, 2020

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 27 of 110 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

81
Total Citations
View Citations
5,691
Total Downloads

Downloads (Last 12 months)2,070
Downloads (Last 6 weeks)376

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma MWang MGao BLi YHuang JChen H(2024)Research on Multimodal Fusion of Temporal Electronic Medical RecordsBioengineering10.3390/bioengineering1101009411:1(94)Online publication date: 18-Jan-2024
https://doi.org/10.3390/bioengineering11010094
Miriyala GSinha A(2024)PSO-XnB: a proposed model for predicting hospital stay of CAD patientsFrontiers in Artificial Intelligence10.3389/frai.2024.13814307Online publication date: 3-May-2024
https://doi.org/10.3389/frai.2024.1381430
Yang RZeng QYou KQiao YHuang LHsieh CRosand BGoldwasser JDave AKeenan TKe YHong CLiu NChew ERadev DLu ZXu HChen QLi I(2024)Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation StudyJournal of Medical Internet Research10.2196/6060126(e60601)Online publication date: 3-Oct-2024
https://doi.org/10.2196/60601
Liao WVoldman J(2024)Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER)PLOS Digital Health10.1371/journal.pdig.00006403:10(e0000640)Online publication date: 21-Oct-2024
https://doi.org/10.1371/journal.pdig.0000640
Giesa NHaufe SMenk MWeiß BSpies CPiper SBalzer FBoie S(2024)Predicting postoperative delirium assessed by the Nursing Screening Delirium Scale in the recovery room for non-cardiac surgeries without craniotomy: A retrospective study using a machine learning approachPLOS Digital Health10.1371/journal.pdig.00004143:8(e0000414)Online publication date: 14-Aug-2024
https://doi.org/10.1371/journal.pdig.0000414
Song YHuang HMa JXing RSong YLi LZhou JOu C(2024)Early prediction of sepsis in emergency department patients using various methods and scoring systemsNursing in Critical Care10.1111/nicc.13201Online publication date: 25-Oct-2024
https://doi.org/10.1111/nicc.13201
Nie WYu YZhang CSong DZhao LBai Y(2024)Temporal-Spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care UnitIEEE Transactions on Biomedical Engineering10.1109/TBME.2023.330995671:2(583-595)Online publication date: Feb-2024
https://doi.org/10.1109/TBME.2023.3309956
Chen LHung KTseng YWang HLu THuang WTsao Y(2024)Self-Supervised Learning-Based General Laboratory Progress Pretrained Model for Cardiovascular Event DetectionIEEE Journal of Translational Engineering in Health and Medicine10.1109/JTEHM.2023.330779412(43-55)Online publication date: 2024
https://doi.org/10.1109/JTEHM.2023.3307794
Xu ZGuo JQin LXie YXiao YLin XLi QLi X(2024)Predicting ICU Interventions: A Transparent Decision Support Model Based on Multivariate Time Series Graph Convolutional Neural NetworkIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.337999828:6(3709-3720)Online publication date: Jun-2024
https://doi.org/10.1109/JBHI.2024.3379998
Hur KOh JKim JKim JLee MCho EMoon SKim YAtallah LChoi E(2024)GenHPF: General Healthcare Predictive Framework for Multi-Task Multi-Source LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.332795128:1(502-513)Online publication date: Jan-2024
https://doi.org/10.1109/JBHI.2023.3327951
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents