Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3368555.3384469acmconferencesArticle/Chapter ViewAbstractPublication PageschilConference Proceedingsconference-collections
research-article
Open access

MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III

Published: 02 April 2020 Publication History

Abstract

Machine learning for healthcare researchers face challenges to progress and reproducibility due to a lack of standardized processing frameworks for public datasets. We present MIMIC-Extract, an open source pipeline for transforming the raw electronic health record (EHR) data of critical care patients from the publicly-available MIMIC-III database into data structures that are directly usable in common time-series prediction pipelines. MIMIC-Extract addresses three challenges in making complex EHR data accessible to the broader machine learning community. First, MIMIC-Extract transforms raw vital sign and laboratory measurements into usable hourly time series, performing essential steps such as unit conversion, outlier handling, and aggregation of semantically similar features to reduce missingness and improve robustness. Second, MIMIC-Extract extracts and makes prediction of clinically-relevant targets possible, including outcomes such as mortality and length-of-stay as well as comprehensive hourly intervention signals for ventilators, vasopressors, and fluid therapies. Finally, the pipeline emphasizes reproducibility and extensibility to future research questions. We demonstrate the pipeline's effectiveness by developing several benchmark tasks for outcome and intervention forecasting and assessing the performance of competitive models.

References

[1]
Denis Agniel, Isaac S Kohane, and Griffin M Weber. 2018. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. bmj 361 (2018), k1479.
[2]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281--305.
[3]
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. 2018. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports 8, 1 (2018).
[4]
Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems. 3504--3512.
[5]
Edward Choi, Cao Xiao, Walter Stewart, and Jimeng Sun. 2018. MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. In Advances in Neural Information Processing Systems. 4552--4562.
[6]
Frederick D'Aragon, Emilie P Belley-Cote, Maureen O Meade, François Lauzier, Neill KJ Adhikari, Matthias Briel, Manoj Lalu, Salmaan Kanji, Pierre Asfar, Alexis F Turgeon, et al. 2015. Blood Pressure Targets For Vasopressor Therapy: A Systematic Review. Shock 43, 6 (2015), 530--539.
[7]
Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, N. Brimmer, Rohit Joshi, Anna Rumshisky, and Peter Szolovits. 2014. Unfolding physiological state: Mortality modelling in intensive care units. In International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 75--84.
[8]
Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. 2018. Opportunities in Machine Learning for Healthcare. arXiv preprint arXiv:1806.00388 (2018).
[9]
Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. 2015. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
[10]
Marzyeh Ghassemi, M. Wu, M. Feng, L.A. Celi, P. Szolovits, and F. Doshi-Velez. 2016. Understanding vasopressor intervention and weaning: Risk prediction in a public heterogeneous clinical time series database. Journal of the American Medical Informatics Association (2016), ocw138.
[11]
Marzyeh Ghassemi, Mike Wu, Michael Hughes, and Finale Doshi-Velez. 2017. Predicting Intervention Onset in the ICU with Switching State Space Models. In Proceedings of the AMIA Summit on Clinical Research Informatics (CRI), Vol. 2017. American Medical Informatics Association.
[12]
Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. 2019. Multitask learning and benchmarking with clinical time series data. Scientific Data (2019).
[13]
Alistair EW Johnson, Tom J Pollard, and Roger G Mark. 2017. Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference. 361--376.
[14]
Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.
[15]
Alistair EW Johnson, David J Stone, Leo A Celi, and Tom J Pollard. 2017. The MIMIC Code Repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association 25, 1 (2017), 32--39.
[16]
Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. 2016. Learning to diagnose with LSTM recurrent neural networks. In International Conference on Learning Representations.
[17]
ML Malbrain, Paul E Marik, Ine Witters, Colin Cordemans, Andrew W Kirkpatrick, Derek J Roberts, and Niels Van Regenmortel. 2014. Fluid overload, de-resuscitation, and outcomes in critically ill or injured patients: a systematic review with suggestions for clinical practice. Anaesthesiol Intensive Ther 46, 5 (2014), 361--80. [18] M.B.A. McDermott, T. Yan, T. Naumann, N. Hunt, H. Suresh, P. Szolovits, and M. Ghassemi. 2018. Semi-supervised Biomedical Translation with Cycle Wasserstein Regression GANs. In Association for the Advancement of Artificial Intelligence. New Orleans, LA.
[18]
M.B.A. McDermott, T. Yan, T. Naumann, N. Hunt, H. Suresh, P. Szolovits, and M. Ghassemi. 2018. Semi-supervised Biomedical Translation with CycleWasserstein Regression GANs. In Association for the Advancement of Artificial Intelligence. New Orleans, LA.
[19]
Matthew BA McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Marzyeh Ghassemi, and Luca Foschini. 2019. Reproducibility in Machine Learning for Health. In Submission. (2019).
[20]
Wes McKinney et al. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. Austin, TX, 51--56.
[21]
Marcus Müllner, Bernhard Urbanek, Christof Havel, Heidrun Losert, Gunnar Gamper, and Harald Herkner. 2004. Vasopressors for shock. The Cochrane Library (2004).
[22]
Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, and Marzyeh Ghassemi. 2019. Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks. In In Press: Machine Learning for Healthcare Conference (MLHC).
[23]
Fernando Pérez and Brian E Granger. 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering 9, 3 (2007), 21--29.
[24]
Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. 2018. Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics 83 (2018).
[25]
A. Raghu, M. Komorowski, L.A. Celi, P. Szolovits, and M. Ghassemi. 2017. Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference (MLHC). 147-- 163.
[26]
Michael Sjoding, Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, and Jenna Wiens. 2019. Democratizing EHR Analyses a Comprehensive Pipeline for Learning from Clinical Data. In Machine Learning For Healthcare (Clinical Abstracts Track).
[27]
Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. 2017. Clinical Intervention Prediction and Understanding with Deep Neural Networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research), Vol. 68. PMLR, Boston, Massachusetts, 322--337.
[28]
Martin J Tobin. 2006. Principles and practice of mechanical ventilation. McGrawHill Medical Pub. Division.
[29]
Karl L Yang and Martin J Tobin. 1991. A prospective study of indexes predicting the outcome of trials of weaning from mechanical ventilation. New England Journal of Medicine 324 (1991).

Cited By

View all
  • (2024)Research on Multimodal Fusion of Temporal Electronic Medical RecordsBioengineering10.3390/bioengineering1101009411:1(94)Online publication date: 18-Jan-2024
  • (2024)PSO-XnB: a proposed model for predicting hospital stay of CAD patientsFrontiers in Artificial Intelligence10.3389/frai.2024.13814307Online publication date: 3-May-2024
  • (2024)Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation StudyJournal of Medical Internet Research10.2196/6060126(e60601)Online publication date: 3-Oct-2024
  • Show More Cited By

Index Terms

  1. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning
      April 2020
      265 pages
      ISBN:9781450370462
      DOI:10.1145/3368555
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 April 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Healthcare
      2. MIMIC-III
      3. Machine learning
      4. Reproducibility
      5. Time series data

      Qualifiers

      • Research-article

      Funding Sources

      • Microsoft Research
      • NSERC Discovery Grant
      • a CIFAR AI chair at Vector Institute
      • National Institutes of Health (NIH): National Institute of Mental Health (NIMH)
      • Wistron Corporation
      • NSF Projects
      • Mitacs Globalink Research Fellowship

      Conference

      ACM CHIL '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 27 of 110 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2,070
      • Downloads (Last 6 weeks)376
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Research on Multimodal Fusion of Temporal Electronic Medical RecordsBioengineering10.3390/bioengineering1101009411:1(94)Online publication date: 18-Jan-2024
      • (2024)PSO-XnB: a proposed model for predicting hospital stay of CAD patientsFrontiers in Artificial Intelligence10.3389/frai.2024.13814307Online publication date: 3-May-2024
      • (2024)Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation StudyJournal of Medical Internet Research10.2196/6060126(e60601)Online publication date: 3-Oct-2024
      • (2024)Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER)PLOS Digital Health10.1371/journal.pdig.00006403:10(e0000640)Online publication date: 21-Oct-2024
      • (2024)Predicting postoperative delirium assessed by the Nursing Screening Delirium Scale in the recovery room for non-cardiac surgeries without craniotomy: A retrospective study using a machine learning approachPLOS Digital Health10.1371/journal.pdig.00004143:8(e0000414)Online publication date: 14-Aug-2024
      • (2024)Early prediction of sepsis in emergency department patients using various methods and scoring systemsNursing in Critical Care10.1111/nicc.13201Online publication date: 25-Oct-2024
      • (2024)Temporal-Spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care UnitIEEE Transactions on Biomedical Engineering10.1109/TBME.2023.330995671:2(583-595)Online publication date: Feb-2024
      • (2024)Self-Supervised Learning-Based General Laboratory Progress Pretrained Model for Cardiovascular Event DetectionIEEE Journal of Translational Engineering in Health and Medicine10.1109/JTEHM.2023.330779412(43-55)Online publication date: 2024
      • (2024)Predicting ICU Interventions: A Transparent Decision Support Model Based on Multivariate Time Series Graph Convolutional Neural NetworkIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.337999828:6(3709-3720)Online publication date: Jun-2024
      • (2024)GenHPF: General Healthcare Predictive Framework for Multi-Task Multi-Source LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.332795128:1(502-513)Online publication date: Jan-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media