Abstract
Objective
Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks.
Materials and Methods
This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients.
Results
When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority).
Conclusions
Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.
Keywords: electronic health records, transfer learning, model robustness, machine learning, self-supervised learning, foundation model
Introduction
The recent uptake in electronic health records (EHR) at clinical institutions has facilitated the increased usage of machine learning-based clinical prediction models to improve patient outcomes.1–7 However, the development of effective models can be challenging for institutions with limited resources,8 in specialized populations for which less data are available for training,9 and for rare clinical conditions.10 This work focuses on pediatric inpatient populations, which often have limited training data and lower outcome rates compared to adult inpatients, often making machine learning for pediatric inpatients challenging.
Development of EHR-based clinical prediction models for pediatric populations has largely relied on traditional machine learning approaches that use custom feature engineering (eg, extracting specific data elements or counting the medical codes recorded within a time span11) to create training datasets.11–15 While there has been some research focused on using adult data to train machine learning models intended for pediatric populations (eg, mixing adult and pediatric data to train models using the traditional approach),16 it has been well described that models developed in one institution or patient population may not perform well in another patient population or institution due to dataset shift.17–19
Recent advances in deep learning have demonstrated the utility of self-supervised learning20 in settings that include limited training data. Self-supervised learning occurs in a phase called pretraining that allows the model to discover informative patterns in the data (without needing labels) and encodes that information into its parameters. The resulting pretrained model can then be used as a starting point to train models for new tasks (task-specific training using labeled data) in a process called transfer learning. We used a self-supervised pretraining learning approach called clinical language model based representations (CLMBR),21 which formulates patient timelines as sequences of clinical events ordered by time, where each event is represented by a medical code. This formulation enables the use of a self-supervised objective where it attempts to predict the next day’s codes given all previous codes, thereby allowing CLMBR to learn informative patterns from patient timelines. Here, “language model based” refers to an architecture suited for sequentially ordered data; natural language processing is not a component of CLMBR. CLMBR is compatible with multiple underlying deep learning architectures, and for these experiments we use a recurrent neural network. While CLMBR has demonstrated improved performance,21 sample-efficiency (ie, requiring less training data)21 and robustness to temporal dataset shift,22 its application in pediatric populations is largely unknown. We hypothesized that pretraining with task-specific training in adult inpatients (in other words, training performed using only adult data) would be noninferior to traditional machine learning models trained using pediatric inpatients on inpatient clinical prediction tasks. Self-supervised pretraining on adult patients could provide a scalable approach towards developing clinical prediction models for pediatric populations, especially in pediatric centers without the patient population or resources to train their own models.
The primary objective was to determine whether pretraining with task-specific training in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks. Secondary objectives were to determine if pretraining with task-specific training in pediatric inpatients was better than pretraining with task-specific training in adult inpatients, when evaluated in pediatric inpatients; and to determine if pretraining with task-specific training in adult inpatients performs better when evaluated in adult inpatients (in-distribution [ID] performance) than pediatric inpatients (out-of-distribution [OOD] performance).
Methods
Data source
This was a retrospective cohort study that used data from the STAnford medicine Research data Repository (STARR).23 Data in STARR are routinely collected in the EHR of Stanford Medicine, which includes Stanford Health Care (primarily adult-directed care) and Lucile Packard Children’s Hospital (primarily pediatric-directed care). Each institution has their own separate instance of the EHR, which are brought together in STARR. These data are mapped to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM),24,25 resulting in STARR-OMOP. This study used a deidentified version of STARR-OMOP covering data collected from January 1, 2008 to August 1, 2022. Access to data was restricted to affiliates of Stanford and The Hospital for Sick Children; it was subject to a data use agreement with Stanford Medicine.23
Cohorts
Figure 1 illustrates the assignment of patients into adult and pediatric inpatient cohorts and reasons for exclusion. We included patients with at least one admission to an inpatient unit within the study time frame. Adult inpatients were those 18 years old or greater at admission. Pediatric inpatients were those more than 28 days and less than 18 years old at admission. For patients with multiple admissions, one was randomly selected as the index admission.
We excluded admissions if patient death or discharge occurred between admission and prediction time. For each task, we also excluded admissions if the task-specific outcome occurred between admission and prediction time.
Outcomes and prediction time
We defined 11 binary outcomes for each inpatient admission. The outcomes were hospital mortality, sepsis,26,27 long length of stay (long LOS), readmission within 30 days of discharge (30-day readmission), acute kidney injury,28 hyperkalemia,29 hypoglycemia,30 hyponatremia,31 neutropenia,32 anemia,33 and thrombocytopenia. These outcomes were chosen as they are clinically relevant, and we had previously validated the laboratory-based outcomes (definitions provided in Appendix S1).34 The laboratory-based outcomes were categorized as severely abnormal for the main experiment and mildly abnormal as a sensitivity analysis based upon commonly used thresholds. Prediction time was midnight on the day of discharge for 30-day readmission, and midnight on the day of admission for all other tasks. The prediction window extended until discharge for all tasks other than readmission, which used a 30-day window from the date of discharge.
Patient representations
Patient representations (also sometimes referred to as features) are the inputs to task-specific models for training and inference. In this study, we used 2 approaches to create patient representations, namely count-based representations and CLMBR. Appendix S2 provides a detailed description for each approach.
Models
Table 1 lists the models and the cohorts used for training. Baseline models were trained using pediatric inpatients. Transfer learning models were trained using adult inpatients with or without some pediatric inpatient data. PedsBaseline used count-based representations to train task-specific logistic regression models in pediatric inpatients. To succinctly summarize the cohorts used for self-supervised learning, these models were depicted using an arrow in its name. Pretraining was performed in the cohort before the arrow while task-specific training was performed in the cohort after the arrow. For example, Adults→Peds refers to CLMBR pretraining in adult inpatients with task-specific training in pediatric inpatients.
Table 1.
Baseline |
Transfer learning |
|||||
---|---|---|---|---|---|---|
PedsBaseline | Peds→Peds | Adults→Adults | Adults→Peds | Cont.Pretrain→Peds | Combined→Peds | |
Pretraining | n/a | Pediatric inpatients | Adult inpatients | Adult inpatients | Adult inpatients | Adult and pediatric inpatients |
Continued pretraining | n/a | n/a | n/a | n/a | Pediatric Inpatients | n/a |
Task-specific training | Pediatric inpatient | Pediatric inpatients | Adult inpatients | Pediatric inpatients | Pediatric inpatients | Pediatric inpatients |
The table depicts cohorts used for training. Baseline models were trained using pediatric inpatients. Transfer learning models were trained using adult inpatients with or without some pediatric inpatient data. PedsBaseline used count-based representations to train task-specific logistic regression models in pediatric inpatients. To succinctly summarize the cohorts used for self-supervised learning, these models were depicted using an arrow in its name. Pretraining was performed in the cohort before the arrow while task-specific training was performed in the cohort after the arrow. For example, Adults→Peds refers to pretraining in adult inpatients with task-specific training in pediatric inpatients.
Abbreviation: n/a, not applicable.
Baseline models
For the primary analysis, the comparator approach used count-based representations to train task-specific logistic regression models in pediatric inpatients (PedsBaseline). As a sensitivity analysis for the primary analysis, we also used count-based representations to train task-specific gradient boosted machine (GBM) models in pediatric inpatients.
For secondary analysis, we evaluated pretraining with task-specific training in pediatric inpatients (Peds→Peds).
Models for transfer learning
The experimental approach in the primary analysis did not use pediatric inpatients for training. This corresponds to a scenario where a model previously developed on adult inpatients is implemented directly in pediatric inpatients. More specifically, the experimental approach used pretraining with task-specific training in adult inpatients (Adults→Adults).
For exploratory analysis, we wanted to determine whether pretraining in adult inpatients might benefit from differential exposure to pediatrics inpatient data. Three approaches were evaluated. The first approach used pretraining in adult inpatients with task-specific training in pediatric inpatients (Adults→Peds). The second approach used pretraining in adult inpatients with continued pretraining and task-specific training in pediatric inpatients (Cont.Pretrain→Peds). The third approach used pretraining in combined adult and pediatric inpatients with task-specific training in pediatric inpatients (Combined→Peds).
Transfer learning experiments
The specific hypothesis, experimental approach, and comparator approach for primary, secondary, and exploratory analyses are described in Table 3. Details of each model are included in the Models section above.
Table 3.
Analysis and hypothesis | Experimental approach | Comparator approach | Mean difference [95% CI] | P-value |
---|---|---|---|---|
Mean AUROC | Mean AUROC | |||
Primary analysis | ||||
Adults→Adults | PedsBaseline | |||
Hypothesis: When evaluated in pediatric inpatients, pretraining with task-specific training in adult inpatients is noninferior to logistic regression models trained in pediatric inpatients (target=pediatric inpatients) | 0.902 | 0.868 | 0.034 [0.014-0.057] | <.001a |
.006b | ||||
Reject | ||||
Secondary analyses | ||||
Peds→Peds | Adults→Adults | |||
Hypothesis: When evaluated in pediatric inpatients, pretraining with task-specific training in pediatric inpatients is better than pretraining with task-specific training in adult inpatients (target=pediatric inpatients) | 0.910 | 0.902 | 0.008 | .394b |
[−0.009 to 0.029] | Do not reject | |||
Adults→Adults | Adults→Adults | |||
(ID) | (OOD) | |||
Hypothesis: Pretraining with task-specific training in adult inpatients performs better when evaluated in adult inpatients (ID) than pediatric inpatients (OOD) (target=adult inpatients [ID] and pediatric inpatients [OOD]) | 0.921 | 0.902 | 0.019 | .047b |
[0.000-0.040] | Reject | |||
Exploratory analyses | ||||
Adults→Peds | Adults→Adults | |||
Hypothesis: When evaluated in pediatric inpatients, pretraining in adult inpatients with task-specific training in pediatric inpatients is better than pretraining with task-specific training only using adult inpatients (target=pediatric inpatients) | 0.920 | 0.902 | 0.019 | .02b |
[0.003-0.035] | Reject | |||
Cont.Pretrain→Peds | Adults→Adults | |||
Hypothesis: When evaluated in pediatric inpatients, pretraining in adult inpatients with continued pretraining and task-specific training in pediatric inpatients is better than pretraining with task-specific training only using adult inpatients (target=pediatric inpatients) | 0.927 | 0.902 | 0.025 | .006b |
[0.009-0.044] | Reject | |||
Combined→Peds | Adults→Adults | |||
Hypothesis: When evaluated in pediatric inpatients, pretraining in combined adult and pediatric inpatients with task-specific training in pediatric inpatients is better than pretraining with task-specific training only using adult inpatients (target=pediatric inpatients) | 0.925 | 0.902 | 0.023 | .016b |
[0.004-0.044] | Reject |
P-value for noninferiority excluding margin of 5% conducted using bootstrapping.
P-value for superiority conducted using bootstrapping.
Abbreviations: AUROC, area under the receiver operating characteristic curve; CI, confidence interval; ID, in-distribution; OOD, out-of-distribution.
Appendix S5 shows radar plots for task-specific ECE, where all models were evaluated in pediatric inpatients. For the primary analysis, Adults→Adults showed worse calibration compared to PedsBaseline and Peds→Peds. Increasing the amount of pediatric data used in training improved calibration.
Primary analysis
The primary analysis compared pretraining with task-specific training in adult inpatients (Adults→Adults) versus logistic regression models trained in pediatric inpatients (PedsBaseline), when evaluated in pediatric inpatients.
Secondary analysis
We compared pretraining with task-specific training in pediatric inpatients (Peds→Peds) versus pretraining with task-specific training in adult inpatients (Adults→Adults), when evaluated in pediatric inpatients. We also compared Adults→Adults when evaluated in adult inpatients (ID), versus the same model evaluated in pediatric inpatients (OOD).
Exploratory analysis
The comparator approach was Adults→Adults for all exploratory analysis. Experimental approaches consisted of increasing use of pediatric data: Adults→Peds, Cont.Pretrain→Peds, and Combined→Peds.
Sensitivity analysis
Appendix S3 shows the sensitivity analysis for the primary analysis. First, we compared pretraining with task-specific training in adult inpatients (Adults→Adults) versus GBM models trained in pediatric inpatients (PedsBaseline[GBM]) instead of logistic regression models trained in pediatric inpatients (PedsBaseline). Second, for the primary analysis comparing pretraining with task-specific training in adult inpatients (Adults→Adults) versus logistic regression models trained in pediatric inpatients (PedsBaseline), we used the mild abnormal category for laboratory-based outcomes instead of the severe abnormal category.
As another sensitivity analysis, we wanted to evaluate the contribution of specific concept domains (eg, measurement) to transfer learning performance. Focusing on Adults→Adults, we shuffled medical codes for concept domain tables and subtracted area-under-the-receiver-operating-characteristic curve (AUROC) between the model trained using the original data and the model trained using the shuffled data, where shuffling was performed for one concept domain table at a time. Evaluations were performed in both adult inpatients (ID) and pediatric inpatients (OOD). Code shuffling retained the marginal frequency of codes.
Data splitting procedure
We temporally split each cohort as follows: January 1, 2008 to December 31, 2019—train task-specific models; January 1, 2020 to December 31, 2020—validation for hyperparameter selection; and January 1, 2021 to August 1, 2022—test sets for evaluation. For pretraining, encounter days between January 1, 2008 and December 31, 2019 were used for pretraining and encounter days between January 1, 2020 and December 31, 2020 were used for hyperparameter selection. Only encounter days within the study time frame were used for pretraining.
To prevent data leakage, patients in the validation and test sets of the adult inpatient cohort were excluded from pretraining for Adults→Adults. Patients in the validation and test sets of the pediatric inpatient cohort were excluded from pretraining for Peds→Peds. Patients in all validation and test sets were excluded from pretraining for Combined→Peds. In addition, for Adults→Adults, patients who had any encounter days when they were less than 18 years of age were excluded from pretraining.
Model development
Appendix S4 provides technical details for the development of count-based logistic regression models and self-supervised pretraining and task-specific training.
Model evaluation and statistical analyses
All evaluations were conducted on the pediatric inpatient test set with the exception of the ID evaluation of Adults→Adults, which was conducted on the adult inpatient test set. We evaluated the discrimination performance of each model for each outcome using AUROC. Calibration of transfer learning and baseline models for each outcome was assessed using expected calibration error (ECE) using 10 quantile bins for the predicted risks.
The primary analysis was to determine whether AUROC of Adults→Adults was noninferior to PedsBaseline. We used a noninferiority margin of 5% to be conservative.35 All other analyses determined whether the experimental approach was superior to the comparator approach. Statistical evaluations were conducted by bootstrapping differences between models using a hierarchical approach (sampling patients and outcomes with replacement).36 For each statistical evaluation, a P-value was computed by finding the percentage of differences that crosses a predefined threshold (depending on whether a noninferiority or superiority test was conducted). Since the differences represented 1-tailed statistical tests, a 2-tailed P-value was computed by taking ,37 while taking into account values that were equal to the threshold. For noninferiority tests the threshold used was the 5% margin, and for superiority tests the threshold used was 0. For all tests, a P-value of <.05 was considered statistically significant.
Results
Figure 1 shows the flow diagram of cohort construction and the allocation of each cohort into training, validation, and test sets. There were 244 811 adult inpatients and 26 436 pediatric inpatients included. Table 2 shows patient demographic characteristics and outcome prevalence for the adult and pediatric inpatient cohorts.
Table 2.
Adult inpatients (n = 244 811) | Pediatric inpatients (n = 26 436) | |
---|---|---|
Median age [IQR] | 56.1 [37.1, 69.8] | 5.6 [0.9, 13.1] |
Sex, n (%) | ||
Male | 105 013 (43.0%) | 15 592 (52.6%) |
Race, n (%) | ||
American Indian or Alaska Native | 658 (0.3%) | 50 (0.2%) |
Asian | 38 570 (15.8%) | 4712 (17.8%) |
Black or African American | 9519 (3.89%) | 650 (2.5%) |
Hispanic or Latino | 43 689 (17.8%) | 9761 (36.9%) |
Native Hawaiian or Pacific Islander | 2803 (1.1%) | 359 (1.4%) |
White | 126 499 (51.7%) | 1843 (7.0%) |
Other | 23 073 (9.4%) | 9061 (34.3%) |
Outcomes,an (%) | ||
Hospital mortality | 5122 (2.1%) | 238 (0.9%) |
Sepsis | 5843 (2.4%) | 635 (2.4%) |
Long LOS | 48 704 (19.9%) | 5148 (19.5%) |
30-day readmission | 12 115 (5.0%) | 1601 (6.1%) |
Acute kidney injury | 2815 (1.2%) | 282 (1.1%) |
Hyperkalemia | 1301 (0.5%) | 517 (1.6%) |
Hypoglycemia | 2940 (1.2%) | 348 (1.3%) |
Hyponatremia | 2574 (1.1%) | 243 (0.9%) |
Neutropenia | 1682 (0.7%) | 321 (1.1%) |
Anemia | 11 503 (4.7%) | 1033 (3.9%) |
Thrombocytopenia | 5543 (2.3%) | 570 (2.2%) |
See Appendix S1 for outcome definitions.
Abbreviations: IQR, interquartile range; LOS, length of stay.
Table 3 shows that pretraining with task-specific training using only adult inpatients was noninferior and statistically significantly better than logistic regression models trained in pediatric inpatients in terms of AUROC. More specifically, the mean AUROC of Adults→Adults was noninferior to PedsBaseline (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority). Individual task-specific AUROCs of each model are displayed in Figure 2 for the primary analysis. The AUROC for Adults→Adults, when compared to PedsBaseline, was higher for several pediatric inpatient tasks including long LOS, sepsis, anemia, and hyponatremia as examples.
For secondary analysis, Table 3 shows that in terms of AUROC, pretraining with task-specific training in pediatric inpatients (Peds→Peds) was not significantly better than pretraining with task-specific training in adult inpatients (Adults→Adults), when evaluated in pediatric inpatients. Table 3 also shows that AUROC for Adults→Adults was significantly better when evaluated in adult inpatients (ID) than pediatric inpatients (OOD).
For exploratory analysis, Table 3 demonstrates that pretraining in adult inpatients with task-specific training in pediatric inpatients (Adults→Peds), continued pretraining and task-specific training in pediatric inpatients (Cont.Pretrain→Peds), and pretraining in combined adult and pediatric inpatients with task-specific training in pediatric inpatients (Combined→Peds) all displayed significantly better AUROC than Adults→Adults.
Appendix S3 shows results of sensitivity analyses for the primary analysis. When count-based GBM models were used as the comparator approach instead of count-based logistic regression models, pretraining with task-specific training in adult inpatients remained significantly noninferior and statistically significantly better than the count-based GBM models trained in pediatric inpatients. Appendix S3 also shows that in the sensitivity analysis using mild abnormal category of laboratory-based outcomes, pretraining with task-specific training in adult inpatients remained significantly noninferior compared to logistic regression models trained in pediatric inpatients.
Appendix S6 shows the impact of removing concept domains from pretraining. Across the 4 evaluated domains, shuffling reduced ID performance although the degree of deterioration was heterogeneous by task. Shuffling condition occurrence codes resulted in qualitatively larger ID and OOD deterioration in performance although OOD performance was particularly variable. Shuffling of codes in general had similar impacts with respect to ID and OOD performance with the exception of procedure occurrence, which appeared to have a greater impact on OOD compared with ID performance.
Discussion
We found that for pediatric inpatient clinical prediction tasks, adapting a self-supervised machine learning model pretrained using adult inpatients showed noninferior and significantly better discrimination performance compared to count-based models trained specifically in pediatric inpatients. We also showed that self-supervised pretraining in pediatric inpatients was not significantly better than pretraining in adult inpatients, when evaluated in pediatric inpatients. Finally, leveraging pediatric inpatient data for task-specific training, continued pretraining, and combined pretraining significantly improved discrimination and calibration compared to pretraining with task-specific training using only adult inpatient data.
These findings may have implications for how we implement machine learning models for pediatric clinical care, particularly in centers without the capacity to develop their own models. Traditional machine learning, such as count-based logistic regression models, requires sufficient sample size and event rate, as well as considerable resources and efforts.8 Implementing a previously developed model directly in a different patient population reduces the required cost and effort, but often at the expense of performance degradation due to dataset shift.17 Our results suggest that even when pretraining and task-specific training were conducted using only adult inpatient data, a discrimination performance benefit was seen in pediatric inpatients. Furthermore, increasing incorporation of pediatric data improved discrimination and calibration. These benefits could reduce the need to train bespoke models for specific patient populations and demonstrate that a general-purpose model can be flexible enough to provide utility across multiple patient populations despite known biological differences.
It is not clear why pretraining in adult inpatients performed so well when evaluated in pediatric inpatients. In this particular study, both adult and pediatric cohorts were derived from the same healthcare system although were on distinct EHR instances. It is possible that transfer learning to an entirely different pediatric setting may not result in similarly excellent performance. We hypothesize that information learned during pretraining about the relationship between medical codes is one reason for its favorable performance in a different population. Our experiments also suggest that all concept domains contribute to transfer learning.
We also found pretraining in pediatric inpatients did not outperform pretraining in adult inpatients. However, pretraining in adult inpatients did have significantly better discrimination when evaluated in adult inpatients (ID) compared to pediatric inpatients (OOD). It is possible that pretraining in pediatric inpatients did not show better performance because of the limited sample size. Whether performance would improve in larger pediatric centers is unknown.
Strengths of our study include a novel application of self-supervised pretrained models for structured EHR in the pediatric setting. Our study also evaluated model performance on multiple clinically important outcomes. Nonetheless, there are several limitations to our study. First, we only included data from a single institution. Evaluation of self-supervised learning in other pediatric institutions is needed. Second, we compared model performance aggregated across tasks, which does not provide insight into the heterogeneity of task-specific performance. In addition, while we evaluated 11 clinical tasks, the number of potential clinical tasks is very large. Third, we did not comprehensively evaluate various transfer learning approaches; these should be explored in future studies. Finally, it is interesting that outcome rates were similar between the adult and pediatric cohorts. It is possible that these results might not be generalizable to scenarios where outcome rates are very dissimilar.
In conclusion, self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding shows promise in demonstrating transferability of self-supervised machine learning models trained in adult patients to pediatric patients, without requiring costly model retraining.
Supplementary Material
Acknowledgments
L.S. is supported by the Canada Research Chair in Pediatric Oncology Supportive Care.
Contributor Information
Joshua Lemmon, Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada.
Lin Lawrence Guo, Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada.
Ethan Steinberg, Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States.
Keith E Morse, Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States.
Scott Lanyon Fleming, Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States.
Catherine Aftandilian, Division of Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States.
Stephen R Pfohl, Google Research, Mountain View, CA 94043, United States.
Jose D Posada, Universidad del Norte, Barranquilla 081007, Colombia.
Nigam Shah, Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States.
Jason Fries, Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States.
Lillian Sung, Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada; Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada.
Author contributions
J.L.—substantial contributions to the analysis of data, drafting the work, final approval of the version to be published and agree to be accountable for all aspects of the work. L.L.G. and L.S.—substantial contributions to the conception and design of the work, acquisition and interpretation of data for the work, drafting the work, final approval of the version to be published and agree to be accountable for all aspects of the work. E.S., K.E.M., S.L.F., C.A., S.R.P., J.P., N.S., and J.F.—substantial contributions to the design of the work, acquisition and interpretation of data for the work, reviewing it critically for important intellectual content, final approval of the version to be published and agree to be accountable for all aspects of the work.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
The funder did not participate in the work or the decision to submit for publication.
Conflicts of interest
None declared.
Data availability
The deidentified Stanford Medicine Research Data Repository is not made publicly available. Access requires affiliation with a Stanford Principal Investigator and a Stanford identity. However, data are available from the corresponding author upon reasonable request.
References
- 1. Hong JC, Eclov NCW, Dalal NH, et al. System for high-intensity evaluation during radiation therapy (SHIELD-RT): a prospective randomized study of machine learning-directed clinical evaluations during radiation and chemoradiation. J Clin Oncol. 2020;38(31):3652-3661. [DOI] [PubMed] [Google Scholar]
- 2. Escobar GJ, Liu VX, Schuler A, Lawson B, Greene JD, Kipnis P.. Automated identification of adults at risk for in-hospital clinical deterioration. N Engl J Med. 2020;383(20):1951-1960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Manz CR, Parikh RB, Small DS, et al. Effect of integrating machine learning mortality estimates with behavioral nudges to clinicians on serious illness conversations among patients with cancer: a stepped-wedge cluster randomized clinical trial. JAMA Oncol. 2020;6(12):e204759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Yelin I, Snitser O, Novich G, et al. Personal clinical history predicts antibiotic resistance of urinary tract infections. Nat Med. 2019;25(7):1143-1152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Tomašev N, Glorot X, Rae JW, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 2019;572(7767):116-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chan L, Nadkarni GN, Fleming F, et al. Derivation and validation of a machine learning risk score using biomarker and electronic patient data to predict progression of diabetic kidney disease. Diabetologia 2021;64(7):1504-1515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Yadgir SR, Engstrom C, Jacobsohn GC, et al. Machine learning-assisted screening for cognitive impairment in the emergency department. J Am Geriatr Soc. 2022;70(3):831-837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Sendak MP, Balu S, Schulman KA.. Barriers to achieving economies of scale in analysis of EHR data. A cautionary tale. Appl Clin Inform. 2017;8(3):826-831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Miotto R, Wang F, Wang S, Jiang X, Dudley JT.. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236-1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Herrin J, Abraham NS, Yao X, et al. Comparative effectiveness of machine learning approaches for predicting gastrointestinal bleeds in patients receiving antithrombotic treatment. JAMA Netw Open. 2021;4(5):e2110703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Sung L, Corbin C, Steinberg E, et al. Development and utility assessment of a machine learning bloodstream infection classifier in pediatric patients receiving cancer treatments. BMC Cancer 2020;20(1):1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Le S, Hoffman J, Barton C, et al. Pediatric severe sepsis prediction using machine learning. Front Pediatr. 2019;7:413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Singh D, Nagaraj S, Mashouri P, et al. Assessment of machine learning-based medical directives to expedite care in pediatric emergency medicine. JAMA Netw Open. 2022;5(3):e222599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Bertsimas D, Dunn J, Steele DW, Trikalinos TA, Wang Y.. Comparison of machine learning optimal classification trees with the pediatric emergency care applied research network head trauma decision rules. JAMA Pediatr. 2019;173(7):648-656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Morse KE, Brown C, Fleming S, et al. Monitoring approaches for a pediatric chronic kidney disease machine learning model. Appl Clin Inform. 2022;13(2):431-438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sabharwal P, , HurstJH, , TejwaniR, , HobbsKT, , RouthJC, , Goldstein BA.. Combining adult with pediatric patient data to develop a clinical decision support tool intended for children: leveraging machine learning to model heterogeneity. BMC Med Inform Decis Mak. 2022;22(1):84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Wong A, Otles E, Donnelly JP, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med. 2021;181(8):1065-1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zhang H, Dullerud N, Seyyed-Kalantari L, Morris Q, Joshi S, Ghassemi M. An empirical framework for domain generalization in clinical settings. In: Proceedings of the Conference on Health, Inference, and Learning. Virtual Event; 2021.
- 19. Quiñonero-Candela J, Sugiyama M, Ben-David S, et al. Dataset Shift in Machine Learning. MIT Press; 2008. [Google Scholar]
- 20. Bommasani R, , HudsonDA, , Adeli E,. et al. On the opportunities and risks of foundation models. arXiv. 2022. [Google Scholar]
- 21. Steinberg E, Jung K, Fries JA, Corbin CK, Pfohl SR, Shah NH.. Language models are an effective representation learning technique for electronic health record data. J Biomed Inform. 2021;113:103637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Guo LL, Steinberg E, Fleming SL, et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci Rep. 2023;13(1):3767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Datta S, Posada J, Olson G, et al. A new paradigm for accelerating clinical data science at Stanford Medicine. arXiv. 2020.
- 24. Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 2015;216:574. [PMC free article] [PubMed] [Google Scholar]
- 25. Voss EA, Makadia R, Matcho A, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc. 2015;22(3):553-564. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Vincent JL, Moreno R, Takala J, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine. Intensive Care Med. 1996;22(7):707-710. [DOI] [PubMed] [Google Scholar]
- 27. Matics TJ, Sanchez-Pinto LN.. Adaptation and validation of a pediatric sequential organ failure assessment score and evaluation of the sepsis-3 definitions in critically ill children. JAMA Pediatr. 2017;171(10):e172352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Khwaja A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin Pract. 2012;120(4):c179-c184. [DOI] [PubMed] [Google Scholar]
- 29. Daly K, Farrington E.. Hypokalemia and hyperkalemia in infants and children: pathophysiology and treatment. J Pediatr Health Care. 2013;27(6):486-496, quiz 497–508. [DOI] [PubMed] [Google Scholar]
- 30. Abraham MB, Jones TW, Naranjo D, et al. ISPAD Clinical Practice Consensus Guidelines 2018: assessment and management of hypoglycemia in children and adolescents with diabetes. Pediatr Diabetes. 2018;19(Suppl 27):178-192. [DOI] [PubMed] [Google Scholar]
- 31. Spasovski G, Vanholder R, Allolio B, et al. ; Hyponatraemia Guideline Development Group. Clinical practice guideline on diagnosis and treatment of hyponatraemia. Eur J Endocrinol. 2014;170(3):G1-G47. [DOI] [PubMed] [Google Scholar]
- 32. Lustberg MB. Management of neutropenia in cancer patients. Clin Adv Hematol Oncol. 2012;10(12):825-826. [PMC free article] [PubMed] [Google Scholar]
- 33. Allali S, Brousse V, Sacri AS, Chalumeau M, de Montalembert M.. Anemia in children: prevalence, causes, diagnostic work-up, and long-term consequences. Expert Rev Hematol. 2017;10(11):1023-1028. [DOI] [PubMed] [Google Scholar]
- 34. Guo LL, Morse KE, Aftandilian C, et al. Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare. medRxiv. 2023. [DOI] [PMC free article] [PubMed]
- 35. Committee for Medicinal Products for Human Use; Efficacy Working Party; Committee for Release for Consultation. Committee for Medicinal Products for Human Use (CHMP) guideline on the choice of the non-inferiority margin. Stat Med. 2006;25(10):1628-1638. [DOI] [PubMed] [Google Scholar]
- 36. Sellam T, Yadlowsky S, Wei J, et al. The MultiBERTs: BERT reproductions for robustness analysis. arXiv. 2022.
- 37. Rousselet GA, Pernet CR, Wilcox RR.. The percentile bootstrap: a primer with step-by-step instructions in R. Adv Methods Pract Psychol Sci. 2021;4(1):251524592091188. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The deidentified Stanford Medicine Research Data Repository is not made publicly available. Access requires affiliation with a Stanford Principal Investigator and a Stanford identity. However, data are available from the corresponding author upon reasonable request.