Abstract
Objective
Electronic health records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR deidentification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHRs time series efficiently.Materials and methods
We introduce a new method for generating diverse and realistic synthetic EHR time series data using denoizing diffusion probabilistic models. We conducted experiments on 6 databases: Medical Information Mart for Intensive Care III and IV, the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with 8 existing methods.Results
Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yield a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk.Discussion
The proposed model utilizes a mixed diffusion process to generate realistic synthetic EHR samples that protect patient privacy. This method could be useful in tackling data availability issues in the field of healthcare by reducing barrier to EHR access and supporting research in machine learning for health.Conclusion
The proposed diffusion model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.References
Articles referenced by this article (21)
Regularization Paths for Generalized Linear Models via Coordinate Descent.
J Stat Softw, (1):1-22 2010
MED: 20808728
Evaluating re-identification risks with respect to the HIPAA privacy rule.
J Am Med Inform Assoc, (2):169-177 2010
MED: 20190059
Data Resource Profile: Clinical Practice Research Datalink (CPRD).
Int J Epidemiol, (3):827-836 2015
MED: 26050254
Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis.
IEEE J Biomed Health Inform, (5):1589-1604 2017
MED: 29989977
Benchmarking machine learning models on multi-centre eICU critical care dataset.
PLoS One, (7):e0235424 2020
MED: 32614874
Synthetic data in health care: A narrative review.
PLOS Digit Health, (1):e0000082 2023
MED: 36812604
Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications.
NPJ Digit Med, (1):98 2023
MED: 37244963
Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model.
Nat Commun, (1):5305 2023
MED: 37652934
Show 10 more references (10 of 21)
Citations & impact
This article has not been cited yet.
Impact metrics
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/169894615
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Distributed clinical data sharing via dynamic access-control policy transformation.
Int J Med Inform, 89:25-31, 12 Feb 2016
Cited by: 4 articles | PMID: 26980356
Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model.
Nat Commun, 14(1):5305, 31 Aug 2023
Cited by: 2 articles | PMID: 37652934 | PMCID: PMC10471716
Efficient Privacy-Preserving Access Control Scheme in Electronic Health Records System.
Sensors (Basel), 18(10):E3520, 18 Oct 2018
Cited by: 5 articles | PMID: 30340411 | PMCID: PMC6210245
Question Answering for Electronic Health Records: Scoping Review of Datasets and Models.
J Med Internet Res, 26:e53636, 30 Oct 2024
Cited by: 0 articles | PMID: 39475821
Review
Funding
Funders who supported this work.
CS+
Department of Computer Science
Duke University
NHLBI NIH HHS (1)
Grant ID: R01 HL168940
NIH (2)
Grant ID: R01HL168940
Grant ID: R01HL169347
NIH HHS (1)
Grant ID: R01HL169347
NSF (1)
Grant ID: CAREER-2203741