Using a pipeline to improve de-identification performance

AMIA Annu Symp Proc. 2009 Nov 14:2009:447-51.

Authors

Frances P Morrison¹, Soumitra Sengupta, George Hripcsak

Affiliation

¹ Columbia University Department of Biomedical Informatics.

PMID: 20351897
PMCID: PMC2815438

Abstract

Effective de-identification methods are needed to support reuse of electronic health record data for research and other purposes. We investigated using two different text-processing systems in tandem as a strategy for de-identification of clinical notes. We ran 100 outpatient notes through deid.pl, from MIT's PhysioToolkit, followed by MedLEE, and we manually compared the output with original notes to determine the amount of protected health information (PHI) retained. Pipelining resulted in an overall error rate of 2%, with 2 personal names retained in output: one initial and a commonly used English term used in medicine. All retained PHI was transformed into standardized medical concepts, making re-identification less likely. Pipelining using deid.pl improved performance of MedLEE in excluding PHI from output and may be a useful strategy for de-identifying clinical data while providing computer-readable output.

Publication types

Comparative Study
Research Support, N.I.H., Extramural

MeSH terms

Confidentiality*
Electronic Health Records*
Humans
Methods
Natural Language Processing*

Abstract

Publication types

MeSH terms

Grants and funding