Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Trie-based rule processing for clinical NLP: : A use-case study of n-trie, making the ConText algorithm more efficient and scalable

Published: 01 September 2018 Publication History

Graphical abstract

Display Omitted

Highlights

N-trie, a new hash trie IE rule engine designed for clinical NLP, is introduced.
The engine was designed to increase the efficiency/scalability of rule-base NLP.
Using the ConText algorithm as a use case, N-trie exhibits superior execution time.
N-trie gracefully accommodates the addition of new rules to improve accuracy.
Trie-based hashing has significant potential in other rule-based NLP tasks.

Abstract

Objective

To develop and evaluate an efficient Trie structure for large-scale, rule-based clinical natural language processing (NLP), which we call n-trie.

Background

Despite the popularity of machine learning techniques in natural language processing, rule-based systems boast important advantages: distinctive transparency, ease of incorporating external knowledge, and less demanding annotation requirements. However, processing efficiency remains a major obstacle for adopting standard rule-base NLP solutions in big data analyses.

Methods

We developed n-trie to specifically address the token-based nature of context detection, an important facet of clinical NLP that is known to slow down NLP pipelines. N-trie, a new rule processing engine using a revised Trie structure, allows fast execution of lexicon-based NLP rules. To determine its applicability and evaluate its performance, we applied the n-trie engine in an implementation (called FastContext) of the ConText algorithm and compared its processing speed and accuracy with JavaConText and GeneralConText, two widely used Java ConText implementations, as well as with a standalone machine learning NegEx implementation, NegScope.

Results

The n-trie engine ran two orders of magnitude faster and was far less sensitive to rule set size than the comparison implementations, and it proved faster than the best machine learning negation detector. Additionally, the engine consistently gained accuracy improvement as the rule set increased (the desired outcome of adding new rules), while the other implementations did not.

Conclusions

The n-trie engine is an efficient, scalable engine to support NLP rule processing and shows the potential for application in other NLP tasks beyond context detection.

References

[1]
M.K. Ross, W. Wei, L. Ohno-Machado, “Big Data” and the electronic health record, Yearb. Med. Inform. 9 (2014) 97–104,.
[2]
G. Divita, M. Carter, A. Redd, Q. Zeng, K. Gupta, B. Trautner, M. Samore, A. Gundlapalli, Scaling-up NLP pipelines to process large corpora of clinical notes, Methods Inf. Med. 54 (2015) 548–552,.
[3]
P.B. Jensen, L.J. Jensen, S. Brunak, Mining electronic health records: towards better research applications and clinical care, Nat. Rev. Genet. 13 (2012) 395–405,.
[4]
D. Blumenthal, Launching HITECH, N. Engl. J. Med. 362 (2010) 382–385,.
[5]
H. Harkema, J.N. Dowling, T. Thornblade, W.W. Chapman, ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports, J. Biomed. Inform. 42 (2009) 839–851,.
[6]
L. Chiticariu, Y. Li, F.R. Reiss, Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!, in: EMNLP, 2013: pp. 827–832. <http://www.aclweb.org/website/old_anthology/D/D13/D13-1079.pdf> (accessed September 4, 2015).
[7]
Sergey Goryachev, M. Sordo, Q.T. Zeng, L. Ngo, Implementation and evaluation of four different methods of negation detection, Boston MA DSG, 2006, <http://www.researchgate.net/profile/Qing_Zeng-Treitler/publication/267552783_Implementation_and_Evaluation_of_Four_Different_Methods_of_Negation_Detection/links/54ac0cf20cf2bce6aa1dee7f.pdf> (accessed July 18, 2015).
[8]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F.R. Reiss, S. Vaithyanathan, SystemT: an algebraic approach to declarative information extraction, in: Proc. 48th Annu. Meet. Assoc. Comput. Linguist., Association for Computational Linguistics, 2010, pp. 128–137. <http://dl.acm.org/citation.cfm?id=1858695> (accessed April 1, 2016).
[9]
L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, S. Vaithyanathan, Domain adaptation of rule-based annotators for named-entity recognition tasks, in: Proc. 2010 Conf. Empir. Methods Nat. Lang. Process., Association for Computational Linguistics, Stroudsburg, PA, USA, 2010, pp. 1002–1012. <http://dl.acm.org/citation.cfm?id=1870658.1870756> (accessed April 1, 2016).
[10]
Ö. Uzuner, I. Solti, E. Cadag, Extracting medication information from clinical text, J. Am. Med. Inform. Assoc. 17 (2010) 514–518,.
[11]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, S. Vaithyanathan, An Algebraic Approach to Rule-Based Information Extraction, in: IEEE 24th Int. Conf. Data Eng. 2008 ICDE 2008, 2008, pp. 933–942. https://doi.org/10.1109/ICDE.2008.4497502.
[12]
Regular Expression Matching with a Trigram Index, (n.d.). <https://swtch.com/~rsc/regexp/regexp4.html> (accessed April 1, 2016).
[13]
J.L. Bentley, R. Sedgewick, Fast Algorithms for Sorting and Searching Strings, in: Proc. Eighth Annu. ACM-SIAM Symp. Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997, pp. 360–369. <http://dl.acm.org/citation.cfm?id=314161.314321> (accessed May 16, 2018).
[14]
D.R. Morrison, PATRICIA—practical algorithm to retrieve information coded in alphanumeric, J. ACM 15 (1968) 514–534,.
[15]
A. Acharya, H. Zhu, K. Shen, Adaptive algorithms for cache-efficient trie search, in: Sel. Pap. Int. Workshop Algorithm Eng. Exp., Springer-Verlag, Berlin, Heidelberg, 1999, pp. 296–311. <http://dl.acm.org/citation.cfm?id=646678.702159> (accessed May 16, 2018).
[17]
N.P. Cruz Díaz, M.J. Maña López, J.M. Vázquez, V.P. Álvarez, A machine-learning approach to negation and speculation detection in clinical texts, J. Am. Soc. Inf. Sci. Technol. 63 (2012) 1398–1410,.
[18]
G. Szarvas, V. Vincze, R. Farkas, J. Csirik, The BioScope Corpus: Annotation for Negation, Uncertainty and Their Scope in Biomedical Texts, in: Proc. Workshop Curr. Trends Biomed. Nat. Lang. Process., Association for Computational Linguistics, Stroudsburg, PA, USA, 2008, pp. 38–45. <http://dl.acm.org/citation.cfm?id=1572306.1572314> (accessed July 18, 2015).
[19]
P.G. Mutalik, A. Deshpande, P.M. Nadkarni, Use of general-purpose negation detection to augment concept indexing of medical documents, J. Am. Med. Inform. Assoc. JAMIA 8 (2001) 598–609. (accessed July 19, 2015) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC130070/.
[20]
SemEval-2015 Task 14: Analysis of Clinical Text, (n.d.). <http://alt.qcri.org/semeval2015/task14/> (accessed July 25, 2015).
[21]
D.B. Aronow, F. Fangfang, W.B. Croft, Ad hoc classification of radiology reports, J. Am. Med. Inform. Assoc. 6 (1999) 393–411,.
[22]
W.W. Chapman, W. Bridewell, P. Hanbury, G.F. Cooper, B.G. Buchanan, A simple algorithm for identifying negated findings and diseases in discharge summaries, J. Biomed. Inform. 34 (2001) 301–310,.
[23]
W.W. Chapman, D. Hilert, S. Velupillai, M. Kvist, M. Skeppstedt, B.E. Chapman, M. Conway, M. Tharp, D.L. Mowery, L. Deleger, Extending the NegEx lexicon for multiple languages, Stud. Health Technol. Inform. 192 (2013) 677–681. (accessed June 22, 2015) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3923890/.
[24]
Y. Huang, H.J. Lowe, A novel hybrid approach to automated negation detection in clinical radiology reports, J. Am. Med. Inform. Assoc. 14 (2007) 304–311,.
[25]
R. Morante, W. Daelemans, A Metalearning approach to processing the scope of negation, in: Proc. Thirteen. Conf. Comput. Nat. Lang. Learn., Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, pp. 21–29. <http://dl.acm.org/citation.cfm?id=1596374.1596381> (accessed April 10, 2016).
[26]
K. Fujikawa, K. Seki, K. Uehara, A hybrid approach to finding negated and uncertain expressions in biomedical documents, in: Proc. 2Nd Int. Workshop Manag. Interoperability Complex. Health Syst., ACM, New York, NY, USA, 2012, pp. 67–74. https://doi.org/10.1145/2389672.2389685.
[27]
S. Agarwal, H. Yu, Biomedical negation scope detection with conditional random fields, J. Am. Med. Inform. Assoc. 17 (2010) 696–701,.
[28]
J. Cogley, N. Stokes, J. Carthy, J. Dunnion, Analyzing patient records to establish if and when a patient suffered from a medical condition, in: Proc. 2012 Workshop Biomed. Nat. Lang. Process., Association for Computational Linguistics, Stroudsburg, PA, USA, 2012, pp. 38–46. <http://dl.acm.org/citation.cfm?id=2391123.2391129> (accessed November 26, 2016).
[29]
S. Wu, J. Masanz, M. Coarr, S. Halgrim, D. Carrell, C. Clark, T. Miller, Negation’s Not Solved: Reconsidering Negation Annotation and Evaluation, MITRE Corp. (n.d.). <http://www.mitre.org/publications/technical-papers/negations-not-solved-reconsidering-negation-annotation-and-evaluation> (accessed April 10, 2016).
[30]
O. Patterson, J.F. Hurdle, Document clustering of clinical narratives: a systematic study of clinical sublanguages, AMIA Annu. Symp. Proc. AMIA Symp. 2011 (2011) 1099–1107.
[31]
K. Doing-Harris, O. Patterson, S. Igo, J. Hurdle, Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts, in: Proc. ACM Int. Workshop Data Text Min. Biomed. Inform. ACM Int. Workshop Data Text Min. Biomed. Inform, 2013, pp. 9–12, https://doi.org/10.1145/2512089.2512101.
[32]
S.M. Meystre, G.K. Savova, K.C. Kipper-Schuler, J.F. Hurdle, Others, extracting information from textual documents in the electronic health record: a review of recent research, Yearb Med. Inf. 35 (2008) 128–144.
[33]
Jianlin Shi, FastContext, GitHub. (n.d.). <https://github.com/jianlins/FastContext> (accessed December 20, 2017).
[34]
JavaConText.zip – negex – JavaConText – a Java implementation of the ConText algorithm. – negation identification for clinical conditions – Google Project Hosting, (n.d.). <https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/negex/JavaConText.zip> (accessed January 5, 2018).
[35]
GeneralConText.Java.v.1.0_10272010.zip – negex – General ConText Java Implementation v.1.0 (Imre Solti & Junebae Kye) & De-Identified Annotations for Negation (Chapman) Input: Sentence to be analyzed. Output: NEGATION scope as token offset or flag if nega, (n.d.). <https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/negex/GeneralConText.Java.v.1.0_10272010.zip> (accessed January 5, 2018).
[36]
Jianlin Shi, Danielle Mowery, Kristina M. Doing-Harris, John F. Hurdle, RuSH: a Rule-based Segmentation Tool Using Hashing for Extremely Accurate Sentence Segmentation of Clinical Text, in: AMIA Annu Symp Proc, Chicago, Ill, 2016, p. 1587. <https://knowledge.amia.org/amia-63300-1.3360278/t005-1.3362920/f005-1.3362921/2495498-1.3363244/2495498-1.3363247>.
[37]
UIMA Asynchronous Scaleout, (n.d.), <https://uima.apache.org/d/uima-as-2.9.0/uima_async_scaleout.html> (accessed December 20, 2017).
[38]
J. Shi, D. Mowery, M. Zhang, J. Sanders, W. Chapman, L. Gawron, Extracting intrauterine device usage from clinical texts using natural language processing, in: 2017 IEEE Int. Conf. Healthc. Inform. ICHI, 2017, pp. 568–571. https://doi.org/10.1109/ICHI.2017.21.

Cited By

View all
  • (2023)A deep learning approach for medication disposition and corresponding attributes extractionJournal of Biomedical Informatics10.1016/j.jbi.2023.104391143:COnline publication date: 1-Jul-2023

Index Terms

  1. Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Journal of Biomedical Informatics
          Journal of Biomedical Informatics  Volume 85, Issue C
          Sep 2018
          208 pages

          Publisher

          Elsevier Science

          San Diego, CA, United States

          Publication History

          Published: 01 September 2018

          Author Tags

          1. Natural language processing
          2. Medical informatics applications
          3. Algorithms
          4. Data accuracy

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 29 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)A deep learning approach for medication disposition and corresponding attributes extractionJournal of Biomedical Informatics10.1016/j.jbi.2023.104391143:COnline publication date: 1-Jul-2023

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media