Abstract
Within the domain of biomedical natural language processing (bioNLP), researchers have used many token features for machine learning models. With recent progress in word embeddings algorithms, it is no longer clear if most of these features are still useful. In this paper we survey the features which have been used in bioNLP, and evaluate each feature’s utility in a sample bioNLP task: the N2C2 2018 named entity recognition challenge. The features we test include two types of word embeddings, syntactic, lexical, and orthographic features, character-embeddings, and clustering and distributional word representations. We find that using fastText word embeddings results in a significantly higher F\(_1\) score than using any other individual feature (0.9142 compared to 0.8750 for the next-best feature). Furthermore, we conducted several experiments using combinations of features, and found that all tested combinations attained a lower F\(_1\) score than using word embeddings only. This indicates that supplementing word embeddings with additional features is not beneficial, and may even be detrimental.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Our code is available at https://github.com/lajesticvantrashell/N2C2_2018_track_2.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Campos, D., Matos, S., Oliveira, J.L.: Biomedical named entity recognition: a survey of machine-learning tools. In: Theory and Applications for Advanced Text Mining. IntechOpen (2012)
Chen, Y., et al.: Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training. J. Biomed. Inform. 96, 103252 (2019)
Edwards, I.R., Aronson, J.K.: Adverse drug reactions: definitions, diagnosis, and management. Lancet 356(9237), 1255–1259 (2000)
Hettne, K.M., et al.: A dictionary to identify small molecules and drugs in free text. Bioinformatics 25(22), 2983–2991 (2009)
Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016). Article number: 160035
Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: Genia corpus manual. Technical report, Citeseer (2006)
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Liu, F., Chen, J., Jagannatha, A., Yu, H.: Learning for biomedical information extraction: methodological review of recent advances. arXiv preprint arXiv:1606.07993 (2016)
Liu, S., Tang, B., Chen, Q., Wang, X.: Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries. Information 6(4), 848–865 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Muneeb, T., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015, pp. 158–163 (2015)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
QasemiZadeh, B., Handschuh, S.: Random indexing explained with high probability. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 414–423. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6_47
dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)
Tang, B., Cao, H., Wang, X., Chen, Q., Xu, H.: Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res. Int. 2014 (2014)
Tsuruoka, Y., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). https://doi.org/10.1007/11573036_36
Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018)
Wang, Y., et al.: Clinical information extraction applications: a literature review. J. Biomed. Inform. 77, 34–49 (2018)
Wishart, D.S., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46(D1), D1074–D1082 (2017)
Wu, Y., Xu, J., Jiang, M., Zhang, Y., Xu, H.: A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015, p. 1326. American Medical Informatics Association (2015)
Acknowledgements
This research was supported by the Precision Driven Health Partnership (www.precisiondrivenhealth.com).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Huggard, H., Zhang, A., Zhang, E., Koh, Y.S. (2019). Feature Importance for Biomedical Named Entity Recognition. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-35288-2_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)