Abstract
Automatic Text Classification (ATC) is an emerging technology with economic importance given the unprecedented growth of text data. This paper reports on work in progress to develop methods for predicting Cause of Death from Verbal Autopsy (VA) documents recommended for use in low-income countries by the World Health Organisation. VA documents contain both coded data and open narrative. The task is formulated as a Text Classification problem and explores various combinations of linguistic and statistical approaches to determine how these may improve on the standard bag-of-words approach using a dataset of over 6400 VA documents that were manually annotated with cause of death. We demonstrate that a significant improvement of prediction accuracy can be obtained through a novel combination of statistical and linguistic features derived from the VA text. The paper explores the methods by which ATC may leads to improved accuracy in Cause of Death prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
World Health Organization: WHO Handbook for Reporting Results of Cancer Treatments (WHO Offset Publication No. 48) (2004)
Kahn, K., Tollman, S.M., Garenne, M., Gear, J.S.: Validation and application of verbal autopsies in a rural area of South Africa. Tropical Medicine & International Health 5(11), 824–831 (2000)
Byass, P., Kathleen, K., Edward, F., Mark, A.C., Stephen, M.T.: Moving from Data on Deaths to Public Health Policy in Agincourt, South Africa: Approaches to Analysing and Understanding Verbal Autopsy Findings. PLoS Medicine 7(8) (2010)
King, G., Lu, Y., Shibuya, K.: Designing verbal autopsy studies. Population Health Metrics 8(1), 19 (2010)
Byass, P., Edward, F., Dao Lan, H., Yamene, B., Tumani, C., Kathleen, K., Lulu, M.: Refining a probabilistic model for interpreting verbal autopsy data. Scandinavian Journal of Public Health 34(1), 26–31 (2006)
Murray, C.J.L., Alan, D.L., Dennis, F., Shannon, T.P., Gonghuan, Y.: Validation of the symptom pattern method for analyzing verbal autopsy data. PLOS Medicine 4, 1739–1753 (2007)
Soleman, N., Chandramohan, D., Shibuya, K.: WHO Technical Consultation on Verbal Autopsy Tools, Geneva (2005)
Pakhomov, S., Shah, N., Hanson, P., Balasubramaniam, S., Smith, S.: Automatic quality of life prediction using electronic medical records. American Medical Informatics Association (2008)
Pakhomov, S., Weston, S.A., Jacobsen, S.J., Chute, C.G., Meverden, R., Roger, V.L.: Electronic medical records for clinical research: application to the identification of heart failure. The American Journal of Managed Care 13(6 Part 1), 281 (2007)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 57–71 (2005)
Cohen, A.M.: An effective general purpose approach for automated biomedical document classification. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association (2006)
Nikfarjam, A., Gonzalez, G.H.: Pattern mining for extraction of mentions of adverse drug reactions from user comments. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association (2011)
Leaman, R., Wojtulewicz, L., Sullivan, R., Skariah, A., Yang, J., Gonzalez, G.: Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. Association for Computational Linguistics (2010)
Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 841. Association for Computational Linguistics, Geneva (2004)
Oberlander, J., Nowson, S.: Whose thumb is it anyway?: classifying author personality from weblog text. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics (2006)
Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 417–424. Association for Computational Linguistics, Philadelphia (2002)
Pang, B., Lee, L.: Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Annual Meeting-Association For Computational Linguistics (2005)
Danso, S., Atwell, E.S., Johnson, O., ten Asbroek, A., Soromekun, S., Edmond, K., Hurt, C., Hurt, L., Zandoh, C., Tawiah, C., Fenty, J., Etego, S., Agyei, S., Kirkwood, B.: A semantically annotated Verbal Autopsy corpus for automatic analysis of cause of death. ICAME Journal of the International Computer Archive of Modern English 37 (in press, 2013)
Francis, W.N., Kucera, H.: Brown corpus manual. Letters to the Editor 5(2), 7 (1979)
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference (1998)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. ACM (2004)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Lawrence Erlbaum Associates Ltd. (1995)
Danso, S., Atwell, E.S., Johnson, O.: A Comparative Study of Machine Learning Methods for Verbal Autopsy Text Classification. International Journal of Computer Science Issues 10 (in press)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, Association for Computational Linguistics (2002)
Loper, E., Bird, S.: NLTK: the Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, vol. 1, pp. 63–70. Association for Computational Linguistics, Philadelphia (2002)
Wilks, Y., Stevenson, M.: Word sense disambiguation using optimised combinations of knowledge sources. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 2. Association for Computational Linguistics (1998)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Machine Learning-International Workshop Conference (1999)
Harris, Z.S.: Methods in structural linguistics (1951)
McKeown, K.R., Radev, D.R.: Collocations. Handbook of Natural Language Processing. Marcel Dekker (2000)
Pearce, D., Qh, B.: Using conceptual similarity for collocation extraction. In: Proceedings of the Fourth Annual CLUK Colloquium (2001)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational. Linguistics 19(1), 61–74 (1993)
Seretan, V., Nerima, L., Wehrli, E.: Extraction of multi-word collocations using syntactic bigram composition. In: Proceedings of the Fourth International Conference on Recent Advances in NLP, RANLP-2003 (2003)
Pearce, D.: A comparative evaluation of collocation extraction techniques. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Danso, S., Atwell, E., Johnson, O. (2013). Linguistic and Statistically Derived Features for Cause of Death Prediction from Verbal Autopsy Text. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-40722-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)