Abstract
This paper addresses a real world problem: the classification of text documents in the medical domain. There are a number of approaches to classifying text documents. Here, we use a partially supervised classification approach and argue that it is effective and computationally efficient for real-world problems. The approach uses a two-step strategy to cut down on the effort required to label each document for classification. Only a small set of positive documents are labeled initially, with others being labeled automatically as a result of the first step. The second step builds the actual text classifier. There are a number of methods that have been proposed for each step. A comprehensive evaluation of various combinations of methods is conducted to compare their performances using real world medical documents. The results show that using EM based methods to build the classifier yields better results than SVM. We also experimentally show that careful selection of a subset of features to represent the documents can improve the performance of the classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida (2003)
Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially Supervised Classification of Text Document’s. In: Proceedings of the Nineteenth International Conference on Mach ine Learning (ICML 2002), Sydney, Australia (2002)
Porter, M.F.: An algorithm for suffix stripping. Program; automated library and information systems 14(3), 130–137 (1980)
Benbrahim, H., Barmer, M.A.: Neighborhood Exploitation in Hypertext Categorization. In: Research and Development in Intelligent Systems XXI. Springer, Heidelberg (2005)
Aronow, D.B., Feng, F.: Ad-Hoc Classification of Electronic Clinical Documents. D-Lib Magazine (1997), ISSN 1082-9873
Bowles, C.J., Leicester, R., Romaya, C., Swarbrick, E., Williams, C.B., Epstein, O.: A Prospective Study of Colonoscopy Practice in the UK today: are we Adequately Prepared for national colorectal Cancer Screening Tomorrow? Gut 53(2), 277–283 (2004)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Learning to Classify Text from Labeled and Unlabeled documents. In: AAAI 1998, pp. 792–799. AAAI Press, Menlo Park (1998)
Yang, Y., Liu, X.: Are-examination of Text Categorization Methods, Special Interest Group of Information Retrieval (SIGIR) (1999)
Lewis, D.D.: Representation and Learning in Information Retrieval, PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)
Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS, vol. 1501, pp. 112–126. Springer, Heidelberg (1998)
Rocchio, J.: Relevant Feedback in Information Retrieval. The smart retrieval system experiments in automatic document processing, Englewood Cliffs, NJ (1971)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (2003)
Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of the Speechand Natural Language Workshop Asilomar, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)
Dempster, A., Laird, N.M., Rubin, D.: Maximum Likelihood from Incomplete Data via EM Algorithm. Journal of the Royal Statistical Society (1997)
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. In: 3rd annual symposium on document analysis and information retrieval, pp. 81–93 (1994)
Joachim, T.: Making Large Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 103–134 (2000)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Saad, F.H., de la Iglesia, B., Bell, G.D. (2006). Comparison of Documents Classification Techniques to Classify Medical Reports. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_34
Download citation
DOI: https://doi.org/10.1007/11731139_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)