Abstract
Maintaining accessibility of biomedical literature databases has led to development of text classification systems to assist human indexers by recommending thematic categories to biomedical articles. These systems rely on using machine learning methods to learn the association between the document terms and predefined categories. The accuracy of a text classification method depends on the metric used in order to assign a weight to each term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. In this paper, we propose two supervised weighting metrics (One-way Klosgen and Loevinger) which both improve the quality of biomedical document classification. We also show that by using moment generating function centroids, an alternative to the traditional arithmetical average centroids, a nearest centroid classifier with Loevinger metric performs significantly better than SVM on a biomedical text classification task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Table 1 contains the abbreviations used in this paper.
References
MEDLINE/PubMed. http://www.ncbi.nlm.nih.gov/pubmed
Wahle, M., Widdows, D., Herskovic, J.R., Bernstam, E.V., Cohen, T.: Deterministic binary vectors for efficient automated indexing of medline/pubmed abstracts. In: AMIA Annual Symposium Proceedings, vol. 2012, p. 940. American Medical Informatics Association (2012)
Huang, M., Névéol, A., Zhiyong, L.: Recommending mesh terms for annotating biomedical articles. J. Am. Med. Inf. Assoc. 18(5), 660–667 (2011)
Vasuki, V., Cohen, T.: Reflective random indexing for semi-automatic indexing of the biomedical literature. J. Biomed. Inf. 43(5), 694–700 (2010)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012)
Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Quemada et al. [27], pp. 201–210
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the ACM Symposium on Applied Computing (SAC), March 9–12, Melbourne, FL, USA, pp. 784–788. ACM (2003)
Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Shanahan, J.G., Amer-Yahia, S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A., (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26–30, pp. 263–270. ACM (2008)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Altinçay, H., Erenel, Z.: Using the absolute difference of term occurrence probabilities in binary text categorization. Appl. Intell. 36(1), 148–160 (2012)
Han, E.-H.S., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013)
Nguyen, T.T., Chang, K., Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)
Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological), pp. 131–142(1966)
Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (2006)
Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Combining supervised term weighting metrics for SVM text classification with extended term representation. Knowl. Inf. Syst. 1–23 (2016)
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 192–201. Springer, London (1994)
Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S.: Supervised term weights for biomedical text classification. In: Proceedings of the 12th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB, Naples, Italy, September 10–12, pp. 55–60 (2015)
Deng, Z.-H., Tang, S., Yang, D., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 38(3), 9 (2006)
Porter, F.M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods-Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999). Chap. 11
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Quemada et al. [27], pp. 211–220
Quemada, J., León, G., Maarek, Y.S., Nejdl, W., (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), Madrid, Spain, 20-24 April 2009. ACM (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Haddoud, M., Mokhtari, A., Lecroq, T., Abdeddaïm, S. (2016). Supervised Term Weights for Biomedical Text Classification: Improvements in Nearest Centroid Computation. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-44332-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44331-7
Online ISBN: 978-3-319-44332-4
eBook Packages: Computer ScienceComputer Science (R0)