Abstract
This paper presents a novel approach for text categorization by fusing “Bag-of-words” (BOW) word feature and multilevel semantic feature (SF). By extending Online LDA (OLDA) as multilevel topic model for learning a semantic space with different topic granularity, multilevel semantic features are extracted for representing text component. The effectiveness of our approach is evaluated on both large scale Wikipedia corpus and middle-sized 20newsgroups dataset. The former experiment shows that our approach is able to preform semantic feature extraction on large scale dataset. It also demonstrates the topics generated from different topic level have different semantic scopes, which is more appropriate to represent text content. Our classification experiments on 20newsgroups achieved 82.19 % accuracy, which illustrates the effectiveness of fusing BOW and SF features. The further investigation on word and semantic feature fusion proves that Support Vector Machine (SVM) is more sensitive to semantic feature than Naive Bayes (NB), K Nearest Neighbor(KNN), Decision Tree (DT). It is shown that appropriately fusing low-level word feature and high-level semantic feature can achieve equally well or even better result than state-of-the-art with reduced feature dimension and computational complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The number of topics in different topic level can be arbitrary and should not be same in order to observe topics generated with different topic granularities.
- 2.
References
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, US (2012)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147–154 (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cao, J., Li, J., Zhang, Y., Tang, S.: Lda-based retrieval framework for semantic news video retrieval. In: International Conference on Semantic Computing, 2007, ICSC 2007, pp. 155–160. IEEE (2007)
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manage. 47(2), 202–214 (2011)
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: IJCAI, pp. 1776–1781. Citeseer (2011)
Griffiths, D.M.B.T.L., Tenenbaum, M.I.J.J.B.: Hierarchical topic models and the nested chinese restaurant process. Adv. Neural Inf. Process. Syst. 16, 17–24 (2004)
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: advances in neural information processing systems, pp. 856–864 (2010)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM (2010)
Li, L., Roth, B., Sporleder, C.: Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–1147. Association for Computational Linguistics (2010)
Lienou, M., Maître, H., Datcu, M.: Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 7(1), 28–32 (2010)
Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, pp. 121–128. MIT Press, Cambridge (2008)
Lo Presti, L., Sclaroff, S., La Cascia, M.: Object matching in distributed video surveillance systems by LDA-based appearance descriptors. In: Foggia, P., Sansone, C., Vento, M. (eds.) ICIAP 2009. LNCS, vol. 5716, pp. 547–557. Springer, Heidelberg (2009)
Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3408–3415. IEEE (2010)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th International Conference on World Wide Web, pp. 111–120. ACM (2008)
Wang, C., Blei, D., Li, F.-F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, pp. 1903–1910. IEEE (2009)
Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)
Zhang, A., Zhu, J., Zhang, B.: Sparse online topic models. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1489–1500. International World Wide Web Conferences Steering Committee (2013)
Zhu, J., Xing, E.P.: Sparse topical coding (2012). arXiv preprint arXiv:1202.3778
Zhu, Y., Li, L., Luo, L.: Learning to classify short text with topic model and external knowledge. In: Wang, M. (ed.) KSEM 2013. LNCS, vol. 8041, pp. 493–503. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, C., Yang, H., Meinel, C. (2015). Does Multilevel Semantic Representation Improve Text Categorization?. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-22849-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)