Abstract
Various kinds of online social media applications such as Twitter and Weibo, have brought a huge volume of short texts. However, mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics. To address the above problems, we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model (PSLDA for short). Specifically, we first assume that short texts are generated from the normal size latent pseudo documents, and the topic distributions are sampled from the pseudo documents. In this way, the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents. To make full use of labeled information in training data, we introduce labels into the model, and further propose a supervised topic model to learn the reasonable distribution of topics. Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.
Similar content being viewed by others
References
Rosso P, Errecalde M, Pinto D. Analysis of short texts on the web: introduction to special issue. Language Resources and Evaluation, 2013, 47(1): 123–126
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50–57
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022
Li Z, Zhang H, Wang S, Huang F, Li Z, Zhou J. Exploit latent Dirichlet allocation for collaborative filtering. Frontiers of Computer Science, 2018, 12(3): 571–581
Chen W, Cai F, Chen H, De Rijke M. Personalized query suggestion diversification in information retrieval. Frontiers of Computer Science, 2020, 14(3): 143602
Miyazawa S, Song X, Xia T, Shibasaki R, Kaneda H. Integrating GPS trajectory and topics from twitter stream for human mobility estimation. Frontiers of Computer Science, 2019, 13(3): 460–470
Hong L, Davison B D. Empirical study of topic modeling in twitter. In: Proceedings of the 1st Workshop on Social Media Analytics. 2010, 80–88
Davison B D, Suel T, Craswell N, Liu B. WSDM’10: Third ACM International Conference on Web Search and Data Mining. New York: ACM, 2010
Mehrotra R, Sanner S, Buntine W, Xie L. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 889–892
Phan X H, Nguyen C T, Le D T, Nguyen L M, Horiguchi S, Ha Q T. A hidden topic-based framework toward building applications with short Web documents. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 961–976
Quan X, Kit C, Ge Y, Pan S J. Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015, 2270–2276
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 2105–2114
Blei D M, Lafferty J D. Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 113–120
Meek C, Chickering M, Halpern J. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff: AUAI Press, 2004
Nguyen D Q, Billingsley R, Du L, Johnson M. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 2015, 3: 299–313
Zhao F, Zhu Y, Jin H, Yang L T. A personalized hashtag recommendation approach using lda-based topic model in microblog environment. Future Generation Computer Systems, 2016, 65: 196–206
Ibeke E, Lin C, Wyner A, Barawi M H. Extracting and understanding contrastive opinion through topic relevant sentences. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 395–400
Tian C, Rong W, Zhou S, Zhang J, Ouyang Y, Xiong Z. Learning word representation by jointly using neighbor and syntactic contexts. Neurocomputing, 2021, 456: 136–146
Weng J, Lim E P, Jiang J, He Q. TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010, 261–270
Jin O, Liu N N, Zhao K, Yu Y, Yang Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 775–784
Lin T, Tian W, Mei Q, Cheng H. The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 539–550
Cheng X, Yan X, Lan Y, Guo J. BTM: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2928–2941
Zuo Y, Zhao J, Xu K. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 2016, 48(2): 379–398
Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 233–242
Li C, Wang H, Zhang Z, Sun A, Ma Z. Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 165–174
Li X, Li C, Chi J, Ouyang J. Short text topic modeling by exploring original documents. Knowledge and Information Systems, 2018, 56(2): 443–462
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa G L. A general framework to expand short text for topic modeling. Information Sciences, 2017, 393: 66–81
Pedrosa G, Pita M, Bicalho P, Lacerda A, Pappa G L. Topic modeling for short texts with co-occurrence frequency-based expansion. In: Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016, 277–282
Shi T, Kang K, Choo J, Reddy C K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of 2018 World Wide Web Conference. 2018, 1105–1114
Miao Y, Yu L, Blunsom P. Neural variational inference for text processing. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1727–1736
Ding R, Nallapati R, Xiang B. Coherence-aware neural topic modeling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 830–836
Zhu J, Xing E P. Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 831–838
Card D, Tan C, Smith N A. Neural models for documents with metadata. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2031–2040
Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. The Journal of Machine Learning Research, 2014, 15(1): 1073–1110
Michael J R, Schucany W R, Haas R W. Generating random variates using transformations with multiple roots. The American Statistician, 1976, 30(2): 88–90
Dua D, Graff C. UCI machine learning repository. See https://archiveics.uci.edu/ml/index website, 2017
Zubiaga A, Ji H. Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd International Conference on World Wide Web. 2013, 225–226
Phan X H, Nguyen C T. GibbsLDA++: A C/C++ implementation of latent dirichlet allocation (LDA). Boston: Free Software Foundation, 2007
Blei D M, McAuliffe J D. Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems. 2007, 121–128
Chong W, Blei D, Li F F. Simultaneous image classification and annotation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1903–1910
Zhu J, Ahmed A, Xing E P. MedLDA: maximum margin supervised topic models. The Journal of Machine Learning Research, 2012, 13(1): 2237–2278
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 2011, 12: 2825–2830
Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 399–408
Author information
Authors and Affiliations
Corresponding author
Additional information
Mingtao Sun is a PhD candidate in School of Economics and Management, Beihang University, China. His research interests include Big Data processing and Education Administration.
Xiaowei Zhao is currently pursuing the PhD degree in Computer Science with Beihang University, China. Her main research interests include transfer learning and sentiment analysis.
Jingjing Lin is currently a senior student at the School of Instrumentation and Optoelectronic Engineering, Beihang University, China. Her research interests include text classification, natural language inference, and sentiment analysis.
Jian Jing received the MS degree in the Engineering of Computer Techonlogy from the Beihang University, China in 2021. His research interests include knowledge reasoning, algorithms and big data processing.
Deqing Wang received the PhD degree in computer science from Beihang University, China in 2013. He is currently an Associate Professor with the School of Computer Science and the Deputy Chief Engineer with the National Engineering Research Center for Science Technology Resources Sharing and Service, Beihang University, China. His research focuses on text categorization and data mining for software engineering and machine learning.
Guozhu Jia received the PhD degree from Aalborg University, Denmark. He is currently a Professor of School of Economics and Management, Beihang University, China and a member of Expert Committee of China Manufacturing Servitization Alliance. He is also a director of China Innovation Method Society.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Sun, M., Zhao, X., Lin, J. et al. PSLDA: a novel supervised pseudo document-based topic model for short texts. Front. Comput. Sci. 16, 166350 (2022). https://doi.org/10.1007/s11704-021-0606-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-021-0606-3