PSLDA: a novel supervised pseudo document-based topic model for short texts

Mingtao Sun¹,
Xiaowei Zhao²,
Jingjing Lin³,
Jian Jing²,
Deqing Wang² &
…
Guozhu Jia¹

119 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Various kinds of online social media applications such as Twitter and Weibo, have brought a huge volume of short texts. However, mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics. To address the above problems, we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model (PSLDA for short). Specifically, we first assume that short texts are generated from the normal size latent pseudo documents, and the topic distributions are sampled from the pseudo documents. In this way, the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents. To make full use of labeled information in training data, we introduce labels into the model, and further propose a supervised topic model to learn the reasonable distribution of topics. Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic extraction from extremely short texts with variational manifold regularization

Article 16 April 2021

Sparse Topical Coding with Sparse Groups

Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

References

Rosso P, Errecalde M, Pinto D. Analysis of short texts on the web: introduction to special issue. Language Resources and Evaluation, 2013, 47(1): 123–126
Article Google Scholar
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50–57
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022
Google Scholar
Li Z, Zhang H, Wang S, Huang F, Li Z, Zhou J. Exploit latent Dirichlet allocation for collaborative filtering. Frontiers of Computer Science, 2018, 12(3): 571–581
Article Google Scholar
Chen W, Cai F, Chen H, De Rijke M. Personalized query suggestion diversification in information retrieval. Frontiers of Computer Science, 2020, 14(3): 143602
Article Google Scholar
Miyazawa S, Song X, Xia T, Shibasaki R, Kaneda H. Integrating GPS trajectory and topics from twitter stream for human mobility estimation. Frontiers of Computer Science, 2019, 13(3): 460–470
Article Google Scholar
Hong L, Davison B D. Empirical study of topic modeling in twitter. In: Proceedings of the 1st Workshop on Social Media Analytics. 2010, 80–88
Davison B D, Suel T, Craswell N, Liu B. WSDM’10: Third ACM International Conference on Web Search and Data Mining. New York: ACM, 2010
Mehrotra R, Sanner S, Buntine W, Xie L. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 889–892
Phan X H, Nguyen C T, Le D T, Nguyen L M, Horiguchi S, Ha Q T. A hidden topic-based framework toward building applications with short Web documents. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 961–976
Article Google Scholar
Quan X, Kit C, Ge Y, Pan S J. Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015, 2270–2276
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 2105–2114
Blei D M, Lafferty J D. Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 113–120
Meek C, Chickering M, Halpern J. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff: AUAI Press, 2004
Google Scholar
Nguyen D Q, Billingsley R, Du L, Johnson M. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 2015, 3: 299–313
Article Google Scholar
Zhao F, Zhu Y, Jin H, Yang L T. A personalized hashtag recommendation approach using lda-based topic model in microblog environment. Future Generation Computer Systems, 2016, 65: 196–206
Article Google Scholar
Ibeke E, Lin C, Wyner A, Barawi M H. Extracting and understanding contrastive opinion through topic relevant sentences. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 395–400
Tian C, Rong W, Zhou S, Zhang J, Ouyang Y, Xiong Z. Learning word representation by jointly using neighbor and syntactic contexts. Neurocomputing, 2021, 456: 136–146
Article Google Scholar
Weng J, Lim E P, Jiang J, He Q. TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010, 261–270
Jin O, Liu N N, Zhao K, Yu Y, Yang Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 775–784
Lin T, Tian W, Mei Q, Cheng H. The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 539–550
Cheng X, Yan X, Lan Y, Guo J. BTM: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2928–2941
Article Google Scholar
Zuo Y, Zhao J, Xu K. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 2016, 48(2): 379–398
Article Google Scholar
Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 233–242
Li C, Wang H, Zhang Z, Sun A, Ma Z. Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 165–174
Li X, Li C, Chi J, Ouyang J. Short text topic modeling by exploring original documents. Knowledge and Information Systems, 2018, 56(2): 443–462
Article Google Scholar
Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa G L. A general framework to expand short text for topic modeling. Information Sciences, 2017, 393: 66–81
Article Google Scholar
Pedrosa G, Pita M, Bicalho P, Lacerda A, Pappa G L. Topic modeling for short texts with co-occurrence frequency-based expansion. In: Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016, 277–282
Shi T, Kang K, Choo J, Reddy C K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of 2018 World Wide Web Conference. 2018, 1105–1114
Miao Y, Yu L, Blunsom P. Neural variational inference for text processing. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1727–1736
Ding R, Nallapati R, Xiang B. Coherence-aware neural topic modeling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 830–836
Zhu J, Xing E P. Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 831–838
Card D, Tan C, Smith N A. Neural models for documents with metadata. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2031–2040
Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. The Journal of Machine Learning Research, 2014, 15(1): 1073–1110
MathSciNet Google Scholar
Michael J R, Schucany W R, Haas R W. Generating random variates using transformations with multiple roots. The American Statistician, 1976, 30(2): 88–90
Google Scholar
Dua D, Graff C. UCI machine learning repository. See https://archiveics.uci.edu/ml/index website, 2017
Zubiaga A, Ji H. Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd International Conference on World Wide Web. 2013, 225–226
Phan X H, Nguyen C T. GibbsLDA++: A C/C++ implementation of latent dirichlet allocation (LDA). Boston: Free Software Foundation, 2007
Google Scholar
Blei D M, McAuliffe J D. Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems. 2007, 121–128
Chong W, Blei D, Li F F. Simultaneous image classification and annotation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1903–1910
Zhu J, Ahmed A, Xing E P. MedLDA: maximum margin supervised topic models. The Journal of Machine Learning Research, 2012, 13(1): 2237–2278
MathSciNet Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 2011, 12: 2825–2830
MathSciNet Google Scholar
Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 399–408

Download references

Author information

Authors and Affiliations

School of Economics and Management, Beihang Univeristy, Beijing, 100191, China
Mingtao Sun & Guozhu Jia
School of Computer Science, Beihang University, Beijing, 100191, China
Xiaowei Zhao, Jian Jing & Deqing Wang
School of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing, 100191, China
Jingjing Lin

Authors

Mingtao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jian Jing
View author publications
You can also search for this author in PubMed Google Scholar
Deqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guozhu Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guozhu Jia.

Additional information

Mingtao Sun is a PhD candidate in School of Economics and Management, Beihang University, China. His research interests include Big Data processing and Education Administration.

Xiaowei Zhao is currently pursuing the PhD degree in Computer Science with Beihang University, China. Her main research interests include transfer learning and sentiment analysis.

Jingjing Lin is currently a senior student at the School of Instrumentation and Optoelectronic Engineering, Beihang University, China. Her research interests include text classification, natural language inference, and sentiment analysis.

Jian Jing received the MS degree in the Engineering of Computer Techonlogy from the Beihang University, China in 2021. His research interests include knowledge reasoning, algorithms and big data processing.

Deqing Wang received the PhD degree in computer science from Beihang University, China in 2013. He is currently an Associate Professor with the School of Computer Science and the Deputy Chief Engineer with the National Engineering Research Center for Science Technology Resources Sharing and Service, Beihang University, China. His research focuses on text categorization and data mining for software engineering and machine learning.

Guozhu Jia received the PhD degree from Aalborg University, Denmark. He is currently a Professor of School of Economics and Management, Beihang University, China and a member of Expert Committee of China Manufacturing Servitization Alliance. He is also a director of China Innovation Method Society.

Electronic supplementary material