Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2939672.2939880acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Topic Modeling of Short Texts: A Pseudo-Document View

Published: 13 August 2016 Publication History

Abstract

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.

References

[1]
D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, apr 2012.
[2]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, mar 2003.
[3]
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, 2004.
[4]
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, 1999.
[5]
L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88, 2010.
[6]
Y. Hu, A. John, F. Wang, and S. Kambhampati. Et-lda: Joint topic modeling for aligning events and their twitter feedback. In AAAI, pages 59--65, 2012.
[7]
H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730--773, 2005.
[8]
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 775--784, 2011.
[9]
A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 891--900, 2014.
[10]
W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006.
[11]
T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: Mining focused topics and focused terms in short text. In Proceedings of the 23rd international conference on World wide web, pages 539--550, 2014.
[12]
X. Liu, B. Du, C. Deng, M. Liu, and B. Lang. Structure sensitive hashing with adaptive product quantization. IEEE Transactions on Cybernetics, PP(0):1--12, 2015.
[13]
Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Information Retrieval, 14(2):178--203, apr 2011.
[14]
R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889--892, 2013.
[15]
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262--272, 2011.
[16]
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100--108, 2010.
[17]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2--3):103--134, may 2000.
[18]
X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2270--2276, 2015.
[19]
J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of The 31st International Conference on Machine Learning, pages 190--198, 2014.
[20]
J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 5--13, 2013.
[21]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.
[22]
H. M. Wallach. Structured Topic Models for Language. PhD thesis, University of Cambridge, 2008.
[23]
C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In Advances in neural information processing systems, pages 1982--1989. 2009.
[24]
X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, 2006.
[25]
J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: Finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261--270, 2010.
[26]
P. Xie and E. P. Xing. Integrating document clustering and topic modeling. Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2013.
[27]
X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, pages 1445--1456, 2013.
[28]
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 937--946, 2009.
[29]
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 233--242, 2014.
[30]
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349, 2011.
[31]
X. W. Zhao, J. Wang, Y. He, J.-Y. Nie, and X. Li. Originator or propagator?: Incorporating social role theory into topic models for twitter content analysis. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1649--1654, 2013.
[32]
A. Zubiaga and H. Ji. Harnessing web page directories for large-scale classification of tweets. In Proceedings of the 22nd international conference on World Wide Web companion, pages 225--226, 2013.
[33]
Y. Zuo, J. Zhao, and K. Xu. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, pages 1--20, 2015.

Cited By

View all
  • (2025)Topic modelling through the bibliometrics lens and its techniqueArtificial Intelligence Review10.1007/s10462-024-11011-x58:3Online publication date: 6-Jan-2025
  • (2024)Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked ClaimsInformation10.3390/info1510065915:10(659)Online publication date: 21-Oct-2024
  • (2024)Designing an information system for integrated topic analysis of social media big dataOntology of Designing10.18287/2223-9537-2024-14-1-55-7014:1(55-70)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. latent dirichlet allocation
  2. pseudo document
  3. short texts
  4. topic modeling

Qualifiers

  • Research-article

Funding Sources

  • National High Technology Research and Development Program of China
  • National Center for International Joint Research on E-Business Information Processing
  • Foundation for the Author of National Excellent Doctoral Dissertation of China
  • China 863 Project
  • NSFC
  • Fundamental Research Funds for Central Universities

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Topic modelling through the bibliometrics lens and its techniqueArtificial Intelligence Review10.1007/s10462-024-11011-x58:3Online publication date: 6-Jan-2025
  • (2024)Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked ClaimsInformation10.3390/info1510065915:10(659)Online publication date: 21-Oct-2024
  • (2024)Designing an information system for integrated topic analysis of social media big dataOntology of Designing10.18287/2223-9537-2024-14-1-55-7014:1(55-70)Online publication date: 1-Mar-2024
  • (2024)Finding answers to COVID-19-specific questionsJournal of Information Science10.1177/0165551522111099550:4(935-951)Online publication date: 1-Aug-2024
  • (2024)A Multi-View Clustering Algorithm for Short Text2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00107(5101-5110)Online publication date: 13-May-2024
  • (2024)See, caption, clusterExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121391237:PBOnline publication date: 1-Feb-2024
  • (2024)A novel text clustering model based on topic modelling and social network analysisChaos, Solitons & Fractals10.1016/j.chaos.2024.114633181(114633)Online publication date: Apr-2024
  • (2024)Short-text topic modeling with dual reinforcement from internal and external semanticsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02427-6Online publication date: 21-Oct-2024
  • (2023)Web content topic modeling using LDA and HTML tagsPeerJ Computer Science10.7717/peerj-cs.14599(e1459)Online publication date: 11-Jul-2023
  • (2023)PAZARLAMADA KONU MODELLEMESİ: LİTERATÜR TARAMASI VE BİLİMETRİK ANALİZTOPIC MODELING IN MARKETING: LITERATURE REVIEW AND SCIENTOMETRIC ANALYSISGüncel Pazarlama Yaklaşımları ve Araştırmaları Dergisi10.54439/gupayad.13165444:1(58-89)Online publication date: 30-Jul-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media