Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2488388.2488514acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A biterm topic model for short texts

Published: 13 May 2013 Publication History

Abstract

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.

References

[1]
A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. In In Proceedings of the 25th Conference on UAI, 2009.
[2]
D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 121--128. MIT Press, Cambridge, MA, 2008.
[3]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003.
[4]
I. Bordino, C. Castillo, D. Donato, and A. Gionis. Query similarity by projecting the query-flow graph. In SIGIR, pages 515--522. ACM, 2010.
[5]
J. Boyd-Graber and D. M. Blei. Syntactic topic models. Technical Report arXiv:1002.4665, Feb 2010.
[6]
J. Boyd-Graber, J. Chang, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, 2009.
[7]
D. Cai, Q. Mei, J. Han, and C. Zhai. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 911--920. ACM, 2008.
[8]
J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi. Short and tweet: experiments on recommending content from information streams. In Proceedings of the 28th international conference on Human factors in computing systems, pages 1185--1194. ACM, 2010.
[9]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407, 1990.
[10]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, 2004.
[11]
T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. NIPS, 17:537--544, 2005.
[12]
A. Gruber, M. Rosen-Zvi, and Y. Weiss. Hidden topic markov models. Artificial Intelligence and Statistics (AISTATS), 2007.
[13]
J. Guo, X. Cheng, G. Xu, and X. Zhu. Intent-aware query similarity. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 259--268. ACM, 2011.
[14]
J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. In SIGIR, pages 267--274. ACM, 2009.
[15]
G. Heinrich. Parameter estimation for text analysis. Technical report, 2005.
[16]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57. ACM, 1999.
[17]
L. Hong and B. Davison. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, pages 80--88. ACM, 2010.
[18]
L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193--218, 1985.
[19]
O. Jin, N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 775--784. ACM, 2011.
[20]
C. X. Lin, B. Zhao, Q. Mei, and J. Han. Pet: a statistical model for popular events tracking in social communities. In Proceedings of the 16th ACM SIGKDD, pages 929--938. ACM, 2010.
[21]
D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262--272. Association for Computational Linguistics, 2011.
[22]
D. Newman, E. V. Bonilla, and W. Buntine. Improving topic coherence with regularized topic models. In Advances in Neural Information Processing Systems 24, pages 496--504. 2011.
[23]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2):103--134, 2000.
[24]
X. Phan, L. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on World Wide Web, pages 91--100. ACM, 2008.
[25]
O. Phelan, K. McCarthy, and B. Smyth. Using twitter to recommend real-time topical news. In Proceedings of the third ACM conference on Recommender systems, pages 385--388, New York, NY, USA, 2009. ACM.
[26]
D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In International AAAI Conference on Weblogs and Social Media, volume 5, pages 130--137, 2010.
[27]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, 2004.
[28]
M. Sahami and T. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, pages 377--386. ACM, 2006.
[29]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 2004.
[30]
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD, pages 424--433, New York, NY, USA, 2006. ACM.
[31]
Y. Wang, E. Agichtein, and M. Benzi. Tm-lda: efficient online modeling of latent topic transitions in social media. In Proceedings of the 18th ACM SIGKDD, pages 123--131, New York, NY, USA, 2012. ACM.
[32]
J. Weng, E. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261--270. ACM, 2010.
[33]
X. Yan, J. Guo, S. Liu, X. Cheng, and Y. Wang. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 2013.
[34]
X. Yan, J. Guo, S. Liu, X.-q. Cheng, and Y. Wang. Clustering short text using ncut-weighted non-negative matrix factorization. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 2259--2262, New York, NY, USA, 2012. ACM.
[35]
W. Zhao, J. Jiang, J. Weng, J. He, E. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. Advances in Information Retrieval, pages 338--349, 2011.

Cited By

View all
  • (2024)The Generative Generic-Field Design Method Based on Design Cognition and Knowledge ReasoningSustainability10.3390/su1622984116:22(9841)Online publication date: 12-Nov-2024
  • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
  • (2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '13: Proceedings of the 22nd international conference on World Wide Web
May 2013
1628 pages
ISBN:9781450320351
DOI:10.1145/2488388

Sponsors

  • NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
  • CGIBR: Comite Gestor da Internet no Brazil

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. biterm
  2. content analysis
  3. short text
  4. topic model

Qualifiers

  • Research-article

Conference

WWW '13
Sponsor:
  • NICBR
  • CGIBR
WWW '13: 22nd International World Wide Web Conference
May 13 - 17, 2013
Rio de Janeiro, Brazil

Acceptance Rates

WWW '13 Paper Acceptance Rate 125 of 831 submissions, 15%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)413
  • Downloads (Last 6 weeks)49
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Generative Generic-Field Design Method Based on Design Cognition and Knowledge ReasoningSustainability10.3390/su1622984116:22(9841)Online publication date: 12-Nov-2024
  • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
  • (2024)A Topic Modeling Based on Prompt LearningElectronics10.3390/electronics1316321213:16(3212)Online publication date: 14-Aug-2024
  • (2024)A PTM-Based Framework for Enhanced User Requirement Classification in Product DesignElectronics10.3390/electronics1308145813:8(1458)Online publication date: 12-Apr-2024
  • (2024)From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super AppApplied Sciences10.3390/app1416695214:16(6952)Online publication date: 8-Aug-2024
  • (2024)A Survey on Event Tracking in Social Media Data StreamsBig Data Mining and Analytics10.26599/BDMA.2023.90200217:1(217-243)Online publication date: Mar-2024
  • (2024)Long COVID Discourse in Canada, the US, and Europe: Analysis Using Topic Modeling and Sentiment Analysis on Twitter Data (Preprint)Journal of Medical Internet Research10.2196/59425Online publication date: 12-Apr-2024
  • (2024)Designing an information system for integrated topic analysis of social media big dataOntology of Designing10.18287/2223-9537-2024-14-1-55-7014:1(55-70)Online publication date: 1-Mar-2024
  • (2024)EMFSA: Emoji-based multifeature fusion sentiment analysisPLOS ONE10.1371/journal.pone.031071519:9(e0310715)Online publication date: 19-Sep-2024
  • (2024)Breaking the spiral of silence: News and social media dynamics on sexual abuse scandal in the Japanese entertainment industryPLOS ONE10.1371/journal.pone.030610419:6(e0306104)Online publication date: 27-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media