research-article

Topic Modeling of Short Texts Using Anchor Words

Authors:

Florian Steuber,

Mirco Schoenfeld,

Gabi Dreo RodosekAuthors Info & Claims

WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics

Pages 210 - 219

https://doi.org/10.1145/3405962.3405968

Published: 24 August 2020 Publication History

Abstract

We present Archetypal LDA or short A-LDA, a topic model tailored to short texts containing "semantic anchors" which convey a certain meaning or implicitly build on discussions beyond their mere presence. A-LDA is an extension to Latent Dirichlet Allocation in that we guide the process of topic inference by these semantic anchors as seed words to the LDA. We identify these seed words unsupervised from the documents and evaluate their co-occurrences using archetypal analysis, a geometric approximation problem that aims for finding k points that best approximate the data set's convex hull. These so called archetypes are considered as latent topics and used to guide the LDA. We demonstrate the effectiveness of our approach using Twitter, where semantic anchor words are the hashtags assigned to tweets by users. In direct comparison to LDA, A-LDA achieves 10-13% better results. We find that representing topics in terms of hashtags corresponding to calculated archetypes alone already results in interpretable topics and the model's performance peaks for seed confidence values ranging from 0.7 to 0.9.

References

[1]

David Alvarez-Melis and Martin Saveski. 2016. Topic Modeling in Twitter: Aggregating Tweets by Conversations. In Tenth International AAAI Conference on Web and Social Media. The AAAI Press, Palo Alto, CA, USA, 519--522.

[2]

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern Information Retrieval. Vol. 463. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

[3]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, Feb (2003), 1137--1155.

[4]

David M. Blei and John D. Lafferty. 2005. Correlated Topic Models. In Proceedings of the 18th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS'05). MIT Press, Cambridge, MA, USA, 147--154.

[5]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022.

[6]

Jonathan Chang and David M. Blei. 2009. Relational Topic Models for Document Networks. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009 (JMLR Proceedings), David A. Van Dyk and Max Welling (Eds.), Vol. 5. JMLR.org, Clearwater Beach, Florida, USA, 81--88. http://proceedings.mlr.press/v5/chang09a.html

[7]

Adele Cutler and Leo Breiman. 1994. Archetypal analysis. Technometrics 36, 4 (1994), 338--347.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Minneapolis, MN, USA, 1--16.

[9]

Susan T Dumais. 2004. Latent Semantic Analysis. Annual Review of Information Science and Technology 38, 1 (2004), 188--230.

[10]

Thomas L Griffiths and Mark Steyvers. 2004. Finding Scientific Topics. Proceedings of the National Academy of Sciences 101, suppl 1 (2004), 5228--5235.

[11]

Aria Haghighi and Dan Klein. 2006. Prototype-Driven Learning for Sequence Models. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (New York, New York) (HLT-NAACL '06). Association for Computational Linguistics, USA, 320--327. https://doi.org/10.3115/1220835.1220876

Digital Library

[12]

Thomas Hofmann. 2013. Probabilistic Latent Semantic Analysis. In SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, United States, 50--57.

[13]

Liangjie Hong and Brian D Davison. 2010. Empirical Study of Topic Modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics. Association for Computing Machinery, New York, NY, United States, 80--88.

Digital Library

[14]

Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating Lexical Priors into Topic Models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Avignon, France, 204--213.

Digital Library

[15]

Roland Kahlert, Matthias Liebeck, and Joseph Cornelius. 2017. Understanding Trending Topics in Twitter. In Datenbanksysteme für Business, Technologie und Web (BTW 2017)-Workshopband. Gesellschaft für Informatik eV, Bonn, Germany, 10.

[16]

Jey Han Lau, Nigel Collier, and Timothy Baldwin. 2012. On-line Trend Analysis with Topic Models:# twitter trends detection topic model online. In Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, 1519--1534.

[17]

Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, United States, 889--892.

Digital Library

[18]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 3111--3119.

Digital Library

[19]

David Mimno and Andrew McCallum. 2008. Topic Models Conditioned on Arbitrary Features with Dirichlet-Multinomial Regression. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (Helsinki, Finland) (UAI'08). AUAI Press, Arlington, Virginia, USA, 411--418.

Digital Library

[20]

David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing Semantic Coherence in Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 262--272.

Digital Library

[21]

Brendan O'Connor, Michel Krieger, and David Ahn. 2010. Tweetmotif: Exploratory Search and Topic Summarization for Twitter. In Fourth International AAAI Conference on Weblogs and Social Media. The AAAI Press, Menlo Park, CA, USA, 384--285.

[22]

David Alfred Ostrowski. 2015. Using Latent Dirichlet Allocation for Topic Modelling in Twitter. In Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015). IEEE, Anaheim, CA, USA, 493--497.

[23]

Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections. In Proceedings of the 17th International Conference on World Wide Web. Association for Computing Machinery, New York, NY, United States, 91--100.

Digital Library

[24]

Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of Population Structure using Multilocus Genotype Data. Genetics 155, 2 (2000), 945--959.

[25]

Dasha Pruss, Yoshinari Fujinuma, Ashlynn R Daughton, Michael J Paul, Brad Arnot, Danielle Albers Szafir, and Jordan Boyd-Graber. 2019. Zika discourse in the Americas: A multilingual topic analysis of Twitter. PloS one 14, 5 (2019), 23.

[26]

Daniele Quercia, Harry Askham, and Jon Crowcroft. 2012. TweetLDA: Supervised Topic Classification and Link Prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference. Association for Computing Machinery, New York, NY, United States, 247--250.

Digital Library

[27]

Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing Microblogs with Topic Models. In Fourth International AAAI Conference on Weblogs and Social Media. The AAAI Press, Menlo Park, CA, USA, 130--137.

[28]

Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science 58, 4 (2014), 1064--1082. https://doi.org/10.1111/ajps.12103 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ajps.12103

[29]

Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, and Padhraic Smyth. 2012. The Author-Topic Model for Authors and Documents. CoRR abs/1207.4169 (2012), 8. arXiv:1207.4169 http://arxiv.org/abs/1207.4169

[30]

Gerard Salton and Michael J McGill. 1983. Introduction to Modern Information Retrieval. mcgraw-hill, New York, NY, USA.

[31]

Kentaro Sasaki, Tomohiro Yoshikawa, and Takeshi Furuhashi. 2014. Online Topic Model for Twitter considering Dynamics of User Interests and Topic Trends. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1977--1985.

[32]

Marina Sokolova, Kanyi Huang, Stan Matwin, Joshua Ramisch, Vera Sazonova, Renee Black, Chris Orwa, Sidney Ochieng, and Nanjira Sambuli. 2016. Topic Modelling and Event Identification from Twitter Textual Data. arXiv preprint arXiv:1608.02519 abs/1608.02519 (2016), 1--17.

[33]

Michael Thelen and Ellen Riloff. 2002. A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 (EMNLP '02). Association for Computational Linguistics, USA, 214--221. https://doi.org/10.3115/1118693.1118721

Digital Library

[34]

Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, United States, 424--433.

Digital Library

[35]

Yu Wang, Eugene Agichtein, and Michele Benzi. 2012. TM-LDA: Efficient Online Modeling of Latent Topic Transitions in Social Media. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, United States, 123--131.

Digital Library

[36]

Zong-Ben Xu, Jiang-She Zhang, and Yiu-Wing Leung. 1998. An approximate algorithm for computing multidimensional convex hulls. Applied mathematics and computation 94, 2--3 (1998), 193--226.

[37]

Dongjin Yu, Dengwei Xu, Dongjing Wang, and Zhiyong Ni. 2019. Hierarchical Topic Modeling of Twitter Data for Online Analytical Processing. IEEE Access 7 (2019), 12373--12385.

[38]

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media using Topic Models. In European Conference on Information Retrieval. Springer, Dublin, Ireland, 338--349.

Digital Library

Cited By

Cogo FXia XHassan A(2023)Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on RustACM Transactions on Software Engineering and Methodology10.1145/354694532:2(1-48)Online publication date: 4-Apr-2023
https://dl.acm.org/doi/10.1145/3546945
Sychev A(2023)Diagnostics of the Topic Model for a Collection of Text Messages Based on Hierarchical Clustering of TermsLobachevskii Journal of Mathematics10.1134/S199508022301039044:1(219-226)Online publication date: 17-May-2023
https://doi.org/10.1134/S1995080223010390
A. ARobledo SZuluaga M(2023)Topic Modeling: Perspectives From a Literature ReviewIEEE Access10.1109/ACCESS.2022.323293911(4066-4078)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232939
Show More Cited By

Index Terms

Topic Modeling of Short Texts Using Anchor Words
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms

Recommendations

A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide Web

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Fuzzy topic modeling approach for text mining over short text
Highlights
- A fuzzy topic modeling method is proposed for short text documents.
- Local and global term frequencies are generated through the bag-of-words model.
- High dimensionality negative effect on global term weighting is eliminated.
- ...
Abstract
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short ...
Targeted aspects oriented topic modeling for short texts
Abstract
Topic modeling has demonstrated its value in short text topic discovery. For this task, a common way adopted by many topic models is to perform a full analysis to find all the possible topics. However, these topic models overlook the importance of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics

June 2020

279 pages

ISBN:9781450375429

DOI:10.1145/3405962

Conference Chairs:
Richard Chbeir
University Pau & Pays Adour
,
Yannis Manolopoulos
Open University of Cyprus, Cyprus
,
Rajendra Akerkar
Western Norway Research Institute, Norway
,
Jolanta Mizera-Pietraszko
Military University of Land Forces, Poland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WIMS 2020

WIMS 2020: The 10th International Conference on Web Intelligence, Mining and Semantics

June 30 - July 3, 2020

Biarritz, France

Acceptance Rates

WIMS 2020 Paper Acceptance Rate 35 of 63 submissions, 56%;

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cogo FXia XHassan A(2023)Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on RustACM Transactions on Software Engineering and Methodology10.1145/354694532:2(1-48)Online publication date: 4-Apr-2023
https://dl.acm.org/doi/10.1145/3546945
Sychev A(2023)Diagnostics of the Topic Model for a Collection of Text Messages Based on Hierarchical Clustering of TermsLobachevskii Journal of Mathematics10.1134/S199508022301039044:1(219-226)Online publication date: 17-May-2023
https://doi.org/10.1134/S1995080223010390
A. ARobledo SZuluaga M(2023)Topic Modeling: Perspectives From a Literature ReviewIEEE Access10.1109/ACCESS.2022.323293911(4066-4078)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232939
Zhu XWang YKochupillai MWerner MHaberle MHoffmann ETaubenbock HTuia DLevering AJacobs NKruspe AAbdulahhad K(2022)Geoinformation Harvesting From Social Media Data: A community remote sensing approachIEEE Geoscience and Remote Sensing Magazine10.1109/MGRS.2022.321958410:4(150-180)Online publication date: Dec-2022
https://doi.org/10.1109/MGRS.2022.3219584
Steuber FSchneider SSchoenfeld M(2022)Embedding Semantic Anchors to Guide Topic Models on Short Text CorporaBig Data Research10.1016/j.bdr.2021.10029327:COnline publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1016/j.bdr.2021.100293

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents