Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3405962.3405968acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Topic Modeling of Short Texts Using Anchor Words

Published: 24 August 2020 Publication History

Abstract

We present Archetypal LDA or short A-LDA, a topic model tailored to short texts containing "semantic anchors" which convey a certain meaning or implicitly build on discussions beyond their mere presence. A-LDA is an extension to Latent Dirichlet Allocation in that we guide the process of topic inference by these semantic anchors as seed words to the LDA. We identify these seed words unsupervised from the documents and evaluate their co-occurrences using archetypal analysis, a geometric approximation problem that aims for finding k points that best approximate the data set's convex hull. These so called archetypes are considered as latent topics and used to guide the LDA. We demonstrate the effectiveness of our approach using Twitter, where semantic anchor words are the hashtags assigned to tweets by users. In direct comparison to LDA, A-LDA achieves 10-13% better results. We find that representing topics in terms of hashtags corresponding to calculated archetypes alone already results in interpretable topics and the model's performance peaks for seed confidence values ranging from 0.7 to 0.9.

References

[1]
David Alvarez-Melis and Martin Saveski. 2016. Topic Modeling in Twitter: Aggregating Tweets by Conversations. In Tenth International AAAI Conference on Web and Social Media. The AAAI Press, Palo Alto, CA, USA, 519--522.
[2]
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern Information Retrieval. Vol. 463. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[3]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, Feb (2003), 1137--1155.
[4]
David M. Blei and John D. Lafferty. 2005. Correlated Topic Models. In Proceedings of the 18th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS'05). MIT Press, Cambridge, MA, USA, 147--154.
[5]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022.
[6]
Jonathan Chang and David M. Blei. 2009. Relational Topic Models for Document Networks. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009 (JMLR Proceedings), David A. Van Dyk and Max Welling (Eds.), Vol. 5. JMLR.org, Clearwater Beach, Florida, USA, 81--88. http://proceedings.mlr.press/v5/chang09a.html
[7]
Adele Cutler and Leo Breiman. 1994. Archetypal analysis. Technometrics 36, 4 (1994), 338--347.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Minneapolis, MN, USA, 1--16.
[9]
Susan T Dumais. 2004. Latent Semantic Analysis. Annual Review of Information Science and Technology 38, 1 (2004), 188--230.
[10]
Thomas L Griffiths and Mark Steyvers. 2004. Finding Scientific Topics. Proceedings of the National Academy of Sciences 101, suppl 1 (2004), 5228--5235.
[11]
Aria Haghighi and Dan Klein. 2006. Prototype-Driven Learning for Sequence Models. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (New York, New York) (HLT-NAACL '06). Association for Computational Linguistics, USA, 320--327. https://doi.org/10.3115/1220835.1220876
[12]
Thomas Hofmann. 2013. Probabilistic Latent Semantic Analysis. In SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, United States, 50--57.
[13]
Liangjie Hong and Brian D Davison. 2010. Empirical Study of Topic Modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics. Association for Computing Machinery, New York, NY, United States, 80--88.
[14]
Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating Lexical Priors into Topic Models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Avignon, France, 204--213.
[15]
Roland Kahlert, Matthias Liebeck, and Joseph Cornelius. 2017. Understanding Trending Topics in Twitter. In Datenbanksysteme für Business, Technologie und Web (BTW 2017)-Workshopband. Gesellschaft für Informatik eV, Bonn, Germany, 10.
[16]
Jey Han Lau, Nigel Collier, and Timothy Baldwin. 2012. On-line Trend Analysis with Topic Models:# twitter trends detection topic model online. In Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, 1519--1534.
[17]
Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, United States, 889--892.
[18]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 3111--3119.
[19]
David Mimno and Andrew McCallum. 2008. Topic Models Conditioned on Arbitrary Features with Dirichlet-Multinomial Regression. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (Helsinki, Finland) (UAI'08). AUAI Press, Arlington, Virginia, USA, 411--418.
[20]
David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing Semantic Coherence in Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 262--272.
[21]
Brendan O'Connor, Michel Krieger, and David Ahn. 2010. Tweetmotif: Exploratory Search and Topic Summarization for Twitter. In Fourth International AAAI Conference on Weblogs and Social Media. The AAAI Press, Menlo Park, CA, USA, 384--285.
[22]
David Alfred Ostrowski. 2015. Using Latent Dirichlet Allocation for Topic Modelling in Twitter. In Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015). IEEE, Anaheim, CA, USA, 493--497.
[23]
Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections. In Proceedings of the 17th International Conference on World Wide Web. Association for Computing Machinery, New York, NY, United States, 91--100.
[24]
Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of Population Structure using Multilocus Genotype Data. Genetics 155, 2 (2000), 945--959.
[25]
Dasha Pruss, Yoshinari Fujinuma, Ashlynn R Daughton, Michael J Paul, Brad Arnot, Danielle Albers Szafir, and Jordan Boyd-Graber. 2019. Zika discourse in the Americas: A multilingual topic analysis of Twitter. PloS one 14, 5 (2019), 23.
[26]
Daniele Quercia, Harry Askham, and Jon Crowcroft. 2012. TweetLDA: Supervised Topic Classification and Link Prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference. Association for Computing Machinery, New York, NY, United States, 247--250.
[27]
Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing Microblogs with Topic Models. In Fourth International AAAI Conference on Weblogs and Social Media. The AAAI Press, Menlo Park, CA, USA, 130--137.
[28]
Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science 58, 4 (2014), 1064--1082. https://doi.org/10.1111/ajps.12103 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ajps.12103
[29]
Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, and Padhraic Smyth. 2012. The Author-Topic Model for Authors and Documents. CoRR abs/1207.4169 (2012), 8. arXiv:1207.4169 http://arxiv.org/abs/1207.4169
[30]
Gerard Salton and Michael J McGill. 1983. Introduction to Modern Information Retrieval. mcgraw-hill, New York, NY, USA.
[31]
Kentaro Sasaki, Tomohiro Yoshikawa, and Takeshi Furuhashi. 2014. Online Topic Model for Twitter considering Dynamics of User Interests and Topic Trends. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1977--1985.
[32]
Marina Sokolova, Kanyi Huang, Stan Matwin, Joshua Ramisch, Vera Sazonova, Renee Black, Chris Orwa, Sidney Ochieng, and Nanjira Sambuli. 2016. Topic Modelling and Event Identification from Twitter Textual Data. arXiv preprint arXiv:1608.02519 abs/1608.02519 (2016), 1--17.
[33]
Michael Thelen and Ellen Riloff. 2002. A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 (EMNLP '02). Association for Computational Linguistics, USA, 214--221. https://doi.org/10.3115/1118693.1118721
[34]
Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, United States, 424--433.
[35]
Yu Wang, Eugene Agichtein, and Michele Benzi. 2012. TM-LDA: Efficient Online Modeling of Latent Topic Transitions in Social Media. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, United States, 123--131.
[36]
Zong-Ben Xu, Jiang-She Zhang, and Yiu-Wing Leung. 1998. An approximate algorithm for computing multidimensional convex hulls. Applied mathematics and computation 94, 2--3 (1998), 193--226.
[37]
Dongjin Yu, Dengwei Xu, Dongjing Wang, and Zhiyong Ni. 2019. Hierarchical Topic Modeling of Twitter Data for Online Analytical Processing. IEEE Access 7 (2019), 12373--12385.
[38]
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media using Topic Models. In European Conference on Information Retrieval. Springer, Dublin, Ireland, 338--349.

Cited By

View all
  • (2023)Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on RustACM Transactions on Software Engineering and Methodology10.1145/354694532:2(1-48)Online publication date: 4-Apr-2023
  • (2023)Diagnostics of the Topic Model for a Collection of Text Messages Based on Hierarchical Clustering of TermsLobachevskii Journal of Mathematics10.1134/S199508022301039044:1(219-226)Online publication date: 17-May-2023
  • (2023)Topic Modeling: Perspectives From a Literature ReviewIEEE Access10.1109/ACCESS.2022.323293911(4066-4078)Online publication date: 2023
  • Show More Cited By

Index Terms

  1. Topic Modeling of Short Texts Using Anchor Words

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WIMS 2020: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics
    June 2020
    279 pages
    ISBN:9781450375429
    DOI:10.1145/3405962
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. archetypal analysis
    2. data mining
    3. short text
    4. text mining
    5. topic modeling

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WIMS 2020

    Acceptance Rates

    WIMS 2020 Paper Acceptance Rate 35 of 63 submissions, 56%;
    Overall Acceptance Rate 140 of 278 submissions, 50%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)56
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on RustACM Transactions on Software Engineering and Methodology10.1145/354694532:2(1-48)Online publication date: 4-Apr-2023
    • (2023)Diagnostics of the Topic Model for a Collection of Text Messages Based on Hierarchical Clustering of TermsLobachevskii Journal of Mathematics10.1134/S199508022301039044:1(219-226)Online publication date: 17-May-2023
    • (2023)Topic Modeling: Perspectives From a Literature ReviewIEEE Access10.1109/ACCESS.2022.323293911(4066-4078)Online publication date: 2023
    • (2022)Geoinformation Harvesting From Social Media Data: A community remote sensing approachIEEE Geoscience and Remote Sensing Magazine10.1109/MGRS.2022.321958410:4(150-180)Online publication date: Dec-2022
    • (2022)Embedding Semantic Anchors to Guide Topic Models on Short Text CorporaBig Data Research10.1016/j.bdr.2021.10029327:COnline publication date: 28-Feb-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media