Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3132847.3132942acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A Topic Model Based on Poisson Decomposition

Published: 06 November 2017 Publication History

Abstract

Determining appropriate statistical distributions for modeling text corpora is important for accurate estimation of numerical characteristics. Based on the validity of the test on a claim that the data conforms to Poisson distribution we propose Poisson decomposition model (PDM), a statistical model for modeling count data of text corpora, which can straightly capture each document's multidimensional numerical characteristics on topics. In PDM, each topic is represented as a parameter vector with multidimensional Poisson distribution, which can be easily normalized to multinomial term probabilities and each document is represented as measurements on topics and thereby reduced to a measurement vector on topics. We use gradient descent methods and sampling algorithm for parameter estimation. We carry out extensive experiments on the topics produced by our models. The results demonstrate our approach can extract more coherent topics and is competitive in document clustering by using the PDM-based features, compared to PLSI and LDA.

References

[1]
David J Aldous. 1985. Exchangeability and related topics. Springer.
[2]
Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. (2009), 27--34.
[3]
David M Blei. 2012. Probabilistic topic models. Communications of The ACM 55, 4 (2012), 77--84.
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993--1022.
[5]
Jordan Boydgraber, David M Blei, and Xiaojin Zhu. 2007. A Topic Model for Word Sense Disambiguation. (2007).
[6]
S R K Branavan, Harr Chen, Jacob Eisenstein, and Regina Barzilay. 2009. Learning document-level semantic properties from free-text annotations. Journal of Artificial Intelligence Research 34, 1 (2009), 569--603.
[7]
Thorsten Brants, Francine R Chen, and Ioannis Tsochantaridis. 2002. Topicbased document segmentation with probabilistic latent semantic analysis. (2002), 211--218.
[8]
Wray Buntine. 2002. Variational Extensions to EM and Multinomial PCA. Lecture Notes in Computer Science (2002), 23--34.
[9]
Deng Cai, Xiaofei He, Senior Member, and Jiawei Han. 2009. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering (2009), 902--913.
[10]
John Canny. 2004. GaP: a factor model for discrete data. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 122--129.
[11]
Laurent Charlin, Rajesh Ranganath, James McInerney, and David M Blei. 2015. Dynamic Poisson Factorization. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, 155--162.
[12]
Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. (2006), 241--248.
[13]
Wenyen Chen, Dong Zhang, and Edward Y Chang. 2008. Combinational collaborative filtering for personalized community recommendation. (2008), 115--123.
[14]
Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google news personalization: scalable online collaborative filtering. international world wide web conferences (2007), 271--280.
[15]
Bruno De Finetti. 1974. Theory of Probability: A Critical Introductory Treatment. Transl. by Antonio Machi and Adrian Smith. J. Wiley.
[16]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391.
[17]
Prem Gopalan, Jake M Hofman, and David M Blei. 2015. Scalable Recommendation with Hierarchical Poisson Factorization. (2015).
[18]
Prem Gopalan, Francisco JR Ruiz, Rajesh Ranganath, and David M Blei. 2014. Bayesian nonparametric Poisson factorization for recommendation systems. Artificial Intelligence and Statistics (AISTATS) 33 (2014), 275--283.
[19]
Prem K Gopalan, Laurent Charlin, and David Blei. 2014. Content-based recommendations with Poisson factorization. In Advances in Neural Information Processing Systems. 3176--3184.
[20]
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1 (2004), 5228--5235.
[21]
Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. 2004. Integrating topics and syntax. (2004), 537--544.
[22]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 50--57.
[23]
Thomas Hofmann. 2004. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems 22, 1 (2004), 89--115.
[24]
Frederick Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press.
[25]
Jing Jiang and Chengxiang Zhai. 2006. Extraction of coherent relevant passages using hidden Markov models. ACM Transactions on Information Systems 24, 3 (2006), 295--319.
[26]
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. Springer.
[27]
Ralf Krestel, Peter Fankhauser, and Wolfgang Nejdl. 2009. Latent dirichlet allocation for tag recommendation. (2009), 61--68.
[28]
Erich L Lehmann and Joseph P Romano. 2006. Testing statistical hypotheses. Springer Science & Business Media.
[29]
Chenghua Lin and Yulan He. 2009. Joint sentiment/topic model for sentiment analysis. (2009), 375--384.
[30]
Yang Liu, Xiangji Huang, Aijun An, and Xiaohui Yu. 2007. ARSA: a sentimentaware model for predicting sales performance using blogs. (2007), 607--614.
[31]
Yue Lu and Chengxiang Zhai. 2008. Opinion integration through semisupervised topic modeling. international world wide web conferences (2008), 121-- 130.
[32]
Eugene Lukacs. 1970. Characteristics functions. Griffin, London (1970).
[33]
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and Chengxiang Zhai. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. international world wide web conferences (2007), 171--180.
[34]
David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew Mccallum. 2011. Optimizing semantic coherence in topic models. (2011), 262-- 272.
[35]
Radford M. Neal and Geoffrey E. Hinton. 1998. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants. Springer Netherlands, Dordrecht, 355--368.
[36]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. (2010), 100--108.
[37]
Derek Ocallaghan, Derek Greene, Joe Carthy, and Padraig Cunningham. 2015. An analysis of the coherence of descriptors in topic modeling. Expert Systems With Applications 42, 13 (2015), 5645--5657.
[38]
D Raikov. 1938. On the decomposition of Gauss and Poisson laws. Izv. Akad. Nauk SSSR Ser. Mat. 1 (1938), 91--124.
[39]
Michael Roder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the Space of Topic Coherence Measures. (2015), 399--408.
[40]
L I Rongmei, Rianne Kaptein, Djoerd Hiemstra, and Jaap Kamps. 2008. Exploring Topic-based Language Models for Effective Web Information Retrieval. IEEE Electron Device Letters (2008), 65--71.
[41]
Ivan Titov and Ryan Mcdonald. 2008. Modeling online reviews with multi-grain topic models. international world wide web conferences (2008), 111--120.
[42]
Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging. (2008), 1521--1528.
[43]
Hanna M Wallach. 2006. Topic modeling: beyond bag-of-words. (2006), 977-- 984.
[44]
Xuerui Wang, Andrew Mccallum, and Xing Wei. 2007. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval. (2007), 697-- 702.
[45]
Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 178--185.
[46]
Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C Lee Giles. 2008. Exploring social annotations for information retrieval. international world wide web conferences (2008), 715--724.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. poisson decomposition
  2. statistical testing
  3. text classification
  4. topic coherence
  5. topic model

Qualifiers

  • Research-article

Funding Sources

  • the National Natural Science Foundation of China
  • Australian Research Council Discovery Project

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modelingKnowledge-Based Systems10.1016/j.knosys.2021.107591234:COnline publication date: 25-Dec-2021
  • (2021)Lifelong topic modeling with knowledge-enhanced adversarial networkWorld Wide Web10.1007/s11280-021-00984-2Online publication date: 23-Dec-2021
  • (2019)Searching Activity Trajectories with SemanticsJournal of Computer Science and Technology10.1007/s11390-019-1942-834:4(775-794)Online publication date: 19-Jul-2019
  • (2018)Domestic Violence Crisis Identification From Facebook Posts Based on Deep LearningIEEE Access10.1109/ACCESS.2018.28714466(54075-54085)Online publication date: 2018
  • (2018)Sentence-Based Topic Modeling Using Lexical AnalysisEmerging Technologies in Data Mining and Information Security10.1007/978-981-13-1501-5_42(487-494)Online publication date: 2-Sep-2018
  • (2018)Text Mining and Real-Time Analytics of Twitter Data: A Case Study of Australian Hay Fever PredictionHealth Information Science10.1007/978-3-030-01078-2_12(134-145)Online publication date: 22-Sep-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media