research-article

Group matrix factorization for scalable topic modeling

Authors:

Hang LiAuthors Info & Claims

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 375 - 384

https://doi.org/10.1145/2348283.2348335

Published: 12 August 2012 Publication History

Abstract

Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.

References

[1]

S. Bengio, F. Pereira, and Y. Singer. Group sparse coding. In NIPS, pages 82--89, 2009.

Digital Library

[2]

P. N. Bennett, K. M. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, pages 111--120, 2010.

Digital Library

[3]

D. Blei. Introduction to probabilistic topic models. COMMUN ACM, to appear, 2011.

Digital Library

[4]

D. Blei and J. McAuliffe. Supervised topic models. In NIPS, pages 121--128, 2008.

Digital Library

[5]

D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[6]

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998.

Digital Library

[7]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.

[8]

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.

[9]

J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.

[10]

W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.

[11]

T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.

Digital Library

[12]

S. Lacoste-Julien, F. Sha, and M. I. Jordan. Disclda: Discriminative learning for dimensionality reduction and classification. In NIPS, pages 897--904, 2008.

[13]

D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.

[14]

D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.

[15]

H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.

Digital Library

[16]

H. Lee and S. Choi. Group nonnegative matrix factorization for eeg classification. In AISTATS, pages 320--327, 2009.

[17]

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, 2008.

[18]

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS, pages 1033--1040. 2009.

Digital Library

[19]

D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007.

Digital Library

[20]

B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.

[21]

M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.

[22]

D. Ramage, C. D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining. In SIGKDD, pages 457--465, 2011.

Digital Library

[23]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.

[24]

G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975.

Digital Library

[25]

F. Wang, N. Lee, J. Sun, J. Hu, and S. Ebadollahi. Automatic group sparse coding. In AAAI, pages 495--500, 2011.

[26]

Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685--694, 2011.

Digital Library

[27]

X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006.

Digital Library

[28]

C. Zhai, A. Velivelli, and B. Yu. A crosscollection mixture model for comparative text mining. In SIGKDD, pages 743--748, 2004.

Digital Library

Cited By

González-Rostani V(2024)Elections, Right-wing Populism, and Political-Economic Polarization: The Role of Institutions and Political OutsidersThe Journal of Politics10.1086/734533Online publication date: 20-Dec-2024
https://doi.org/10.1086/734533
Lou WMeng J(2023)The diversity of canonical and ubiquitous progress in computer visionInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10323860:3Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103238
Khan MSmyth BCoyle D(2021)Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approachJournal of Intelligent Information Systems10.1007/s10844-021-00639-8Online publication date: 12-May-2021
https://doi.org/10.1007/s10844-021-00639-8
Show More Cited By

Index Terms

Group matrix factorization for scalable topic modeling
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Localized user-driven topic discovery via boosted ensemble of nonnegative matrix factorization

Nonnegative matrix factorization (NMF) has been widely used in topic modeling of large-scale document corpora, where a set of underlying topics are extracted by a low-rank factor matrix from NMF. However, the resulting topics often convey only general, ...
Jointly Discovering Fine-grained and Coarse-grained Sentiments via Topic Modeling
MM '14: Proceedings of the 22nd ACM international conference on Multimedia

The ever-increasing user-generated contents in social media and other web services make it highly desirable to discover opinions of users on all kinds of topics. Motivated by the assumption that individual word and paragraph in documents will deliver ...
Extractive text summarization using clustering-based topic modeling
Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

August 2012

1236 pages

ISBN:9781450314725

DOI:10.1145/2348283

General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '12

Sponsor:

SIGIR

SIGIR '12: The 35th International ACM SIGIR conference on research and development in Information Retrieval

August 12 - 16, 2012

Oregon, Portland, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
580
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

González-Rostani V(2024)Elections, Right-wing Populism, and Political-Economic Polarization: The Role of Institutions and Political OutsidersThe Journal of Politics10.1086/734533Online publication date: 20-Dec-2024
https://doi.org/10.1086/734533
Lou WMeng J(2023)The diversity of canonical and ubiquitous progress in computer visionInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10323860:3Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103238
Khan MSmyth BCoyle D(2021)Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approachJournal of Intelligent Information Systems10.1007/s10844-021-00639-8Online publication date: 12-May-2021
https://doi.org/10.1007/s10844-021-00639-8
Bhavana PPadmanabhan V(2020)Matrix factorization of large scale data using multistage matrix factorizationApplied Intelligence10.1007/s10489-020-01957-0Online publication date: 25-Nov-2020
https://doi.org/10.1007/s10489-020-01957-0
Koltcov SIgnatenko VKoltsova O(2019)Estimating Topic Modeling Performance with Sharma–Mittal EntropyEntropy10.3390/e2107066021:7(660)Online publication date: 5-Jul-2019
https://doi.org/10.3390/e21070660
Li HLiu YQian YMamoulis NTu WCheung D(2019)HHMF: hidden hierarchical matrix factorization for recommender systemsData Mining and Knowledge Discovery10.1007/s10618-019-00632-4Online publication date: 27-May-2019
https://doi.org/10.1007/s10618-019-00632-4
Skitalinskaya GAlexandrov MCardiff J(2019)Comparison of Two-Pass Algorithms for Dynamic Topic Modeling Based on Matrix DecompositionsAdvances in Computational Intelligence10.1007/978-3-030-02840-4_3(27-43)Online publication date: 1-Jan-2019
https://doi.org/10.1007/978-3-030-02840-4_3
Checco ABianchi GLeith D(2017)BLCACM Transactions on Privacy and Security10.1145/304176020:2(1-25)Online publication date: 25-May-2017
https://dl.acm.org/doi/10.1145/3041760
Zamani HDadashkarimi JShakery ACroft WMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Pseudo-Relevance Feedback Based on Matrix FactorizationProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983844(1483-1492)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983844
Greene DCross JDe Roure DBurnap PHalford S(2015)Unveiling the Political Agenda of the European Parliament PlenaryProceedings of the ACM Web Science Conference10.1145/2786451.2786464(1-10)Online publication date: 28-Jun-2015
https://dl.acm.org/doi/10.1145/2786451.2786464
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten