Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2348283.2348335acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Group matrix factorization for scalable topic modeling

Published: 12 August 2012 Publication History

Abstract

Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.

References

[1]
S. Bengio, F. Pereira, and Y. Singer. Group sparse coding. In NIPS, pages 82--89, 2009.
[2]
P. N. Bennett, K. M. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, pages 111--120, 2010.
[3]
D. Blei. Introduction to probabilistic topic models. COMMUN ACM, to appear, 2011.
[4]
D. Blei and J. McAuliffe. Supervised topic models. In NIPS, pages 121--128, 2008.
[5]
D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[6]
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998.
[7]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.
[8]
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.
[9]
J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.
[10]
W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.
[11]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[12]
S. Lacoste-Julien, F. Sha, and M. I. Jordan. Disclda: Discriminative learning for dimensionality reduction and classification. In NIPS, pages 897--904, 2008.
[13]
D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.
[14]
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.
[15]
H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.
[16]
H. Lee and S. Choi. Group nonnegative matrix factorization for eeg classification. In AISTATS, pages 320--327, 2009.
[17]
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, 2008.
[18]
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS, pages 1033--1040. 2009.
[19]
D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007.
[20]
B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.
[21]
M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.
[22]
D. Ramage, C. D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining. In SIGKDD, pages 457--465, 2011.
[23]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.
[24]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975.
[25]
F. Wang, N. Lee, J. Sun, J. Hu, and S. Ebadollahi. Automatic group sparse coding. In AAAI, pages 495--500, 2011.
[26]
Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685--694, 2011.
[27]
X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006.
[28]
C. Zhai, A. Velivelli, and B. Yu. A crosscollection mixture model for comparative text mining. In SIGKDD, pages 743--748, 2004.

Cited By

View all
  • (2023)The diversity of canonical and ubiquitous progress in computer visionInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10323860:3Online publication date: 1-May-2023
  • (2021)Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approachJournal of Intelligent Information Systems10.1007/s10844-021-00639-8Online publication date: 12-May-2021
  • (2020)Matrix factorization of large scale data using multistage matrix factorizationApplied Intelligence10.1007/s10489-020-01957-0Online publication date: 25-Nov-2020
  • Show More Cited By

Index Terms

  1. Group matrix factorization for scalable topic modeling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
    August 2012
    1236 pages
    ISBN:9781450314725
    DOI:10.1145/2348283
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. large scale
    2. matrix factorization
    3. topic modeling

    Qualifiers

    • Research-article

    Conference

    SIGIR '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The diversity of canonical and ubiquitous progress in computer visionInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10323860:3Online publication date: 1-May-2023
    • (2021)Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approachJournal of Intelligent Information Systems10.1007/s10844-021-00639-8Online publication date: 12-May-2021
    • (2020)Matrix factorization of large scale data using multistage matrix factorizationApplied Intelligence10.1007/s10489-020-01957-0Online publication date: 25-Nov-2020
    • (2019)Estimating Topic Modeling Performance with Sharma–Mittal EntropyEntropy10.3390/e2107066021:7(660)Online publication date: 5-Jul-2019
    • (2019)HHMF: hidden hierarchical matrix factorization for recommender systemsData Mining and Knowledge Discovery10.1007/s10618-019-00632-4Online publication date: 27-May-2019
    • (2019)Comparison of Two-Pass Algorithms for Dynamic Topic Modeling Based on Matrix DecompositionsAdvances in Computational Intelligence10.1007/978-3-030-02840-4_3(27-43)Online publication date: 1-Jan-2019
    • (2017)BLCACM Transactions on Privacy and Security10.1145/304176020:2(1-25)Online publication date: 25-May-2017
    • (2016)Pseudo-Relevance Feedback Based on Matrix FactorizationProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983844(1483-1492)Online publication date: 24-Oct-2016
    • (2015)Unveiling the Political Agenda of the European Parliament PlenaryProceedings of the ACM Web Science Conference10.1145/2786451.2786464(1-10)Online publication date: 28-Jun-2015
    • (2015)Modeling Emerging, Evolving and Fading Topics Using Dynamic Soft Orthogonal NMF with Sparse RepresentationProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.96(61-70)Online publication date: 14-Nov-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media