research-article

A Scalable Asynchronous Distributed Algorithm for Topic Modeling

Authors:

S.V.N. Vishwanathan,

Inderjit S. DhillonAuthors Info & Claims

WWW '15: Proceedings of the 24th International Conference on World Wide Web

Pages 1340 - 1350

https://doi.org/10.1145/2736277.2741682

Published: 18 May 2015 Publication History

Abstract

Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of Yun et al, 2014. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.

References

[1]

A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88, 2008.

[2]

A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, pages 27--34, 2009.

Digital Library

[3]

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, Jan. 2003.

Digital Library

[4]

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990.

Digital Library

[5]

P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327--336, 1994.

Digital Library

[6]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011.

Digital Library

[7]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.

Digital Library

[8]

T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.

[9]

G. Heinrich. Parameter estimation for text analysis. Technical report, 2008.

[10]

A. Ihler and D. Newman. Understanding errors in approximate distributed latent Dirichlet allocation. IEEE TKDE, 24(5):952--960, May 2012.

Digital Library

[11]

A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In KDD, 2014.

Digital Library

[12]

M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B. Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.

Digital Library

[13]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, 10:1801--1828, 2009.

[14]

M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.

[15]

B. Recht and C. Re. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201--226, June 2013.

[16]

A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1):703--710, 2010.

Digital Library

[17]

P. Snyder. tmpfs: A virtual memory file system. In Proceedings of the Autumn 1990 European UNIX Users' Group Conference, 1990.

[18]

M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering, 17(9):972--975, 1991.

Digital Library

[19]

A. J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 3(3):253--256, Sept. 1977.

Digital Library

[20]

Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In International Conference on Algorithmic Aspects in Information and Management, 2009.

Digital Library

[21]

C. K. Wong and M. C. Easton. An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9(1):111--113, 1980.

Digital Library

[22]

F. Yan, N. Xu, and Y. Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.

[23]

L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, 2009.

Digital Library

[24]

H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. Libshorttext: A library for short-text classification and analysis. Technical report, 2013.

[25]

H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment, 7(11):975--986, 2014.

Digital Library

Cited By

Ampazis N(2024)Improving RAG Quality for Large Language Models with Topic-Enhanced RerankingArtificial Intelligence Applications and Innovations10.1007/978-3-031-63215-0_6(74-87)Online publication date: 19-Jun-2024
https://doi.org/10.1007/978-3-031-63215-0_6
Wang SLiu HGaihre AYu H(2023)ezLDA: Efficient and Scalable LDA on GPUsIEEE Access10.1109/ACCESS.2023.331523911(100165-100179)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3315239
Rugeles DHai ZDash MCong G(2022)Deterministic Inference of Topic Models via Maximal Latent State ReplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300055934:4(1684-1695)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TKDE.2020.3000559
Show More Cited By

Index Terms

A Scalable Asynchronous Distributed Algorithm for Topic Modeling
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Markov decision processes
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Markov decision processes

Recommendations

Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
A joint model for topic-sentiment modeling from text
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Traditional topic models, like LDA and PLSA, have been efficiently extended to capture further aspects of text in addition to the latent topics (e.g., time evolution, sentiment etc.). In this paper, we discuss the issue of joint topic-sentiment ...
Modeling online reviews with multi-grain topic models
WWW '08: Proceedings of the 17th international conference on World Wide Web

In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '15: Proceedings of the 24th International Conference on World Wide Web

May 2015

1460 pages

ISBN:9781450334693

General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy

Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

WWW '15

Sponsor:

IW3C2

WWW '15: 24th International World Wide Web Conference

May 18 - 22, 2015

Florence, Italy

Acceptance Rates

WWW '15 Paper Acceptance Rate 131 of 929 submissions, 14%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
481
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ampazis N(2024)Improving RAG Quality for Large Language Models with Topic-Enhanced RerankingArtificial Intelligence Applications and Innovations10.1007/978-3-031-63215-0_6(74-87)Online publication date: 19-Jun-2024
https://doi.org/10.1007/978-3-031-63215-0_6
Wang SLiu HGaihre AYu H(2023)ezLDA: Efficient and Scalable LDA on GPUsIEEE Access10.1109/ACCESS.2023.331523911(100165-100179)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3315239
Rugeles DHai ZDash MCong G(2022)Deterministic Inference of Topic Models via Maximal Latent State ReplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300055934:4(1684-1695)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TKDE.2020.3000559
Zhao KCong GLi X(2022)PGeoTopic: A Distributed Solution for Mining Geographical Topic ModelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298914234:2(881-893)Online publication date: 1-Feb-2022
https://doi.org/10.1109/TKDE.2020.2989142
Guo YLi J(2021)Distributed Latent Dirichlet Allocation on StreamsACM Transactions on Knowledge Discovery from Data10.1145/345152816:1(1-20)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1145/3451528
Shao YLi XChen YYu LCui B(2021)Sys-TM: A Fast and General Topic Modeling SystemIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295651833:6(2790-2802)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TKDE.2019.2956518
Razzaque M(2021)Enabling Efficient and Scalable Service Search in IoT With Topic Modeling: An EvaluationIEEE Access10.1109/ACCESS.2021.30710099(53452-53465)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3071009
Renz-Wieland AGemulla RZeuch SMarkl V(2020)Dynamic parameter allocation in parameter serversProceedings of the VLDB Endowment10.14778/3407790.340779613:12(1877-1890)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407796
Li KChen JChen WZhu J(2020)SaberLDA: Sparsity-Aware Learning of Topic Models on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297970231:9(2112-2124)Online publication date: 1-Sep-2020
https://doi.org/10.1109/TPDS.2020.2979702
Zhou SKannan RPrasanna V(2020)Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGAIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297474431:8(1897-1911)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2974744
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents