Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2736277.2741682acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A Scalable Asynchronous Distributed Algorithm for Topic Modeling

Published: 18 May 2015 Publication History

Abstract

Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of Yun et al, 2014. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.

References

[1]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88, 2008.
[2]
A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, pages 27--34, 2009.
[3]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, Jan. 2003.
[4]
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990.
[5]
P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327--336, 1994.
[6]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011.
[7]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.
[8]
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.
[9]
G. Heinrich. Parameter estimation for text analysis. Technical report, 2008.
[10]
A. Ihler and D. Newman. Understanding errors in approximate distributed latent Dirichlet allocation. IEEE TKDE, 24(5):952--960, May 2012.
[11]
A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In KDD, 2014.
[12]
M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B. Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.
[13]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, 10:1801--1828, 2009.
[14]
M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
[15]
B. Recht and C. Re. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201--226, June 2013.
[16]
A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1):703--710, 2010.
[17]
P. Snyder. tmpfs: A virtual memory file system. In Proceedings of the Autumn 1990 European UNIX Users' Group Conference, 1990.
[18]
M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering, 17(9):972--975, 1991.
[19]
A. J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 3(3):253--256, Sept. 1977.
[20]
Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In International Conference on Algorithmic Aspects in Information and Management, 2009.
[21]
C. K. Wong and M. C. Easton. An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9(1):111--113, 1980.
[22]
F. Yan, N. Xu, and Y. Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.
[23]
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, 2009.
[24]
H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. Libshorttext: A library for short-text classification and analysis. Technical report, 2013.
[25]
H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment, 7(11):975--986, 2014.

Cited By

View all
  • (2024)Improving RAG Quality for Large Language Models with Topic-Enhanced RerankingArtificial Intelligence Applications and Innovations10.1007/978-3-031-63215-0_6(74-87)Online publication date: 19-Jun-2024
  • (2023)ezLDA: Efficient and Scalable LDA on GPUsIEEE Access10.1109/ACCESS.2023.331523911(100165-100179)Online publication date: 2023
  • (2022)Deterministic Inference of Topic Models via Maximal Latent State ReplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300055934:4(1684-1695)Online publication date: 1-Apr-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '15: Proceedings of the 24th International Conference on World Wide Web
May 2015
1460 pages
ISBN:9781450334693

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. sampling
  2. scalability
  3. topic models

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

WWW '15
Sponsor:
  • IW3C2

Acceptance Rates

WWW '15 Paper Acceptance Rate 131 of 929 submissions, 14%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving RAG Quality for Large Language Models with Topic-Enhanced RerankingArtificial Intelligence Applications and Innovations10.1007/978-3-031-63215-0_6(74-87)Online publication date: 19-Jun-2024
  • (2023)ezLDA: Efficient and Scalable LDA on GPUsIEEE Access10.1109/ACCESS.2023.331523911(100165-100179)Online publication date: 2023
  • (2022)Deterministic Inference of Topic Models via Maximal Latent State ReplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300055934:4(1684-1695)Online publication date: 1-Apr-2022
  • (2022)PGeoTopic: A Distributed Solution for Mining Geographical Topic ModelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298914234:2(881-893)Online publication date: 1-Feb-2022
  • (2021)Distributed Latent Dirichlet Allocation on StreamsACM Transactions on Knowledge Discovery from Data10.1145/345152816:1(1-20)Online publication date: 20-Jul-2021
  • (2021)Sys-TM: A Fast and General Topic Modeling SystemIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295651833:6(2790-2802)Online publication date: 1-Jun-2021
  • (2021)Enabling Efficient and Scalable Service Search in IoT With Topic Modeling: An EvaluationIEEE Access10.1109/ACCESS.2021.30710099(53452-53465)Online publication date: 2021
  • (2020)Dynamic parameter allocation in parameter serversProceedings of the VLDB Endowment10.14778/3407790.340779613:12(1877-1890)Online publication date: 14-Sep-2020
  • (2020)SaberLDA: Sparsity-Aware Learning of Topic Models on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297970231:9(2112-2124)Online publication date: 1-Sep-2020
  • (2020)Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGAIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297474431:8(1897-1911)Online publication date: 1-Aug-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media