Abstract
Learning topic information from large-scale unstructured text has attracted extensive attention from both the academia and industry. Topic models, such as LDA and its variants, are a popular machine learning technique to discover such latent structure. Among them, online variational hierarchical Dirichlet process (onlineHDP) is a promising candidate for dynamically processing streaming text. Instead of a static assignment in advance, the number of topics in onlineHDP is inferred from the corpus as the training process proceeds. However, when dealing with large scale streaming data it still suffers from the limited model capacity problem. To this end, we proposed a distributed version of the onlineHDP algorithm (named as DistHDP) in this paper, the training task is split into many sub-batch tasks and distributed across multiple worker nodes, such that the whole training process is accelerated. The model convergence is guaranteed through a distributed variation inference algorithm. Extensive experiments conducted on several real-world datasets demonstrate the usability and scalability of the proposed algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Change history
20 August 2019
The book was inadvertently published with an uncorrected version. The following corrections should have been carried out before publication:
-
1.
Page 42: Sentence “By computing the natural gradient, corpus-level parameters are updated according to Eq. (Error! Reference source not found)–Eq. (17).” Correctly it should read “By computing the natural gradient, corpus-level parameters are updated according to Eqs. (17)–(19).”
-
1.
References
Sato, M.: Online model selection based on the variational Bayes. Neural Comput. 13(7), 1649–1681 (2005)
She, J., Chen, L.: Tomoha: topic model-based hashtag recommendation on twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2016)
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: SIGKDD, pp. 937–946 (2009)
Yuan, J., et al.: LightLDA: big topic models on modest compute clusters. In: The International World Wide Web Conference, pp. 1351–1361 (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
Lele, Yu., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Tang, Y.-K., Mao, X.-L., Huang, H., Shi, X., Wen, G.: Conceptualization topic modeling. Multimedi Tools Appl. 3(77), 3455–3471 (2018)
Wang, C., Paisley, J., Blei, D.M.: Online variational inference for the hierarchical Dirichlet process. In: 14th International Conference on Artificial Intelligence and Statistics, pp. 752–760 (2011)
Chen, J., Li, K., Zhu, J., Chen, W.: WrapLDA: a cache efficient O(1) algorithm for Latent Dirichlet allocation. Proc. VLDB Endow. 9(10), 744–755 (2016)
Li, A.Q., Ahmed, A., Ravi, S., Smola, A.J.: Reducing the sampling complexity of topic models. In: SIGKDD, pp. 891–900 (2014)
Yu, H.-F., Hsieh, C.-J., Yun, H., Vishwanathan, S., Dhillon, I.S.: A scalable asynchronous distributed algorithm for topic modeling. In: The International World Wide Web Conference, pp. 1340–1350 (2015)
Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539 (2014)
Fu, X., et al.: Dynamic Online HDP model for discovering evolutionary topics from Chinese social texts. Neurocomputing 171, 412–424 (2016)
Internet Live Stats, Twitter Usage Statistics page. https://www.internetlivestats.com/twitter-statistics. Accessed 01 June 2019
Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)
U.S. National Library of Medicine, Download PubMed Data. https://www.nlm.nih.gov/databases/download/pubmed_medline.html. Accessed 05 June 2019
Canini, K., Shi, L., Griffiths, T.: Online inference of topics with latent Dirichlet allocation. In: Artificial Intelligence and Statistics, pp. 65–72 (2009)
Fox, E., Sudderth, E., Jordan, M., et al.: An HDP-HMM for systems with state persistence. In: 25th International Conference on Machine Learning, pp. 312–319 (2008)
Acknowledgments
This work is sponsored by the National Key R&D Program of China [grant number 2018YFB0204300].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Feng, D., Lu, M., Li, D. (2019). A Distributed Topic Model for Large-Scale Streaming Text. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds) Knowledge Science, Engineering and Management. KSEM 2019. Lecture Notes in Computer Science(), vol 11776. Springer, Cham. https://doi.org/10.1007/978-3-030-29563-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-29563-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29562-2
Online ISBN: 978-3-030-29563-9
eBook Packages: Computer ScienceComputer Science (R0)