Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3292500.3330837acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Scaling Multinomial Logistic Regression via Hybrid Parallelism

Published: 25 July 2019 Publication History

Abstract

We study the problem of scaling Multinomial Logistic Regression (MLR) to datasets with very large number of data points in the presence of large number of classes. At a scale where neither data nor the parameters are able to fit on a single machine, we argue that simultaneous data and model parallelism (Hybrid Parallelism) is inevitable. The key challenge in achieving such a form of parallelism in MLR is the log-partition function which needs to be computed across all K classes per data point, thus making model parallelism non-trivial. To overcome this problem, we propose a reformulation of the original objective that exploits double-separability, an attractive property that naturally leads to hybrid parallelism. Our algorithm (DS-MLR) is asynchronous and completely de-centralized, requiring minimal communication across workers while keeping both data and parameter workloads partitioned. Unlike standard data parallel approaches, DS-MLR avoids bulk-synchronization by maintaining local normalization terms on each worker and accumulating them incrementally using a token-ring topology. We demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets. In particular, to demonstrate scaling via hybrid parallelism, we created a new benchmark dataset (Reddit-Full) by pre-processing 1.7 billion reddit user comments spanning the period 2007-2015. We used DS-MLR to solve an extreme multi-class classification problem of classifying 211 million data points into their corresponding subreddits. Reddit-Full is a massive data set with data occupying 228 GB and 44 billion parameters occupying 358 GB. To the best of our knowledge, no other existing methods can handle MLR in this setting.

References

[1]
R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13--24. ACM, 2013.
[2]
D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1--38):3, 2011.
[3]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, 2010.
[4]
G. Bouchard. Efficient bounds for the softmax function, applications to inference in hybrid models. 2007.
[5]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1--122, 2011.
[6]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004.
[7]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. Map-reduce for machine learning on multicore. In Advances in neural information processing systems, pages 281--288, 2007.
[8]
J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, et al. The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pages 293--296. ACM, 2010.
[9]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. ACM, 2011.
[10]
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[11]
S. Gopal and Y. Yang. Distributed training of large-scale logistic models. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 289--297, 2013.
[12]
J. Z. HaoChen and S. Sra. Random shuffling beats sgd after finite epochs. arXiv preprint arXiv:1806.10077, 2018.
[13]
H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 935--944. ACM, 2016.
[14]
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, 2013.
[15]
A. Nedić and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223--264. Springer, 2001.
[16]
Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
[17]
J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 2nd edition, 2006.
[18]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[19]
H. E. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.
[20]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211--252, 2015.
[21]
O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46--54, 2016.
[22]
P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 84--97. ACM, 2016.
[23]
L. Xiao, A. W. Yu, Q. Lin, and W. Chen. Dscovr: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. arXiv preprint arXiv:1710.05080, 2017.
[24]
P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. P. Xing. Distributed machine learning via sufficient factor broadcasting. CoRR, 2015.
[25]
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: a new platform for distributed machine learning on big data. Big Data, IEEE Transactions on, 2015.
[26]
I. E.-H. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. Dhillon. Pd-sparse : A primal and dual sparse approach to extreme multiclass and multilagrawal2013multoabel classification. In Proceedings of The 33rd International Conference on Machine Learning, pages 3069--3077, 2016.
[27]
H. Yun. Doubly Separable Models. PhD thesis, Purdue University West Lafayette, 2014.
[28]
H. Yun, P. Raman, and S. Vishwanathan. Ranking via robust binary classification. In Advances in Neural Information Processing Systems, 2014.
[29]
H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. 2013.
[30]
X. Zhang, J. Liu, and Z. Zhu. Taming convergence for asynchronous stochastic gradient descent with unbounded delay in non-convex learning. arXiv preprint arXiv:1805.09470, 2018.
[31]
Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. 2013.
[32]
M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595--2603, 2010.

Cited By

View all
  • (2024)LASCA: A Large-Scale Stable Customer Segmentation Approach to Credit Risk AssessmentProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671550(5006-5017)Online publication date: 25-Aug-2024
  • (2022)Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion ParametersProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539070(3288-3298)Online publication date: 14-Aug-2022
  • (2021)Just move it!Proceedings of the VLDB Endowment10.14778/3476311.347632514:12(2707-2710)Online publication date: 28-Oct-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. large scale machine learning
  2. multinomial logistic regression
  3. stochastic optimization

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '19
Sponsor:

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)26
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LASCA: A Large-Scale Stable Customer Segmentation Approach to Credit Risk AssessmentProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671550(5006-5017)Online publication date: 25-Aug-2024
  • (2022)Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion ParametersProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539070(3288-3298)Online publication date: 14-Aug-2022
  • (2021)Just move it!Proceedings of the VLDB Endowment10.14778/3476311.347632514:12(2707-2710)Online publication date: 28-Oct-2021
  • (2020)Dynamic parameter allocation in parameter serversProceedings of the VLDB Endowment10.14778/3407790.340779613:12(1877-1890)Online publication date: 14-Sep-2020
  • (2020)Classification of Big Data Using Spark FrameworkProceedings of the Fourth International Conference on Microelectronics, Computing and Communication Systems10.1007/978-981-15-5546-6_70(847-854)Online publication date: 20-Sep-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media