research-article

Stochastic gradient descent on GPUs

Authors:

Rashid Kaleem,

Sreepathi Pai,

Keshav PingaliAuthors Info & Claims

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Pages 81 - 89

https://doi.org/10.1145/2716282.2716289

Published: 07 February 2015 Publication History

Get Access

Abstract

Irregular algorithms such as Stochastic Gradient Descent (SGD) can benefit from the massive parallelism available on GPUs. However, unlike in data-parallel algorithms, synchronization patterns in SGD are quite complex. Furthermore, scheduling for scale-free graphs is challenging. This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling. We observe that static schedules do not yield better performance despite eliminating the need to perform conflict detection and resolution at runtime. We identify the source of the performance degradation to be the structure of certain parts of the graph (dense vs sparse). This classification can be used to devise hybrid scheduling strategies which exploit different schedules for different regions of the graph to obtain better performance. We found that the best schedule for some problems can be up to two orders of magnitude faster than the worst one. To evaluate the performance of our GPU implementation, we also compare against a CPU implementation of SGD. Dynamic schedules perform comparably to a 14-thread CPU implementation, while a static schedule performs comparably to a 6-thread CPU implementation.

References

[1]

B. A.-L. Barabási and E. Bonabeau. Scale-free networks. Scientific American, 2003.

Crossref

Google Scholar

[2]

Y. Bengio. Speeding up stochastic gradient descent. In NIPS workshop on Efficient Machine Learning, 2007.

Google Scholar

[3]

R. F. Boisvert, R. Pozo, K. A. Remington, R. F. Barrett, and J. Dongarra. Matrix market: a web resource for test matrix collections. In Quality of Numerical Software, pages 125–137, 1996.

Digital Library

Google Scholar

[4]

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104–111. ACM, 2008.

Digital Library

Google Scholar

[5]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems 2012. December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1232–1240, 2012.

Google Scholar

[6]

M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.

Digital Library

Google Scholar

[7]

T. Paine, H. Jin, J. Yang, Z. Lin, and T. S. Huang. GPU asynchronous stochastic gradient descent to speed up neural network training. CoRR, abs/1312.6186, 2013.

Google Scholar

[8]

K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 12–25, New York, NY, USA, 2011. ACM.

Digital Library

Google Scholar

[9]

R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 873–880, New York, NY, USA, 2009. ACM.

Digital Library

Google Scholar

[10]

N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 979–990, New York, NY, USA, 2014. ACM.

Digital Library

Google Scholar

[11]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. On parallelizability of stochastic gradient descent for speech DNNs. In Proc. ICASSP, 2014.

Crossref

Google Scholar

[12]

D. Steinkrau, P. Y. Simard, and I. Buck. Using gpus for machine learning algorithms. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, ICDAR ’05, pages 1115–1119, Washington, DC, USA, 2005. IEEE Computer Society. Introduction Problem statement SGD Implementation Dynamic Schedules Edge-Locked (EL) Node-Locked (NL) Hybrid-Locked (HL) Static Scheduling All-Graph Matching (AGM) Sub-Graph Matching (SGM) Hybrid (H) Evaluation Methodology Inputs Optimizations Overall Performance Static vs Dynamic Schedules CPU Comparison Related work Future work Conclusion Acknowledgement References

Digital Library

Google Scholar

Cited By

View all

Xia CChen YZhang HWu J(2023)STADIA: Photonic Stochastic Gradient Descent for Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/360792022:5s(1-23)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3607920
Joshi VKorah J(2022)Formulating Parallel Supervised Machine Learning Designs For Anomaly-Based Network Intrusion Detection in Resource Constrained Use Cases2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS)10.1109/MASS56207.2022.00117(748-753)Online publication date: Oct-2022
https://doi.org/10.1109/MASS56207.2022.00117
Chen JFang JLiu WYang C(2021)BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306494232:9(2291-2302)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPDS.2021.3064942
Show More Cited By

Recommendations

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on ...
Guided parallelized stochastic gradient descent for delay compensation
Abstract
Stochastic gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. However, with the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to ...
Highlights
- Its convergence rate of O 1 ρ T + σ 2 shows its applicability for the real-time systems.
Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms
MLHPC '15: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments

The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

February 2015

120 pages

ISBN:9781450334075

DOI:10.1145/2716282

Program Chairs:
David Kaeli
Northeastern University, USA
,
John Cavazos
University of Delaware, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU-8

GPGPU-8: General-purpose Processing with Graphics Processing Units 8

February 7, 2015

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
422
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Xia CChen YZhang HWu J(2023)STADIA: Photonic Stochastic Gradient Descent for Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/360792022:5s(1-23)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3607920
Joshi VKorah J(2022)Formulating Parallel Supervised Machine Learning Designs For Anomaly-Based Network Intrusion Detection in Resource Constrained Use Cases2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS)10.1109/MASS56207.2022.00117(748-753)Online publication date: Oct-2022
https://doi.org/10.1109/MASS56207.2022.00117
Chen JFang JLiu WYang C(2021)BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306494232:9(2291-2302)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPDS.2021.3064942
Yeh TSinclair MBeckmann BRogers T(2021)Deadline-Aware Offloading for High-Throughput Accelerators2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00048(479-492)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00048
Wang ZXiao YWeng FLi XZhu DLu FLiu XHou MMeng Y(2021)R-JaunLab: Automatic Multi-Class Recognition of Jaundice on Photos of Subjects with Region Annotation NetworksJournal of Digital Imaging10.1007/s10278-021-00432-7Online publication date: 25-Feb-2021
https://doi.org/10.1007/s10278-021-00432-7
AKÇAY M(2020)Estimation of Constant Speed Time for Railway Vehicles by Stochastic Gradient Descent AlgorithmSakarya University Journal of Computer and Information Sciences10.35377/saucis.03.03.8055983:3(355-365)Online publication date: 30-Dec-2020
https://doi.org/10.35377/saucis.03.03.805598
He XFang KQiao BZhu XChen Y(2020)Research of Watermelon Disease Detection Based on Deep LearningInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S0218001421520042Online publication date: 15-Oct-2020
https://doi.org/10.1142/S0218001421520042
Zhou SKannan RPrasanna V(2020)Accelerating Stochastic Gradient Descent Based Matrix Factorization on FPGAIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297474431:8(1897-1911)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2974744
Wang KFussell DLin CBahar IHerlihy MWitchel ELebeck A(2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304055
Ma YRusu FTorres M(2019)Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00113(1063-1072)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00113
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Guided parallelized stochastic gradient descent for delay compensation

Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms