Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Published: 29 September 2023 Publication History

Abstract

Deep learning (DL) has become a key component of modern software. In the “big model” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as distributed deep learning training, or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.

References

[1]
2012. Parallel array or array of structures [closed]. Retrieved on December 21, 2022https://stackoverflow.com/questions/13239607. (2012).
[2]
2016. Distributed tensorflow on localhosts failed by “socket error, connection refused”. Retrieved on March 16, 2022https://stackoverflow.com/questions/38937984. (2016).
[3]
2016. Synchronous vs asynchronous computation in Tensorflow. Retrieved on March 16, 2022https://stackoverflow.com/questions/34349316/synchronous-vs-asynchronous-computation-in-tensorflow. (2016).
[4]
2016. Why neural network tends to output “mean value”?Retrieved on December 21, 2022https://stackoverflow.com/questions/39863606. (2016).
[5]
2017. Baidu-Allreduce. Retrieved on March 16, 2022https://github.com/baidu-research/baidu-allreduce. (2017).
[6]
2017. CUDA_ERROR_OUT_OF_MEMORY: How to activate multiple GPUs from Keras in Tensorflow. Retrieved on March 16, 2022https://stackoverflow.com/questions/45546737. (2017).
[7]
2017. Horovod’s Work Pattern?Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/117. (2017).
[8]
2017. How to use multiple GPUs effectively when training deep networks?Retrieved on March 16, 2022https://stackoverflow.com/questions/43236349. (2017).
[9]
2017. Keras predict not working for multiple GPU’s. Retrieved on March 16, 2022https://stackoverflow.com/questions/43620478. (2017).
[10]
2017. Memory management when using GPU in TensorFlow?Retrieved on December 21, 2022https://stackoverflow.com/questions/42307975. (2017).
[11]
2017. Run distributed : ERROR: ORTE_ERROR_LOG: Data unpack would read past end of buffer. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/133. (2017).
[12]
2017. TensorFlow: Is There a Rule to Set the Port of Worker/PS When Creating ClusterSpec?Retrieved on March 16, 2022https://stackoverflow.com/questions/41649708/tensorflow-is-there-a-rule-to-set-the-port-of-worker-ps-when-creating-clustersp. (2017).
[13]
2018. Difference Between Parallel and DistributedRetrieved on April 18, 2023https://www.differencebetween.com/difference-between-parallel-and-vs-distributed-computing/. (2018).
[14]
2018. Horovod dosn’t work with CUDA 9.1. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/161. (2018).
[15]
2018. Horovod hangs with multi gpus on one machine. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/638. (2018).
[16]
2018. Introducing HorovodRunner for Distributed Deep Learning Training. Retrieved on March 16, 2022https://databricks.com/blog/2018/11/19/introducing-horovodrunner-for-distributed-deep-learning-training.html. (2018).
[17]
2018. NVIDIA: Accelerating Deep Learning with Uber’s Horovod. Retrieved on March 16, 2022https://eng.uber.com/nvidia-horovod-deep-learning/. (2018).
[18]
2018. Open Source at Uber: Meet Alex Sergeev, Horovod Project Lead. Retrieved on March 16, 2022https://eng.uber.com/alex-sergeev-horovod/. (2018).
[19]
2018. Permission denied (publickey,password) when I run on muti node. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/467. (2018).
[20]
2019. AI and Compute. Retrieved on March 16, 2022https://openai.com/blog/ai-and-compute/. (2019).
[22]
2019. Fabric for Deep Learning (FfDL). Retrieved on March 16, 2022https://github.com/IBM/FfDL. (2019).
[23]
2019. Horovod pypi release doesn’t have horovod.spark package. Retrieved on December 19, 2022https://github.com/horovod/horovod/issues/818. (2019).
[24]
2019. NCCL. Retrieved on March 16, 2022https://developer.nvidia.com/nccl. (2019).
[25]
2019. Script freezes with no output when using DistributedDataParallel. Retrieved on March 16, 2022https://github.com/pytorch/pytorch/issues/22834. (2019).
[26]
2019. torch.distributed.launch receives RuntimeError: ProcessGroupNCCL does not support barrier. Retrieved on March 16, 2022https://github.com/pytorch/pytorch/issues/17848. (2019).
[27]
2019. Where does the documentation point to a list of values for the loss property of the compile function?Retrieved on December 21, 2022https://stackoverflow.com/questions/57244733. (2019).
[28]
2020. AssertionError: Default process group is not initialized. Retrieved on March 16, 2022https://github.com/pytorch/pytorch/issues/38300. (2020).
[29]
2020. CIFAR Scaling Efficiency. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/2103. (2020).
[30]
2020. Installation issue with MXNet built from source - No such file or dictionary <dmlc/base.h>. Retrieved on March 16, 2022.https://github.com/horovod/horovod/issues/1910. (2020).
[31]
2020. Open MPI: Open Source High Performance Computing. Retrieved on March 16, 2022https://www.open-mpi.org. (2020).
[32]
2020. Popular Deep Learning Frameworks: An Overview. Retrieved on March 16, 2022https://analyticsindiamag.com/deep-learning-frameworks/. (2020).
[33]
2020. Pytorch DataParallel doesn’t work when the model contain tensor operation. Retrieved on March 16, 2022https://stackoverflow.com/questions/60799655. (2020).
[34]
2020. What does “with strategy.scope():” or “with tf.distribute.experimental.TPUStrategy(tpu).scope():” do to the creation of a NN?Retrieved on January 16, 2023https://stackoverflow.com/questions/65358676. (2020).
[35]
2020. When Build Docker Container with Ubuntu16.04 Install Horovod Failed with Error Code -4. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/1798. (2020).
[36]
2020. When build docker container with ubuntu16.04 install horovod failed with error code -4. Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/1798. (2020).
[37]
2021. does it automatically use multiple gpu, if availabe?Retrieved on Feburary 11, 2023https://github.com/keras-team/keras/issues/106. (2021).
[38]
2021. Github Search API. Retrieved on March 16, 2022https://developer.github.com/v3/search/. (2021).
[39]
2021. Gloo. Retrieved on March 16, 2022https://github.com/facebookincubator/gloo. (2021).
[40]
2021. GPT-3 Powers the Next Generation of Apps. Retrieved on March 16, 2022https://openai.com/blog/gpt-3-apps/. (2021).
[41]
2021. Horovod. Retrieved on March 16, 2022https://github.com/horovod/horovod. (2021).
[42]
2021. I meet deadlock problem when use horovod.Retrieved on March 16, 2022https://github.com/horovod/horovod/issues/2506. (2021).
[43]
2021. init_rpc: TENSOR_PIPE backend sigaborts when CUDA is not available. Retrieved on December 19, 2022https://github.com/pytorch/pytorch/issues/54266. (2021).
[44]
2021. Keras: Deep Learning for Python. Retrieved on March 16, 2022https://github.com/keras-team/keras. (2021).
[45]
2021. PaddlePaddle. Retrieved on March 16, 2022https://github.com/PaddlePaddle/Paddle. (2021).
[46]
2021. PyTorch. Retrieved on March 16, 2022https://github.com/pytorch/pytorch. (2021).
[47]
2021. Running a Basic Distributed MNIST Solver in TensorFlow. Retrieved on March 16, 2022https://stackoverflow.com/questions/49984317/running-a-basic-distributed-mnist-solver-in-tensorflow. (2021).
[48]
2021. Stack Exchange Data Dump. Retrieved on December 6, 2021https://archive.org/details/stackexchange. (2021).
[49]
2021. TensorFlow. Retrieved on March 16, 2022https://github.com/tensorflow/tensorflow. (2021).
[50]
2021. Top 5 Deep Learning Frameworks You Should Try in 2021. Retrieved on March 16, 2022https://nexart.tech/blog/top-10-deep-learning-frameworks-you-should-try-it-in-2021/
[51]
2021. Top 5 Deep Learning Frameworks in 2021. Retrieved on March 16, 2022 https://makeinbusiness.com/top-5-deep-learning-frameworks/
[52]
2022. Top 10 Deep Learning Frameworks in 2022 You Can’t Ignore. Retrieved on March 16, 2022 https://www.upgrad.com/blog/top-deep-learning-frameworks/
[53]
2023. Distributed training of deep learning models on Azure. Retrieved on April 18, 2023https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-deep-learning. (2023).
[54]
2023. Supplemental Materials. Retrieved on April 22, 2023https://github.com/gudiandian/TOSEM23-DistributedTraining. (2023).
[55]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016. 265–283.
[56]
Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software documentation issues unveiled. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019. 1199–1210.
[57]
Sara Abbaspour Asadollah. 2018. Concurrency Bugs: Characterization, Debugging and Runtime Verification. Ph.D. Dissertation. Mälardalen University College, Västerås, Eskilstuna, Sweden.
[58]
Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel DNN training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SIGOPS 2021. 359–375.
[59]
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv. 52, 4 (2019), 65:1–65:43.
[60]
Alexandre Berard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. CoRR abs/1612.01744 (2016).
[61]
Robert L. Brennan and Dale J. Prediger. 1981. Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41, 3 (1981), 687–699.
[62]
Chenyi Chen, Ari Seff, Alain L. Kornhauser, and Jianxiong Xiao. 2015. DeepDriving: Learning affordance for direct perception in autonomous driving. In Proceedings of 2015 IEEE International Conference on Computer Vision, ICCV 2015. 2722–2730.
[63]
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2020. 750–762.
[64]
Zhenpeng Chen, Huihan Yao, Yiling Lou, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, and Xuanzhe Liu. 2021. An empirical study on deployment faults of deep learning based mobile applications. In Proceedings of the 43rd International Conference on Software Engineering, ICSE 2021. 674–685.
[65]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.
[66]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232–1240.
[67]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
[68]
J. A. Ferreira and A. H. Zwinderman. 2006. On The Benjamini–Hochberg method. The Annals of Statistics 34, 4 (2006), 1827–1849.
[69]
Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd D. Millstein. 2015. A general approach to network configuration analysis. In Proceedings of 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. 469–483.
[70]
Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017. 509–519.
[71]
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018. 539–550.
[72]
Priscilla E. Greenwood and Michael S. Nikulin. 1996. A Guide to Chi-Aquared Testing, Vol. 280.
[73]
Diandian Gu, Yihao Zhao, Yinmin Zhong, Yifan Xiong, Zhenhua Han, Peng Cheng, Fan Yang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2023. ElasticFlow: An elastic serverless training platform for distributed deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023. ACM, 266–280.
[74]
Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. 620–629.
[75]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770–778.
[76]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103–112.
[77]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1110–1121.
[78]
Md. Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019. 510–520.
[79]
Md. Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: Fix patterns and challenges. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1135–1146.
[80]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of 2019 USENIX Annual Technical Conference, USENIX ATC 2019. 947–960.
[81]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.
[82]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463–479.
[83]
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, et al. 2019. Cloud programming simplified: A Berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).
[84]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106–1114.
[85]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics (1977), 159–174.
[86]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583–598.
[87]
Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. 2013. A characteristic study on failures of production distributed data-parallel programs. In Proceedings of 35th International Conference on Software Engineering, ICSE 2013. 963–972.
[88]
Xuanzhe Liu, Gang Huang, Qi Zhao, Hong Mei, and M. Brian Blake. 2014. iMashup: A mashup-based framework for service composition. Sci. China Inf. Sci. 57, 1 (2014), 1–20.
[89]
Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Understanding build issue resolution in practice: Symptoms and fix patterns. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2020. 617–628.
[90]
Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv. 53, 1 (2020), 3:1–3:37.
[91]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 1–15.
[92]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2 (2008), 40–53.
[93]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024–8035.
[94]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 16–29.
[95]
Justus J. Randolph. 2005. Free-marginal multirater Kappa (multirater \(\kappa\)free): An alternative to Fleiss’ Fixed-Marginal Multirater Kappa. Online Submission (2005).
[96]
Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Trans. Software Eng. 25, 4 (1999), 557–572.
[97]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018).
[98]
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A survey on distributed machine learning. ACM Comput. Surv. 53, 2 (2020), 30:1–30:33.
[99]
Stephanie Wang, John Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, and Ion Stoica. 2019. Lineage stash: Fault tolerance off the critical path. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 338–352.
[100]
Jinfeng Wen, Zhenpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An empirical study on challenges of application development in serverless computing. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021. 416–428.
[101]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In Proceedings of 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022.
[102]
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1159–1170.
[103]
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018. 129–140.
[104]
Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, and Xin Jin. 2020. Is network the bottleneck of distributed training?. In Proceedings of the 2020 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2020. 8–13.
[105]
Naiqian Zheng, Mengqi Liu, Ennan Zhai, Hongqiang Harry Liu, Yifan Li, Kaicheng Yang, Xuanzhe Liu, and Xin Jin. 2022. Meissa: Scalable network testing for programmable data planes. In Proceedings of ACM SIGCOMM 2022 Conference.350–364.

Cited By

View all
  • (2024)Bias Behind the Wheel: Fairness Testing of Autonomous Driving SystemsACM Transactions on Software Engineering and Methodology10.1145/3702989Online publication date: 2-Nov-2024
  • (2024)Streamlining Cloud-Native Application Development and Deployment with Robust EncapsulationProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698552(847-865)Online publication date: 20-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 6
November 2023
949 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3625557
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2023
Online AM: 13 May 2023
Accepted: 17 April 2023
Revised: 14 February 2023
Received: 23 June 2022
Published in TOSEM Volume 32, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Empirical study
  2. distributed training
  3. software engineering

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Natural Science Fund for the Excellent Young Scientists Fund Program
  • Center for Data Space Technology and System, Peking University
  • ERC Advanced Grant

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)400
  • Downloads (Last 6 weeks)34
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Bias Behind the Wheel: Fairness Testing of Autonomous Driving SystemsACM Transactions on Software Engineering and Methodology10.1145/3702989Online publication date: 2-Nov-2024
  • (2024)Streamlining Cloud-Native Application Development and Deployment with Robust EncapsulationProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698552(847-865)Online publication date: 20-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media