research-article

Open access

Training Job Placement in Clusters with Statistical In-Network Aggregation

Authors:

Wenfei WuAuthors Info & Claims

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

Pages 420 - 434

https://doi.org/10.1145/3617232.3624863

Published: 17 April 2024 Publication History

Abstract

In-Network Aggregation (INA) offloads the gradient aggregation in distributed training (DT) onto programmable switches, where the switch memory could be allocated to jobs in either synchronous or statistical multiplexing mode. Statistical INA has advantages in switch memory utilization, control-plane simplicity, and management safety, but it faces the problem of cross-layer resource efficiency in job placement. This paper presents a job placement system NetPack for clusters with statistical INA, which aims to maximize the utilization of both computation and network resources. NetPack periodically batches and places jobs into the cluster. When placing a job, NetPack runs a steady state estimation algorithm to acquire the available resources in the cluster, heuristically values each server according to its available resources (GPU and bandwidth), and runs a dynamic programming algorithm to efficiently search for servers with the highest value for the job. Our prototype of NetPack and the experiments demonstrate that NetPack outperforms prior job placement methods by 45% in terms of average job completion time on production traces.

References

[1]

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63--74.

[2]

Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference. 63--74.

Digital Library

[3]

Yixin Bao, Yanghua Peng, and Chuan Wu. 2019. Deep learning-based job placement in distributed machine learning clusters. In IEEE INFO-COM 2019-IEEE conference on computer communications. IEEE, 505--513.

Digital Library

[4]

Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online job scheduling in distributed machine learning clusters. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 495--503.

Digital Library

[5]

D Bertsekas and R Gallager. 1987. Max-min flow control. Data Networks (1987), 448--455.

[6]

Marcel Blöcher, Lin Wang, Patrick Eugster, and Max Schmidt. 2021. Switches for HIRE: Resource Scheduling for Data Center in-Network Computing. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 268--285.

Digital Library

[7]

Paolo Costa, Austin Donnelly, Antony IT Rowstron, and Greg O'Shea. 2012. Camdoop: Exploiting In-network Aggregation for Big Data Applications. In NSDI, Vol. 12. 3--3.

[8]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613--627.

[9]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.

Digital Library

[10]

Jin Fang, Gongming Zhao, Hongli Xu, Changbo Wu, and Zhuolong Yu. 2023. GRID: Gradient Routing With In-Network Aggregation for Distributed Training. IEEE/ACM Transactions on Networking (2023), 1--14.

Digital Library

[11]

Nadeen Gebara, Paolo Costa, and Manya Ghobadi. 2021. PANAMA: In-network Aggregation for Shared Machine Learning Clusters. In Conference on Machine Learning and Systems (MLSys) 2021. https://www.microsoft.com/en-us/research/publication/panama-in-network-aggregation-for-shared-machine-learning-clusters/

[12]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network Aggregation for Shared Machine Learning Clusters. Proceedings of Machine Learning and Systems 3 (2021), 829--844.

[13]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 1--10.

[14]

Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review 44, 4 (2014), 455--466.

Digital Library

[15]

Gurobi. 2022. Gurobi: The Fastest Solver. https://www.gurobi.com/.

[16]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539--558.

[17]

Yongchao He, Wenfei Wu, Yanfang Le, Ming Liu, and ChonLam Lao. 2023. A Generic Service to Provide In-Network Aggregation for Key-Value Streams. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 33--47.

Digital Library

[18]

Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 721--739. https://www.usenix.org/conference/nsdi21/presentation/hwang

[19]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 947--960.

[20]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996--1009.

Digital Library

[21]

Mari Kobayashi and Giuseppe Caire. 2006. An iterative water-filling algorithm for maximum weighted sum-rate of Gaussian MIMO-BC. IEEE Journal on Selected Areas in Communications 24, 8 (2006), 1640--1646.

Digital Library

[22]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 741--761. https://www.usenix.org/conference/nsdi21/presentation/lao

[23]

Charles E Leiserson. 1985. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE transactions on Computers 100, 10 (1985), 892--901.

Digital Library

[24]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 583--598.

Digital Library

[25]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.

Digital Library

[26]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, 279--291.

Digital Library

[27]

Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-Network Aggregation with Transport Transparency for Distributed Training. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 376--391.

Digital Library

[28]

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 289--304.

[29]

Luo Mai, Lukas Rupprecht, Abdul Alim, Paolo Costa, Matteo Migliavacca, Peter Pietzuch, and Alexander L Wolf. 2014. Netagg: Using middleboxes for application-specific on-path aggregation in data centres. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. 249--262.

Digital Library

[30]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.

Digital Library

[31]

Netsack. 2021. NetPack Source Code. https://anonymous.4open.science/r/ATP-Controller-35D4.

[32]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. 1--14.

Digital Library

[33]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16--29.

Digital Library

[34]

David Pisinger. 1995. Algorithms for knapsack problems. (1995).

[35]

Qilin Qi, Andrew Minturn, and Yaoqing Yang. 2012. An efficient waterfilling algorithm for power allocation in OFDM-based cognitive radio systems. In 2012 International Conference on Systems and Informatics (ICSAI2012). IEEE, 2069--2073.

[36]

Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, Marco Canini, and Panos Kalnis. 2017. In-network computation is a dumb idea whose time has come. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks. 150--156.

Digital Library

[37]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701 (2019).

[38]

Gesualdo Scutari, Daniel P. Palomar, and Sergio Barbarossa. 2009. The MIMO Iterative Waterfilling Algorithm. IEEE Transactions on Signal Processing 57, 5 (2009), 1917--1935.

Digital Library

[39]

Peng Sun, Yonggang Wen, Nguyen Binh Duong Ta, and Shengen Yan. 2017. Towards distributed machine learning in shared clusters: A dynamically-partitioned approach. In 2017 IEEE International Conference on Smart Computing (SMARTCOMP). IEEE, 1--6.

[40]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.

Digital Library

[41]

Raajay Viswanathan, Arjun Balasubramanian, and Aditya Akella. 2020. Network-accelerated distributed machine learning for multi-tenant settings. In Proceedings of the 11th ACM Symposium on Cloud Computing. 447--461.

Digital Library

[42]

Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing. 390--404.

Digital Library

[43]

Xueying Zhang, Ruiting Zhou, John CS Lui, and Zongpeng Li. 2020. Dynamic Pricing and Placement for Distributed Machine Learning Jobs. In 2020 6th International Conference on Big Data Computing and Communications (BIGCOM). IEEE, 152--160.

[44]

Bohan Zhao, Chang Liu, Jianbo Dong, Zheng Cao, Wei Nie, and Wenfei Wu. 2023. Enabling Switch Memory Management for Distributed Training with In-Network Aggregation. In IEEE INFOCOM 2023-IEEE conference on computer communications. IEEE.

[45]

Bohan Zhao, Wenfei Wu, and Wei Xu. 2022. NetRPC: Enabling In-Network Computation in Remote Procedure Calls. arXiv preprint arXiv:2212.08362 (2022).

Index Terms

Training Job Placement in Clusters with Statistical In-Network Aggregation
1. Networks
  1. Network services
    1. Programmable networks

Recommendations

Routability driven white space allocation for fixed-die standard-cell placement
ISPD '02: Proceedings of the 2002 international symposium on Physical design

The use of white space in fixed-die standard-cell placement is an effective way to improve routability. In this paper, we present a white space allocation approach that dynamically assigns white space according to the congestion distribution of the ...
Routability-Driven Blockage-Aware Macro Placement
DAC '14: Proceedings of the 51st Annual Design Automation Conference

We present a new floorplan representation, called circular-packing trees (CP-trees), for the problem of macro placement. Our CP-trees can flexibly pack movable macros toward corners or pre-placed macros along chip boundaries circularly to optimize macro ...
Routability-driven analytical placement by net overlapping removal for large-scale mixed-size designs
DAC '08: Proceedings of the 45th annual Design Automation Conference

Routability is a challenging cost metric for modern large-scale mixed-size placement. Most existing routability-driven placement algorithms apply whitespace allocation to relieve the routing congestion. Nevertheless, we observe that whitespace ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

April 2024

494 pages

ISBN:9798400703720

DOI:10.1145/3617232

General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '24

Sponsor:

ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

April 27 - May 1, 2024

CA, La Jolla, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
760
Total Downloads

Downloads (Last 12 months)760
Downloads (Last 6 weeks)73

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents