research-article

Open access

MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud

Authors:

Danyang ZhuoAuthors Info & Claims

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Pages 679 - 690

https://doi.org/10.1145/3651890.3672252

Published: 04 August 2024 Publication History

Abstract

Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying physical network configuration or how other tenants use the shared cloud network---this lack of information prevents the library from selecting an optimal algorithm. In this paper, we explore a new approach for collective communication that more tightly integrates the implementation with the cloud network instead of the applications. We introduce MCCS, or Managed Collective Communication as a Service, which exposes traditional collective communication abstractions to applications while providing control and flexibility to the cloud provider for their implementations. Realizing MCCS involves overcoming several key challenges to integrate collective communication as part of the cloud network, including memory management of tenant GPU buffers, synchronizing changes to collective communication strategies, and supporting policies that involve cross-layer traffic optimization. Our evaluations show that MCCS improves tenant collective communication performance by up to 2.4× compared to one of the state-of-the-art collective communication libraries (NCCL), while adding more management features including dynamic algorithm adjustment, quality of service, and network-aware traffic engineering.

References

[1]

Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. 2010. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI.

Digital Library

[2]

Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In SIGCOMM.

[3]

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). In SIGCOMM.

[4]

Sebastian Angel, Hitesh Ballani, Thomas Karagiannis, Greg O'Shea, and Eno Thereska. 2014. End-to-end Performance Isolation Through Virtual Datacenters. In OSDI.

[5]

Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. 2011. Towards Predictable Datacenter Networks. In SIGCOMM.

[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-shot Learners. Advances in Neural Information Processing Systems 33 (2020), 1877--1901.

[7]

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing Optimal Collective Algorithms. In PPoPP.

[8]

Jingrong Chen, Hong Zhang, Wei Zhang, Liang Luo, Jeffrey Chase, Ion Stoica, and Danyang Zhuo. 2022. NetHint: White-Box Networking for Multi-Tenant Data Centers. In NSDI.

[9]

Mathijs Den Burger, Thilo Kielmann, and Henri E Bal. 2005. Balanced Multicasting: High-Throughput Communication for Grid Applications. In ACM/IEEE Conference on Supercomputing (SC).

[10]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-Network Aggregation for Shared Machine Learning Clusters. In MLSys.

[11]

Gloo 2023. Collective Communications Library with Various Primitives for Multi-Machine Training. https://github.com/facebookincubator/gloo. (2023).

[12]

Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. 2005. Open MPI: A Flexible High Performance MPI. In International Conference on Parallel Processing and Applied Mathematics. Springer, 228--239.

[13]

Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-Architecting Datacenter Networks and Stacks for Low Latency and High Performance. In SIGCOMM.

[14]

Olaf Hartmann, Matthias Kühnemann, Thomas Rauber, and Gudula Rünger. 2005. Adaptive Selection of Communication Methods to Optimize Collective MPI Operations. In PARCO.

[15]

intelmpi 2024. Intel MPI Library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html. (2024).

[16]

Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazières, Balaji Prabhakar, Albert Greenberg, and Changhoon Kim. 2013. EyeQ: Practical Network Performance Isolation at the Edge. In NSDI.

[17]

Nicholas T Karonis, Bronis R De Supinski, Ian Foster, William Gropp, Ewing Lusk, and John Bresnahan. 2000. Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance. In IPDPS.

[18]

Naga Katta, Aditi Ghag, Mukesh Hira, Isaac Keslassy, Aran Bergman, Changhoon Kim, and Jennifer Rexford. 2017. Clove: Congestion-Aware Load Balancing at the Virtual Edge. In CoNEXT.

Digital Library

[19]

Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, Nate Foster, and Amin Vahdat. 2019. PicNIC: Predictable Virtualized NIC. In SIGCOMM.

[20]

Katrina LaCurts, Shuo Deng, Ameesh Goyal, and Hari Balakrishnan. 2013. Choreo: Network-Aware Task Placement for Cloud Applications. In IMC.

Digital Library

[21]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-Network Aggregation for Multi-tenant Learning. In NSDI.

[22]

Jeongkeun Lee, Yoshio Turner, Myungjin Lee, Lucian Popa, Sujata Banerjee, Joon-Myung Kang, and Puneet Sharma. 2014. Application-Driven Bandwidth Guarantees in Datacenters. In SIGCOMM.

[23]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the Public Cloud. In MLSys.

[24]

nccl 2023. The NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl. (2023).

[25]

NVIDIA. 2023. Performance Reported by NCCL Tests. https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md. (2023).

[26]

nvidiasharp 2024. NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). https://docs.nvidia.com/networking/display/sharpv300. (2024).

[27]

Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu. 2018. Stateless Datacenter Load-balancing with Beamer. In NSDI.

[28]

oneccl 2023. oneAPI Collective Communications Library (oneCCL). https://github.com/oneapi-src/oneCCL. (2023).

[29]

Siddharth Pal, Liangyu Zhao, Jason Fantl, Joud Khoury, Arvind Krishnamurthy, and Prithwish Basu. 2023. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies. arXiv preprint arXiv:2309.13541 (2023).

[30]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32 (2019).

[31]

Y Peng, Y Zhu, Y Chen, Y Bao, B Yi, C Lan, C Wu, and C Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In SOSP.

[32]

Jelena Pješivac-Grbović, George Bosilca, Graham E. Fagg, Thara Angskun, and Jack J. Dongarra. 2007. MPI Collective Algorithm Selection and Quadtree Encoding. Parallel Comput. (2007).

[33]

Lucian Popa, Praveen Yalagandula, Sujata Banerjee, Jeffrey C. Mogul, Yoshio Turner, and Jose Renato Santos. 2013. ElasticSwitch: Practical Work-Conserving Bandwidth Guarantees for Cloud Computing. In SIGCOMM.

[34]

Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. 2011. Improving Datacenter Performance and Robustness with Multipath TCP. In SIGCOMM.

[35]

Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. In NSDI.

[36]

Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. 2022. Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models. In ISCA.

[37]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505--3506.

Digital Library

[38]

rccl 2023. ROCm Communication Collectives Library (RCCL). https://github.com/ROCm/rccl. (2023).

[39]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In NSDI.

[40]

Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoefler. 2024. Swing: Short-cutting Rings for Higher Bandwidth Allreduce. In NSDI.

[41]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. In NSDI.

[42]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2019).

[43]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).

[44]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In MLSys.

[45]

Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. 2017. Resilient Datacenter Load Balancing in the Wild. In SIGCOMM.

[46]

Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In SIGCOMM.

Index Terms

MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud

Recommendations

Cloud service engineering
ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2

Building on compute and storage virtualization, Cloud Computing provides scalable, network-centric, abstracted IT infrastructure, platforms, and applications as on-demand services that are billed by consumption. Cloud Service Engineering is the ...
Engineering multi-tenant software-as-a-service systems
PESOS '11: Proceedings of the 3rd International Workshop on Principles of Engineering Service-Oriented Systems

Increasingly, Software-as-a-Service (SaaS) is becoming a dominant mechanism for the consumption of software by end users. From a vendor's perspective, the benefits of SaaS arise from leveraging economies of scale, by serving a large number of customers (...
Towards Dynamic Tenant Management for Microservice based Multi-Tenant SaaS Applications
ISEC '18: Proceedings of the 11th Innovations in Software Engineering Conference

In a multi-tenant cloud application, more than one heterogeneous tenants share the single instance of the application. It increases the degree of resource sharing among tenants and brings down the operational cost. In this work, we propose a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

August 2024

1033 pages

ISBN:9798400706141

DOI:10.1145/3651890

Co-chairs:
Aruna Seneviratne,
Darryl Veitch,
Program Co-chairs:
Vyas Sekar,
Minlan Yu

This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

ACM SIGCOMM '24

Sponsor:

SIGCOMM

ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference

August 4 - 8, 2024

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
521
Total Downloads

Downloads (Last 12 months)521
Downloads (Last 6 weeks)283

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents