research-article

Hostmesh: Monitor and Diagnose Networks in Rail-optimized RoCE Clusters

Authors:

Tao HuangAuthors Info & Claims

APNet '24: Proceedings of the 8th Asia-Pacific Workshop on Networking

Pages 122 - 128

https://doi.org/10.1145/3663408.3663426

Published: 03 August 2024 Publication History

Abstract

RoCE services are sensitive to failures and bottlenecks, which become more common as the RoCE network scales. To effectively detect and locate these problems independent of service traffic, RoCE networks require a monitoring and diagnostic system based on active probing. However, existing active probing schemes typically rely on a controller to design the probing plan for each server, which is difficult to deploy and has high synchronization overhead in multi-tenant clusters. Fortunately, rail-optimized clusters have become more common in recent years to improve network performance. In these clusters, the controller is unnecessary.

In this paper, we propose Hostmesh, the first network monitoring and diagnostic system for rail-optimized RoCE clusters based solely on full-mesh probing between RDMA NICs on the same host. Hostmesh uses the feature of rail-optimized networks and does not rely on a controller to generate pinglists. We deployed Hostmesh for over three months on a multi-tenant rail-optimized RoCE cluster with hundreds of servers. During the deployment, Hostmesh effectively detected and located 8 types of problems caused by hardware failures, misconfigurations, network congestion, and intra-host bottlenecks. And we share our experience in dealing with them.

References

[1]

Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 2018. 007: Democratically finding the cause of packet drops. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 419–435.

[2]

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, 2023. Empowering azure storage with RDMA. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 49–67.

[3]

Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, Minian Yu, and Michael Mitzenmacher. 2020. PINT: Probabilistic in-band network telemetry. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 662–680.

Digital Library

[4]

Ítalo Cunha, Renata Teixeira, Nick Feamster, and Christophe Diot. 2009. Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement. 254–266.

Digital Library

[5]

Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 610–622.

[6]

Nick Duffield. 2006. Network Tomography of Binary Network Performance Characteristics. IEEE Transactions on Information Theory 52, 12 (2006), 5373–5388.

Digital Library

[7]

Chongrong Fang, Haoyu Liu, Mao Miao, Jie Ye, Lei Wang, Wansheng Zhang, Daxiang Kang, Biao Lyv, Peng Cheng, and Jiming Chen. 2020. VTrace: Automatic diagnostic system for persistent packet loss in cloud-scale overlay network. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 31–43.

Digital Library

[8]

Seyed K Fayaz, Tushar Sharma, Ari Fogel, Ratul Mahajan, Todd Millstein, Vyas Sekar, and George Varghese. 2016. Efficient network reachability analysis using a succinct control plane representation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 217–232.

Digital Library

[9]

Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, 2021. When Cloud Storage Meets RDMA. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 519–533.

[10]

Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. 2019. SIMON: A simple and scalable method for sensing, inference and measurement in data center networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 549–564.

[11]

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In Proceedings of the 2016 ACM SIGCOMM Conference. 202–215.

Digital Library

[12]

Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, and Hua and Chen. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. Computer communication review 45, 4 (2015), 139–152.

Digital Library

[13]

Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 357–371.

Digital Library

[14]

Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 71–85.

Digital Library

[15]

Qun Huang, Haifeng Sun, Patrick PC Lee, Wei Bai, Feng Zhu, and Yungang Bao. 2020. Omnimon: Re-architecting network telemetry with resource efficiency and full accuracy. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 404–421.

Digital Library

[16]

Yiyi Huang, Nick Feamster, and Renata Teixeira. 2008. Practical Issues with Using Network Tomography for Fault Diagnosis. ACM SIGCOMM Computer Communication Review 38, 5 (2008), 53–58.

Digital Library

[17]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 463–479.

Digital Library

[18]

Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be General and Fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 1–16.

Digital Library

[19]

Anuj Kalia, Michael Kaminsky, and David G Andersen. 2014. Using RDMA Efficiently for Key-Value Services. In Proceedings of the 2014 ACM Conference on SIGCOMM. 295–306.

Digital Library

[20]

Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided Datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation. 185–201.

[21]

Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, and Gianni Antichi. 2023. Direct Telemetry Access. In Proceedings of the ACM SIGCOMM 2023 Conference. 832–849.

Digital Library

[22]

Linux. 2024. rdma-core. https://github.com/linux-rdma/rdma-core.

[23]

Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang Tan, Tian Pan, and Tao Huang. 2023. Hostping: Diagnosing intra-host network bottlenecks in RDMA servers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 15–29.

[24]

Teng Ma, Tao Ma, Zhuo Song, Jingxuan Li, Huaixin Chang, Kang Chen, Hai Jiang, and Yongwei Wu. 2019. X-rdma: Effective rdma middleware in large-scale production environments. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 1–12.

[25]

Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. 2016. Trumpet: Timely and precise triggers in data centers. In Proceedings of the 2016 ACM SIGCOMM Conference. 129–143.

Digital Library

[26]

NVIDIA. 2023. Doubling all2all Performance with NVIDIA Collective Communication Library 2.12. https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/.

[27]

NVIDIA. 2023. NCCL Environment Variables. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html.

[28]

NVIDIA. 2023. NVIDIA DGX SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf.

[29]

NVIDIA. 2024. NCCL. https://github.com/NVIDIA/nccl.

[30]

Yanghua Peng, Ji Yang, Chuan Wu, Chuanxiong Guo, Chengchen Hu, and Zongpeng Li. 2017. deTector: a Topology-aware Monitoring System for Data Center Networks. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 55–68.

[31]

Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agarwal, John Carter, and Rodrigo Fonseca. 2014. Planck: Millisecond-scale monitoring and control for commodity networks. ACM SIGCOMM Computer Communication Review 44, 4 (2014), 407–418.

Digital Library

[32]

Arjun Roy, Deepak Bansal, David Brumley, Harish Kumar Chandrappa, Parag Sharma, Rishabh Tewari, Behnaz Arzani, and Alex C Snoeren. 2018. Cloud datacenter sdn monitoring: Experiences and challenges. In Proceedings of the Internet Measurement Conference 2018. 464–470.

[33]

Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599–614.

[34]

Xinchen Wan, Hong Zhang, Hao Wang, Shuihai Hu, Junxue Zhang, and Kai Chen. 2020. Rat-Resilient Allreduce Tree for Distributed Machine Learning. In 4th Asia-Pacific Workshop on Networking. 52–57.

[35]

Weitao Wang, Xinyu Crystal Wu, Praveen Tammana, Ang Chen, and TS Eugene Ng. 2022. Closed-loop network performance monitoring and diagnosis with SpiderMon. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 267–285.

[36]

Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561–575.

Digital Library

[37]

Da Yu, Yibo Zhu, Behnaz Arzani, Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. 2019. dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 207–220.

[38]

Zhuolong Yu, Bowen Su, Wei Bai, Shachar Raindel, Vladimir Braverman, and Xin Jin. 2023. Understanding the micro-behaviors of hardware offloaded network stacks with lumina. In Proceedings of the ACM SIGCOMM 2023 Conference. 1074–1087.

Digital Library

[39]

Yikai Zhao, Kaicheng Yang, Zirui Liu, Tong Yang, Li Chen, Shiyi Liu, Naiqian Zheng, Ruixin Wang, Hanbo Wu, Yi Wang, 2021. LightGuardian: A full-visibility, lightweight, in-band telemetry system using sketchlets. In 18th USENIX Symposium on Networked Systems Design and Implementation. 991–1010.

[40]

Yu Zhou, Chen Sun, Hongqiang Harry Liu, Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, 2020. Flow event telemetry on programmable data plane. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 76–89.

Digital Library

[41]

Shunmin Zhu, Jianyuan Lu, Biao Lyu, Tian Pan, Chenhao Jia, Xin Cheng, Daxiang Kang, Yilong Lv, Fukun Yang, Xiaobo Xue, 2022. Zoonet: a proactive telemetry system for large-scale cloud networks. In Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies. 321–336.

Digital Library

[42]

Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, 2015. Packet-Level Telemetry in Large Datacenter Networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 479–491.

Digital Library

Index Terms

Hostmesh: Monitor and Diagnose Networks in Rail-optimized RoCE Clusters
1. Networks

Recommendations

R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

RoCE services are sensitive to network failures and performance bottlenecks, which become more common as the RoCE network scales. In addition, some non-network problems behave like network problems and can waste troubleshooting time. However, existing ...
On the Impact of Cluster Configuration on RoCE Application Design
APNet '19: Proceedings of the 3rd Asia-Pacific Workshop on Networking

RDMA over Converged Ethernet (RoCE) allows RDMA-enabled NICs to operate in datacenter networks. This study focuses on identifying how different aspects of datacenter cluster configuration impact the latency, and throughput, and CPU utilization of ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshops

Virtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

APNet '24: Proceedings of the 8th Asia-Pacific Workshop on Networking

August 2024

230 pages

ISBN:9798400717581

DOI:10.1145/3663408

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 August 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

APNet 2024

APNet 2024: The 8th Asia-Pacific Workshop on Networking

August 3 - 4, 2024

Sydney, Australia

Acceptance Rates

APNet '24 Paper Acceptance Rate 50 of 118 submissions, 42%;

Overall Acceptance Rate 50 of 118 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)35

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten