Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3672198.3673798acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI Computing

Published: 04 August 2024 Publication History

Abstract

The successful application cases of Large Language Models (LLMs) and Machine Learning (ML) are driving traditional data centers to transform into intelligent computing data centers characterized by low latency, high bandwidth, high reliability, and zero packet loss. The demand for immense computing and ultra-low latency suggests that in-network computing (INC) may be a viable solution, such as In-network aggregation (INA). INA involves a hierarchical structure of switches and servers to form different Service Function Chains (SFCs) including switches, servers, physical links, and virtual links for accomplishing model training. However, the aggregation of heavy traffic in CTCs tends to a sudden and drastic increase in a specific node, greatly increasing the likelihood of node failure. To detect SFC failure in real time, we propose an in-network SFC failure detection approach based on INC. We introduce digital twins (DT) and propose a collaborative AI framework based on the data plane and control plane to avoid model overfitting. In addition, to reduce the computing consumption, we propose the concept of "multiple SFC chains multiple models" to customize each SFC failure detection model and validate the mechanism on a BMv2-based prototype, which implements a high-accuracy failure detection with minor performance degradation.

References

[1]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. {ATP}: In-network aggregation for multi-tenant learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 741--761, 2021.
[2]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. Scaling distributed machine learning with {In-Network} aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785--808, 2021.
[3]
Jin Fang, Gongming Zhao, Hongli Xu, Changbo Wu, and Zhuolong Yu. Grid: Gradient routing with in-network aggregation for distributed training. IEEE/ACM Transactions on Networking, 2023.
[4]
Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. In-network aggregation with transport transparency for distributed training. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 376--391, 2023.
[5]
Zhaoqi Xiong and Noa Zilberman. Do switches dream of machine learning? toward in-network classification. In Proceedings of the 18th ACM workshop on hot topics in networks, pages 25--33, 2019.
[6]
Wenbin Pei, Bing Xue, Mengjie Zhang, Lin Shang, Xin Yao, and Qiang Zhang. A survey on unbalanced classification: How can evolutionary computation help? IEEE Transactions on Evolutionary Computation, 2023.
[7]
Le Wang, Meng Han, Xiaojuan Li, Ni Zhang, and Haodong Cheng. Review of classification methods on unbalanced data sets. IEEE Access, 9:64606--64628. 2021.
[8]
Domenico Cotroneo, Roberto Natella, and Stefano Rosiello. A fault correlation approach to detect performance anomalies in virtual network function chains. In 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), pages 90--100. IEEE, 2017.
[9]
Junzhi Gong, Yuliang Li, Bilal Anwer, Aman Shaikh, and Minlan Yu. Microscope: Queue-based performance diagnosis for network functions. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 390--403, 2020.
[10]
Nguyen Van Tu, Jae-Hyoung Yoo, and James Won-Ki Hong. Pptmon: Real-time and fine-grained packet processing time monitoring in virtual network functions. IEEE Transactions on Network and Service Management, 18(4):4324--4336. 2021.
[11]
Arij Elmajed, Armen Aghasaryan, and Eric Fabre. Machine learning approaches to early fault detection and identification in nfv architectures. In 2020 6th IEEE Conference on Network Softwarization (NetSoft), pages 200--208. IEEE, 2020.
[12]
L Girish and Sridhar KN Rao. Anomaly detection in cloud environment using artificial intelligence techniques. Computing, 105(3):675--688, 2023.
[13]
Guangyuan Piao, Patrick K Nicholson, and Diego Lugones. Env2vec: accelerating vnf testing with deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.
[14]
Alessio Diamanti, José Manuel Sanchez Vilchez, and Stefano Secci. Lstm-based radiography for anomaly detection in softwarized infrastructures. In 2020 32nd International Teletraffic Congress (ITC 32), pages 28--36. IEEE, 2020.
[15]
Huan X Nguyen, Ramona Trestian, Duc To, and Mallik Tatipamula. Digital twin for 5g and beyond. IEEE Communications Magazine, 59(2):10--15, 2021.
[16]
Dan Kushnir and Maayan Goldstein. Causality inference for failures in nfv. In 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 929--934. IEEE, 2016.
[17]
KY-guokuo. Collasfc. https://github.com/KY-guokuo/CollaSFC, 2023. Online; accessed 2023-06-19.
[18]
Xuelong Li. Positive-incentive noise. IEEE Transactions on Neural Networks and Learning Systems, 2022.

Index Terms

  1. CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI Computing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    NAIC '24: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing
    August 2024
    89 pages
    ISBN:9798400707131
    DOI:10.1145/3672198
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Failure Detection
    2. Intelligent Computing Data Center
    3. Service Function Chains

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Beijing Jiaotong University

    Conference

    ACM SIGCOMM '24
    Sponsor:
    ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
    August 4 - 8, 2024
    NSW, Sydney, Australia

    Acceptance Rates

    NAIC '24 Paper Acceptance Rate 13 of 22 submissions, 59%;
    Overall Acceptance Rate 13 of 22 submissions, 59%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 140
      Total Downloads
    • Downloads (Last 12 months)140
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media