Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

NetPilot: automating datacenter network failure mitigation

Published: 13 August 2012 Publication History

Abstract

Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services.
Recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.

References

[1]
Cut-Through and Store-and-Forward Ethernet Switching for Low-Latency Environments. http: //www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-465436.pdf.
[2]
Virtual PortChannels: Building Networks without Spanning Tree Protocol. http: //www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9402/white_paper_c11-516396.html.
[3]
M. K. Aguilera, J. C. Mogul, J. L.Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. SOSP '03.
[4]
M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM '08.
[5]
P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM '07.
[6]
D. Banerjee, V. Madduri, and M. Srivatsa. A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks. In SRDS '09.
[7]
M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI '04.
[8]
W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[9]
D. P. et. al. Recovery Oriented Computing: Motivation, Definition, Techniques, and Case Studies. Technical report, Berkeley Computer Science, 2002.
[10]
A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: a scalable and flexible data center network. In SIGCOMM '09.
[11]
U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers.
[12]
C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, 2000.
[13]
IEEE. 802.3ad Link Aggregation Standard.
[14]
M. Isard. Autopilot: automatic data center management. SIGOPS Oper. Syst. Rev., 41:60--67, April 2007.
[15]
S. Kandula, D. Katabi, and J.-P. Vasseur. Shrink: a tool for failure diagnosis in IP networks. In MineNet '05.
[16]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM '09.
[17]
S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: measurements & analysis. In IMC '09.
[18]
R. B. Kiran, K. Tati, Y. chung Cheng, S. Savage, and G. M. Voelker. Total Recall: System Support for Automated Availability Management. In NSDI '04.
[19]
Kompella, R.R and Yates, Jennifer, and Greenberg, Albert and Snoeren, Alex. IP Fault Localization Via Risk Modeling. In NSDI '05.
[20]
C. Lonvick. The BSD Syslog Protocol. RFC 3164, 2001.
[21]
R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM '09.
[22]
R. Presuhn. Management Information Base (MIB) for the Simple Network Management Protocol (SNMP). RFC 3418, 2002.
[23]
P. Reynolds, J. L.Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat.WAP5: black-box performance debugging for wide-area systems. In WWW '06.
[24]
S. Nadas, Ericsson. Virtual Router Redundancy Protocol (VRRP) Version 3 for IPv4 and IPv6. Internet RFC 5798, 2010.
[25]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In IMW '02.
[26]
K. Shen. Structure management for scalable overlay service construction. In NSDI '04.
[27]
T. Li, B. Cole, P. Morton, D. Li. Cisco Hot Standby Router Protocol (HSRP). RFC 2281, 1998.
[28]
M. Tariq, A. Zeitoun, V. Valancius, N. Feamster, and M. Ammar. Answering what-if deployment and configuration questions with wise. In SIGCOMM '08.
[29]
A. A. Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http: //aws.amazon.com/message/65648/.
[30]
Y. Wang, C. Huang, J. Li, and K. Ross. Estimating the performance of hypothetical cloud service deployments: A measurement-based approach. In INFOCOM '11.
[31]
Y. Wang, H. Wang, A. Mahimkar, R. Alimi, Y. Zhang, L. Qiu, and Y. R. Yang. R3: resilient routing reconfiguration. In SIGCOMM '10.

Cited By

View all
  • (2023)Dependable Virtualized Fabric on Programmable Data PlaneIEEE/ACM Transactions on Networking10.1109/TNET.2022.322461731:4(1748-1764)Online publication date: Aug-2023
  • (2023)Efficient Fiber-Inspection Method for Optical-Circuit Datacenter NetworksGLOBECOM 2023 - 2023 IEEE Global Communications Conference10.1109/GLOBECOM54140.2023.10437204(1119-1124)Online publication date: 4-Dec-2023
  • (2022)dDrops: Detecting silent packet drops on programmable data planeComputer Networks10.1016/j.comnet.2022.109171214(109171)Online publication date: Sep-2022
  • Show More Cited By

Index Terms

  1. NetPilot: automating datacenter network failure mitigation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGCOMM Computer Communication Review
    ACM SIGCOMM Computer Communication Review  Volume 42, Issue 4
    Special october issue SIGCOMM '12
    October 2012
    538 pages
    ISSN:0146-4833
    DOI:10.1145/2377677
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 August 2012
    Published in SIGCOMM-CCR Volume 42, Issue 4

    Check for updates

    Author Tags

    1. automated failure mitigation
    2. datacenter networks

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Dependable Virtualized Fabric on Programmable Data PlaneIEEE/ACM Transactions on Networking10.1109/TNET.2022.322461731:4(1748-1764)Online publication date: Aug-2023
    • (2023)Efficient Fiber-Inspection Method for Optical-Circuit Datacenter NetworksGLOBECOM 2023 - 2023 IEEE Global Communications Conference10.1109/GLOBECOM54140.2023.10437204(1119-1124)Online publication date: 4-Dec-2023
    • (2022)dDrops: Detecting silent packet drops on programmable data planeComputer Networks10.1016/j.comnet.2022.109171214(109171)Online publication date: Sep-2022
    • (2022)State-of-the-Art DCN TopologiesData Center Networking10.1007/978-981-16-9368-7_2(25-56)Online publication date: 24-Feb-2022
    • (2021)Network Telemetry by Observing and Recording on Programmable Data Plane2021 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking52078.2021.9472807(1-6)Online publication date: 21-Jun-2021
    • (2020)Predictive and adaptive failure mitigation to avert production cloud VM interruptionsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488831(1155-1170)Online publication date: 4-Nov-2020
    • (2018)007Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307478(419-435)Online publication date: 9-Apr-2018
    • (2018)Masking failures from application performance in data center networks with shareable backupProceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication10.1145/3230543.3230577(176-190)Online publication date: 7-Aug-2018
    • (2018)PreFixProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/31794052:1(1-29)Online publication date: 3-Apr-2018
    • (2017)Enhancing Enterprise Security through Cost-effective and Highly Customizable Network MonitoringProceedings of the 10th EAI International Conference on Mobile Multimedia Communications10.4108/eai.13-7-2017.2270274(133-142)Online publication date: 8-Dec-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media