Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2504730.2504737acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Published: 23 October 2013 Publication History

Abstract

Network appliances or middleboxes such as firewalls, intrusion detection and prevention systems (IDPS), load balancers, and VPNs form an integral part of datacenters and enterprise networks. Realizing their importance and shortcomings, the research community has proposed software implementations, policy-aware switching, consolidation appliances, moving middlebox processing to VMs, end hosts, and even offloading it to the cloud. While such efforts can use middlebox failure characteristics to improve their reliability, management, and cost-effectiveness, little has been reported on these failures in the field. In this paper, we make one of the first attempts to perform a large-scale empirical study of middlebox failures over two years in a service provider network comprising thousands of middleboxes across tens of datacenters. We find that middlebox failures are prevalent and they can significantly impact hosted services. Several of our findings differ in key aspects from commonly held views: (1) Most failures are grey dominated by connectivity errors and link flaps that exhibit intermittent connectivity, (2) Hardware faults and overload problems are present but they are not in majority, (3) Middleboxes experience a variety of misconfigurations such as incorrect rules, VLAN misallocation and mismatched keys, and (4) Middlebox failover is ineffective in about 33\% of the cases for load balancers and firewalls due to configuration bugs, faulty failovers and software version mismatch. Finally, we analyze current middlebox proposals based on our study and discuss directions for future research.

Supplementary Material

PDF File (crimc076.pdf)
Consolidated Review of Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters

References

[1]
V. Sekar, S. Ratnasamy, M. Reiter, N. Egi, and G. Shi, "The Middlebox Manifesto: Enabling Innovation in Middlebox Deployment," in HotNets, 2011.
[2]
D. Joseph, A. Tavakoli, and I. Stoica, "A Policy-Aware Switching Layer for Data Centers," in SIGCOMM CCR, 2008.
[3]
M. Walfish, J. Stribling, M. Krohn, H. Balakrishnan, R. Morris, and S. Shenker, "Middleboxes No Longer Considered Harmful," in OSDI, 2004.
[4]
P. Srisuresh, J. Kuthan, J. Rosenberg, A. Molitor, and A. Rayhan, "Middlebox Communication Architecture and Framework," RFC 3303, 2002.
[5]
ABI, "Enterprise Network and Data Security Spending Shows Remarkable Resilience," http://goo.gl/t43ax, January 2011.
[6]
J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar, "Making Middleboxes Someone Else's Problem: Network Processing As a Cloud Service," in SIGCOMM, 2012.
[7]
I. Week, "Data Center Outages Generate Big Losses," http://goo.gl/zBOfv, May 2011.
[8]
"Why Gmail went down: Google misconfigured load balancing servers," http://goo.gl/NXTeW.
[9]
"2011 ADC Security Survey Global Findings," http://goo.gl/A3b2Q.
[10]
M. Scharf and A. Ford, "MP-TCP Application Interface Considerations," draft-ietf-mptcp-api-00, 2010.
[11]
V. Sekar, N. Egi, S. Ratnasamy, M. Reiter, and G. Shi, "Design and Implementation of a Consolidated Middlebox Architecture," in NSDI, 2012.
[12]
H. Uppal, V. Brajkovic, D. Brandon, T. Anderson, and A. Krishnamurthy, "ETTM: A Scalable Fault Tolerant Network Manager," in NSDI, 2011.
[13]
A. Greenhalgh, F. Huici, M. Hoerdt, P. Papadimitriou, M. Handley, and L. Mathy, "Flow Processing and the Rise of Commodity Network Hardware," SIGCOMM CCR, 2009.
[14]
V. Liu, A. Krishnamurthy, and T. Anderson, "F10: Fault-Tolerant Engineered Networks," in NSDI, 2013.
[15]
Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang, "An Untold Story of Middleboxes in Cellular Networks," SIGCOMM CCR, 2011.
[16]
M. Allman, "On Performance of Middleboxes," in IMC, 2003.
[17]
P. Gill, N. Jain, and N. Nagappan, "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications," in SIGCOMM, 2011.
[18]
"Cisco Data Center Network Architecture," http://goo.gl/kP28P.
[19]
J. Lockwood, "Open Platform for Development of Network Processing Modules in Reprogrammable Hardware," in DesignCon, 2001.
[20]
J. Case, M. Fedor, M. Schoffstall, and J. Davin, "Simple Network Management Protocol," SRI International Network Information Center, May 1990.
[21]
M. McCloghrie, K. ad Rose, "Management Information Base for Network Management of TCP/IP-based internets," RFC 1213.
[22]
D. Johnson, "NOC Internal Integrated Trouble Ticket System," RFC 1297, 1992.
[23]
H. Mann and D. Whitney, "On a Test of Whether One of Two Random Variables is Stochastically Larger than the Other," in The Annals of Mathematical Statistics, 1947.
[24]
A. Hart, "Mann-Whitney test is not just a test of medians: differences in spread can be important," 2001.
[25]
R. Potharaju, N. Jain, and C. Nita-Rotaru, "Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets," in NSDI, 2013.
[26]
C. E. Brown, "Coefficient of Variation," in AMSGRS, 1998.
[27]
A. Mahimkar, J. Dange, V. Shmatikov, H. Vin, and Y. Zhang, "dFence: Transparent network-based denial of service mitigation," in NSDI, 2007.
[28]
R. Sakia, "The Box-Cox Transformation Technique: A Review," in JSTOR Statistician, 1992.
[29]
W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, "On the Self-Similar Nature of Ethernet Traffic (extended version)," in IEEE ToN, 1994.
[30]
T. L. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Bipolymers," in ISMB, 1994.
[31]
T. K. Moon, "The Expectation-Maximization Algorithm," 1996.
[32]
H. W. Lilliefors, "On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown," JASA, 1967.
[33]
R. Bendel, S. Higgins, J. Teberg, and D. Pyke, "Comparison of Skewness Coefficient, Coefficient of Variation, and Gini Coefficient as Inequality Measures within Populations," in Oecologia, 1989.
[34]
K. Argyraki, S. Baset, B. Chun, K. Fall, G. Iannaccone, A. Knies, E. Kohler, M. Manesh, S. Nedevschi, and S. Ratnasamy, "Can Software Routers Scale?" in PRESTO, 2008.
[35]
S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy, and D. Wetherall, "Reducing Network Energy Consumption via Sleeping and Rate-adaptation," in NSDI, 2008.
[36]
A. Greenberg, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta, "Towards a Next Generation Data Center Architecture: Scalability and Commoditization," in PRESTO.\hskip 1em plus 0.5em minus 0.4em\relax ACM, 2008.
[37]
P. Kazemian, G. Varghese, and N. McKeown, "Header space analysis: Static checking for networks," in NSDI, 2012.
[38]
N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown, "Where is the debugger for my software-defined network?" in Proceedings of the first workshop on Hot topics in software defined networks.\hskip 1em plus 0.5em minus 0.4em\relax ACM, 2012, pp. 55--60.
[39]
H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. Godfrey, and S. King, "Debugging the Data Plane with Anteater," SIGCOMM CCR, 2011.
[40]
N. Feamster and H. Balakrishnan, "Detecting BGP configuration Faults with Static Analysis," in NSDI, 2005.
[41]
A. Feldmann and J. Rexford, "IP network Configuration for Intra-domain Traffic Engineering," Network, IEEE, 2001.
[42]
L. Ellram, "Total Cost of Ownership: An Analysis Approach for Purchasing," in Journal of PDLM, 1995.
[43]
R. Swale, P. Mart, P. Sijben, S. Brim, and M. Shore, "Middlebox Communications Protocol Requirements," RFC 3304, 2002.
[44]
B. Carpenter and S. Brim, "Middleboxes: Taxonomy and Issues," RFC 3234, 2002.
[45]
M. Stiemerling and J. Quittek, "Middlebox communication (MIDCOM) protocol semantics," RFC 4097, 2008.
[46]
R. Hancock, S. Bosch, G. Karagiannis, and J. Loughney, "Next steps in signaling (NSIS): Framework," in IETF RFC 4080, 2005.
[47]
A. Greenberg, G. Hjalmtysson, D. Maltz, A. Myers, J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang, "Clean slate 4D approach to Network Control and Management," in SIGCOMM, 2005.
[48]
M. Casado, M. Freedman, J. Pettit, J. Luo, N. McKeown, and S. Shenker, "Ethane: Taking Control of the Enterprise," in SIGCOMM CCR, 2007.
[49]
J. Eppinger, "TCP connections for P2P Apps: Software Approach to Solving the NAT Problem," in ISR, 2005.
[50]
A. Biggadike, D. Ferullo, G. Wilson, and A. Perrig, "NATBLASTER: Establishing TCP Connections Between Hosts Behind NATs," in ACM SIGCOMM Workshop, 2005.
[51]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl, "Detailed Diagnosis in Enterprise Networks," in SIGCOMM, 2009.
[52]
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye, "A Study of End-to-End Web Access Failures," in CoNext, 2006.
[53]
C. Labovitz, A. Ahuja, and F. Jahanian, "Experimental Study of Internet Stability and Backbone Failures," in FTC, 1999.
[54]
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chuah, Y. Ganjali, and C. Diot, "Characterization of Failures in an Operational IP Backbone Network," in IEEE/ACM TON, 2008.
[55]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb, "A Case Study of OSPF Behavior in a Large Enterprise Network," in ACM SIGCOMM WIM, 2002.
[56]
D. Turner, K. Levchenko, A. Snoeren, and S. Savage, "California Fault Lines: Understanding the Causes and Impact of Network Failures," in ACM SIGCOMM CCR, 2010.

Cited By

View all
  • (2024)Probabilistic Protection for Both Computing and Transmission Capacities of Virtual Networks Under Multiple Facility Node FailuresIEEE Transactions on Network and Service Management10.1109/TNSM.2024.339785521:4(4563-4582)Online publication date: Aug-2024
  • (2024)Unavailability-Aware Backup Allocation Model Based on Two-Stage Shared Protection for MiddleboxesIEEE Transactions on Network and Service Management10.1109/TNSM.2023.328527821:1(70-87)Online publication date: Feb-2024
  • (2024)Energy- and Reliability-Aware Provisioning of Parallelized Service Function Chains With Delay GuaranteesIEEE Transactions on Green Communications and Networking10.1109/TGCN.2023.33179278:1(205-223)Online publication date: Mar-2024
  • Show More Cited By

Index Terms

  1. Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IMC '13: Proceedings of the 2013 conference on Internet measurement conference
    October 2013
    480 pages
    ISBN:9781450319539
    DOI:10.1145/2504730
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. datacenters
    2. middleboxes
    3. network reliability

    Qualifiers

    • Research-article

    Conference

    IMC'13
    IMC'13: Internet Measurement Conference
    October 23 - 25, 2013
    Barcelona, Spain

    Acceptance Rates

    IMC '13 Paper Acceptance Rate 42 of 178 submissions, 24%;
    Overall Acceptance Rate 277 of 1,083 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Probabilistic Protection for Both Computing and Transmission Capacities of Virtual Networks Under Multiple Facility Node FailuresIEEE Transactions on Network and Service Management10.1109/TNSM.2024.339785521:4(4563-4582)Online publication date: Aug-2024
    • (2024)Unavailability-Aware Backup Allocation Model Based on Two-Stage Shared Protection for MiddleboxesIEEE Transactions on Network and Service Management10.1109/TNSM.2023.328527821:1(70-87)Online publication date: Feb-2024
    • (2024)Energy- and Reliability-Aware Provisioning of Parallelized Service Function Chains With Delay GuaranteesIEEE Transactions on Green Communications and Networking10.1109/TGCN.2023.33179278:1(205-223)Online publication date: Mar-2024
    • (2024)SafeDRL: Dynamic Microservice Provisioning With Reliability and Latency Guarantees in Edge EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.332919473:1(235-248)Online publication date: Jan-2024
    • (2024)Katoptron: Efficient State Mirroring for Middlebox ResilienceNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575815(1-9)Online publication date: 6-May-2024
    • (2024)Analytical Model for Assessing Unavailability in Middleboxes with Multiple Backup Servers under Shared Protection2024 20th International Conference on the Design of Reliable Communication Networks (DRCN)10.1109/DRCN60692.2024.10539160(1-8)Online publication date: 6-May-2024
    • (2024)Unavailability-aware allocation of backup resources considering failures of virtual and physical machinesComputer Networks10.1016/j.comnet.2023.110141239(110141)Online publication date: Feb-2024
    • (2023)DEFT: Distributed, Elastic, and Fault-Tolerant State Management of Network Functions2023 19th International Conference on Network and Service Management (CNSM)10.23919/CNSM59352.2023.10327813(1-7)Online publication date: 30-Oct-2023
    • (2023)Shared Backup Allocation Model of Middlebox Based on Workload-Dependent Failure RateIEICE Transactions on Communications10.1587/transcom.2022EBP3097E106.B:5(427-438)Online publication date: 1-May-2023
    • (2023)RuleDRL: Reliability-Aware SFC Provisioning With Bounded Approximations in Dynamic EnvironmentsIEEE Transactions on Services Computing10.1109/TSC.2023.328175916:5(3651-3664)Online publication date: Sep-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media