Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2523616.2523638acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

When the network crumbles: an empirical study of cloud network failures and their impact on services

Published: 01 October 2013 Publication History

Abstract

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.

References

[1]
Keynote Web Performance Testing. http://goo.gl/khl9Q.
[2]
S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan. Volley: Automated Data Placement for Geo-distributed Cloud Services. In Proceedings of NSDI. USENIX Association, 2010.
[3]
Amazon. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://goo.gl/yUlTJ, May 2011.
[4]
L. Bairavasundaram, A. Arpaci-Dusseau, R. Arpaci-Dusseau, G. Goodson, and B. Schroeder. An Analysis of Data Corruption in the Storage Stack. Proceedings of ACM Transactions on Storage (TOS), 4(3), 2008.
[5]
G. Box, J. Hunter, and W. Hunter. Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2005.
[6]
E. A. Brewer. Lessons from Giant-Scale Services. Internet Computing, IEEE, 5(4): 46--55, 2001.
[7]
J. Brodkin. Netflix attacks own network with "Chaos Monkey" - And now you can too. http://goo.gl/XhiKM, July 2012.
[8]
C. E. Brown. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences, pages 155--157. Springer, 1998.
[9]
J. Case, M. Fedor, M. Schoffstall, and J. Davin. Simple Network Management Protocol. http://goo.gl/az3Fv, May 1990.
[10]
Y. Chen, S. Jain, V. Adhikari, Z. Zhang, and K. Xu. A First Look at Inter-data Center Traffic Characteristics via Yahoo! Datasets. In Proceedings of INFOCOM. IEEE, 2011.
[11]
S. Deering and R. Hinden. Internet Protocol, Version (IPv6) Specification. RFC 2460.
[12]
L. Ellram. Total Cost of Ownership: An Analysis Approach for Purchasing. Journal of PDLM, 1995.
[13]
D. Etherington. Dropbox Currently Experiencing Widespread Service Outage. http://goo.gl/rszmb, May 2013.
[14]
N. Feamster and H. Balakrishnan. Detecting BGP Configuration Faults with Static Analysis. In Proceedings of USENIX NSDI, 2005.
[15]
S. G. and I. B. Websites Scramble as Hurricane Sandy Floods Data Centers. http://goo.gl/zOXDb, October 31 2012.
[16]
P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Datacenters: Measurement, Analysis, and Implications. In Proceedings of SIGCOMM, 2011.
[17]
A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Datacenter Network. ACM SIGCOMM CCR, 2009.
[18]
H. Jiang, F. Kéfélian, S. Crane, O. Lopez, M. Lours, J. Millo, D. Holleville, P. Lemonde, C. Chardonnet, A. Amy-Klein, et al. Long-distance Frequency Transfer Over an Urban Fiber Link Using Optical Phase Stabilization. JOSA B, 25(12), 2008.
[19]
W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. TOS, 2008.
[20]
D. Johnson. NOC Internal Integrated Trouble Ticket System. http://goo.gl/eMZxX, January 1992.
[21]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM CCR, 2009.
[22]
S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The Nature of Data center Traffic: Measurements & Analysis. In Proceedings of SIGCOMM. ACM, 2009.
[23]
D. C. Knowledge. Data Center Global Expansion Trend. http://goo.gl/SOvtA, November 2012.
[24]
K. Kompella, L. Berger, and Y. Rekhter. Link Bundling in MPLS Traffic Engineering (TE). 2005.
[25]
C. Labovitz, A. Ahuja, and F. Jahanian. Experimental Study of Internet Stability and Backbone Failures. In Proceedings of IEEE Fault-Tolerant Computing, 1999.
[26]
N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez. Inter-datacenter Bulk Transfers with Net-Stitcher. In Proceedings of SIGCOMM, 2011.
[27]
Y. Li, H. Wang, P. Zhang, J. Dong, and S. Cheng. D4D: Inter-datacenter Bulk Transfers with ISP Friendliness. In IEEE CLUSTER, 2012.
[28]
H. Lilliefors. On the Kolmogorov-Smirnov Test for the Exponential Distribution with Mean Unknown. Journal of the American Statistical Association, 64(325), 1969.
[29]
A. Mahimkar, A. Chiu, R. Doverspike, M. Feuer, P. Magill, E. Mavrogiorgis, J. Pastor, S. Woodward, and J. Yates. Bandwidth On Demand for Inter-Data center Communication. In HotNets. ACM, 2011.
[30]
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chuah, Y. Ganjali, and C. Diot. Characterization of Failures in an Operational IP Backbone Network. IEEE/ACM TON, 2008.
[31]
M. McCloghrie, K. ad Rose. Management Information Base for Network Management of TCP/IP-based internets. RFC 1213.
[32]
G. Mohan and C. Murthy. Lightpath Restoration in WDM Optical Networks. Network, IEEE, 14(6), 2000.
[33]
T. K. Moon. The Expectation-Maximization Algorithm. Signal Processing Magazine, IEEE, 13(6): 47--60, 1996.
[34]
J. Mudigonda, P. Yalagandula, J. Mogul, B. Stiekes, and Y. Pouffary. NetLord: A Scalable Multi-tenant Network Architecture for Virtualized Datacenters. In Proceedings of ACM SIGCOMM, 2011.
[35]
E. Nightingale, J. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems. ACM, 2011.
[36]
R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer-2 Data center Network Fabric. In SIGCOMM CCR. ACM, 2009.
[37]
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A Study of End-to-End Web Access Failures. In Proceedings of ACM CoNEXT, 2006.
[38]
E. Pinheiro, W. Weber, and L. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of FAST, 2007.
[39]
R. Potharaju and N. Jain. An Empirical Analysis of Intra-and Inter-datacenter Network Failures for Geo-distributed Services. In Extended Abstract Proceedings of ACM SIGMETRICS. ACM, 2013.
[40]
R. Potharaju and N. Jain. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 13th ACM SIGCOMM Conference on Internet Measurement, 2013.
[41]
R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets. In Proceedings of USENIX NSDI, 2013.
[42]
R. Sakia. The Box-Cox Transformation Technique: A Review. The Statistician, pages 169--178, 1992.
[43]
B. Schroeder and G. Gibson. Disk Failures in the Real World: What does an MTTF of 1,000,000 hours mean to you. In Proceedings of FAST, 2007.
[44]
B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of ACM SIGMETRICS, 2009.
[45]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A Case Study of OSPF Behavior in a Large Enterprise Network. In ACM SIGCOMM WIM, 2002.
[46]
J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar. Making Middleboxes someone else's Problem: Network Processing as a Cloud Service. In Proceedings of SIGCOMM, 2012.
[47]
C. Talbot. Dropbox Outage Represents First Major Cloud Outage of 2013. http://goo.gl/rszmb, January 2013.
[48]
D. Turner, K. Levchenko, A. Snoeren, and S. Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM CCR, 2010.
[49]
M. Wilk and R. Gnanadesikan. Probability Plotting Methods for the Analysis for the Analysis of Data. Biometrika, 55(1), 1968.
[50]
S. Works. Hurricane Sandy - AC2 Transatlantic Cable Cut. http://goo.gl/dywVO, October 2012.
[51]
Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, and S. Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of ACM SOSP, 2011.

Cited By

View all
  • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
  • (2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
  • (2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023
  • Show More Cited By

Index Terms

  1. When the network crumbles: an empirical study of cloud network failures and their impact on services

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
      October 2013
      427 pages
      ISBN:9781450324281
      DOI:10.1145/2523616
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 October 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cloud services
      2. datacenters
      3. inter-datacenter links
      4. network reliability

      Qualifiers

      • Research-article

      Conference

      SOCC '13
      Sponsor:
      SOCC '13: ACM Symposium on Cloud Computing
      October 1 - 3, 2013
      California, Santa Clara

      Acceptance Rates

      SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;
      Overall Acceptance Rate 169 of 722 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)80
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 20 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
      • (2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
      • (2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023
      • (2023)SFC-ACO: A Robust Path Failure Handling Method for Service Function Chaining in Kubernetes on OpenStack Magnum2023 International Conference on Computer, Electronics & Electrical Engineering & their Applications (IC2E3)10.1109/IC2E357697.2023.10262703(1-7)Online publication date: 8-Jun-2023
      • (2023)Availability modeling and evaluation of switches and data centers2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00102(710-721)Online publication date: 10-Aug-2023
      • (2022)FlexileProceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569119(110-125)Online publication date: 30-Nov-2022
      • (2022)HARS: A High-Available and Resource-Saving Service Function Chain Placement Approach in Data Center NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.314510319:2(829-847)Online publication date: Jun-2022
      • (2022)A Robust Service Mapping Scheme for Multi-Tenant CloudsIEEE/ACM Transactions on Networking10.1109/TNET.2021.313329330:3(1146-1161)Online publication date: Jun-2022
      • (2022)Modeling and Evaluation of a Data Center Sovereignty with Software Failures2022 6th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS56243.2022.10067479(233-242)Online publication date: 23-Nov-2022
      • (2022)Probabilistic Analysis of Network Availability2022 IEEE 30th International Conference on Network Protocols (ICNP)10.1109/ICNP55882.2022.9940438(1-11)Online publication date: 30-Oct-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media