Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Published: 03 October 2018 Publication History

Abstract

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

References

[1]
2011. NAND Flash Media Management Through RAIN. Micron.
[2]
2017. Open Hardware Monitor. Retrieved December 2017 from http://openhardwaremonitor.org.
[3]
2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http://ucare.cs.uchicago.edu/projects/failslow/.
[4]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16).
[5]
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII).
[6]
Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10).
[7]
Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07).
[8]
Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08).
[9]
Robert C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability (TDMR) 5, 3 (September 2005).
[10]
Eric Brewer. 2016. Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).
[11]
Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization and mitigation. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’15).
[12]
Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu. 2015. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA-21).
[13]
George Candea and Armando Fox. 2003. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX).
[14]
Christine S. Chan, Boxiang Pan, Kenny Gross, Kenny Gross, and Tajana Simunic Rosing. 2013. Correcting vibration-induced performance degradation in enterprise servers. In Proceedings of the Greenmetrics Workshop (Greenmetrics’13).
[15]
Allen Clement, Edmund L. Wong, Lorenzo Alvisi, Michael Dahlin, and Mirco Marchetti. 2009. Making Byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09).
[16]
Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14).
[17]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04).
[18]
Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13).
[19]
Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13).
[20]
Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 2012 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12).
[21]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).
[22]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14).
[23]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16).
[24]
Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17).
[25]
Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).
[26]
Peng Huang, Chuanxiong Guo, Lindong Znhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randonph Yao. 2017. Gray failure: The Achilles’ heel of cloud scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS XVII).
[27]
Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. 2009. Tolerating hardware device failures in software. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09).
[28]
Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. Black-box problem diagnosis in parallel file systems. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).
[29]
Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15).
[30]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16).
[31]
Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15).
[32]
Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15).
[33]
Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).
[34]
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05).
[35]
Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).
[36]
Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07).
[37]
Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).
[38]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09).
[39]
Brian D. Strom, SungChang Lee, George W. Tyndall, and Andrei Khurshudov. 2007. Hard disk drive reliability modeling and failure prediction. IEEE Transactions on Magnetics (TMAG) 43, 9 (September 2007).
[40]
Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, Daniar H. Kurniawan, Vincentius Martin, Uma Maheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17).
[41]
Eitan Yaakobi, Laura Grupp, Paul H. Siegel, Steven Swanson, and Jack K. Wolf. 2012. Characterization and error-correcting codes for TLC flash memories. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC’12).
[42]
Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).
[43]
Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11).

Cited By

View all
  • (2024)Asymmetric RAID: Rethinking RAID for SSD HeterogeneityProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665952(101-107)Online publication date: 8-Jul-2024
  • (2024)Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00109(1939-1955)Online publication date: 19-May-2024
  • (2024)DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104935192(104935)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Storage
      ACM Transactions on Storage  Volume 14, Issue 3
      Special Issue on FAST 2018 and Regular Papers
      August 2018
      210 pages
      ISSN:1553-3077
      EISSN:1553-3093
      DOI:10.1145/3282875
      • Editor:
      • Sam H. Noh
      Issue’s Table of Contents
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 October 2018
      Accepted: 01 July 2018
      Received: 01 June 2018
      Published in TOS Volume 14, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Hardware fault
      2. fail-slow
      3. fail-stutter
      4. jitter
      5. limpware
      6. performance

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • DOE Office of Science User Facility
      • NSF

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)380
      • Downloads (Last 6 weeks)45
      Reflects downloads up to 19 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Asymmetric RAID: Rethinking RAID for SSD HeterogeneityProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665952(101-107)Online publication date: 8-Jul-2024
      • (2024)Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00109(1939-1955)Online publication date: 19-May-2024
      • (2024)DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104935192(104935)Online publication date: Oct-2024
      • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
      • (2023)Analog-to-digital conversion of information archived in display holograms: I. discussionJournal of the Optical Society of America A10.1364/JOSAA.47849840:4(B47)Online publication date: 31-Mar-2023
      • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
      • (2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
      • (2023)Understanding Silent Data Corruptions in a Large Production CPU PopulationProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613149(216-230)Online publication date: 23-Oct-2023
      • (2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
      • (2023)The Performance of Distributed Applications: A Traffic Shaping PerspectiveProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583733(207-220)Online publication date: 15-Apr-2023
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media