research-article

Public Access

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

ACM Transactions on Storage (TOS), Volume 14, Issue 3

Article No.: 23, Pages 1 - 26

https://doi.org/10.1145/3242086

Published: 03 October 2018 Publication History

Abstract

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

References

[1]

2011. NAND Flash Media Management Through RAIN. Micron.

[2]

2017. Open Hardware Monitor. Retrieved December 2017 from http://openhardwaremonitor.org.

[3]

2018. UCARE: Fail-Slow Database. Retrieved February 2018 from http://ucare.cs.uchicago.edu/projects/failslow/.

[4]

Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16).

Digital Library

[5]

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2001. Fail-stutter fault tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS VIII).

Digital Library

[6]

Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10).

Digital Library

[7]

Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07).

Digital Library

[8]

Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08).

Digital Library

[9]

Robert C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability (TDMR) 5, 3 (September 2005).

[10]

Eric Brewer. 2016. Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).

[11]

Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization and mitigation. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’15).

Digital Library

[12]

Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu. 2015. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA-21).

[13]

George Candea and Armando Fox. 2003. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX).

Digital Library

[14]

Christine S. Chan, Boxiang Pan, Kenny Gross, Kenny Gross, and Tajana Simunic Rosing. 2013. Correcting vibration-induced performance degradation in enterprise servers. In Proceedings of the Greenmetrics Workshop (Greenmetrics’13).

[15]

Allen Clement, Edmund L. Wong, Lorenzo Alvisi, Michael Dahlin, and Mirco Marchetti. 2009. Making Byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09).

Digital Library

[16]

Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. PerfScope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14).

Digital Library

[17]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04).

Digital Library

[18]

Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13).

Digital Library

[19]

Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13).

Digital Library

[20]

Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 2012 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12).

Digital Library

[21]

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).

Digital Library

[22]

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14).

Digital Library

[23]

Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16).

Digital Library

[24]

Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17).

Digital Library

[25]

Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).

Digital Library

[26]

Peng Huang, Chuanxiong Guo, Lindong Znhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randonph Yao. 2017. Gray failure: The Achilles’ heel of cloud scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS XVII).

Digital Library

[27]

Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. 2009. Tolerating hardware device failures in software. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09).

Digital Library

[28]

Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. Black-box problem diagnosis in parallel file systems. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).

Digital Library

[29]

Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15).

Digital Library

[30]

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16).

Digital Library

[31]

Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15).

Digital Library

[32]

Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15).

Digital Library

[33]

Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).

Digital Library

[34]

Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05).

Digital Library

[35]

Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).

Digital Library

[36]

Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07).

Digital Library

[37]

Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).

Digital Library

[38]

Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09).

Digital Library

[39]

Brian D. Strom, SungChang Lee, George W. Tyndall, and Andrei Khurshudov. 2007. Hard disk drive reliability modeling and failure prediction. IEEE Transactions on Magnetics (TMAG) 43, 9 (September 2007).

[40]

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, Daniar H. Kurniawan, Vincentius Martin, Uma Maheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17).

Digital Library

[41]

Eitan Yaakobi, Laura Grupp, Paul H. Siegel, Steven Swanson, and Jack K. Wolf. 2012. Characterization and error-correcting codes for TLC flash memories. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC’12).

[42]

Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).

Digital Library

[43]

Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11).

Digital Library

Cited By

Jiao ZKim B(2024)Asymmetric RAID: Rethinking RAID for SSD HeterogeneityProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665952(101-107)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665952
Chen YMa FZhou YGu MLiao QJiang Y(2024)Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00109(1939-1955)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00109
Cotroneo DNatella RRosiello S(2024)DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104935192(104935)Online publication date: Oct-2024
https://doi.org/10.1016/j.jpdc.2024.104935
Show More Cited By

Index Terms

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
  2. Real-time systems

Recommendations

Fail-slow at scale: evidence of hardware performance faults in large production systems
FAST'18: Proceedings of the 16th USENIX Conference on File and Storage Technologies

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and ...
A Distributed Recovery Block Approach to Fault-Tolerant Execution of Application Tasks in Hypercubes

An approach to fault-tolerant execution of real-time application tasks in hypercubes isproposed. The approach is based on the distributed recovery block (DRB) scheme anddoes not require special hardware mechanisms in support of fault tolerance. Each ...
A Novel Approach for Fault Tolerance Control System and Embedded System Security

The paper proposes novel approach for providing security mechanism for faults estimation and fault tolerance in the automated vehicle in order to provide safety through diagnose and rectification of hardware faults. For providing authentication to the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage

ACM Transactions on Storage Volume 14, Issue 3

Special Issue on FAST 2018 and Regular Papers

August 2018

210 pages

ISSN:1553-3077

EISSN:1553-3093

DOI:10.1145/3282875

Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Korea

Issue’s Table of Contents

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2018

Accepted: 01 July 2018

Received: 01 June 2018

Published in TOS Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

DOE Office of Science User Facility
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

76
Total Citations
View Citations
1,830
Total Downloads

Downloads (Last 12 months)380
Downloads (Last 6 weeks)45

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiao ZKim B(2024)Asymmetric RAID: Rethinking RAID for SSD HeterogeneityProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665952(101-107)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665952
Chen YMa FZhou YGu MLiao QJiang Y(2024)Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00109(1939-1955)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00109
Cotroneo DNatella RRosiello S(2024)DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104935192(104935)Online publication date: Oct-2024
https://doi.org/10.1016/j.jpdc.2024.104935
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu JNaor DGoel A(2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585942
Rabosh EBalbekin NPetrov N(2023)Analog-to-digital conversion of information archived in display holograms: I. discussionJournal of the Optical Society of America A10.1364/JOSAA.47849840:4(B47)Online publication date: 31-Mar-2023
https://doi.org/10.1364/JOSAA.478498
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu J(2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3617690
Ganatra VParayil AGhosh SKang YMa MBansal CNath SMace JChandra SBlincoe KTonella P(2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613898
Wang SZhang GWei JWang YWu JLuo QDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Understanding Silent Data Corruptions in a Large Production CPU PopulationProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613149(216-230)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613149
Reghenzani FGuo ZFornaciari W(2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3589950
Hasenoot JRellermeyer JUta AVieira MCardellini VDi Marco ATuma P(2023)The Performance of Distributed Applications: A Traffic Shaping PerspectiveProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583733(207-220)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583733
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents