Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

Published: 01 November 2023 Publication History

Abstract

The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

References

[1]
(n.d.). S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology). https://en.wikipedia.org/wiki/S.M.A.R.T.
[2]
Hervé Abdi and Lynne J. Williams. 2010. Principal component analysis. WIREs Computational Statistics (2010).
[3]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappan
[4]
Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 75, 14 pages. DOI:
[5]
Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 2018. 007: Democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/arzani
[6]
Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-08/analysis-data-corruption-storage-stack
[7]
Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:
[8]
Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawa, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:
[9]
Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:
[10]
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225–267.
[11]
Wei Chen, S. Toueg, and M. Kawazoe Aguilera. 2000. On the quality of service of failure detectors. In Proceedings of the 30th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[12]
Wooseong Cheong, Chanho Yoon, Seonghoon Woo, Kyuwook Han, Daehyun Kim, Chulseung Lee, Youra Choi, Shine Kim, Dongku Kang, Geunyeong Yu, Jaehong Kim, Jaechun Park, Ki-Whan Song, Ki-Tae Park, Sangyeun Cho, Hwaseok Oh, Daniel D. G. Lee, Jin-Hyeok Choi, and Jaeheon Jeong. 2018. A flash memory controller for 15\(\mu\)s ultra-low-latency SSD using high-speed 3D NAND flash with 3\(\mu\)s read time. In Proceedings of the IEEE International Solid State Circuits Conference (ISSCC).
[13]
David Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020).
[14]
Brian Choi, Randal Burns, and Peng Huang. 2021. Understanding and dealing with hard faults in persistent memory systems. In Proceedings of the 16th European Conference on Computer Systems (EuroSys).
[15]
Allen Clement, Edmund Wong, Lorenzo Alvisi, and Mirco Marchetti. 2009. Making byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[16]
G. W. Corder and D. I. Foreman. 2014. Nonparametric Statistics: A Step-by-Step Approach.Wiley.
[17]
Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Article 14, 14 pages. DOI:
[18]
Norman R. Draper and Harry Smith. 1998. Applied Regression Analysis. Wiley.
[19]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). 16. DOI:
[20]
Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast18/presentation/gunawi
[21]
Trinabh Gupta, Joshua B. Leners, Marcos K. Aguilera, and Michael Walfish. 2013. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/leners
[22]
Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. 2021. An in-depth study of correlated failures in production SSD-based data centers. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast21/presentation/han
[23]
Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/hao
[24]
Peng Huang, Chuanxiong Guo, Jacob R. Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi18/presentation/huang
[25]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS). 6. DOI:
[26]
Peng Huang, Xiao Ma, Dongcai Shen, and Yuanyuan Zhou. 2014. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE).
[27]
Volodymyr Kuznetsov, Vitaly Chipounov, and George Candea. 2010. Testing closed-source binary device drivers with DDT. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/usenix-atc-10/testing-closed-source-binary-device-drivers-ddt
[28]
Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). 16. DOI:
[29]
Jiaxin Li, Yuxi Chen, Haopeng Liu, Shan Lu, Yiming Zhang, Haryadi S. Gunawi, Xiaohui Gu, Xicheng Lu, and Dongsheng Li. 2018. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the 13th European Conference on Computer Systems (EuroSys). DOI:
[30]
Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi20/presentation/lou
[31]
Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, and Jiesheng Wu. 2023. Perseus: A fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast23/presentation/lu
[32]
Ruiming Lu, Erci Xu, Yiming Zhang, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Minglu Li, and Jiesheng Wu. 2022. NVMe SSD failures in the field: The fail-stop and the fail-slow. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22).
[33]
Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast15/technical-sessions/presentation/ma
[34]
Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. 2020. A study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast20/presentation/maneas
[35]
Brian Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure (1975).
[36]
Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 14. DOI:
[37]
Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? And Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR). Article 7, 11 pages. DOI:
[38]
Biswaranjan Panda, Deepthi Srinivasan, Huan Ke, Karan Gupta, Vinayak Khot, and Haryadi S. Gunawi. 2019. IASO: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/panda
[39]
Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast17/technical-sessions/presentation/pillai
[40]
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 15. DOI:
[41]
Matthew J. Renzelmann, Asim Kadav, and Michael M. Swift. 2012. SymDrive: Testing drivers without devices. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[42]
Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-10/understanding-latent-sector-errors-and-how-protect-against-them
[43]
Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder
[44]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:
[45]
Erich Schubert, Jörg Sander, Martin Ester, Hans Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (2017). DOI:
[46]
Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, Daniar H. Kurniawan, Vincentius Martin, Maheswara Rao G. Uma, and Haryadi S. Gunawi. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC).
[47]
Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi19/presentation/tan
[48]
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. PREPARE: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS). DOI:
[49]
Benjamin Walker. 2016. SPDK: Building blocks for scalable, high performance storage applications. In Proceedings of the Storage Developer Conference. https://www.snia.org/sites/default/files/SDC/2016/presentations/performance/BenjaminWalker_SPDK_Building_Blocks_SDC_2016.pdf
[50]
Erci Xu, Mai Zheng, Feng Qin, Yikang Xu, and Jiesheng Wu. 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/xu
[51]
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/zhang-qiao

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 4
November 2023
238 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3626486
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2023
Online AM: 11 September 2023
Accepted: 13 August 2023
Received: 09 May 2023
Published in TOS Volume 19, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fail-slow failures
  2. machine learning
  3. datasets
  4. root cause reasoning

Qualifiers

  • Research-article

Funding Sources

  • NSFC
  • Alibaba Innovation Research (AIR)
  • National Key R&D Program of China
  • Program of Hunan Postdoc Innovation
  • Program of Shanghai Academic Research Leader

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 371
    Total Downloads
  • Downloads (Last 12 months)322
  • Downloads (Last 6 weeks)26
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media