Article

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Authors:

Bianca Schroeder,

Garth A. GibsonAuthors Info & Claims

FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies

Pages 1 - es

Published: 13 February 2007 Publication History

Abstract

Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million.

In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.

We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wearout degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.

Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

References

[1]

{1} Personal communication with Dan Dummer, Andrei Khurshudov, Erik Riedel, Ron Watts of Seagate, 2006.

[2]

{2} G. Cole. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate. 2000.

[3]

{3} P. F. Corbett, R. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar. Row-diagonal parity for double disk failure correction. In Proc. of the FAST '04 Conference on File and Storage Technologies , 2004.

Digital Library

[4]

{4} J. G. Elerath. AFR: problems of definition, calculation and measurement in a commercial environment. In Proc. of the Annual Reliability and Maintainability Symposium, 2000.

[5]

{5} J. G. Elerath. Specifying reliability in the disk drive industry: No more MTBFs. In Proc. of the Annual Reliability and Maintainability Symposium, 2000.

[6]

{6} J. G. Elerath and S. Shah. Server class drives: How reliable are they? In Proc. of the Annual Reliability and Maintainability Symposium, 2004.

[7]

{7} S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. of the 19th ACM Symposium on Operating Systems Principles (SOSP'03), 2003.

Digital Library

[8]

{8} G. A. Gibson. Redundant disk arrays: Reliable, parallel secondary storage. Dissertation. MIT Press. 1992.

Digital Library

[9]

{9} J. Gray. Why do computers stop and what can be done about it. In Proc. of the 5th Symposium on Reliability in Distributed Software and Database Systems , 1986.

[10]

{10} J. Gray. A census of tandem system availability between 1985 and 1990. IEEE Transactions on Reliability , 39(4), 1990.

[11]

{11} T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2002.

Digital Library

[12]

{12} R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst., 4(3), 1986.

Digital Library

[13]

{13} M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In Proc. of the 18th IEEE Symposium on Reliable Distributed Systems, 1999.

Digital Library

[14]

{14} T. Karagiannis. Selfis: A short tutorial. Technical report, University of California, Riverside, 2002.

[15]

{15} T. Karagiannis, M. Molle, and M. Faloutsos. Long-range dependence: Ten years of internet traffic modeling. IEEE Internet Computing, 08(5), 2004.

Digital Library

[16]

{16} W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of ethernet traffic. IEEE/ACM Transactions on Networking, 2(1), 1994.

Digital Library

[17]

{17} T.-T. Y. Lin and D. P. Siewiorek. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 1990.

[18]

{18} J. Meyer and L. Wei. Analysis of workload influence on dependability. In Proc. International Symposium on Fault-Tolerant Computing, 1988.

[19]

{19} B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, 11(5), 1995.

[20]

{20} D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Euro-Par'05, 2005.

Digital Library

[21]

{21} D. L. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about it? In USENIX Symposium on Internet Technologies and Systems, 2003.

Digital Library

[22]

{22} D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. of the ACM SIGMOD International Conference on Management of Data, 1988.

Digital Library

[23]

{23} E. Pinheiro, W. D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proc. of the FAST '07 Conference on File and Storage Technologies, 2007.

Digital Library

[24]

{24} V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Iron file systems. In Proc. of the 20th ACM Symposium on Operating Systems Principles (SOSP'05), 2005.

Digital Library

[25]

{25} S. M. Ross. In Introduction to probability models. 6th edition. Academic Press.

Digital Library

[26]

{26} R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proc. of the 2004 International Conference on Dependable Systems and Networks (DSN'04), 2004.

Digital Library

[27]

{27} B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of the 2006 International Conference on Dependable Systems and Networks (DSN'06), 2006.

Digital Library

[28]

{28} T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk failure investigations at the internet archive. In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), 2006.

[29]

{29} N. Talagala and D. Patterson. An analysis of error behaviour in a large storage system. In The IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems, 1999.

[30]

{30} D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a VAX cluster system. In Proc. International Symposium on Fault-tolerant computing, 1990.

[31]

{31} C. van Ingen and J. Gray. Empirical measurements of disk failure rates and error rates. In MSR-TR-2005-166, 2005.

[32]

{32} J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked Windows NT system field failure data analysis. In Proc. of the 1999 Pacific Rim International Symposium on Dependable Computing, 1999.

Digital Library

[33]

{33} J. Yang and F.-B. Sun. A comprehensive review of hard-disk drive reliability. In Proc. of the Annual Reliability and Maintainability Symposium, 1999.

Cited By

Li SCao QWan SXia WXie C(2023)gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure CodesACM Transactions on Architecture and Code Optimization10.1145/362500520:4(1-25)Online publication date: 21-Sep-2023
https://dl.acm.org/doi/10.1145/3625005
George AHanley JOral S(2023)Disk Failure Trends in Alpine Storage SystemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624119(502-506)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624119
Han RGatla OZheng MCao JZhang DDai DChen YCook J(2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/348344718:2(1-44)Online publication date: 28-Apr-2022
https://dl.acm.org/doi/10.1145/3483447
Show More Cited By

Index Terms

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Recommendations

Partial Disk Failures: Using Software to Analyze Physical Damage
MSST '07: Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies

A good understanding of disk failures is crucial to ensure a reliable storage of data. There have been numerous studies characterizing disk failures under the common assumption that failed disks are generally unusable. Contrary to this assumption, ...
Characteristics, impact, and tolerance of partial disk failures
Studies of disk arrays tolerating two disk failures and a proposal for a heterogeneous disk array

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies

February 2007

61 pages

Sponsors

USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 13 February 2007

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

198
Total Citations
View Citations
40
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li SCao QWan SXia WXie C(2023)gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure CodesACM Transactions on Architecture and Code Optimization10.1145/362500520:4(1-25)Online publication date: 21-Sep-2023
https://dl.acm.org/doi/10.1145/3625005
George AHanley JOral S(2023)Disk Failure Trends in Alpine Storage SystemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624119(502-506)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624119
Han RGatla OZheng MCao JZhang DDai DChen YCook J(2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/348344718:2(1-44)Online publication date: 28-Apr-2022
https://dl.acm.org/doi/10.1145/3483447
Torquato MMaciel PVieira M(2022)Model-Based Performability and Dependability Evaluation of a System with VM Migration as Rejuvenation in the Presence of Bursty WorkloadsJournal of Network and Systems Management10.1007/s10922-021-09619-330:1Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s10922-021-09619-3
Bender MKuszmaul WMarx D(2021)Randomized cup game algorithms against strong adversariesProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458187(2059-2077)Online publication date: 10-Jan-2021
https://dl.acm.org/doi/10.5555/3458064.3458187
Shan YChen KGong TZhou LZhou TWu Y(2021)Geometric PartitioningProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483558(457-471)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483558
Wu RChen JHuang PWang LWang MWu Y(2021)An efficient RAID scaling scheme for RS-RAID62021 2nd International Conference on Artificial Intelligence and Information Systems10.1145/3469213.3470260(1-6)Online publication date: 28-May-2021
https://dl.acm.org/doi/10.1145/3469213.3470260
Hong DHa KKo MChun MKim YLee SKim J(2021)Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDsACM Transactions on Storage10.1145/345097717:3(1-24)Online publication date: 16-Aug-2021
https://dl.acm.org/doi/10.1145/3450977
Magalhaes AMonteiro JBrayner A(2021)Main Memory Database RecoveryACM Computing Surveys10.1145/344219754:2(1-36)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3442197
Brooker MChen TPing FBhagwan RPorter G(2020)Millions of tiny databasesProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388276(463-478)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388276
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents