Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1267903.1267904guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Published: 13 February 2007 Publication History

Abstract

Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million.
In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.
We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.
We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wearout degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.
Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.
Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

References

[1]
{1} Personal communication with Dan Dummer, Andrei Khurshudov, Erik Riedel, Ron Watts of Seagate, 2006.
[2]
{2} G. Cole. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate. 2000.
[3]
{3} P. F. Corbett, R. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar. Row-diagonal parity for double disk failure correction. In Proc. of the FAST '04 Conference on File and Storage Technologies , 2004.
[4]
{4} J. G. Elerath. AFR: problems of definition, calculation and measurement in a commercial environment. In Proc. of the Annual Reliability and Maintainability Symposium, 2000.
[5]
{5} J. G. Elerath. Specifying reliability in the disk drive industry: No more MTBFs. In Proc. of the Annual Reliability and Maintainability Symposium, 2000.
[6]
{6} J. G. Elerath and S. Shah. Server class drives: How reliable are they? In Proc. of the Annual Reliability and Maintainability Symposium, 2004.
[7]
{7} S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. of the 19th ACM Symposium on Operating Systems Principles (SOSP'03), 2003.
[8]
{8} G. A. Gibson. Redundant disk arrays: Reliable, parallel secondary storage. Dissertation. MIT Press. 1992.
[9]
{9} J. Gray. Why do computers stop and what can be done about it. In Proc. of the 5th Symposium on Reliability in Distributed Software and Database Systems , 1986.
[10]
{10} J. Gray. A census of tandem system availability between 1985 and 1990. IEEE Transactions on Reliability , 39(4), 1990.
[11]
{11} T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2002.
[12]
{12} R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst., 4(3), 1986.
[13]
{13} M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In Proc. of the 18th IEEE Symposium on Reliable Distributed Systems, 1999.
[14]
{14} T. Karagiannis. Selfis: A short tutorial. Technical report, University of California, Riverside, 2002.
[15]
{15} T. Karagiannis, M. Molle, and M. Faloutsos. Long-range dependence: Ten years of internet traffic modeling. IEEE Internet Computing, 08(5), 2004.
[16]
{16} W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of ethernet traffic. IEEE/ACM Transactions on Networking, 2(1), 1994.
[17]
{17} T.-T. Y. Lin and D. P. Siewiorek. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 1990.
[18]
{18} J. Meyer and L. Wei. Analysis of workload influence on dependability. In Proc. International Symposium on Fault-Tolerant Computing, 1988.
[19]
{19} B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, 11(5), 1995.
[20]
{20} D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Euro-Par'05, 2005.
[21]
{21} D. L. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about it? In USENIX Symposium on Internet Technologies and Systems, 2003.
[22]
{22} D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. of the ACM SIGMOD International Conference on Management of Data, 1988.
[23]
{23} E. Pinheiro, W. D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proc. of the FAST '07 Conference on File and Storage Technologies, 2007.
[24]
{24} V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Iron file systems. In Proc. of the 20th ACM Symposium on Operating Systems Principles (SOSP'05), 2005.
[25]
{25} S. M. Ross. In Introduction to probability models. 6th edition. Academic Press.
[26]
{26} R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proc. of the 2004 International Conference on Dependable Systems and Networks (DSN'04), 2004.
[27]
{27} B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of the 2006 International Conference on Dependable Systems and Networks (DSN'06), 2006.
[28]
{28} T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk failure investigations at the internet archive. In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), 2006.
[29]
{29} N. Talagala and D. Patterson. An analysis of error behaviour in a large storage system. In The IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems, 1999.
[30]
{30} D. Tang, R. K. Iyer, and S. S. Subramani. Failure analysis and modelling of a VAX cluster system. In Proc. International Symposium on Fault-tolerant computing, 1990.
[31]
{31} C. van Ingen and J. Gray. Empirical measurements of disk failure rates and error rates. In MSR-TR-2005-166, 2005.
[32]
{32} J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked Windows NT system field failure data analysis. In Proc. of the 1999 Pacific Rim International Symposium on Dependable Computing, 1999.
[33]
{33} J. Yang and F.-B. Sun. A comprehensive review of hard-disk drive reliability. In Proc. of the Annual Reliability and Maintainability Symposium, 1999.

Cited By

View all
  • (2023)gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure CodesACM Transactions on Architecture and Code Optimization10.1145/362500520:4(1-25)Online publication date: 21-Sep-2023
  • (2023)Disk Failure Trends in Alpine Storage SystemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624119(502-506)Online publication date: 12-Nov-2023
  • (2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/348344718:2(1-44)Online publication date: 28-Apr-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies
February 2007
61 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 13 February 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure CodesACM Transactions on Architecture and Code Optimization10.1145/362500520:4(1-25)Online publication date: 21-Sep-2023
  • (2023)Disk Failure Trends in Alpine Storage SystemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624119(502-506)Online publication date: 12-Nov-2023
  • (2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/348344718:2(1-44)Online publication date: 28-Apr-2022
  • (2022)Model-Based Performability and Dependability Evaluation of a System with VM Migration as Rejuvenation in the Presence of Bursty WorkloadsJournal of Network and Systems Management10.1007/s10922-021-09619-330:1Online publication date: 1-Jan-2022
  • (2021)Randomized cup game algorithms against strong adversariesProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458187(2059-2077)Online publication date: 10-Jan-2021
  • (2021)Geometric PartitioningProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483558(457-471)Online publication date: 26-Oct-2021
  • (2021)An efficient RAID scaling scheme for RS-RAID62021 2nd International Conference on Artificial Intelligence and Information Systems10.1145/3469213.3470260(1-6)Online publication date: 28-May-2021
  • (2021)Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDsACM Transactions on Storage10.1145/345097717:3(1-24)Online publication date: 16-Aug-2021
  • (2021)Main Memory Database RecoveryACM Computing Surveys10.1145/344219754:2(1-36)Online publication date: 5-Mar-2021
  • (2020)Millions of tiny databasesProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388276(463-478)Online publication date: 25-Feb-2020
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media