Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

Published: 01 March 2014 Publication History

Abstract

We introduce a new closed-form equation for estimating the number of data-loss events for a redundant array of inexpensive disks in a RAID-6 configuration. The equation expresses operational failures, their restorations, latent (sector) defects, and disk media scrubbing by time-based distributions that can represent non-homogeneous Poisson processes. It uses two-parameter Weibull distributions that allows the distributions to take on many different shapes, modeling increasing, decreasing, or constant occurrence rates. This article focuses on the statistical basis of the equation. It also presents time-based distributions of the four processes based on an extensive analysis of field data collected over several years from 10,000s of commercially available systems with 100,000s of disk drives. Our results for RAID-6 groups of size 16 indicate that the closed-form expression yields much more accurate results compared to the MTTDL reliability equation and matching computationally-intensive Monte Carlo simulations.

References

[1]
Ascher, H. 1983. Statistical methods in reliability: Discussion. Technometrics 25, 4.
[2]
Ascher, H. 1999. A set of numbers is not a dataset. IEEE Trans. Reliab 48, 2.
[3]
Ascher, H. 2010. Personal communication.
[4]
Bairavasundaram, L., Goodson, G., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. ACM SIGMETRICS, Perform. Eval. Rev. 35, 1, 289--300.
[5]
Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-Dusseau, A., and Arpaci-Dusseau, R. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies.
[6]
Bartlett J., Bartlett, W., Carr, R., Garcia, D., Gray, J., Horst, R., Jardine, R., Lenoski, D., and McGuire, D. 1990. Fault tolerance in tandem computers. NetApp Tech. rep. 90.5.
[7]
Bazovsky, I. 1961. Reliability Theory and Practice. Prentice Hall.
[8]
Blaum, M., Brady, J., Bruck, J., and Menon, J. 1994. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st International Symposium on Computer Architecture.
[9]
Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Shankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies.
[10]
Dholakia, A., Eleftheriou, E., Hu, X., Iliadis, I., Menon, J., and Rao, K. K. 2006. Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM SIGMETRICS Perform. Eval. Rev. 34, 1.
[11]
Elerath, J. 2009a. A simple equation for estimating reliability of an N+1 redundant array of independent disks. In Proceedings of the 39th International Conference on Dependable Systems and Networks.
[12]
Elerath, J. 2009b. Hard disk drives: The good, the bad and the ugly. ACM Queue 52, 6.
[13]
Elerath, J. and Pecht, M. 2009. A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (RAID). IEEE Trans. Comput. 58, 3.
[14]
EMC. 2007. EMC CLARiion RAID 6 Technology: A detailed review. http://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf. (Last accessed July 2012.)
[15]
Gao, Y., Meister, D., and Binkmann, A. 2010. Reliability analysis of declustered-parity RAID 6 with disk scrubbing and considering irrecoverable read errors. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage.
[16]
Gibson, G. and Patterson, D. 1993. Designing disk arrays for high data reliability. J. Parallel Distrib. Comput. 17.
[17]
Greenan, K., Plank, J., and Wylie, J. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems.
[18]
Kececioglu, D. 1993. Reliability & Life Testing Handbook, Volumes 1 & 2. Prentice Hall.
[19]
Malhotra, M. and Trivedi, K. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 1--2.
[20]
Nelson, W. 1982. Applied Life Data Analysis. Addison-Wesley.
[21]
Nelson, W. 1990. Accelerated Testing. Wiley & Sons.
[22]
Nelson, W. 2003. Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications. ASA-SIAM Series on Statistics and Applied Probability, Society for Industrial and Applied Mathematics.
[23]
NetApp. 2013. NetApp data ONTAP 8 operating system.http://www.netapp.com/us/products/platform-os/data-ontap-8/index.aspx.
[24]
Oracle. 2010. A better RAID strategy for high capacity drives in mainframe storage. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/raid-strategy-hi-capacity -drives-170907.pdf.
[25]
Paris, J., Amer, A., Long, D., and Schwarz, T. 2009. Evaluating the impact of irrecoverable read errors on disk array reliability. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing.
[26]
Patterson, D. A., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[27]
Pinheiro, E., Weber, W., and Barroso, L. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies.
[28]
Rao, K. K., Hafner, J., and Golding, R. 2006. Reliability for networked storage nodes. In Proceedings of the 36th International Conference on Dependable Systems and Networks.
[29]
Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies.
[30]
Serve The Home. 2011. The RAID reliability anthology -- The primer. http://www.servethehome.com/raid-reliability-failureanthology-part-primer.
[31]
Shah, S. and Elerath, J. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the IEEE Reliability and Maintainability Symposium.
[32]
Thompson, W. 1981. On the foundations of reliability. Technometrics 23, 1.
[33]
Thomasian, A. and Blaum, M. 2009. Higher reliability in redundant disk arrays: Organization, operation, and coding. ACM Trans. Storage 5, 3.
[34]
Tobias, P. and Trindade, D. 2011. Applied Reliability3rd Ed. CRC Press.

Cited By

View all
  • (2023)Elastic RAID: Implementing RAID over SSDs with Built-in Transparent CompressionProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594773(83-93)Online publication date: 5-Jun-2023
  • (2023)Reliability Evaluation of Erasure-coded Storage Systems with Latent ErrorsACM Transactions on Storage10.1145/356831319:1(1-47)Online publication date: 11-Jan-2023
  • (2023)End-to-end Resiliency Analysis Framework for Cloud Storage Services2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00025(134-141)Online publication date: 24-Oct-2023
  • Show More Cited By

Index Terms

  1. Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 10, Issue 2
    March 2014
    86 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/2600090
    • Editor:
    • Darrell Long
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2014
    Accepted: 01 July 2013
    Revised: 01 June 2013
    Received: 01 March 2013
    Published in TOS Volume 10, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. RAID reliability
    2. Storage systems

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Elastic RAID: Implementing RAID over SSDs with Built-in Transparent CompressionProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594773(83-93)Online publication date: 5-Jun-2023
    • (2023)Reliability Evaluation of Erasure-coded Storage Systems with Latent ErrorsACM Transactions on Storage10.1145/356831319:1(1-47)Online publication date: 11-Jan-2023
    • (2023)End-to-end Resiliency Analysis Framework for Cloud Storage Services2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00025(134-141)Online publication date: 24-Oct-2023
    • (2023)Evaluation of Data Durability in Erasure Coding Using Peak Shift Method of Drive Failure Risk2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE59766.2023.10487647(1-6)Online publication date: 4-Dec-2023
    • (2022)Random Versus Copyset Placement: Data-Loss Models for Proactive-Tolerance Replica-Based Data StorageIEEE Transactions on Reliability10.1109/TR.2021.310567571:1(63-74)Online publication date: Mar-2022
    • (2022)Failure Prediction Mechanism of Disk Devices Based on LSTM2022 2nd Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS)10.1109/ACCTCS53867.2022.00086(388-391)Online publication date: Feb-2022
    • (2021)On the Distribution Modeling of Heavy-Tailed Disk Failure Lifetime in Big Data CentersIEEE Transactions on Reliability10.1109/TR.2020.300712770:2(507-524)Online publication date: Jun-2021
    • (2021)New Design and Analysis of Error-Resilient LRCs for DSSs With Silent Disk ErrorsIEEE Access10.1109/ACCESS.2021.31078389(124463-124477)Online publication date: 2021
    • (2020)Reliability Equations for Cloud Storage Systems with Proactive Fault ToleranceIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2018.288251217:4(782-794)Online publication date: 1-Jul-2020
    • (2020)A Modeling Framework for Reliability of Erasure Codes in SSD ArraysIEEE Transactions on Computers10.1109/TC.2019.296269169:5(649-665)Online publication date: 1-May-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media