Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

Published: 03 March 2012 Publication History

Abstract

Main memory is one of the leading hardware causes for machine crashes in today's datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this paper, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, the paper uses the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluates the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.

References

[1]
Soft errors in electronic memory -- a white paper. Tezzaron Semiconductor. URL http://tezzaron.com/about/papes/soft_errors_1_1_secture.pdf.
[2]
L. A. Barroso and U. Hölzle. The case for energy-proportional computing. IEEE Computer, 40 (12), 2007.
[3]
T. M. Chalfant. Solaris operating system availability features. In SunBluePrints Online, 2004.
[4]
T. J. Dell. A white paper on the benefits of chip kill-correct ECC for PC server main memory. IBM Microelectronics, 1997.
[5]
T. J. Dell. System RAS implications of DRAM soft errors. IBM J. Res. Dev., 52 (3), 2008.
[6]
P. E. Dodd. Device simulation of charge collection and single-event upset. IEEE Nuclear Science, 43: 561--575, 1996.
[7]
A. Gara. Overview of the Blue Gene/L system architecture. IBM J. Res. Dev., 49: 195--212, March 2005.
[8]
IBM journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev., 52 (1/2): 199--220, January 2008.
[9]
H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki, Y. Nagai, and K. Takahisa. Evaluation of lsi soft errors induced by terrestrial cosmic rays and alpha particles. Technical report, Sony corporation and RCNP Osaka University, 2001.
[10]
X. Li, K. Shen, M. Huang, and L. Chu. A memory soft error measurement on production systems. In Proc. USENIX Annual Technical Conference (ATC '07), pages 21:1--21:6, 2007.
[11]
X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proc. USENIX Annual Technical Conference (ATC '10), pages 75--88, 2010.
[12]
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26 (1), 1979.
[13]
B. Murphy. Automating software failure reporting. ACM Queue, 2, 2004.
[14]
E. Normand. Single event upset at ground level. IEEE Transaction on Nuclear Sciences, 6 (43): 2742--2750, 1996.
[15]
T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev., 40 (1), 1996.
[16]
M. Ohmacht. Blue Gene/L compute chip: memory and Ethernet subsystem. IBM J. Res. Dev., 49: 255--264, March 2005.
[17]
R. V. Rein. BadRAM: Linux kernel support for broken RAM modules. URL http://rick.vanrein.org/linux/badram/.
[18]
B. Schroeder and G. A. Gibson. A large scale study of failures in high-performance-computing systems. In Proc. Int'l Conf. Dependable Systems and Networks (DSN 2006), pages 249--258, 2006.
[19]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Proc. 11th Int'l Joint Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '09), pages 193--204, 2009.
[20]
D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proc. Int'l Conf. Dependable Systems and Networks (DSN 2006), pages 365--370, 2006.
[21]
H. H. Tang. Semm-2: a new generation of single-event-effect modeling tools. IBM J. Res. Dev., 52: 233--244, May 2008.
[22]
H. H. K. Tang, C. E. Murray, G. Fiorenza, K. P. Rodbell, M. S. Gordon, and D. F. Heidel. New simulation methodology for effects of radiation in semiconductor chip structures. IBM J. Res. Dev., 52: 245--253, May 2008.
[23]
USENIX. The computer failure data repository (CFDR). URL http://cfdr.usenix.org/.
[24]
D. H. Yoon and M. Erez. Virtualized and flexible ECC for main memory. In Proc. 15th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '10), pages 397--408, 2010.
[25]
J. Ziegler. IBM experiments in soft fails in computer electronics. Political Analysis, 40 (1): 3--18, 1996.
[26]
J. F. Ziegler. Terrestrial cosmic rays. IBM J. Res. Dev., 40: 19--39, January 1996.
[27]
J. F. Ziegler and W. A. Lanford. Effect of Cosmic Rays on Computer Memories. Science, 206: 776--788, 1979.
[28]
J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J. Solid-state Circuits, 33: 246--252, 1998.

Cited By

View all
  • (2024)Hyperdimensional computing: a framework for stochastic computation and symbolic AIJournal of Big Data10.1186/s40537-024-01010-811:1Online publication date: 24-Oct-2024
  • (2024)SoK: Rowhammer on Commodity Operating SystemsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3656998(436-452)Online publication date: 1-Jul-2024
  • (2023)Mars Attacks!Proceedings of the 22nd ACM Workshop on Hot Topics in Networks10.1145/3626111.3628199(245-253)Online publication date: 28-Nov-2023
  • Show More Cited By

Index Terms

  1. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 4
    ASPLOS '12
    April 2012
    453 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2248487
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
      March 2012
      476 pages
      ISBN:9781450307598
      DOI:10.1145/2150976
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 March 2012
    Published in SIGPLAN Volume 47, Issue 4

    Check for updates

    Author Tags

    1. DRAM errors
    2. correctable errors
    3. field study
    4. reliability
    5. uncorrectable errors

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hyperdimensional computing: a framework for stochastic computation and symbolic AIJournal of Big Data10.1186/s40537-024-01010-811:1Online publication date: 24-Oct-2024
    • (2024)SoK: Rowhammer on Commodity Operating SystemsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3656998(436-452)Online publication date: 1-Jul-2024
    • (2023)Mars Attacks!Proceedings of the 22nd ACM Workshop on Hot Topics in Networks10.1145/3626111.3628199(245-253)Online publication date: 28-Nov-2023
    • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
    • (2023)CSI:Rowhammer – Cryptographic Security and Integrity against Rowhammer2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179390(1702-1718)Online publication date: May-2023
    • (2023)Exploration of Bitflip’s Effect on Deep Neural Network Accuracy in Plaintext and CiphertextIEEE Micro10.1109/MM.2023.327311543:5(24-34)Online publication date: 5-May-2023
    • (2023)Workload Failure Prediction for Data Centers2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00064(479-485)Online publication date: Jul-2023
    • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
    • (2022)Hyperdimensional hashingProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530553(907-912)Online publication date: 10-Jul-2022
    • (2022)Performance and Power Estimation of STT-MRAM Main Memory with Reliable System-level SimulationACM Transactions on Embedded Computing Systems10.1145/347683821:1(1-25)Online publication date: 14-Jan-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media