Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2503210.2503257acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

Published: 17 November 2013 Publication History

Abstract

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings.
We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.

References

[1]
Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi.
[2]
AMD64 architecture programmer's manual revision 3.17, 2011.
[3]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on, 1(1):11--33, 2004.
[4]
R. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE Reliability Physics Tutorial Notes, 2002.
[5]
R. Baumann. Soft errors in advanced computer systems. Design Test of Computers, IEEE, 22(3):258--266, 2005.
[6]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, 2008.
[7]
L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In Reliability Physics Symposium, 2008. IRPS 2008. IEEE International, pages 482--487, 2008.
[8]
A. Chugg, A. Burnell, P. Duncan, S. Parker, and J. Ward. The random telegraph signal behavior of intermittently stuck bits in sdrams. Nuclear Science, IEEE Transactions on, 56(6):3057--3064, 2009.
[9]
C. Constantinescu. Impact of deep submicron technology on dependability of vlsi circuits. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on, pages 205--209, 2002.
[10]
C. Constantinescu. Trends and challenges in vlsi circuit reliability. Micro, IEEE, 23(4):14--19, 2003.
[11]
A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In Silicon Errors in Logic - System Effects (SELSE), 2009 IEEE Workshop on, 2009.
[12]
N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 163--174, New York, NY, USA, 2012. ACM.
[13]
M. Gadlage, J. Ahlbin, B. Narasimham, V. Ramachandran, C. Dinkins, B. Bhuva, R. Schrimpf, and R. Shuler. The effect of elevated temperature on digital single event transient pulse widths in a bulk cmos technology. In Reliability Physics Symposium, 2009 IEEE International, pages 170--173, 2009.
[14]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design. In Proceedings of the 17th international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 111--122, New York, NY, USA, 2012. ACM.
[15]
X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 6--20, Berkeley, Calif., USA, 2010. USENIX Association.
[16]
X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC'07, pages 21:1--21:6, Berkeley, Calif., USA, 2007. USENIX Association.
[17]
T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. Electron Devices, IEEE Transactions on, 26(1):2--9, 1979.
[18]
A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. Computers, IEEE Transactions on, 53(12):1557--1568, 2004.
[19]
H. Quinn, P. Graham, and T. Fairbanks. Sees induced by high-energy protons and neutrons in sdram. In Radiation Effects Data Workshop (REDW), 2011 IEEE, pages 1--5, 2011.
[20]
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pages 249--258, 2006.
[21]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011.
[22]
T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In Silicon Errors in Logic - System Effects (SELSE), 2013 IEEE Workshop on, 2013.
[23]
V. Sridharan and D. Liberty. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 76:1--76:11, Los Alamitos, Calif., USA, 2012. IEEE Computer Society Press.
[24]
M. Walt and H. H. Barschall. Angular distributions of elastically scattered 1-mev neutrons. Phys. Rev., 90:714--715, May 1953.
[25]
W. Zhang and T. Li. Microarchitecture soft error vulnerability characterization and mitigation under 3d integration technology. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 435--446, Washington, D.C., USA, 2008. IEEE Computer Society.
[26]
J. Ziegler and W. Lanford. The effect of sea level cosmic rays on electronic devices. Journal of Applied Physics, 52(6):4305--4312, 1981.

Cited By

View all
  • (2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
  • (2024)BEC: Bit-Level Static Analysis for Reliability against Soft Errors2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444844(283-295)Online publication date: 2-Mar-2024
  • (2023)HashTagProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620394(2797-2814)Online publication date: 9-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
  • General Chair:
  • William Gropp,
  • Program Chair:
  • Satoshi Matsuoka
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC13
Sponsor:

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)7
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
  • (2024)BEC: Bit-Level Static Analysis for Reliability against Soft Errors2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444844(283-295)Online publication date: 2-Mar-2024
  • (2023)HashTagProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620394(2797-2814)Online publication date: 9-Aug-2023
  • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
  • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
  • (2023)Checkpoint Placement for Systematic Fault-Injection Campaigns2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323809(1-9)Online publication date: 28-Oct-2023
  • (2023)A Systematic Study of DDR4 DRAM Faults in the Field2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071066(991-1002)Online publication date: Feb-2023
  • (2023)Compiler-Implemented Differential Checksums: Effective Detection and Correction of Transient and Permanent Memory Errors2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00021(81-94)Online publication date: Jun-2023
  • (2023)Workload Failure Prediction for Data Centers2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00064(479-485)Online publication date: Jul-2023
  • (2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media