research-article

Lessons learned from memory errors observed over the lifetime of Cielo

Authors:

Kurt B. Ferreira,

Nathan DeBardeleben,

Taniya Siddiqua,

Vilas Sridharan,

Elisabeth BasemanAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 43, Pages 1 - 12

Published: 11 November 2018 Publication History

Abstract

Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze a corpus of empirical failure data collected over the entire five-year lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several important findings about failures on Cielo: (i) its memory (DRAM and SRAM) exhibited no aging effects; detectable, uncorrectable errors (DUE) showed no discernible increase over its five-year lifetime; (ii) contrary to popular belief, correctable DRAM faults are not predictive of future uncorrectable DRAM faults; (iii) the majority of system down events have no identifiable hardware root cause, highlighting the need for more comprehensive logging facilities to improve failure analysis on future systems; and (iv) continued advances will be needed in order for current failure mitigation techniques to be viable on future systems. Our analysis of this corpus of empirical data provides critical analysis of, and guidance for, the deployment of extreme-scale systems.

References

[1]

K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead," 2008.

[2]

X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar, "Low-power, low-storage-overhead chipkill correct via multi-line error correction," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '13. New York, NY, USA: ACM, 2013, pp. 24:1--24:12. {Online}. Available

Digital Library

[3]

B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Dependable Systems and Networks (DSN 2006), Philadelphia, PA, June 2006.

Digital Library

[4]

X. Li, K. Shen, M. C. Huang, and L. Chu, "A memory soft error measurement on production systems," in 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ser. ATC'07. Berkeley, Calif., USA: USENIX Association, 2007, pp. 21:1--21:6. {Online}. Available: http://dl.acm.org/citation.cfm?id-1364385.1364406

Digital Library

[5]

B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM errors in the wild: a large-scale field study," Commun. ACM, vol. 54, no. 2, pp. 100--107, Feb. 2011. {Online}. Available

Digital Library

[6]

X. Li, M. C. Huang, K. Shen, and L. Chu, "A realistic evaluation of memory hardware errors and software system susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10. Berkeley, Calif., USA: USENIX Association, 2010, pp. 6--20. {Online}. Available: http://dl.acm.org/citation.cfm?id-1855840.1855846

Digital Library

[7]

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design," in Proceedings of the 17th international conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVII. New York, NY, USA: ACM, 2012, pp. 111--122. {Online}. Available

Digital Library

[8]

T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi, "Analysis of memory errors from large-scale field data collection," in Silicon Errors in Logic - System Effects (SELSE), 2013 IEEE Workshop on, 2013.

[9]

V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, "Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults," in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '13. New York, NY, USA: ACM, 2013, pp. 22:1--22:11. {Online}. Available

Digital Library

[10]

C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer, "Lessons learned from the analysis of system failures at petascale: The case of Blue Waters," in International Conference on Dependable Systems and Networks, 2014.

Digital Library

[11]

A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," Dependable and Secure Computing, IEEE Transactions on, vol. 1, no. 1, pp. 11--33, 2004.

Digital Library

[12]

"AMD64 architecture programmer's manual volume 2: System programming, revision 3.23," http://developer.amd.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.

[13]

G. Aupy, Y. Robert, and F. Vivien, "Assuming failure independence: Are we right to be wrong?" in 2017 IEEE International Conference on Cluster Computing (CLUSTER), Sept 2017, pp. 709--716.

[14]

S. Levy and K. B. Ferreira, "An examination of the impact of failure distribution on coordinated checkpoint/restart," in Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS@HPDC 2016, Kyoto, Japan, May 31, 2016, 2016, pp. 35--42. {Online}. Available

Digital Library

[15]

G. Schwarz et al., "Estimating the dimension of a model," The annals of statistics, vol. 6, no. 2, pp. 461--464, 1978.

[16]

A. E. Raftery, "Bayesian model selection in social research," Sociological methodology, pp. 111--163, 1995.

[17]

N. Sematech, "Nist/sematech e-handbook of statistical methods," NIST SEMATECH, 2013. {Online}. Available: https://www.itl.nist.gov/div898/handbook/index.htm

[18]

P. Ramachandran, S. V. Adve, P. Bose, and J. A. Rivers, "Metrics for architecture-level lifetime reliability analysis," in ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, April 2008, pp. 202--212.

Digital Library

[19]

J. Meza, Q. Wu, S. Kumar, and O. Mutlu, "Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field," in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 2015, pp. 415--426.

Digital Library

[20]

D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro, "Assessment of the effect of memory page retirement on system ras against hardware faults," in International Conference on Dependable Systems and Networks (DSN'06), June 2006, pp. 365--370.

Digital Library

[21]

K. Pearson and L. N. G. Filon, "Mathematical contributions to the theory of evolution. IV. on the probable errors of frequency constants and on the influence of random selection on variation and correlation," Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 191, pp. 229--311, 1898. {Online}. Available: http://www.jstor.org/stable/90745

[22]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2010, pp. 1--11.

Digital Library

[23]

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka, "Fti: High performance fault tolerance interface for hybrid systems," in 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov 2011, pp. 1--12.

Digital Library

[24]

J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303--312, 2006.

[25]

J. W. Young, "A first order approximation to the optimum checkpoint interval," Communications of the ACM, vol. 17, no. 9, pp. 530--531, 1974.

Digital Library

[26]

LANL, "Trinity Technical Specifications," http://www.lanl.gov/projects/trinity/specifications.php, Jan. 10 2017.

[27]

J. Lujan et al., "Apex workflows," Technical report, LANL, NERSC, SNL, Tech. Rep. LA-UR-15-29113, 2015. {Online}. Available: https://www.nersc.gov/assets/apex-workflows-v2.pdf

[28]

V. Sridharan and D. Liberty, "A study of DRAM failures in the field," in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1--11.

Digital Library

[29]

V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, "Memory errors in modern systems: The good, the bad, and the ugly," ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 297--310, 2015.

Digital Library

[30]

J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in International Conference on High Performance Computing for Computational Science. Springer, 2010, pp. 1--25.

Digital Library

[31]

N. El-Sayed and B. Schroeder, "Reading between the lines of failure logs: Understanding how HPC systems fail," in 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2013, pp. 1--12.

Digital Library

[32]

D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell, "Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility," in SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2015, pp. 1--12.

Digital Library

[33]

D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland, "Understanding gpu errors on large-scale hpc systems and the implications for system design and operation," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 331--342.

[34]

A. Gainaru, F. Cappello, and W. Kramer, "Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium, May 2012, pp. 1168--1179.

Digital Library

[35]

Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, "Bluegene/l failure analysis and prediction models," in International Conference on Dependable Systems and Networks (DSN'06), June 2006, pp. 425--434.

Digital Library

[36]

Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, J. Moreira, and M. Gupta, "Filtering failure logs for a BlueGene/L prototype," in 2005 International Conference on Dependable Systems and Networks (DSN'05), June 2005, pp. 476--485.

Digital Library

[37]

B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers, "A large-scale study of soft-errors on GPUs in the field," in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 519--530.

[38]

A. Patwari, I. Laguna, M. Schulz, and S. Bagchi, "Understanding the spatial characteristics of DRAM errors in HPC clusters," in Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, ser. FTXS '17. New York, NY, USA: ACM, 2017, pp. 17--22. {Online}. Available

Digital Library

[39]

S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, "Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems," in Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, ser. DSN '15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 37--44. {Online}. Available

Digital Library

[40]

N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, "Temperature management in data centers: why some (might) like it hot," in Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS '12. New York, NY, USA: ACM, 2012, pp. 163--174. {Online}. Available

Digital Library

[41]

V. Sridharan and D. Liberty, "A study of DRAM failures in the field," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, Calif., USA: IEEE Computer Society Press, 2012, pp. 76:1--76:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=2388996.2389100

Digital Library

[42]

T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, S. Levy, E. Baseman, and Q. Guan, "Lifetime memory reliability data from the field," in 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Oct 2017, pp. 1--6.

[43]

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, "Failures in large scale systems: Long-term measurement, analysis, and implications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: ACM, 2017, pp. 44:1--44:12. {Online}. Available

Digital Library

Lessons learned from memory errors observed over the lifetime of Cielo
1. Computer systems organization
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

Lessons learned from memory errors observed over the lifetime of Cielo
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few ...
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory ...
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
ASPLOS '15

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
143
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents