Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3291656.3291714acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Lessons learned from memory errors observed over the lifetime of Cielo

Published: 11 November 2018 Publication History

Abstract

Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze a corpus of empirical failure data collected over the entire five-year lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several important findings about failures on Cielo: (i) its memory (DRAM and SRAM) exhibited no aging effects; detectable, uncorrectable errors (DUE) showed no discernible increase over its five-year lifetime; (ii) contrary to popular belief, correctable DRAM faults are not predictive of future uncorrectable DRAM faults; (iii) the majority of system down events have no identifiable hardware root cause, highlighting the need for more comprehensive logging facilities to improve failure analysis on future systems; and (iv) continued advances will be needed in order for current failure mitigation techniques to be viable on future systems. Our analysis of this corpus of empirical data provides critical analysis of, and guidance for, the deployment of extreme-scale systems.

References

[1]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead," 2008.
[2]
X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar, "Low-power, low-storage-overhead chipkill correct via multi-line error correction," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '13. New York, NY, USA: ACM, 2013, pp. 24:1--24:12. {Online}. Available
[3]
B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Dependable Systems and Networks (DSN 2006), Philadelphia, PA, June 2006.
[4]
X. Li, K. Shen, M. C. Huang, and L. Chu, "A memory soft error measurement on production systems," in 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ser. ATC'07. Berkeley, Calif., USA: USENIX Association, 2007, pp. 21:1--21:6. {Online}. Available: http://dl.acm.org/citation.cfm?id-1364385.1364406
[5]
B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM errors in the wild: a large-scale field study," Commun. ACM, vol. 54, no. 2, pp. 100--107, Feb. 2011. {Online}. Available
[6]
X. Li, M. C. Huang, K. Shen, and L. Chu, "A realistic evaluation of memory hardware errors and software system susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10. Berkeley, Calif., USA: USENIX Association, 2010, pp. 6--20. {Online}. Available: http://dl.acm.org/citation.cfm?id-1855840.1855846
[7]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design," in Proceedings of the 17th international conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVII. New York, NY, USA: ACM, 2012, pp. 111--122. {Online}. Available
[8]
T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi, "Analysis of memory errors from large-scale field data collection," in Silicon Errors in Logic - System Effects (SELSE), 2013 IEEE Workshop on, 2013.
[9]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, "Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults," in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '13. New York, NY, USA: ACM, 2013, pp. 22:1--22:11. {Online}. Available
[10]
C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer, "Lessons learned from the analysis of system failures at petascale: The case of Blue Waters," in International Conference on Dependable Systems and Networks, 2014.
[11]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," Dependable and Secure Computing, IEEE Transactions on, vol. 1, no. 1, pp. 11--33, 2004.
[12]
"AMD64 architecture programmer's manual volume 2: System programming, revision 3.23," http://developer.amd.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.
[13]
G. Aupy, Y. Robert, and F. Vivien, "Assuming failure independence: Are we right to be wrong?" in 2017 IEEE International Conference on Cluster Computing (CLUSTER), Sept 2017, pp. 709--716.
[14]
S. Levy and K. B. Ferreira, "An examination of the impact of failure distribution on coordinated checkpoint/restart," in Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS@HPDC 2016, Kyoto, Japan, May 31, 2016, 2016, pp. 35--42. {Online}. Available
[15]
G. Schwarz et al., "Estimating the dimension of a model," The annals of statistics, vol. 6, no. 2, pp. 461--464, 1978.
[16]
A. E. Raftery, "Bayesian model selection in social research," Sociological methodology, pp. 111--163, 1995.
[17]
N. Sematech, "Nist/sematech e-handbook of statistical methods," NIST SEMATECH, 2013. {Online}. Available: https://www.itl.nist.gov/div898/handbook/index.htm
[18]
P. Ramachandran, S. V. Adve, P. Bose, and J. A. Rivers, "Metrics for architecture-level lifetime reliability analysis," in ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, April 2008, pp. 202--212.
[19]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu, "Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field," in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 2015, pp. 415--426.
[20]
D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro, "Assessment of the effect of memory page retirement on system ras against hardware faults," in International Conference on Dependable Systems and Networks (DSN'06), June 2006, pp. 365--370.
[21]
K. Pearson and L. N. G. Filon, "Mathematical contributions to the theory of evolution. IV. on the probable errors of frequency constants and on the influence of random selection on variation and correlation," Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 191, pp. 229--311, 1898. {Online}. Available: http://www.jstor.org/stable/90745
[22]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2010, pp. 1--11.
[23]
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka, "Fti: High performance fault tolerance interface for hybrid systems," in 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov 2011, pp. 1--12.
[24]
J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303--312, 2006.
[25]
J. W. Young, "A first order approximation to the optimum checkpoint interval," Communications of the ACM, vol. 17, no. 9, pp. 530--531, 1974.
[26]
LANL, "Trinity Technical Specifications," http://www.lanl.gov/projects/trinity/specifications.php, Jan. 10 2017.
[27]
J. Lujan et al., "Apex workflows," Technical report, LANL, NERSC, SNL, Tech. Rep. LA-UR-15-29113, 2015. {Online}. Available: https://www.nersc.gov/assets/apex-workflows-v2.pdf
[28]
V. Sridharan and D. Liberty, "A study of DRAM failures in the field," in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1--11.
[29]
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, "Memory errors in modern systems: The good, the bad, and the ugly," ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 297--310, 2015.
[30]
J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in International Conference on High Performance Computing for Computational Science. Springer, 2010, pp. 1--25.
[31]
N. El-Sayed and B. Schroeder, "Reading between the lines of failure logs: Understanding how HPC systems fail," in 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2013, pp. 1--12.
[32]
D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell, "Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility," in SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2015, pp. 1--12.
[33]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland, "Understanding gpu errors on large-scale hpc systems and the implications for system design and operation," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 331--342.
[34]
A. Gainaru, F. Cappello, and W. Kramer, "Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium, May 2012, pp. 1168--1179.
[35]
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, "Bluegene/l failure analysis and prediction models," in International Conference on Dependable Systems and Networks (DSN'06), June 2006, pp. 425--434.
[36]
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, J. Moreira, and M. Gupta, "Filtering failure logs for a BlueGene/L prototype," in 2005 International Conference on Dependable Systems and Networks (DSN'05), June 2005, pp. 476--485.
[37]
B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers, "A large-scale study of soft-errors on GPUs in the field," in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 519--530.
[38]
A. Patwari, I. Laguna, M. Schulz, and S. Bagchi, "Understanding the spatial characteristics of DRAM errors in HPC clusters," in Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, ser. FTXS '17. New York, NY, USA: ACM, 2017, pp. 17--22. {Online}. Available
[39]
S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, "Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems," in Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, ser. DSN '15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 37--44. {Online}. Available
[40]
N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, "Temperature management in data centers: why some (might) like it hot," in Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS '12. New York, NY, USA: ACM, 2012, pp. 163--174. {Online}. Available
[41]
V. Sridharan and D. Liberty, "A study of DRAM failures in the field," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, Calif., USA: IEEE Computer Society Press, 2012, pp. 76:1--76:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=2388996.2389100
[42]
T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, S. Levy, E. Baseman, and Q. Guan, "Lifetime memory reliability data from the field," in 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Oct 2017, pp. 1--6.
[43]
S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, "Failures in large scale systems: Long-term measurement, analysis, and implications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: ACM, 2017, pp. 44:1--44:12. {Online}. Available
  1. Lessons learned from memory errors observed over the lifetime of Cielo

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
      November 2018
      932 pages

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      IEEE Press

      Publication History

      Published: 11 November 2018

      Check for updates

      Qualifiers

      • Research-article

      Conference

      SC18
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 143
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 23 Nov 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media