Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2751504.2751510acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

Published: 15 June 2015 Publication History

Abstract

Graphics Processing Units are nowadays the most common general-purpose computing accelerators employed in High Performance Computing (HPC) systems. The performance and energy efficiency of such devices enables extremely powerful HPC systems to be built. However, as the machine scale increases, the reliability problem increases as well, with failures on an exascale system expected to occur every few hours.
We present data obtained at Los Alamos Neutron Science Center and measure how algorithms optimization and hardening strategies impact the Silent Data Corruption and crash sensitivity of modern GPUs. We also extend our reliability analysis by evaluating the Mean Executions and Mean Workload Between Failures of the different algorithms implementations. Moreover, we push even more the compromise of reliability and performance applying hardening strategies to current optimized codes. We show that common strategies, such as ECC and Checkpoint-rollback, can be no match to strategies like Algorithm-Based Fault Tolerance and even Duplication with Comparison.

References

[1]
J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19, 1965.
[2]
J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites: November 2013, 2013.
[3]
K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984.
[4]
JEDEC. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. Technical Report JESD89A, JEDEC Standard, 2006.
[5]
J.-Y. Jou and J. Abraham. Fault-Tolerant FFT Networks. Computers, IEEE Transactions on, 37(5):548--561, 1988.
[6]
S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan. Heterocheckpoint: Efficient checkpointing for accelerator-based systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, pages 738--743, June 2014.
[7]
J. Krüger and R. Westermann. Linear Algebra Operators for GPU Implementation of Numerical Algorithms. In SIGGRAPH 2003.
[8]
R. Lucas. Top ten exascale research challenges. In DOE ASCAC Subcommittee Report, 2014.
[9]
W. C. Needleman, S.B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(2):443--453, 1969.
[10]
NVIDIA. NVIDIA Kepler K20 GPU Datasheet, 2012.
[11]
D. Oliveira, P. Rech, H. Quinn, T. Fairbanks, L. Monroe, S. Michalak, C. Anderson-Cook, P. Navaux, and L. Carro. Modern gpus radiation sensitivity evaluation and mitigation through duplication with comparison. Nuclear Science, IEEE Transactions on, 61(6):3115--3122, Dec 2014.
[12]
Preparing for exascale: Ornl leadership computing facility application requirements and strategy. 2009.
[13]
L. Pilla, P. Rech, F. Silvestri, C. Frost, P. Navaux, M. Reorda, and L. Carro. Software-based hardening strategies for neutron sensitive fit algorithms on gpus. Nuclear Science, IEEE Transactions on, PP(99):1--7, 2014.
[14]
P. Rech, C. Aguiar, C. Frost, and L. Carro. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs. Nuclear Science, IEEE Transactions on, 60(4):2797--2804, 2013.
[15]
P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro. Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability. In DSN 2014, Atlanta, USA, 2014.
[16]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. Understanding gpu errors on large-scale hpc systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, Feb 2015.
[17]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, number November, pages 1--11, 2008.
[18]
V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. University of California, Berkeley, 40, 2008.
[19]
C. Weaver et al. Techniques to reduce the soft error rate of a high-performance microprocessor. In ISCA'04, pages 264--275. IEEE Press, 2004.

Cited By

View all
  • (2022)Paralellism-Based Techniques for Slowing Down Soft Error Propagation2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)10.1109/DASC/PiCom/CBDCom/Cy55231.2022.9927870(1-6)Online publication date: 12-Sep-2022

Index Terms

  1. The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
      June 2015
      78 pages
      ISBN:9781450335690
      DOI:10.1145/2751504
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. error-correcting codes
      2. hardening strategy
      3. neutron beam testing
      4. radiation testing

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      HPDC'15
      Sponsor:

      Acceptance Rates

      FTXS '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
      Overall Acceptance Rate 16 of 25 submissions, 64%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 20 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Paralellism-Based Techniques for Slowing Down Soft Error Propagation2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)10.1109/DASC/PiCom/CBDCom/Cy55231.2022.9927870(1-6)Online publication date: 12-Sep-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media