Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3075564.3075598acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
short-paper

CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators

Published: 15 May 2017 Publication History

Abstract

Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code.
We performed a fault injection campaign injecting more than 67,000 faults into an Intel Xeon Phi executing six representative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.

References

[1]
CAROL-FI Fault Injector 2017. CAROL-FI. https://github.com/UFRGS-CAROL/carol-fi. (2017).
[2]
Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. 44--54.
[3]
Josep de la Puente, Miguel Ferrer, Mauricio Hanzich, Jos E. Castillo, and Jos M. Cela. 2014. Mimetic seismic wave modeling including topography on deformed staggered grids. GEOPHYSICS 79, 3 (2014), T125--T141.
[4]
D. A. G. de Oliveira, L. L. Pilla, T. Santini, and P. Rech. 2016. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units. IEEE Trans. Comput. 65, 3 (March 2016), 791--804.
[5]
J.J. Dongarra, H.W. Meuer, and E. Strohmaier. 2016. TOP500 Supercomputer Sites: June 2016. (2016). http://www.top500.org
[6]
D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, Nam Sung Kim, and K. Flautner. 2004. Razor: circuit-level correction of timing errors for low-power operation. IEEE Micro 24, 6 (Nov 2004), 10--20.
[7]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 221--230.
[8]
Qiang Guan, N. DeBardeleben, B. Artkinson, R. Robey, and W.M. Jones. 2015. Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App. In Cluster Computing (CLUSTER), 2015 IEEE International Conference on. 176--179.
[9]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W Keckler, and Joel Emer. 2015. Sassifi: Evaluating resilience of GPU applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects (SELSE).
[10]
Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 57, 11 pages.
[11]
Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher, and Pradip Bose. 2016. Understanding Error Propagation in GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 21, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014932
[12]
Robert Lucas. 2014. Top Ten Exascale Research Challenges. In DOE ASCAC Subcommittee Report.
[13]
Shubhendu S Mukherjee and others. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society.
[14]
G. P. Saggese, N. J. Wang, Z. T. Kalbarczyk, S. J. Patel, and R. K. Iyer. 2005. An experimental study of soft errors in microprocessors. IEEE Micro 25, 6 (Nov 2005), 30--39.
[15]
V. Sridharan and D. R. Kaeli. 2009. Eliminating microarchitectural dependency from Architectural Vulnerability. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. 117--128.
[16]
Devesh Tiwari, Saurabh Gupta, Jim Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, Philippe Navaux, Luigi Carro, and Arthur Buddy Bland. 2015. Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation. In Proceedings of 21st IEEE Symp. on High Performance Computer Architecture (HPCA). ACM.
[17]
S. Tselonis and D. Gizopoulos. 2016. GUFI: A framework for GPUs reliability assessment. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 90--100.

Cited By

View all
  • (2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-0Online publication date: 22-Feb-2024
  • (2023)DeBaTE-FI: A Debugger-Based Fault Injector Infrastructure for IoT Soft Error Reliability Assessment2023 IEEE 9th World Forum on Internet of Things (WF-IoT)10.1109/WF-IoT58464.2023.10539573(1-6)Online publication date: 12-Oct-2023
  • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: Dec-2023
  • Show More Cited By
  1. CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CF'17: Proceedings of the Computing Frontiers Conference
      May 2017
      450 pages
      ISBN:9781450344876
      DOI:10.1145/3075564
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 May 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Conference

      CF '17
      Sponsor:
      CF '17: Computing Frontiers Conference
      May 15 - 17, 2017
      Siena, Italy

      Acceptance Rates

      CF'17 Paper Acceptance Rate 43 of 87 submissions, 49%;
      Overall Acceptance Rate 273 of 785 submissions, 35%

      Upcoming Conference

      CF '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 23 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-0Online publication date: 22-Feb-2024
      • (2023)DeBaTE-FI: A Debugger-Based Fault Injector Infrastructure for IoT Soft Error Reliability Assessment2023 IEEE 9th World Forum on Internet of Things (WF-IoT)10.1109/WF-IoT58464.2023.10539573(1-6)Online publication date: 12-Oct-2023
      • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: Dec-2023
      • (2022)Evaluating the Impact of Hardware Faults on Program Execution in a Microkernel Environment2022 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)10.1109/HOST54066.2022.9840063(149-152)Online publication date: 27-Jun-2022
      • (2021)Analysis of Single Event Effects on Embedded ProcessorElectronics10.3390/electronics1024316010:24(3160)Online publication date: 18-Dec-2021
      • (2021)FT-BLASProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460364(127-138)Online publication date: 3-Jun-2021
      • (2021)Error resilience of three GMRES implementations under fault injectionThe Journal of Supercomputing10.1007/s11227-021-04148-x78:5(7158-7185)Online publication date: 5-Nov-2021
      • (2019)Identifying the Most Reliable Collaborative Workload Distribution in Heterogeneous Devices2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715107(1325-1330)Online publication date: Mar-2019
      • (2019)Reliability Evaluation of Mixed-Precision Architectures2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00041(238-249)Online publication date: Feb-2019
      • (2019)Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS49593.2019.00010(41-49)Online publication date: Nov-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media