Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3385412.3386033acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Compiler-directed soft error resilience for lightweight GPU register file protection

Published: 11 June 2020 Publication History

Abstract

This paper presents Penny, a compiler-directed resilience scheme for protecting GPU register files (RF) against soft errors. Penny replaces the conventional error correction code (ECC) based RF protection by using less expensive error detection code (EDC) along with idempotence based recovery. Compared to the ECC protection, Penny can achieve either the same level of RF resilience yet with significantly lower hardware costs or stronger resilience using the same ECC due to its ability to detect multi-bit errors when it is used solely for detection. In particular, to address the lack of store buffers in GPUs, which causes both checkpoint storage overwriting and the high cost of checkpointing stores, Penny provides several compiler optimizations such as storage coloring and checkpoint pruning. Across 25 benchmarks, Penny causes only ≈3% run-time overhead on average.

References

[2]
NVIDIA Tesla V100 GPU Architecture. Technical Report. Nvidia.
[3]
Gene M Amdahl. 2013.
[4]
Computer architecture and amdahl’s law. Computer 46, 12 (2013), 38–46.
[5]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GP U simulator. In ISPASS’09.
[6]
Rajeev Balasubramonian, Sandhya Dwarkadas, and David H Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO 34.
[7]
Shekhar Borkar. 2013. Exascale Computer-a fact or a fiction. Keynote address: IEEE International Parallel and Distributed Processing Symposium (2013).
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009.
[9]
Jongouk Choi, Hyunwoo Joe, Yongjoo Kim, and Changhee Jung. 2019.
[10]
Achieving stagnation-free intermittent computation with boundaryfree adaptive execution. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 331–344.
[11]
Jongouk Choi, Qingrui Liu, and Changhee Jung. 2019. CoSpec: Compiler directed speculative intermittent computation. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 399–412.
[12]
Cristian Constantinescu. 2003. Trends and Challenges in VLSI Circuit Reliability. In MICRO 36.
[13]
W.J. Cook, W.H. Cunningham, W.R. Pulleyblank, and A. Schrijver. 2011.
[14]
Combinatorial Optimization. Wiley.
[15]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009.
[16]
Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.
[17]
Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010.
[18]
Relax: An Architectural Framework for Software Recovery of Hardware Faults. In ISCA’10.
[19]
Marc de Kruijf and Karthikeyan Sankaralingam. 2013. Idempotent code generation: Implementation, analysis, and evaluation. In CGO’13.
[20]
Marc A. De Kruijf. 2012.
[21]
Compiler Construction of Idempotent Regions and Applications in Architecture Design. Ph.D. Dissertation. Madison, WI, USA. Advisor(s) Sankaralingam, Karthikeyan.
[22]
Marc A. de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. 2012.
[23]
Static Analysis and Compiler Design for Idempotent Processing. In PLDI 2012.
[24]
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The Transmeta Code Morphing&Trade; Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (San Francisco, California, USA). 15–24.
[25]
Jeno Egerváry. 1931. Matrixok kombinatorius tulajdonságairól. Matematikai és Fizikai Lapok 38, 1931 (1931), 16–28.
[26]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In ISCA’11.
[27]
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. 2014. GP UQin: A methodology for evaluating the error resilience of GPGP U applications. In ISPASS’14. IEEE.
[28]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2016. A systematic methodology for evaluating the error resilience of GPGP U applications. IEEE Transactions on Parallel and Distributed Systems 27, 12 (2016), 3397–3411.
[29]
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A Mahlke, and David I August. 2011. Encore: low-cost, fine-grained transient fault recovery. In MICRO 44.
[30]
L Bautista Gomez, Franck Cappello, Luigi Carro, Nathan DeBardeleben, Bo Fang, Sudhanva Gurumurthi, Karthik Pattabiraman, Paolo Rech, and M Sonza Reorda. 2014. GPGP Us: how to combine high computational power with high reliability. In Proceedings of the conference on Design, Automation and Test in Europe. 341.
[31]
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2011. Toward Dark Silicon in Servers. In MICRO 31.
[32]
Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable Onchip Systems in the Nano-era: Lessons Learnt and Future Trends. In DAC 2013.
[33]
Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-Threshold Voltage (NTV) Design: Opportunities and Challenges. In DAC’12.
[34]
Seon Wook Kim, Chong-Liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar. 2006.
[35]
Exploiting Reference Idempotency to Reduce Speculative Storage Overflow. In TOPLAS’06.
[36]
Denés Konig. 1931. Gráfok és mátrixok. Matematikai és Fizikai Lapok.
[37]
Lingbo Kou. 2014.
[38]
Impact of Process Variations on Soft Error Sensitivity of 32-nm VLSI Circuits in Near-threshold Region. Master’s thesis.
[39]
Jingwen Leng, Tayler H. Hetherington, Ahmed ElTantawy, Syed Zohaib Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013.
[40]
GP UWattch: enabling energy optimizations in GPGP Us. In ISCA’13.
[41]
Jingwen Leng, Yazhou Zu, and Vijay Janapa Reddi. 2015. GP U voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GP U architectures. In HPCA’15.
[42]
Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher, and Pradip Bose. 2016. Understanding Error Propagation in GPGP U Applications. In SC’16.
[43]
Qingrui Liu, Joseph Izraelevitz, Se Kwon Lee, Michael L Scott, Sam H Noh, and Changhee Jung. 2018. iDO: Compiler-Directed Failure Atomicity for Nonvolatile Memory. In MICRO 51.
[44]
Qingrui Liu and Changhee Jung. 2016. Lightweight Hardware Support for Transparent Consistency-Aware Checkpointing in Intermittent Energy-Harvesting systems. In NVMSA 2016.
[45]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler Directed Lightweight Soft Error Resilience. In LCTES’15 (Portland, OR, USA).
[46]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016.
[47]
Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error Recovery. In SC’16.
[48]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016.
[49]
Compiler Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR. TECS’16 (2016).
[50]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016.
[51]
Low-Cost Soft Error Resilience with Unified Data Verification and Fine-Grained Recovery. In MICRO 49.
[52]
Scott A. Mahlke, William Y. Chen, Wen-mei W. Hwu, B. Ramakrishna Rau, and Michael S. Schlansker. 1992. Sentinel Scheduling for VLIW and Superscalar Processors. In ASPLOS V.
[53]
Gokhan Memik, Masud H Chowdhury, Arindam Mallik, and Yehea I Ismail. 2005. Engineering over-clocking: Reliability-performance tradeoffs for high-performance register files. In DSN’05.
[54]
Gokhan Memik, Mahmut T. Kandemir, and Ozcan Ozturk. 2005. Increasing Register File Immunity to Transient Errors. In DATE.
[55]
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGP U: Exception Support and Speculative Execution on GP Us. In ISCA’12.
[56]
Pablo Montesinos, Wei Liu, and Josep Torrellas. 2007. Using register lifetime predictions to protect register files against soft errors. In DSN’07.
[57]
Todd K Moon. 2005.
[58]
Error correction coding: mathematical methods and algorithms. John Wiley & Sons. PLDI ’20, June 15–20, 2020, London, UK H. Kim, J. Zeng, Q. Liu, M. Abdel-Majeed, J. Lee, and C. Jung
[59]
S.S. Muchnick. 1997.
[60]
Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers.
[61]
Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault Site Pruning for Practical Reliability Analysis of GPGP U Applications. In MICRO 51. IEEE.
[62]
Nvidia 2007.
[63]
CUDA Programming Guide. Nvidia. http://developer.download.nvidia.com/compute/cuda.
[64]
Nvidia 2013.
[65]
CUDA Toolkit 5.5. Nvidia. https://developer.nvidia.com/cuda-toolkit-55-archive.
[66]
Robert Pawlowski. 2015.
[67]
Measurement and Analysis of Soft Error Vulnerability of Low-Voltage Logic and Memory Circuits. Ph.D. Dissertation. Corvallis, OR, USA.
[68]
William Wesley Peterson and EJ Weldon. 1972.
[69]
Error-correcting codes. MIT press.
[70]
George A Reis, Jonathan Chang, Neil Vachharajani, Shubhendu S Mukherjee, Ram Rangan, and David I August. 2005.
[71]
Design and evaluation of hybrid fault-detection systems. In 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 148–159.
[72]
Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives. In DAC ’14.
[73]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01 127 (2012).
[74]
Synopsys. 2001. Compiler, Design and User, RTL and Guide, Modeling. (2001). http://www. synopsys. com.
[75]
Robert Tarjan. 1972. Depth-first search and linear graph algorithms. SIAM journal on computing 1, 2 (1972), 146–160.
[76]
Michael B. Taylor. 2012. Is Dark Silicon Useful?: Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse. In DAC’12 (San Francisco, California).
[77]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GP U errors on large-scale HPC systems and the implications for system design and operation. In HPCA ’15. IEEE.
[78]
Marc Tremblay and Yu Tamir. 1989.
[79]
Support for fault tolerance in VLSI processors. In Circuits and Systems, 1989., IEEE International Symposium on. IEEE, 388–392.
[80]
Joel Van Der Woude and Matthew Hicks. 2016.
[81]
Intermittent computation without hardware support or programmer intervention. In OSDI’16.
[82]
Liang Wang and Kevin Skadron. 2013. Implications of the Power Wall: Dim Cores and Reconfigurable Logic. IEEE Micro (2013), 40–48.
[83]
S.J.E. Wilton et al. 1996. CACTI: An enhanced cache access and cycle time model. JSSC’96 (May 1996).
[84]
Mimi Xie, Mengying Zhao, Chao Pan, Jingtong Hu, Yongpan Liu, and Chun Xue. 2015. Fixing the Broken Time Machine: Consistency-Aware Checkpointing for Energy Harvesting Powered Non-Volatile Processor. In DAC’15.
[85]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GP Us. TC’08 (2018).

Cited By

View all
  • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
  • (2024)Compiler-Directed Whole-System Persistence2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00074(961-977)Online publication date: 29-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2020
1174 pages
ISBN:9781450376136
DOI:10.1145/3385412
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ECC
  2. GPU
  3. Resilience

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)8
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
  • (2024)Compiler-Directed Whole-System Persistence2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00074(961-977)Online publication date: 29-Jun-2024
  • (2024)Evaluating the Soft Error Resilience of Graph Applications on GPGPUs2024 IEEE 10th Conference on Big Data Security on Cloud (BigDataSecurity)10.1109/BigDataSecurity62737.2024.00022(84-89)Online publication date: 10-May-2024
  • (2023)SweepCache: Intermittence-Aware Cache on the CheapProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623781(1059-1074)Online publication date: 28-Oct-2023
  • (2023)Persistent Processor ArchitectureProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623772(1075-1091)Online publication date: 28-Oct-2023
  • (2023)Write-Light Cache for Energy Harvesting SystemsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589098(1-13)Online publication date: 17-Jun-2023
  • (2023)RTailor: Parameterizing Soft Error Resilience for Mixed-Criticality Real-Time Systems2023 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS59052.2023.00037(344-357)Online publication date: 5-Dec-2023
  • (2022)Near-Zero Downtime Recovery From Transient-Error-Induced CrashesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309605533:4(765-778)Online publication date: 1-Apr-2022
  • (2022)CapOS: Capacitor Error Resilience for Energy Harvesting SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320286141:11(4539-4550)Online publication date: Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media