Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO.2009.14acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Published: 22 March 2009 Publication History

Abstract

As semiconductor technology scales into the deep submicron regime the occurrence of transient or soft errors will increase. This will require new approaches to error detection. Software checking approaches are attractive because they require little hardware modification and can be easily adjusted to fit different reliability and performance requirements. Unfortunately, software checking adds a significant performance overhead.In this paper we present ESoftCheck, a set of compiler optimization techniques to determine which are the vital checks, that is, the minimum number of checks that are necessary to detect an error and roll back to a correct program state. ESoftCheck identifies the vital checks on platforms where registers are hardware-protected with parity or ECC, when there are redundant checks and when checks appear in loops. ESoftCheck also provides knobs to trade reliability for performance based on the support for recovery and the degree of trustiness of the operations. Our experimental results on a Pentium 4 show that ESoftCheck can obtain 27.1% performance improvement without losing fault coverage.

References

[1]
Z. Alkhalifa, V. S. S. Nair, N. Krishnamurthy, and J. A. Abraham. Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection. IEEE Trans. Parallel Distrib. Syst., 10(6):627-641, 1999.
[2]
R. Baumann. Soft Errors in Commercial Semiconductor Technology: Overview and Scaling Trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121_01.1-121_01.14, April 2002.
[3]
D. Bossen, J. Tendler, and K. Reick. Power4 system design for high reliability. IEEE Micro, 22(2):16-24, March/April 2002.
[4]
G. Bronevetsky, D. Marques, K. Pingali, and R. Rugina. Compiler-enhanced incremental checkpoint. In Proceedings of Workshop on Languages and Compilers for Parallel Computing (LCPC), 2007.
[5]
J. Chang, G. A. Reis, and D. I. August. Automatic Instruction-Level Software-Only Recovery. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 83-92, 2006.
[6]
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451-490, October 1991.
[7]
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault Recovery for Chip Multiprocessors. In Proceedings of International Symposium on Computer Architecture (ISCA), pages 98-109, 2003.
[8]
J. Hu, S. Wang, and S. G. Ziavras. In-register duplication: Exploiting narrow-width value for improving register file reliability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 281-290, 2006.
[9]
P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded sparc processor. IEEE Micro, 25(2):21-29, 2005.
[10]
C. Lattner and V. Adve. The LLVM Compiler Framework and Infrastructure Tutorial. In LCPC'04 Mini Workshop on Compiler Research Infrastructures, 2004.
[11]
C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the Intenational Conference on Programming Language Design and Implementation (PLDI), 2005.
[12]
A. Mahmood and E. McCluskey. Concurrent Error Detection Using Watchdog Processors - A Survey. IEEE Transactions on Computers, 37(2):160-174, 1988.
[13]
D. McEvoy. The architecture of tandem's nonstop system. In ACM 81: Proceedings of the ACM '81 conference, page 245, 1981.
[14]
C. McNairy and R. Bhatia. Montecito: A Dual-core, Dual-thread Itanium Processor. IEEE Micro, 25(2):10-20, March-April 2005.
[15]
A. Meixner and D. J. Sorin. Error detection using dynamic dataflow verification. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 104-118, 2007.
[16]
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability, 5:329-335, September 2005.
[17]
P. Montesinos, W. Liu, and J. Torrellas. Shield: Cost-Effective Soft-Error Protection for Register Files. In Third IBM TJ Watson Conference on Interaction between Architecture, Circuits and Compilers (PAC206), 2006.
[18]
S. S. Muchnick. Advanced Compiler Design and Implementation , pages 378-396. Morgan Kauffmann, 1997.
[19]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of International Symposium on Computer Architecture, pages 99-110, 2002.
[20]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture (MICRO), page 29, 2003.
[21]
N. Nakka, K. Pattabiraman, and R. Iyer. Processor-level selective replication. In DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 544-553, Washington, DC, USA, 2007. IEEE Computer Society.
[22]
N. Oh, P. Shirvani, and E. McCluskey. Control-flow checking by software signatures. Reliability, IEEE Transactions on, 51(1):111-122, Mar 2002.
[23]
F. Perry, L. Mackey, G. A. Reis, J. Ligatti, D. I. August, and D. Walker. Fault-tolerant typed assembly language. SIGPLAN Not., 42(6):42-53, 2007.
[24]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2002.
[25]
S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In Proceedings of International Symposium on Computer Architecture, pages 25-36, 2000.
[26]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software Implemented Fault Tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), 2005.
[27]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Design and Evaluation of Hybrid Fault-Detection Systems. In Proceedings of the International International Symposium on Computer Architecture (ISCA), 2005.
[28]
G. A. Reis, J.Chang, D. I. August, R. Cohn, and S. S. Mukherjee. Configurable Transient Fault Detection via Dynamic Binary Translation. In Proceedings of the 2nd Workshop on Architectural Reliability (WAR), 2006.
[29]
G. P. Saggese, N. J. Wang, Z. T. Kalbarczyk, S. J. Patel, and R. K. Iyer. An experimental study of soft errors in microprocessors. IEEE Micro, 25(6):30-39, 2005.
[30]
T. Slegel, R. Averill, M. Check, B. Giamei, B. Krumm, C. Krygowski, W. Li, J. Liptay, J. MacDougall, T. McPherson, J. Navarro, E. Schwarz, K. Shum, and C. Webb. IBM's S/390 G5 Microprocessor Design. IEEE Micro, 19(2):12-23, March-April 1999.
[31]
D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2002.
[32]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault Recovery using Simultaneous Multithreading. In Proceedings of International Symposium on Computer Architecture (ISCA), pages 87-98, 2002.
[33]
C. Wang, H. seop Kim, Y. Wu, and V. Ying. Compiler-managed software-based redundant multi-threading for transient fault detection. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 2007.
[34]
N. J. Wang and S. J. Patel. ReStore: Symptom Based Soft Error Detection in Microprocessors. In Proceedings of the International Conference on Dependable Systems and Network (DSN), pages 30-39, 2005.
[35]
M. Wolfe. Beyond induction variables. SIGPLAN Not., 27(7):162-174, 1992.
[36]
Y. Yeh. Design Considerations in Boeing 777 Fly-by-wire Computers. In Proceedings of the IEEE International High-Assurance Systems Engineering Symposium., pages 64-72, 1998.
[37]
J. Yu, M. J. Garzaran, and M. Snir. Techniques for Efficient Software Checking. In Proceedings of Workshop on Languages and Compilers for Parallel Computing (LCPC), 2007.

Cited By

View all
  • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
  • (2023)Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593715(360-372)Online publication date: 21-Jun-2023
  • (2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
March 2009
299 pages
ISBN:9780769535760

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 22 March 2009

Check for updates

Author Tags

  1. ESoftCheck
  2. fault tolerance
  3. non-vital checks

Qualifiers

  • Article

Conference

CGO '09

Acceptance Rates

CGO '09 Paper Acceptance Rate 26 of 70 submissions, 37%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Soft Error Resilience at Near-Zero CostProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656605(176-187)Online publication date: 30-May-2024
  • (2023)Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593715(360-372)Online publication date: 21-Jun-2023
  • (2022)Trace-and-brace (TAB): bespoke software countermeasures against soft errorsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535070(73-85)Online publication date: 14-Jun-2022
  • (2021)FT-BLASProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460364(127-138)Online publication date: 3-Jun-2021
  • (2017)Towards a More Complete Understanding of SDC PropagationProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078617(131-142)Online publication date: 26-Jun-2017
  • (2017)Trading Fault Tolerance for Performance in AN EncodingProceedings of the Computing Frontiers Conference10.1145/3075564.3075565(183-190)Online publication date: 15-May-2017
  • (2017)InCheckProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062265(1-6)Online publication date: 18-Jun-2017
  • (2016)nZDCProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2898054(1-6)Online publication date: 5-Jun-2016
  • (2016)IPAS: intelligent protection against silent output corruption in scientific applicationsProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854059(227-238)Online publication date: 29-Feb-2016
  • (2014)Exploiting Narrow Data-Width to Mask Soft Errors in Register FilesProceedings of the 33rd International Conference on Computer Safety, Reliability, and Security - Volume 866610.1007/978-3-319-10506-2_9(125-138)Online publication date: 10-Sep-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media