Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063384.2063445acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

System implications of memory reliability in exascale computing

Published: 12 November 2011 Publication History

Abstract

Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.

References

[1]
"McSim: A Manycore Simulation Infrastructure," http://scale.snu.ac.kr/mcsim.
[2]
J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future Scaling of Processor-Memory Interfaces," in Supercomputing Conference, 2009.
[3]
AMD, "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors, Technical Report," Nov. 2009.
[4]
D. A. Bader, G. Cong, and J. Feo, "On the architectural requirements for efficient execution of graph algorithms," in ICPP '05: Proceedings of the 2005 International Conference on Parallel Processing, 2005, pp. 547--556.
[5]
K. Bergman, et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems." 2008, DARPA IPTO sponsored report.
[6]
J. Berry, B. Hendrickson, S. Kahan, and P. Konecny, "Software and Algorithms for Graph Queries on Multithreaded Architectures," in 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, p. 495.
[7]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in PACT, 2008.
[8]
S. Borkar, "The Exascale Challenge," in Asia Academic Forum, Nov 2010.
[9]
L. Borucki, G. Schindlbeck, and C. Slayman, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level," in Proceedings of 46th Annual International Reliability Physics Symposium, 2008.
[10]
Cray Corporation, "Cray MTA-2 System."
[11]
J. T. Daly, "A Higher Order Estimate Of The Optimum Checkpoint Interval For Restart Dumps," Future Gener. Comput. Syst., vol. 22, pp. 303--312, February 2006.
[12]
T. J. Dell, "System RAS Implications of DRAM Soft Errors," IBM Journal of Research and Development, vol. 52, no. 3, pp. 307--314, 2008.
[13]
T. Dell, "A White Paper On The Benefits Of Chipkill-Correct ECC for PC Server Main Memory," IBM Microelectronics Division," Technical Report, Nov. 1997.
[14]
X. Dong, N. P. Jouppi, and Y. Xie, "PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM," in Proceedings of the 2009 International Conference on Computer-Aided Design, ser. ICCAD '09. New York, NY, USA: ACM, 2009, pp. 269--275. {Online}. Available: http://doi.acm.org/10.1145/1687399.1687449
[15]
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009.
[16]
J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," in Proceedings of the 2nd conference on Computing frontiers, Ischia, Italy, 2005, pp. 28--34.
[17]
J. L. Henning, "Performance Counters and Development of SPEC CPU2006," Computer Architecture News, vol. 35, no. 1, 2007.
[18]
M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, "A Framework for Architecture-Level Power, Area, And Thermal Simulation and Its Application to Network-on-Chip Design Exploration," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 63--68, March 2011. {Online}. Available: http://doi.acm.org/10.1145/1964218.1964229
[19]
B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007.
[20]
JEDEC, "http://www.jedec.org/."
[21]
P. Koka, et al., "Silicon-Photonic Network Architectures For Scalable, Power-Efficient Multi-Chip Systems," ISCA 2010, vol. 38, pp. 117--128, June 2010.
[22]
H. Lee, et al., "A 16Gb/s/link, 64GB/s Bidirectional Asymmetric Memory Interface," JSSC, vol. 44, no. 4, 2009.
[23]
S. Li, J. Ahn, J. B. Brockman, and N. P. Jouppi, "McPAT 1.0: An Integrated Power, Area, and Timing Modeling Framework for Multicore Architectures," HP Labs, Tech. Rep. HPL-2009-206, 2009.
[24]
S. Li, et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469--480.
[25]
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques," in ICCAD, 2011.
[26]
S. Li, et al., "A Heterogeneous Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2007.
[27]
S. Li, S. Kuntz, J. Brockman, and P. Kogge, "Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 7, July, 2011.
[28]
S. Li, S. Kuntz, P. Kogge, and J. Brockman, "Memory Model Effects on Application Performance for a Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2008.
[29]
X. Li, M. C. Huang, and K. Shen, "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10, 2010, pp. 6--6.
[30]
Los Alamos National Laboratory, Reliability Data Sets. {Online}. Available: {http://institutes.lanl.gov/data/fdata/}
[31]
C.-K. Luk, et al., "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in PLDI, Jun 2005.
[32]
P. Rosenfeld et al, "DRAMSim2," http://www.ece.umd.edu/dramsim/.
[33]
R. Palmer, aet al., "A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications," in ISSCC'07, 2007, pp. 440--614.
[34]
T. Rao and E. Fujiwara, Error-Control Coding for Computer Systems. Prentice Hall, 1989.
[35]
A. F. Rodrigues, et al., "The Structural Simulation Toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 37--42, March 2011.
[36]
Samsung Electronics Corporation, "Samsung Electronics Develops World's First Eight-Die Multi-Chip Package for Multimedia Cell Phones," 2005, (Press release from http://www.samsung.com).
[37]
B. Schroeder and G. A. Gibson, "A Large-scale Study of Failures in High Performance Computing Systems," in Proceedings of DSN, 2006.
[38]
B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM Errors in The Wild: A Large-Scale Field Study," Commun. ACM, vol. 54, no. 2, pp. 100--107, 2011.
[39]
Semiconductor Industries Association, "International Technology Roadmap for Semiconductors./Model for Assessment of CMOS Technologies and Roadmaps (MASTAR) http://www.itrs.net/."
[40]
H. Simon, "Exascale Challenges for the Computational Science Community," Lawrence Berkeley National Laboratory and UC Berkeley, Tech. Rep., Oct. 2010.
[41]
C. Slayman, M. Ma, and S. Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory," in Proceedings of the IEEE Integrated Reliability Workshop, 2006, pp. 190--193.
[42]
B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in Proceedings of the International Conference on Parallel Processing, 1978, pp. 6--8.
[43]
D. Strukov, "The Area And Latency Tradeoffs Of Binary Bit-Parallelbch Decoders For Prospective Nanoelectronicmemories," in Proceedings of 2006 Asilomar Conference on Signals Systems and Computers, Oct. 2006, pp. 1183--1187.
[44]
Tezzaron Semiconductor, "Soft Errors in Electronic Memory-A White Paper," Tezzaron Semiconductor," Technical Report, 2004.
[45]
A. N. Udipi et al., "Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores," in Proceedings of ISCA, 2010.
[46]
C. Wilkerson, et al., "Reducing Cache Power With Low-Cost, Multi-Bit Error-Correcting Codes," in International Symposium on Computer Architecture, 2010, pp. 83--93.
[47]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in ISCA, 1995.
[48]
D. H. Yoon, et al., "FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors," in Proc. the Int'l Symp. High-Performance Computer Architecture (HPCA), February 2011.
[49]
J. W. Young, "A First Order Approximation To The Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, September 1974.

Cited By

View all
  • (2024)A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104879190:COnline publication date: 1-Aug-2024
  • (2023)Redundant Array of Independent Memory DevicesIEEE Computer Architecture Letters10.1109/LCA.2023.333498922:2(181-184)Online publication date: Jul-2023
  • (2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
  • Show More Cited By

Index Terms

  1. System implications of memory reliability in exascale computing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BCH
    2. DRAM
    3. ECC
    4. checkpointing
    5. chipkill
    6. exascale computing
    7. memory system
    8. reliability
    9. tagged memory

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '11
    Sponsor:

    Acceptance Rates

    SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 29 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104879190:COnline publication date: 1-Aug-2024
    • (2023)Redundant Array of Independent Memory DevicesIEEE Computer Architecture Letters10.1109/LCA.2023.333498922:2(181-184)Online publication date: Jul-2023
    • (2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
    • (2020)SERN: Modeling and Analyzing the Soft Error Reliability of Convolutional Neural NetworksProceedings of the 2020 on Great Lakes Symposium on VLSI10.1145/3386263.3406938(445-450)Online publication date: 7-Sep-2020
    • (2020)WoLFRaM: Enhancing Wear-Leveling and Fault Tolerance in Resistive Memories using Programmable Address Decoders2020 IEEE 38th International Conference on Computer Design (ICCD)10.1109/ICCD50377.2020.00044(187-196)Online publication date: Oct-2020
    • (2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 24-Aug-2020
    • (2019)Innovations in the Memory SystemSynthesis Lectures on Computer Architecture10.2200/S00933ED1V01Y201906CAC04814:2(1-151)Online publication date: 10-Sep-2019
    • (2019)BonVoisionProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330388(484-496)Online publication date: 26-Jun-2019
    • (2019)ρProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304045(659-671)Online publication date: 4-Apr-2019
    • (2017)Odd-ECCProceedings of the International Symposium on Memory Systems10.1145/3132402.3132443(96-111)Online publication date: 2-Oct-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media