Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

XED: exposing on-die error detection information for strong memory reliability

Published: 18 June 2016 Publication History

Abstract

Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller.
This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose e<u>X</u>posed On-Die <u>E</u>rror <u>D</u>etection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined "catch-word" instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172x higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.

References

[1]
P. J. Nair, D.-H. Kim, and M. K. Qureshi, "Archshield: architectural framework for assisting dram scaling by tolerating high error rates," in ISCA 2013.
[2]
Y. H. Son et al., "Cidra: A cache-inspired dram resilience architecture," in HPCA 2015.
[3]
K. Uksong et al., "Co-architecting controllers and DRAM to enhance DRAM process scaling," in The Memory Forum, ISCA, 2014.
[4]
R. W. HAMMING, "Error detecting and error correcting codes," BELL SYSTEM TECHNICAL JOURNAL, vol. 29, no. 2, pp. 147--160, 1950.
[5]
M. Greenberg, "Reliability, availability, and serviceability (ras) for ddr dram interfaces," in memcon, 2014.
[6]
T.-Y. Oh et al., "25.1 a 3.2gb/s/pin 8gb 1.0v lpddr4 sdram with integrated ecc engine for sub-1v dram core operation," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, Feb 2014, pp. 430--431.
[7]
V. Sridharan and D. Liberty, "A study of dram failures in the field," in SC 2012.
[8]
V. Sridharan et al., "Feng shui of supercomputer memory: Positional effects in dram and sram faults," in SC 2013.
[9]
V. Sridharan et al., "Memory errors in modern systems: The good, the bad, and the ugly," in ASPLOS 2015.
[10]
JEDEC Standard, "DDR3 Standard," in JESD79-3E, 2015.
[11]
JEDEC Standard, "DDR4 Standard," in JESD79-4, 2015.
[12]
D. H. Yoon and M. Erez, "Virtualized and flexible ecc for main memory," in ASPLOS 2010.
[13]
A. Udipi et al., "Lot-ecc: Localized and tiered reliability mechanisms for commodity memory systems," in ISCA 2012.
[14]
S. Li et al., "System implications of memory reliability in exascale computing," in SC 2011.
[15]
X. Jian and R. Kumar, "Adaptive reliability chipkill correct (arcc)," in HPCA 2013.
[16]
P. J. Nair, D. A. Roberts, and M. K. Qureshi, "Citadel: Efficiently protecting stacked memory from large granularity failures," in MICRO 2014.
[17]
B. L. Jacob, S. W. Ng, and D. T. Wang, Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, 2008.
[18]
Y. Kim et al., "A case for exploiting subarray-level parallelism (salp) in dram," in ISCA 2012.
[19]
A. N. Udipi et al., "Rethinking dram design and organization for energy-constrained multi-cores," in ISCA 2010.
[20]
T. Zhang et al., "Half-dram: A high-bandwidth and low-power dram architecture from the rethinking of fine-grained activation," in ISCA 2014.
[21]
H. Zheng et al., "Mini-rank: Adaptive dram architecture for improving memory power efficiency," in MICRO 2008.
[22]
S. Hong, "Memory technology trend and future challenges," in Electron Devices Meeting (IEDM), 2010 IEEE International, Dec 2010, pp. 12.4.1--12.4.4.
[23]
B. Gu et al., "Challenges and future directions of laser fuse processing in memory repair," Proc. Semicon China, 2003.
[24]
K. Takeuchi et al., "Alpha-particle-induced charge collection measurements for megabit dram cells," Electron Devices, IEEE Transactions on, Sep 1989.
[25]
M. K. Qureshi et al., "Avatar: A variable-retention-time (vrt) aware refresh for dram systems," in DSN 2015.
[26]
C. Chen and M. Hsiao, "Error-correcting codes for semiconductor memory applications: a state-of-the-art review," IBM Journal, vol. 28, no. 2, pp. 124--134, March 1984.
[27]
R. T. Chien, "Cyclic decoding procedures for bose-chaudhuri-hocquenghem codes," in IEEE Transactions on Information Theory, vol. 10, no. 4, Oct 1964, pp. 357--363.
[28]
R. Bose and D. Ray-Chaudhuri, "On a class of error correcting binary group codes," Information and Control, vol. 3, no. 1, pp. 68 -- 79, 1960.
[29]
I. S. Reed and G. Solomon, "Polynomial codes over certain finite fields," Journal of the society for industrial and applied mathematics, vol. 8, no. 2, pp. 300--304, 1960.
[30]
J. Nerl et al., "System and method for controlling application of an error correction code (ecc) algorithm in a memory subsystem," Patent US 7 437 651 B2.
[31]
J. Nerl et al., "System and method for applying error correction code (ecc) erasure mode and clearing recorded information from a page deallocation table," Patent US 7 313 749 B2.
[32]
P. J. Nair, D. A. Roberts, and M. K. Qureshi, "Faultsim: A fast, configurable memory-reliability simulator for conventional and 3d-stacked systems," in ACM-TACO 2015.
[33]
E. Marcus and H. Stern, Blueprints for High Availability. Wiley, 2003.
[34]
B. Lin, "Correcting single-bit errors with crc8 in atm cell headers," Freescale Semiconductor, Inc., Tech. Rep., 2005.
[35]
INTERNATIONAL TELECOMMUNICATION UNION (ITU), "Series i: Integrated services digital network isdn user -network interfaces - layer 1 recommendations," ITU-T, Tech. Rep. I.432.1, 1999.
[36]
N. Chatterjee et al., "Usimm: the utah simulated memory module," University of Utah and Intel Corp, Tech. Rep. UUCS-12-002, Feb. 2012.
[37]
(2012) Memory scheduling championship (msc).
[38]
TN-41-01: Calculating Memory System Power for DDR3: Rev. B 8/07 EN, Micron Technology Inc, 2007.
[39]
"Spec cpu2006 benchmark suite," in Standard Performance Evaluation Corporation.
[40]
C. Bienia, "Benchmarking modern multiprocessors," in Ph.D. Thesis, Princeton University, 2011.
[41]
K. Albayraktaroglu et al., "Biobench: A benchmark suite of bioinformatics applications."
[42]
JEDEC Standard, "High Bandwidth Memory (HBM) DRAM," in JESD235, 2013.
[43]
JEDEC Standard, "WIDE-IO DRAM," in JESD229, 2013.
[44]
S. Kwon, Y. H. Son, and J. H. Ahn, "Understanding ddr4 in pursuit of in-dram ecc," in SoC Design Conference (ISOCC), 2014 International, 2014, pp. 276--277.
[45]
L. Chen and Z. Zhang, "Memguard: A low cost and energy efficient design to support and enhance memory system reliability," in ISCA 2014.
[46]
D. J. Palframan, N. S. Kim, and M. H. Lipasti, "Cop: To compress and protect main memory," in ISCA 2015.
[47]
J. Kim et al., "Frugal ecc: Efficient and versatile memory error protection through fine-grained compression," in SC 2015.
[48]
J. Kim et al., "Bamboo ecc: Strong, safe, and flexible codes for reliable computer memory," in HPCA 2015.
[49]
X. Jian et al., "Low-power, low-storage-overhead chipkill correct via multi-line error correction," in SC 2013.
[50]
D. W. Kim and M. Erez, "Balancing reliability, cost, and performance tradeoffs with freefault," in HPCA 2015.
[51]
X. Jian, V. Sridharan, and R. Kumar, "Parity helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems," in HPCA 2016.

Cited By

View all
  • (2024)Improving DRAM Reliability Using a High Order Error Correction CodeIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.340067743:12(4775-4785)Online publication date: Dec-2024
  • (2024)Revisiting row hammer: A deep dive into understanding and resolving the issueMicroelectronics Reliability10.1016/j.microrel.2024.115467160(115467)Online publication date: Sep-2024
  • (2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
ISCA'16
June 2016
730 pages
ISSN:0163-5964
DOI:10.1145/3007787
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
    June 2016
    756 pages
    ISBN:9781467389471
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2016
Published in SIGARCH Volume 44, Issue 3

Check for updates

Author Tags

  1. RAID-3
  2. chipkill
  3. double-chipkill
  4. on-die ECC

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)13
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving DRAM Reliability Using a High Order Error Correction CodeIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.340067743:12(4775-4785)Online publication date: Dec-2024
  • (2024)Revisiting row hammer: A deep dive into understanding and resolving the issueMicroelectronics Reliability10.1016/j.microrel.2024.115467160(115467)Online publication date: Sep-2024
  • (2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
  • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
  • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
  • (2023)Unity ECC: Unified Memory Protection Against Bit and Chip ErrorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607081(1-16)Online publication date: 12-Nov-2023
  • (2023)Construction of Cyclic Redundancy Check Codes for SDDC Decoding in DRAM SystemsIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.317506670:2(736-740)Online publication date: Feb-2023
  • (2023)Review of Memory RAS for Data CentersIEEE Access10.1109/ACCESS.2023.332998411(124782-124796)Online publication date: 2023
  • (2022)ECMO: ECC Architecture Reusing Content-Addressable Memories for Obtaining High Reliability in DRAMIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.315389430:6(781-793)Online publication date: Jun-2022
  • (2022)Understanding RowHammer Under Reduced Wordline Voltage: An Experimental Study Using Real DRAM Devices2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN53405.2022.00054(475-487)Online publication date: Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media