Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A case for exploiting subarray-level parallelism (SALP) in DRAM

Published: 09 June 2012 Publication History

Abstract

Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of off-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low cost approach. To this end, we propose three new mechanisms that overlap the latencies of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.
Our proposed mechanisms (SALP-1, SALP-2, and MASA) mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure and only needs reinterpretation of some DRAM timing parameters. SALP-2 and MASA require only modest changes (< 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Evaluations show that all our schemes significantly improve performance for both single-core systems and multi-core systems. Our schemes also interact positively with application-aware memory request scheduling in multi-core systems.

References

[1]
J. H. Ahn et al. Multicore DIMM: An energy efficient memory module with independently controlled DRAMs. IEEE CAL, Jan. 2009.
[2]
J. H. Ahn et al. Improving system energy efficiency with memory rank subsetting. ACM TACO, Mar. 2012.
[3]
N. Chatterjee et al. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In HPCA, 2012.
[4]
Y. Chou et al. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA, 2004.
[5]
J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS, 1997.
[6]
E. Ebrahimi et al. Parallel application memory scheduling. In MICRO, 2011.
[7]
Enhanced Memory Systems. Enhanced SDRAM SM2604, 2002.
[8]
H. Fredriksson and C. Svensson. Improvement potential and equalization example for multidrop DRAM memory buses. IEEE Transactions on Advanced Packaging, 2009.
[9]
B. Ganesh et al. Fully-buffered DIMM memory architectures: Understanding mechanisms, overheads and scaling. In HPCA, 2007.
[10]
C. A. Hart. CDRAM in a unified memory architecture. In Compcon, 1994.
[11]
H. Hidaka et al. The cache DRAM architecture: A DRAM with an on-chip cache memory. IEEE Micro, Mar. 1990.
[12]
HPCC. RandomAccess. http://icl.cs.utk.edu/hpcc/.
[13]
W.-C. Hsu and J. E. Smith. Performance of cached DRAM organizations in vector supercomputers. In ISCA, 1993.
[14]
Intel. 2nd Gen. Intel Core Processor Family Desktop Datasheet, 2011.
[15]
Intel. Intel Core Desktop Processor Series Datasheet, 2011.
[16]
E. Ipek et al. Self optimizing memory controllers: A reinforcement learning approach. In ISCA, 2008.
[17]
K. Itoh. VLSI Memory Chip Design. Springer, 2001.
[18]
JEDEC. Standard No. 79-3E. DDR3 SDRAM Specification, 2010.
[19]
JEDEC. Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3 SDRAM Modules, 2011.
[20]
G. Kedem and R. P. Koganti. WCDRAM: A fully associative integrated cached-DRAM with wide cache lines. CS-1997-03, Duke, 1997.
[21]
B. Keeth et al. DRAM Circuit Design. Fundamental and High-Speed Topics. Wiley-IEEE Press, 2007.
[22]
R. Kho et al. 75nm 7Gb/s/pin 1Gb GDDR5 graphics memory device with bandwidth-improvement techniques. In ISSCC, 2009.
[23]
Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.
[24]
Y. Kim et al. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010.
[25]
T. Kirihata. Latched row decoder for a random access memory. U. S. patent number 5615164, 1997.
[26]
B.-S. Kong et al. Conditional-capture flip-flop for statistical power reduction. IEEE JSSC, 2001.
[27]
D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA, 1981.
[28]
B. C. Lee et al. Architecting phase change memory as a scalable DRAM alternative. In ISCA, 2009.
[29]
C. J. Lee et al. DRAM-aware last-level cache writeback: Reducing write-caused interference in memory systems. TR-HPS-2010-002, UT Austin, 2010.
[30]
C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[31]
Micron. DDR3 SDRAM System-Power Calculator, 2010.
[32]
Micron. 2Gb: x16, x32 Mobile LPDDR2 SDRAM, 2012.
[33]
Micron. 2Gb: x4, x8, x16, DDR3 SDRAM, 2012.
[34]
Micron. DDR3 SDRAM Verilog Model, 2012.
[35]
M. J. Miller. Bandwidth engine serial memory chip breaks 2 billion accesses/sec. In HotChips, 2011.
[36]
Y. Moon et al. 1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architecture. In ISSCC, 2009.
[37]
T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX SS, 2007.
[38]
S. P. Muralidhara et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In MICRO, 2011.
[39]
O. Mutlu et al. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003.
[40]
O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007.
[41]
O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA, 2008.
[42]
NEC. Virtual Channel SDRAM uPD4565421, 1999.
[43]
K. J. Nesbit et al. Fair queuing memory systems. In MICRO, 2006.
[44]
J.-h. Oh. Semiconductor memory having a bank with sub-banks. U. S. patent number 7782703, 2010.
[45]
M. K. Qureshi et al. A case for MLP-aware cache replacement. In ISCA, 2006.
[46]
Rambus. DRAM Power Model, 2010.
[47]
S. Rixner et al. Memory access scheduling. In ISCA, 2000.
[48]
P. Rosenfeld et al. DRAMSim2: A cycle accurate memory system simulator. IEEE CAL, Jan. 2011.
[49]
R. H. Sartore et al. Enhanced DRAM with embedded registers. U. S. patent number 5887272, 1999.
[50]
Y. Sato et al. Fast Cycle RAM (FCRAM); a 20-ns random row access, pipe-lined operating DRAM. In Symposium on VLSI Circuits, 1998.
[51]
B. Sinharoy et al. IBM POWER7 multicore server processor. IBM Journal Res. Dev., May. 2011.
[52]
B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.
[53]
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS, 2000.
[54]
STREAM Benchmark. http://www.streambench.org/.
[55]
J. Stuecheli et al. The virtual write queue: Coordinating DRAM and last-level cache policies. In ISCA, 2010.
[56]
K. Sudan et al. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In ASPLOS, 2010.
[57]
Sun Microsystems. OpenSPARC T1 microarch. specification, 2006.
[58]
J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965.
[59]
S. Thoziyoor et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA, 2008.
[60]
R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal Res. Dev., Jan. 1967.
[61]
TPC. http://www.tpc.org/.
[62]
A. N. Udipi et al. Rethinking DRAM design and organization for energy-constrained multi-cores. In ISCA, 2010.
[63]
T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In MICRO, 2010.
[64]
F. Ware and C. Hampel. Improving power and data efficiency with threaded memory modules. In ICCD, 2006.
[65]
W. A. Wong and J.-L. Baer. DRAM caching. CSE-97-03-04, UW, 1997.
[66]
T. Yamauchi et al. The hierarchical multi-bank DRAM: A high-performance architecture for memory integrated with processors. In Advanced Research in VLSI, 1997.
[67]
G. L. Yuan et al. Complexity effective memory access scheduling for many-core accelerator architectures. In MICRO, 2009.
[68]
Z. Zhang et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO, 2000.
[69]
Z. Zhang et al. Cached DRAM for ILP processor memory access latency reduction. IEEE Micro, Jul. 2001.
[70]
H. Zheng et al. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In MICRO, 2008.
[71]
W. K. Zuravleff and T. Robinson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U. S. patent number 5630096, 1997.

Cited By

View all
  • (2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
  • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
  • (2024)Dramaton: A Near-DRAM Accelerator for Large Number Theoretic TransformsIEEE Computer Architecture Letters10.1109/LCA.2024.338145223:1(108-111)Online publication date: 27-Mar-2024
  • Show More Cited By
  1. A case for exploiting subarray-level parallelism (SALP) in DRAM

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 40, Issue 3
    ISCA '12
    June 2012
    559 pages
    ISSN:0163-5964
    DOI:10.1145/2366231
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
      June 2012
      584 pages
      ISBN:9781450316422
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2012
    Published in SIGARCH Volume 40, Issue 3

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)166
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
    • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
    • (2024)Dramaton: A Near-DRAM Accelerator for Large Number Theoretic TransformsIEEE Computer Architecture Letters10.1109/LCA.2024.338145223:1(108-111)Online publication date: 27-Mar-2024
    • (2024)HiFi-DRAM: Enabling High-fidelity DRAM Research by Uncovering Sense Amplifiers with IC Imaging2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00020(133-149)Online publication date: 29-Jun-2024
    • (2024)Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00030(280-296)Online publication date: 2-Mar-2024
    • (2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
    • (2024)Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00024(99-114)Online publication date: 24-Jun-2024
    • (2024)PRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473968(557-563)Online publication date: 22-Jan-2024
    • (2024)SpyHammer: Understanding and Exploiting RowHammer Under Fine-Grained Temperature VariationsIEEE Access10.1109/ACCESS.2024.340938912(80986-81003)Online publication date: 2024
    • (2023)DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM ChipsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328217242:12(5098-5112)Online publication date: Dec-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media