research-article

A case for exploiting subarray-level parallelism (SALP) in DRAM

Authors:

Vivek Seshadri,

Onur MutluAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 40, Issue 3

Pages 368 - 379

https://doi.org/10.1145/2366231.2337202

Published: 09 June 2012 Publication History

Abstract

Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of off-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low cost approach. To this end, we propose three new mechanisms that overlap the latencies of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.

Our proposed mechanisms (SALP-1, SALP-2, and MASA) mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure and only needs reinterpretation of some DRAM timing parameters. SALP-2 and MASA require only modest changes (< 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Evaluations show that all our schemes significantly improve performance for both single-core systems and multi-core systems. Our schemes also interact positively with application-aware memory request scheduling in multi-core systems.

References

[1]

J. H. Ahn et al. Multicore DIMM: An energy efficient memory module with independently controlled DRAMs. IEEE CAL, Jan. 2009.

Digital Library

[2]

J. H. Ahn et al. Improving system energy efficiency with memory rank subsetting. ACM TACO, Mar. 2012.

Digital Library

[3]

N. Chatterjee et al. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In HPCA, 2012.

Digital Library

[4]

Y. Chou et al. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA, 2004.

Digital Library

[5]

J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In ICS, 1997.

Digital Library

[6]

E. Ebrahimi et al. Parallel application memory scheduling. In MICRO, 2011.

Digital Library

[7]

Enhanced Memory Systems. Enhanced SDRAM SM2604, 2002.

[8]

H. Fredriksson and C. Svensson. Improvement potential and equalization example for multidrop DRAM memory buses. IEEE Transactions on Advanced Packaging, 2009.

[9]

B. Ganesh et al. Fully-buffered DIMM memory architectures: Understanding mechanisms, overheads and scaling. In HPCA, 2007.

Digital Library

[10]

C. A. Hart. CDRAM in a unified memory architecture. In Compcon, 1994.

[11]

H. Hidaka et al. The cache DRAM architecture: A DRAM with an on-chip cache memory. IEEE Micro, Mar. 1990.

Digital Library

[12]

HPCC. RandomAccess. http://icl.cs.utk.edu/hpcc/.

[13]

W.-C. Hsu and J. E. Smith. Performance of cached DRAM organizations in vector supercomputers. In ISCA, 1993.

Digital Library

[14]

Intel. 2nd Gen. Intel Core Processor Family Desktop Datasheet, 2011.

[15]

Intel. Intel Core Desktop Processor Series Datasheet, 2011.

[16]

E. Ipek et al. Self optimizing memory controllers: A reinforcement learning approach. In ISCA, 2008.

Digital Library

[17]

K. Itoh. VLSI Memory Chip Design. Springer, 2001.

[18]

JEDEC. Standard No. 79-3E. DDR3 SDRAM Specification, 2010.

[19]

JEDEC. Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3 SDRAM Modules, 2011.

[20]

G. Kedem and R. P. Koganti. WCDRAM: A fully associative integrated cached-DRAM with wide cache lines. CS-1997-03, Duke, 1997.

[21]

B. Keeth et al. DRAM Circuit Design. Fundamental and High-Speed Topics. Wiley-IEEE Press, 2007.

Digital Library

[22]

R. Kho et al. 75nm 7Gb/s/pin 1Gb GDDR5 graphics memory device with bandwidth-improvement techniques. In ISSCC, 2009.

[23]

Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.

[24]

Y. Kim et al. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010.

Digital Library

[25]

T. Kirihata. Latched row decoder for a random access memory. U. S. patent number 5615164, 1997.

[26]

B.-S. Kong et al. Conditional-capture flip-flop for statistical power reduction. IEEE JSSC, 2001.

[27]

D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA, 1981.

Digital Library

[28]

B. C. Lee et al. Architecting phase change memory as a scalable DRAM alternative. In ISCA, 2009.

Digital Library

[29]

C. J. Lee et al. DRAM-aware last-level cache writeback: Reducing write-caused interference in memory systems. TR-HPS-2010-002, UT Austin, 2010.

[30]

C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.

Digital Library

[31]

Micron. DDR3 SDRAM System-Power Calculator, 2010.

[32]

Micron. 2Gb: x16, x32 Mobile LPDDR2 SDRAM, 2012.

[33]

Micron. 2Gb: x4, x8, x16, DDR3 SDRAM, 2012.

[34]

Micron. DDR3 SDRAM Verilog Model, 2012.

[35]

M. J. Miller. Bandwidth engine serial memory chip breaks 2 billion accesses/sec. In HotChips, 2011.

[36]

Y. Moon et al. 1.2V 1.6Gb/s 56nm 6F2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architecture. In ISSCC, 2009.

[37]

T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX SS, 2007.

Digital Library

[38]

S. P. Muralidhara et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In MICRO, 2011.

Digital Library

[39]

O. Mutlu et al. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003.

Digital Library

[40]

O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007.

Digital Library

[41]

O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA, 2008.

Digital Library

[42]

NEC. Virtual Channel SDRAM uPD4565421, 1999.

[43]

K. J. Nesbit et al. Fair queuing memory systems. In MICRO, 2006.

Digital Library

[44]

J.-h. Oh. Semiconductor memory having a bank with sub-banks. U. S. patent number 7782703, 2010.

[45]

M. K. Qureshi et al. A case for MLP-aware cache replacement. In ISCA, 2006.

Digital Library

[46]

Rambus. DRAM Power Model, 2010.

[47]

S. Rixner et al. Memory access scheduling. In ISCA, 2000.

Digital Library

[48]

P. Rosenfeld et al. DRAMSim2: A cycle accurate memory system simulator. IEEE CAL, Jan. 2011.

Digital Library

[49]

R. H. Sartore et al. Enhanced DRAM with embedded registers. U. S. patent number 5887272, 1999.

[50]

Y. Sato et al. Fast Cycle RAM (FCRAM); a 20-ns random row access, pipe-lined operating DRAM. In Symposium on VLSI Circuits, 1998.

[51]

B. Sinharoy et al. IBM POWER7 multicore server processor. IBM Journal Res. Dev., May. 2011.

Digital Library

[52]

B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.

[53]

A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS, 2000.

Digital Library

[54]

STREAM Benchmark. http://www.streambench.org/.

[55]

J. Stuecheli et al. The virtual write queue: Coordinating DRAM and last-level cache policies. In ISCA, 2010.

Digital Library

[56]

K. Sudan et al. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In ASPLOS, 2010.

Digital Library

[57]

Sun Microsystems. OpenSPARC T1 microarch. specification, 2006.

[58]

J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965.

Digital Library

[59]

S. Thoziyoor et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA, 2008.

Digital Library

[60]

R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal Res. Dev., Jan. 1967.

Digital Library

[61]

TPC. http://www.tpc.org/.

[62]

A. N. Udipi et al. Rethinking DRAM design and organization for energy-constrained multi-cores. In ISCA, 2010.

Digital Library

[63]

T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In MICRO, 2010.

Digital Library

[64]

F. Ware and C. Hampel. Improving power and data efficiency with threaded memory modules. In ICCD, 2006.

[65]

W. A. Wong and J.-L. Baer. DRAM caching. CSE-97-03-04, UW, 1997.

[66]

T. Yamauchi et al. The hierarchical multi-bank DRAM: A high-performance architecture for memory integrated with processors. In Advanced Research in VLSI, 1997.

Digital Library

[67]

G. L. Yuan et al. Complexity effective memory access scheduling for many-core accelerator architectures. In MICRO, 2009.

Digital Library

[68]

Z. Zhang et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO, 2000.

Digital Library

[69]

Z. Zhang et al. Cached DRAM for ILP processor memory access latency reduction. IEEE Micro, Jul. 2001.

Digital Library

[70]

H. Zheng et al. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In MICRO, 2008.

Digital Library

[71]

W. K. Zuravleff and T. Robinson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U. S. patent number 5630096, 1997.

Cited By

Du HQin YChen SKang Y(2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3649455
Park JChoi JKyung KKim MKwon YKim NAhn JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640422
Park YPal SAmarnath ASwaminathan KLu WBuyuktosunoglu ABose P(2024)Dramaton: A Near-DRAM Accelerator for Large Number Theoretic TransformsIEEE Computer Architecture Letters10.1109/LCA.2024.338145223:1(108-111)Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1109/LCA.2024.3381452
Show More Cited By

A case for exploiting subarray-level parallelism (SALP) in DRAM
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

A case for exploiting subarray-level parallelism (SALP) in DRAM
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of off-chip memory. Adding more banks to the system to ...
Exploiting Refresh Effect of DRAM Read Operations: A Practical Approach to Low-Power Refresh

Dynamic random access memory (DRAM) requires periodic refresh operations to retain its data. In practice, DRAM retention times are normally distributed from 64 ms to several seconds. However, the conventional refresh method uses 64 ms as the refresh ...
Improving DRAM latency with dynamic asymmetric subarray
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

The evolution of DRAM technology has been driven by capacity and bandwidth during the last decade. In contrast, DRAM access latency stays relatively constant and is trending to increase. Much efforts have been devoted to tolerate memory access latency ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 40, Issue 3

ISCA '12

June 2012

559 pages

ISSN:0163-5964

DOI:10.1145/2366231

Issue’s Table of Contents

ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
June 2012
584 pages
ISBN:9781450316422
General Chair:
Shih-Lien Lu
Intel
,
Program Chair:
Josep Torrellas
University of Illinois

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2012

Published in SIGARCH Volume 40, Issue 3

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

141
Total Citations
View Citations
929
Total Downloads

Downloads (Last 12 months)166
Downloads (Last 6 weeks)15

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Du HQin YChen SKang Y(2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3649455
Park JChoi JKyung KKim MKwon YKim NAhn JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640422
Park YPal SAmarnath ASwaminathan KLu WBuyuktosunoglu ABose P(2024)Dramaton: A Near-DRAM Accelerator for Large Number Theoretic TransformsIEEE Computer Architecture Letters10.1109/LCA.2024.338145223:1(108-111)Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1109/LCA.2024.3381452
Marazzi MSachsenweger TSolt FZeng PTakashi KYarema MRazavi K(2024)HiFi-DRAM: Enabling High-fidelity DRAM Research by Uncovering Sense Amplifiers with IC Imaging2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00020(133-149)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00020
Yüksel İTuğrul YOlgun ABostancı FYağlıkçı AOliveira GLuo HGómez-Luna JSadrosadati MMutlu O(2024)Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00030(280-296)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00030
Oliveira GOlgun AYağlıkçı ABostancı FGómez-Luna JGhose SMutlu O(2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00024
Yüksel İTuğrul YBostancı FOliveira GYağlıkçı AOlgun ASoysal MLuo HGómez-Luna JSadrosadati MMutlu O(2024)Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00024(99-114)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00024
Pan YZhou MLee CLi ZKushwah RNarayanan VRosing T(2024)PRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473968(557-563)Online publication date: 22-Jan-2024
https://doi.org/10.1109/ASP-DAC58780.2024.10473968
Orosa LRührmair UGiray Yağlikçi ALuo HOlgun AJattke PPatel MKim JRazavi KMutlu O(2024)SpyHammer: Understanding and Exploiting RowHammer Under Fine-Grained Temperature VariationsIEEE Access10.1109/ACCESS.2024.340938912(80986-81003)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3409389
Olgun AHassan HYağlıkçı ATuğrul YOrosa LLuo HPatel MErgin OMutlu O(2023)DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM ChipsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328217242:12(5098-5112)Online publication date: Dec-2023
https://doi.org/10.1109/TCAD.2023.3282172
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents