Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3369583.3392670acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

PAC: Paged Adaptive Coalescer for 3D-Stacked Memory

Published: 23 June 2020 Publication History

Abstract

Many contemporary data-intensive applications exhibit irregular and highly concurrent memory access patterns and thus challenge the performance of conventional memory systems. Driven by an expanding need for high-bandwidth memory featuring low access latency, 3D-stacked memory devices, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), were designed to provide significantly higher throughput as compared to standard JEDEC DDR devices. However, existing memory interfaces and coalescing models, designed for conventional DDR devices, are unable to fully exploit the bandwidth potential inherent in these new 3D-stacked memory devices. In order to remedy this disparity, we introduce in this work a novel paged adaptive coalescer (PAC) infrastructure with a scalable coalescing network for 3D-stacked memory. We present the design and simulated implementation of this approach on RISC-V embedded cores with attached HMC devices. We have carried out extensive evaluations and the results show that the proposed PAC methodology yields an average coalescing efficiency of 56.01%. Further, our evaluation results also show that the PAC reduces bank conflicts and the power consumption by 85.16% and 59.21%, respectively. Overall, PAC achieves an average performance gain of 14.35% (and up to 26.06%) across 14 test suites. These results showcase the potential of the PAC methodology as applied to architecture design for increasingly critical data-intensive algorithms and applications.

Supplementary Material

MP4 File (3369583.3392670.mp4)
Driven by an expanding need for high-bandwidth memory featuring low access latency, 3D-stacked memory devices were designed to provide significantly higher throughput as compared to standard JEDEC DDR devices. However, existing memory interfaces and coalescing models, designed for conventional DDR devices, are unable to fully exploit the bandwidth potential inherent in these new 3D-stacked memory devices. In order to remedy this disparity, we introduce a new Paged Adaptive Coalescer (PAC) methodology and associated design for effectively performing DMC (dynamic memory coalescing) on emerging 3D-stacked memory. PAC is designed with a pipelined coalescing network that aggregates memory requests based on the granularity of physical pages to enhance the bandwidth utilization of the 3D-stacked memory. It also extends the miss status holding registers (MSHRs) to adaptively merge requests with flexible sizes. In addition to the PAC design, we also present our simulated implementation and evaluation of this work.

References

[1]
Laurent Schares et al. A throughput-optimized optical network for data-intensive computing. IEEE Micro, 2014.
[2]
Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In HPCA 2017.
[3]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In ISCA 2015.
[4]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In PACT 2015.
[5]
Mingyu Gao and Christos Kozyrakis. Hrl: Efficient and flexible reconfigurable logic for near-data processing. In HPCA 2016.
[6]
C. Ortega, V. Garcia, M. Moreto, M. Casas, and R. Rusitoru. Data prefetching on in-order processors. In HPCS 2018.
[7]
Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. TLB improvements for chip multiprocessors: Inter-core cooperative prefetchers and shared last-level TLBs. TACO 2013.
[8]
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO 2010.
[9]
Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. TPDS 2011.
[10]
John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke. Warppool: sharing requests with inter-warp coalescing for throughput processors. In MICRO 2015.
[11]
Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F Wenisch, John Danskin, and Stephen W Keckler. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In HPCA 2016.
[12]
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. Colt: Coalesced large-reach tlbs. In MICRO 2012.
[13]
Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H Loh. Increasing TLB reach by exploiting clustering in page translations. In HPCA 2014.
[14]
Abhishek Bhattacharjee. Large-reach memory management unit caches. In MICRO 2013.
[15]
Shuai Che, Jeremy W Sheaffer, and Kevin Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC 2011.
[16]
Hybrid Memory Cube Specification 2.1. Technical report, December 2015.
[17]
Shaizeen Aga and Satish Narayanasamy. InvisiMem: Smart Memory Defenses for Memory Bus Side Channel. In ISCA 2017.
[18]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition. In HPCA 2018, pages 544--557. IEEE.
[19]
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. Memory-centric system interconnect design with hybrid memory cubes. In PACT 2013.
[20]
Paul Rosenfeld. Performance exploration of the hybrid memory cube. PhD thesis, 2014.
[21]
Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube. arXiv preprint arXiv:1707.05399, 2017.
[22]
Maya Gokhale, Scott Lloyd, and Chris Macaraeg. Hybrid memory cube performance characterization on data-centric workloads. In IA3 2015.
[23]
J Thomas Pawlowski. Hybrid memory cube (HMC). In HCS 2011.
[24]
Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanovi?. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical report, EECS Department, University of California, Berkeley, 2015.
[25]
Doe Hyun Yoon, Min Kyu Jeong, and Mattan Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In CAN 2011.
[26]
Pablo Prieto, Valentin Puente, and Jose Angel Gregorio. CMP off-chip bandwidth scheduling guided by instruction criticality. In ICS 2013.
[27]
David Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA 1981.
[28]
CUDA Toolkit Documentation. Technical report, July 2018.
[29]
Jason Power, Mark D Hill, and David A Wood. Supporting x86--64 address translation for 100s of gpu lanes. In HPCA 2014.
[30]
Toshio Yoshida. Fujitsu high performance cpu for the post-k computer. In Hot Chips, volume 30, 2018.
[31]
S Rixner, WJ Dally, UJ Kapasi, P Mattson, and JD Owens. Memory access scheduling. In ISCA 2000.
[32]
Xi Wang, John D Leidel, and Yong Chen. Memory Coalescing for Hybrid Memory Cube. In ICPP 2018, page 62. ACM.
[33]
Xi Wang, Antonino Tumeo, John D Leidel, Jie Li, and Yong Chen. Mac: Memory access coalescer for 3d-stacked memory. In Proceedings of the 48th International Conference on Parallel Processing, pages 1--10, 2019.
[34]
Niladrish Chatterjee, Mike O'Connor, Donghyuk Lee, Daniel R Johnson, Stephen W Keckler, Minsoo Rhu, and William J Dally. Architecting an energy-efficient dram system for gpus. In HPCA 2017.
[35]
Joe Jeddeloh and Brent Keeth. Hybrid memory cube new DRAM architecture increases density and performance. In VLSIT 2012.
[36]
Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In IISWC 2017.
[37]
Ravi Nair et al. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 2015.
[38]
James Tuck, Luis Ceze, and Josep Torrellas. Scalable cache miss handling for high memory-level parallelism. In MICRO 2006.
[39]
John D. McCalpin. A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers, 1995.
[40]
David Bader and Kamesh Madduri. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors. HiPC 2005.
[41]
Toward a New Metric for Ranking High Performance Computing Systems. Technical report, Sandia National Laboratories, 2013.
[42]
Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In ICPP 2009.
[43]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. NAS Parallel Benchmark Results. In SC 1992, Supercomputing 1992.
[44]
Scott Beamer, Krste Asanovic, and David A. Patterson. The GAP benchmark suite. CoRR, 2015.
[45]
Dong Chen, Fangzhou Liu, Chen Ding, and Sreepathi Pai. Locality analysis through static parallel sampling. In PLDI 2019.
[46]
Dongjoo Shin, Jinmook Lee, Jinsu Lee, and Hoi-Jun Yoo. 14.2 DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In ISSCC 2017.
[47]
JEDEC Standard High Bandwidth Memory(HBM) DRAM Specification. Technical report, 2013.
[48]
Mike O'Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In MICRO 2017.
[49]
Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, and Al Davis. MemZip: Exploring unconventional benefits from memory compression. In HPCA 2014.
[50]
Sunny G. Using The AutoHBW Library with Jemalloc and Memkind. Technical report, June 2015.
[51]
Alberto V. Improve Vectorization Performance with Intel? AVX-512. Technical report, September 2016.
[52]
John D Leidel and Yong Chen. HMC-Sim: A simulation framework for hybrid memory cube devices. Parallel Processing Letters, 2014.
[53]
Martin Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, 1996.
[54]
Kenneth E Batcher. Sorting networks and their applications. In Proceedings of the April 30--May 2, 1968, spring joint computer conference.
[55]
Alexander Greb and Gabriel Zachmann. GPU-ABiSort: Optimal parallel sorting on stream architectures. In IPDPS 2006.
[56]
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne. High performance comparison-based sorting algorithm on many-core GPUs. In IPDPS 2010.
[57]
Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal, and Mateo Valero. VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors. In HPCA 2015.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D-stacked memory
  2. data-intensive computing
  3. memory coalescing

Qualifiers

  • Research-article

Conference

HPDC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 147
    Total Downloads
  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media