research-article

Stash: have your scratchpad and cache it too

Authors:

Rakesh Komuravelli,

Matthew D. Sinclair,

Johnathan Alsop,

Muhammad Huzaifa,

Maria Kotsifakou,

Prakalp Srivastava,

Sarita V. Adve,

Vikram S. AdveAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 707 - 719

https://doi.org/10.1145/2749469.2750374

Published: 13 June 2015 Publication History

Abstract

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches.

We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).

References

[1]

"HSA Platform System Architecture Specification," 2015.

[2]

N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha, "GARNET: A detailed on-chip network model inside a full-system simulator," in ISPASS, 2009.

[3]

AMD, "Compute Cores," https://www.amd.com/Documents/Compute_Cores_Whitepaper.pdf.

[4]

AMD, "Sea Islands Series Instruction Set Architecture," http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf.

[5]

O. Avissar, R. Barua, and D. Stewart, "An Optimal Memory Allocation Scheme for Scratchpad-based Embedded Systems," TACO, 2002.

Digital Library

[6]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.

[7]

R. Banakar, S. Steinke, B. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems," in CODES, 2002.

Digital Library

[8]

A. Basu, M. D. Hill, and M. M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," in ISCA, 2012.

Digital Library

[9]

M. Bauer, H. Cook, and B. Khailany, "CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization," in SC, 2011.

Digital Library

[10]

H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," in Computer Vision -- ECCV, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2006, vol. 3951.

Digital Library

[11]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, "Impulse: Building a Smarter Memory Controller," in HPCA, 1999.

Digital Library

[12]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009.

Digital Library

[13]

S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems," in SC, 2011.

Digital Library

[14]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT, 2011.

Digital Library

[15]

J. Cong, M. A. Ghodrat, M. Gill, C. Liu, and G. Reinman, "BiN: A buffer-in-NUCA Scheme for Accelerator-rich CMPs," in ISLPED, 2012.

Digital Library

[16]

J. Cong, K. Gururaj, H. Huang, C. Liu, G. Reinman, and Y. Zou, "An Energy-efficient Adaptive Hybrid Cache," in ISLPED, 2011.

Digital Library

[17]

H. Cook, K. Asanovic, and D. A. Patterson, "Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments," UC Berkeley, Tech. Rep., 2009.

[18]

DoE, "Top Ten Exascale Research Challenges," http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf, 2014.

[19]

B. Hechtman, S. Che, D. Hower, Y. Tian, B. Beckmann, M. Hill, S. Reinhardt, and D. Wood, "QuickRelease: A throughput-oriented approach to release consistency on GPUs," in HPCA, 2014.

[20]

IntelPR, "Intel Discloses Newest Microarchitecture and 14 Nanometer Manufacturing Process Technical Details," Intel Newsroom, 2014.

[21]

D. A. Jamshidi, M. Samadi, and S. Mahlke, "D2MA: Accelerating Coarse-grained Data Transfer for GPUs," in PACT, 2014.

Digital Library

[22]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, 2011.

Digital Library

[23]

F. Kjolstad, T. Hoefler, and M. Snir, "Automatic Datatype Generation and Optimization," in PPoPP, 2012.

Digital Library

[24]

R. Komuravelli, "Exploiting Software Information for an Efficient Memory Hierarchy," Ph.D. dissertation, The University of Illinois at Urbana-Champaign, 2014.

[25]

J. Leng, T. Hetherington, A. El Tantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in ISCA, 2013.

Digital Library

[26]

C. Li, Y. Yang, D. Hongwen, Y. Shengen, F. Mueller, and H. Zhou, "Understanding the Tradeoffs Between Software-Managed vs. Hardware-Managed Caches in GPUs," in ISPASS, 2014.

[27]

S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009.

Digital Library

[28]

M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks, "The Accelerator Store: A Shared Memory Framework for Accelerator-Based Systems," TACO, 2012.

Digital Library

[29]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, 2005.

Digital Library

[30]

N. Nguyen, A. Dominguez, and R. Barua, "Memory Allocation for Embedded Systems with a Compile-time-unknown Scratchpad Size," TECS, 2009.

Digital Library

[31]

NVIDIA, "CUDA SDK 3.1," http://developer.nvidia.com/object/cuda_3_1_downloads.html.

[32]

J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in MICRO, 2013.

Digital Library

[33]

A. Sembrant, E. Hagersten, and D. Black-Shaffer, "TLC: A Tag-less Cache for Reducing Dynamic First Level Cache Energy," in MICRO, 2013.

Digital Library

[34]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Dept. of ECE and CS, Univ. of Illinois at Urbana-Champaign, Tech. Rep., 2012.

[35]

H. Sung and S. V. Adve, "Supporting Arbitrary Sychronization without Writer-Initiated Invalidations," in ASPLOS, 2015.

Digital Library

[36]

H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: Efficient Hardware Support for Disciplined Non-determinism," in ASPLOS, 2013.

Digital Library

[37]

S. Udayakumaran and R. Barua, "Compiler-decided Dynamic Memory Allocation for Scratchpad Based Embedded Systems," in CASES, 2003.

Digital Library

[38]

S. Udayakumaran, A. Dominguez, and R. Barua, "Dynamic Allocation for Scratchpad Memory Using Compile-time Decisions," TECS, 2006.

Digital Library

[39]

Z. Zheng, Z. Wang, and M. Lipasti, "Tag Check Elision," in ISLPED, 2014.

Digital Library

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Xue ZWu YEmer JSze V(2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623793
Show More Cited By

Index Terms

Stash: have your scratchpad and cache it too
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Stash: have your scratchpad and cache it too
ISCA'15

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. ...
STASH: security architecture for smart hybrid memories
DAC '18: Proceedings of the 55th Annual Design Automation Conference

Whereas emerging non-volatile memories (NVMs) are low power, dense, scalable alternatives to DRAM, the high latency and low endurance of these NVMs limit the feasibility of NVM-only memory systems. Smart hybrid memories (SHMs) that integrate NVM, DRAM, ...
STASH: SecuriTy Architecture for Smart Hybrid Memories
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)
Whereas emerging non-volatile memories (NVMs) are low power, dense, scalable alternatives to DRAM, the high latency and low endurance of these NVMs limit the feasibility of NVM-only memory systems. Smart hybrid memories (SHMs) that integrate NVM, DRAM, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Center for Future Architectures Research
Qualcomm Innovation Fellowship
National Science Foundation
Illinois Intel Parallelism Center

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
1,161
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)9

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Xue ZWu YEmer JSze V(2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623793
Fujiki D(2023)MVC: Enabling Fully Coherent Multi-Data-Views through the Memory Hierarchy with Processing in MemoryProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623784(800-814)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623784
Qin SLi WFan ZWang ZLiu TWu HZhang KAn XYe XFan D(2023)ROMA: A Reconfigurable On-chip Memory Architecture for Multi-core Accelerators2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00017(49-57)Online publication date: 17-Dec-2023
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00017
Sedaghati AHakimi MHojabr RShriraman ASalapura VZahran MChong FTang L(2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527380
Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3154315
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Rucker AVilim MZhao TZhang YPrabhakar ROlukotun K(2021)Capstan: A Vector RDA for SparsityMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480047(1022-1035)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480047
Zhang GAttaluri NEmer JSanchez DSherwood TBerger EKozyrakis C(2021)Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplicationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446702(687-701)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446702
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents