Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2749469.2750374acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Stash: have your scratchpad and cache it too

Published: 13 June 2015 Publication History

Abstract

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches.
We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).

References

[1]
"HSA Platform System Architecture Specification," 2015.
[2]
N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha, "GARNET: A detailed on-chip network model inside a full-system simulator," in ISPASS, 2009.
[3]
AMD, "Compute Cores," https://www.amd.com/Documents/Compute_Cores_Whitepaper.pdf.
[4]
AMD, "Sea Islands Series Instruction Set Architecture," http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf.
[5]
O. Avissar, R. Barua, and D. Stewart, "An Optimal Memory Allocation Scheme for Scratchpad-based Embedded Systems," TACO, 2002.
[6]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.
[7]
R. Banakar, S. Steinke, B. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems," in CODES, 2002.
[8]
A. Basu, M. D. Hill, and M. M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," in ISCA, 2012.
[9]
M. Bauer, H. Cook, and B. Khailany, "CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization," in SC, 2011.
[10]
H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," in Computer Vision -- ECCV, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2006, vol. 3951.
[11]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, "Impulse: Building a Smarter Memory Controller," in HPCA, 1999.
[12]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009.
[13]
S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems," in SC, 2011.
[14]
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT, 2011.
[15]
J. Cong, M. A. Ghodrat, M. Gill, C. Liu, and G. Reinman, "BiN: A buffer-in-NUCA Scheme for Accelerator-rich CMPs," in ISLPED, 2012.
[16]
J. Cong, K. Gururaj, H. Huang, C. Liu, G. Reinman, and Y. Zou, "An Energy-efficient Adaptive Hybrid Cache," in ISLPED, 2011.
[17]
H. Cook, K. Asanovic, and D. A. Patterson, "Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments," UC Berkeley, Tech. Rep., 2009.
[18]
DoE, "Top Ten Exascale Research Challenges," http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf, 2014.
[19]
B. Hechtman, S. Che, D. Hower, Y. Tian, B. Beckmann, M. Hill, S. Reinhardt, and D. Wood, "QuickRelease: A throughput-oriented approach to release consistency on GPUs," in HPCA, 2014.
[20]
IntelPR, "Intel Discloses Newest Microarchitecture and 14 Nanometer Manufacturing Process Technical Details," Intel Newsroom, 2014.
[21]
D. A. Jamshidi, M. Samadi, and S. Mahlke, "D2MA: Accelerating Coarse-grained Data Transfer for GPUs," in PACT, 2014.
[22]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, 2011.
[23]
F. Kjolstad, T. Hoefler, and M. Snir, "Automatic Datatype Generation and Optimization," in PPoPP, 2012.
[24]
R. Komuravelli, "Exploiting Software Information for an Efficient Memory Hierarchy," Ph.D. dissertation, The University of Illinois at Urbana-Champaign, 2014.
[25]
J. Leng, T. Hetherington, A. El Tantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in ISCA, 2013.
[26]
C. Li, Y. Yang, D. Hongwen, Y. Shengen, F. Mueller, and H. Zhou, "Understanding the Tradeoffs Between Software-Managed vs. Hardware-Managed Caches in GPUs," in ISPASS, 2014.
[27]
S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009.
[28]
M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks, "The Accelerator Store: A Shared Memory Framework for Accelerator-Based Systems," TACO, 2012.
[29]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, 2005.
[30]
N. Nguyen, A. Dominguez, and R. Barua, "Memory Allocation for Embedded Systems with a Compile-time-unknown Scratchpad Size," TECS, 2009.
[31]
NVIDIA, "CUDA SDK 3.1," http://developer.nvidia.com/object/cuda_3_1_downloads.html.
[32]
J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in MICRO, 2013.
[33]
A. Sembrant, E. Hagersten, and D. Black-Shaffer, "TLC: A Tag-less Cache for Reducing Dynamic First Level Cache Energy," in MICRO, 2013.
[34]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Dept. of ECE and CS, Univ. of Illinois at Urbana-Champaign, Tech. Rep., 2012.
[35]
H. Sung and S. V. Adve, "Supporting Arbitrary Sychronization without Writer-Initiated Invalidations," in ASPLOS, 2015.
[36]
H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: Efficient Hardware Support for Disciplined Non-determinism," in ASPLOS, 2013.
[37]
S. Udayakumaran and R. Barua, "Compiler-decided Dynamic Memory Allocation for Scratchpad Based Embedded Systems," in CASES, 2003.
[38]
S. Udayakumaran, A. Dominguez, and R. Barua, "Dynamic Allocation for Scratchpad Memory Using Compile-time Decisions," TECS, 2006.
[39]
Z. Zheng, Z. Wang, and M. Lipasti, "Tag Check Elision," in ISLPED, 2014.

Cited By

View all
  • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
  • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
  • (2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. Stash: have your scratchpad and cache it too

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
      June 2015
      768 pages
      ISBN:9781450334020
      DOI:10.1145/2749469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ISCA '15
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)62
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
      • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
      • (2023)Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer CapacityProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623793(1347-1363)Online publication date: 28-Oct-2023
      • (2023)MVC: Enabling Fully Coherent Multi-Data-Views through the Memory Hierarchy with Processing in MemoryProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623784(800-814)Online publication date: 28-Oct-2023
      • (2023)ROMA: A Reconfigurable On-chip Memory Architecture for Multi-core Accelerators2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00017(49-57)Online publication date: 17-Dec-2023
      • (2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022
      • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
      • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
      • (2021)Capstan: A Vector RDA for SparsityMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480047(1022-1035)Online publication date: 18-Oct-2021
      • (2021)Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplicationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446702(687-701)Online publication date: 19-Apr-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media