research-article

REMOC: efficient request managements for on-chip memories of GPUs

Authors:

Jizeng WeiAuthors Info & Claims

CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers

Pages 1 - 11

https://doi.org/10.1145/3528416.3530229

Published: 17 May 2022 Publication History

Abstract

The on-chip memories of GPUs, including the register file, shared memory and L1 cache, can provide high bandwidth and low latency access for the temporary storage of data. The capacity of L1 cache can be increased by using the registers/shared memory that are unassigned to any warps/thread blocks or released after warps/thread blocks are finished as cache-lines. In this paper, we propose two techniques to manage requests for on-chip memories to improve the efficiency of L1 cache on the base of leveraging registers and shared memory as cache-lines. Specifically, we develop a data transferring policy which is triggered when cache-lines are recalled by the first register or shared memory accesses of warps that are newly launched to prevent the data locality from being destroyed. Additionally, we design a parallel issue scheme by exploring the parallel feature of requests of an instruction accessing the register file, shared memory and L1 cache to decrease the processing latency and hence increase the throughput of instructions. The experimental results demonstrate that our approach improves the performance by 15% over prior work.

References

[1]

Jaeguk Ahn, Cheolgyu Jin, Jiho Kim, Minsoo Rhu, Yunsi Fei, David Kaeli, and John Kim. 2021. Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack for AES Key Recovery. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 332--344.

[2]

A. Arunkumar, S. Lee, V. Soundararajan, and C. Wu. 2018. LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 221--234.

[3]

A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.

[4]

Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J.W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization. 44--54.

[5]

Xuhao Chen, Li-Wen Chang, C.I. Rodrigues, Jie Lv, Zhiying Wang, and Wen mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 343--355.

[6]

Hodjat Asghari Esfeden, Amirali Abdolrashidi, Shafiur Rahman, Daniel Wong, and Nael Abu-Ghazaleh. 2020. BOW: Breathing Operand Windows to Exploit Bypassing in GPUs. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 996--1008.

[7]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Autotuning A High-level Language Targeted to GPU Codes. In Innovative Parallel Computing (InPar). 1--10.

[8]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 260--269.

[9]

Mohamed Assem Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2021. Analyzing and Leveraging Decoupled L1 Caches in GPUs. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 467--478.

[10]

Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 272--283.

[11]

Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated Register File: An Integrated On-chip Memory Architecture for High Performance GPGPUs. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 14:1--14:12.

[12]

Gurunath Kadam, Danfeng Zhang, and Adwait Jog. 2020. BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 570--581.

[13]

Aditya K. Kamath, Alvin A. George, and Arkaprava Basu. 2020. ScoRD: A Scoped Race Detector for GPUs. In ACM/IEEE International Symposium on Computer Architecture (ISCA). 1036--1049.

[14]

A. Karki, C. Palangotu Keshava, S. Mysore Shivakumar, J. Skow, G. Madhukeshwar Hegde, and H. Jeon. 2019. Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. In IEEE International Symposium on Performance Analysis of Systems and Software. 137--138.

[15]

F. Khorasani, H. Asghari Esfeden, A. Farmahini-Farahani, N. Jayasena, and V. Sarkar. 2018. RegMutex: Inter-Warp GPU Register Time-Sharing. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 816--828.

[16]

G. Koo, Y. Oh, W. W. Ro, and M. Annavaram. 2017. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 307--319.

[17]

Jagadish B. Kotra, Michael LeBeane, Mahmut T. Kandemir, and Gabriel H. Loh. 2021. Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1169--1181.

[18]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 487--498.

Digital Library

[19]

B. Li, J. Sun, M. Annavaram, and N. S. Kim. 2017. Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 82--91.

[20]

Bingchao Li, Jizeng Wei, and Nam Sung Kim. 2021. Virtual-Cache: A Cache-line Borrowing Technique for Efficient GPU Cache Architectures. Microprocessors and Microsystems 85 (2021), 104301.

Digital Library

[21]

Bingchao Li, Jizeng Wei, Jizhou Sun, Murali Annavaram, and Nam Sung Kim. 2019. An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns. ACM Trans. Archit. Code Optim. 16, 3 (Jun 2019), 20:1--20:24.

Digital Library

[22]

John Erik Lindholm, Ming Y. Siu, Simon S. Moy, Samuel Liu, and John R. Nickolls. 2006. Simulating Multiported Memories Using Lower Port Count Memories. https://www.freepatentsonline.com/y2006/0012603.html

[23]

Samuel Liu, Erik Lindholm, Ming Y Siu, Breet W. Coon, and Stuart F. Oberman. 2010. Operand Collector Architecture. http://www4.drugfuture.com/uspat/download/US7834881.pdf US Patent App. 10/429,009.

[24]

Keshav Pingali Martin Burtscher, Rupesh Nasre. 2012. A Quantitative Study of Irregular Programs on GPUs. In IEEE International Symposium on Workload Characterization. 141--151.

Digital Library

[25]

NVIDIA Corporation 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA Corporation.

[26]

NVIDIA Corporation 2019. NVIDIA CUDA C Programming Guide. NVIDIA Corporation.

[27]

Yunho Oh, Gunjae Koo, Murali Annavaram, and Won Woo Ro. 2019. Linebacker: Preserving Victim Cache Lines in Idle Register Files of GPUs. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 183--196.

Digital Library

[28]

Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 582--595.

[29]

T.G. Rogers, M. O'Connor, and T.M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 72--83.

[30]

Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 99--110.

[31]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 489--502.

Digital Library

[32]

C. Shelor. 2004. Logic and method for reading data from cache. http://137.175.46.122/patents/US20040221117 US Patent App. 10/429,009.

[33]

H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog. 2018. Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 247--258.

[34]

Lu Wang, Magnus Jahre, Almutaz Adileho, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009--1021.

[35]

Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated Static and Dynamic Cache Bypassing for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 76--88.

[36]

Qiumin Xu, Hyeran Jeon, and M. Annavaram. 2014. Graph Processing on GPUs: Where are the Bottlenecks?. In IEEE International Symposium on Workload Characterization. 140--149.

[37]

Jie Zhang, Shuwen Gao, Nam Sung Kim, and Myoungsoo Jung. 2018. CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 149--159.

[38]

Bojian Zheng, Nandita Vijaykumar, and Gennady Pekhimenko. 2020. Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training. In ACM/IEEE International Symposium on Computer Architecture (ISCA). 1089--1102.

Index Terms

REMOC: efficient request managements for on-chip memories of GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide ...
Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architectures
Abstract
GPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high ...
Linebacker: preserving victim cache lines in idle register files of GPUs
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

Modern GPUs suffer from cache contention due to the limited cache size that is shared across tens of concurrently running warps. To increase the per-warp cache size prior techniques proposed warp throttling which limits the number of active warps. Warp ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers

May 2022

321 pages

ISBN:9781450393386

DOI:10.1145/3528416

General Chair:
Luca Sterpone
Politecnico di Torino, IT
,
Program Chairs:
Andrea Bartolini
Universit`a di Bologna, IT
,
Anastasiia Butko
Lawrence Berkeley National Laboratory

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities of Civil Aviation University of China

Conference

CF '22

Sponsor:

SIGMICRO

CF '22: 19th ACM International Conference on Computing Frontiers

May 17 - 22, 2022

Turin, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
148
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents