Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3528416.3530229acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

REMOC: efficient request managements for on-chip memories of GPUs

Published: 17 May 2022 Publication History

Abstract

The on-chip memories of GPUs, including the register file, shared memory and L1 cache, can provide high bandwidth and low latency access for the temporary storage of data. The capacity of L1 cache can be increased by using the registers/shared memory that are unassigned to any warps/thread blocks or released after warps/thread blocks are finished as cache-lines. In this paper, we propose two techniques to manage requests for on-chip memories to improve the efficiency of L1 cache on the base of leveraging registers and shared memory as cache-lines. Specifically, we develop a data transferring policy which is triggered when cache-lines are recalled by the first register or shared memory accesses of warps that are newly launched to prevent the data locality from being destroyed. Additionally, we design a parallel issue scheme by exploring the parallel feature of requests of an instruction accessing the register file, shared memory and L1 cache to decrease the processing latency and hence increase the throughput of instructions. The experimental results demonstrate that our approach improves the performance by 15% over prior work.

References

[1]
Jaeguk Ahn, Cheolgyu Jin, Jiho Kim, Minsoo Rhu, Yunsi Fei, David Kaeli, and John Kim. 2021. Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack for AES Key Recovery. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 332--344.
[2]
A. Arunkumar, S. Lee, V. Soundararajan, and C. Wu. 2018. LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 221--234.
[3]
A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.
[4]
Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J.W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization. 44--54.
[5]
Xuhao Chen, Li-Wen Chang, C.I. Rodrigues, Jie Lv, Zhiying Wang, and Wen mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 343--355.
[6]
Hodjat Asghari Esfeden, Amirali Abdolrashidi, Shafiur Rahman, Daniel Wong, and Nael Abu-Ghazaleh. 2020. BOW: Breathing Operand Windows to Exploit Bypassing in GPUs. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 996--1008.
[7]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Autotuning A High-level Language Targeted to GPU Codes. In Innovative Parallel Computing (InPar). 1--10.
[8]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 260--269.
[9]
Mohamed Assem Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2021. Analyzing and Leveraging Decoupled L1 Caches in GPUs. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). 467--478.
[10]
Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 272--283.
[11]
Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated Register File: An Integrated On-chip Memory Architecture for High Performance GPGPUs. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 14:1--14:12.
[12]
Gurunath Kadam, Danfeng Zhang, and Adwait Jog. 2020. BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 570--581.
[13]
Aditya K. Kamath, Alvin A. George, and Arkaprava Basu. 2020. ScoRD: A Scoped Race Detector for GPUs. In ACM/IEEE International Symposium on Computer Architecture (ISCA). 1036--1049.
[14]
A. Karki, C. Palangotu Keshava, S. Mysore Shivakumar, J. Skow, G. Madhukeshwar Hegde, and H. Jeon. 2019. Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. In IEEE International Symposium on Performance Analysis of Systems and Software. 137--138.
[15]
F. Khorasani, H. Asghari Esfeden, A. Farmahini-Farahani, N. Jayasena, and V. Sarkar. 2018. RegMutex: Inter-Warp GPU Register Time-Sharing. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 816--828.
[16]
G. Koo, Y. Oh, W. W. Ro, and M. Annavaram. 2017. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 307--319.
[17]
Jagadish B. Kotra, Michael LeBeane, Mahmut T. Kandemir, and Gabriel H. Loh. 2021. Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1169--1181.
[18]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 487--498.
[19]
B. Li, J. Sun, M. Annavaram, and N. S. Kim. 2017. Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 82--91.
[20]
Bingchao Li, Jizeng Wei, and Nam Sung Kim. 2021. Virtual-Cache: A Cache-line Borrowing Technique for Efficient GPU Cache Architectures. Microprocessors and Microsystems 85 (2021), 104301.
[21]
Bingchao Li, Jizeng Wei, Jizhou Sun, Murali Annavaram, and Nam Sung Kim. 2019. An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns. ACM Trans. Archit. Code Optim. 16, 3 (Jun 2019), 20:1--20:24.
[22]
John Erik Lindholm, Ming Y. Siu, Simon S. Moy, Samuel Liu, and John R. Nickolls. 2006. Simulating Multiported Memories Using Lower Port Count Memories. https://www.freepatentsonline.com/y2006/0012603.html
[23]
Samuel Liu, Erik Lindholm, Ming Y Siu, Breet W. Coon, and Stuart F. Oberman. 2010. Operand Collector Architecture. http://www4.drugfuture.com/uspat/download/US7834881.pdf US Patent App. 10/429,009.
[24]
Keshav Pingali Martin Burtscher, Rupesh Nasre. 2012. A Quantitative Study of Irregular Programs on GPUs. In IEEE International Symposium on Workload Characterization. 141--151.
[25]
NVIDIA Corporation 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. NVIDIA Corporation.
[26]
NVIDIA Corporation 2019. NVIDIA CUDA C Programming Guide. NVIDIA Corporation.
[27]
Yunho Oh, Gunjae Koo, Murali Annavaram, and Won Woo Ro. 2019. Linebacker: Preserving Victim Cache Lines in Idle Register Files of GPUs. In IEEE/ACM International Symposium on Computer Architecture (ISCA). 183--196.
[28]
Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 582--595.
[29]
T.G. Rogers, M. O'Connor, and T.M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 72--83.
[30]
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 99--110.
[31]
Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 489--502.
[32]
C. Shelor. 2004. Logic and method for reading data from cache. http://137.175.46.122/patents/US20040221117 US Patent App. 10/429,009.
[33]
H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog. 2018. Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 247--258.
[34]
Lu Wang, Magnus Jahre, Almutaz Adileho, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009--1021.
[35]
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated Static and Dynamic Cache Bypassing for GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA). 76--88.
[36]
Qiumin Xu, Hyeran Jeon, and M. Annavaram. 2014. Graph Processing on GPUs: Where are the Bottlenecks?. In IEEE International Symposium on Workload Characterization. 140--149.
[37]
Jie Zhang, Shuwen Gao, Nam Sung Kim, and Myoungsoo Jung. 2018. CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 149--159.
[38]
Bojian Zheng, Nandita Vijaykumar, and Gennady Pekhimenko. 2020. Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training. In ACM/IEEE International Symposium on Computer Architecture (ISCA). 1089--1102.

Index Terms

  1. REMOC: efficient request managements for on-chip memories of GPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers
    May 2022
    321 pages
    ISBN:9781450393386
    DOI:10.1145/3528416
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 May 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. cache
    3. register file
    4. shared memory

    Qualifiers

    • Research-article

    Funding Sources

    • Fundamental Research Funds for the Central Universities of Civil Aviation University of China

    Conference

    CF '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 148
      Total Downloads
    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media