Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3605573.3605645acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel Processors

Published: 13 September 2023 Publication History

Abstract

The performance of GPU’s external memories is becoming more critical since a modern GPU runs thousands of concurrent threads that demand a huge volume of data. In order to utilize resources in the memory hierarchy more efficiently, GPU employs a memory coalescing scheme to reduce the number of demand requests created from a group of threads (i.e. a warp). However, GPU’s memory coalescing does not work well for applications that exhibit irregular memory access patterns, thus a single warp can generate multiple memory transactions. Since memory requests are serviced by different hierarchy levels and/or memory partitions, multiple outstanding requests from a single warp exhibit diverged fetch latency. Considering the execution time of a load warp is decided by the slowest memory transaction, the diverged memory latency within a warp is a critical performance factor for load warps.
In this paper, we propose a warp-aware memory controller scheme, called Warped-MC, to mitigate the memory latency divergence issues. Based on the in-depth analysis, we reveal the memory latency divergence within a warp is mainly caused by GPU memory controllers. While the conventional FR-FCFS memory controller can maximize the effective bandwidth of DRAM channels, the scheduling scheme of the conventional memory controller can exacerbate the memory latency divergence of a warp. Warped-MC employs a warp-aware scheduling scheme to alleviate the memory latency divergence, thus Warped-MC can tackle the long tail of the load warp execution time to improve the performance of memory-intensive applications. We implement Warped-MC on GPGPU-Sim configured with the modern GPU architecture, and our evaluation results exhibit Warped-MC can improve the performance of memory-intensive applications by 8.9% on average with a maximum of 45.8%.

References

[1]
Chris Anderson. 2006. The long tail: Why the future of business is selling less of more. Hachette UK.
[2]
Akhil Arunkumar, Shin-ying Lee, and Carole-jean Wu. 2016. ID-cache: instruction and memory divergence based cache management for GPUs. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1–10. https://doi.org/10.1109/IISWC.2016.7581276
[3]
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H Loh, Chita R Das, Mahmut T Kandemir, and Onur Mutlu. 2018. Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv preprint arXiv:1804.11038 (2018).
[4]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
[5]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). 141–151. https://doi.org/10.1109/IISWC.2012.6402918
[6]
Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonia. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 128–139. https://doi.org/10.1109/SC.2014.16
[7]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54. https://doi.org/10.1109/IISWC.2009.5306797
[8]
Shuai Che, Jeremy W Sheaffer, Michael Boyer, Lukasz G Szafaryn, Liang Wang, and Kevin Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC’10). IEEE, 1–11.
[9]
Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38, 2 (2018), 42–52. https://doi.org/10.1109/MM.2018.022071134
[10]
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2017. Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 239–248. https://doi.org/10.1109/ISPASS.2017.7975295
[11]
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 492–505. https://doi.org/10.1109/HPCA.2019.00061
[12]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). 1–10. https://doi.org/10.1109/InPar.2012.6339595
[13]
Takakazu Ikeda and Kenji Kise. 2013. Application Aware DRAM Bank Partitioning in CMP. In 2013 International Conference on Parallel and Distributed Systems. 349–356. https://doi.org/10.1109/ICPADS.2013.56
[14]
Shiwei Jia, Ze Tian, Yueyuan Ma, Chenglu Sun, Yimen Zhang, and Yuming Zhang. 2021. A Survey of GPGPU Parallel Processing Architecture Performance Optimization. In 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall). 75–82. https://doi.org/10.1109/ICISFall51598.2021.9627400
[15]
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. 351–363.
[16]
Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. Hbm (high bandwidth memory) dram technology and architecture. In 2017 IEEE International Memory Workshop (IMW). IEEE, 1–4.
[17]
Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 738–753. https://doi.org/10.1145/3466752.3480063
[18]
Mahmoud Khairy, Jain Akshay, Tor Aamodt, and Timothy G Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. arXiv preprint arXiv:1810.07269 (2018).
[19]
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
[20]
Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 163–175. https://doi.org/10.1109/HPCA.2016.7446062
[21]
Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In 2015 IEEE International Symposium on Workload Characterization. 120–129. https://doi.org/10.1109/IISWC.2015.23
[22]
Gunjae Koo, Hyeran Jeon, Zhenhong Liu, Nam Sung Kim, and Murali Annavaram. 2018. CTA-Aware Prefetching and Scheduling for GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 137–148. https://doi.org/10.1109/IPDPS.2018.00024
[23]
Milind Kulkarni, Martin Burtscher, Calin Cascaval, and Keshav Pingali. 2009. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 65–76. https://doi.org/10.1109/ISPASS.2009.4919639
[24]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th annual international symposium on Computer architecture. 235–246.
[25]
Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating Cache Management and Memory Scheduling for GPGPU Applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 8 (2014), 1803–1814. https://doi.org/10.1109/TVLSI.2013.2278025
[26]
Onur Mutlu and Thomas Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 146–160. https://doi.org/10.1109/MICRO.2007.21
[27]
Nvidia. 2015. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.
[28]
Nvidia. 2019. GeForce RTX 2060 Super. https://www.nvidia.com/en-us/geforce/ graphics-cards/rtx-2060-super/.
[29]
S. Rixner, W.J. Dally, U.J. Kapasi, P. Mattson, and J.D. Owens. 2000. Memory access scheduling. In Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201). 128–138. https://doi.org/10.1145/339647.339668
[30]
John Sartori and Rakesh Kumar. 2012. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 427–428.
[31]
A. Seznec. 1994. Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio. In Proceedings of 21 International Symposium on Computer Architecture. 384–393. https://doi.org/10.1109/ISCA.1994.288133
[32]
JEDEC standard. 2018. GDDR6. JESD250C.
[33]
John A. Stratton, Christopher I. Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Liu, and Wen mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing.
[34]
Lu Wang, Magnus Jahre, Almutaz Adileho, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009–1021. https://doi.org/10.1109/MICRO50266.2020.00085
[35]
Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 140–149. https://doi.org/10.1109/IISWC.2014.6983053
[36]
George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 34–44. https://doi.org/10.1145/1669112.1669119

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
August 2023
858 pages
ISBN:9798400708435
DOI:10.1145/3605573
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU Architecture
  2. Memory Controller
  3. Memory System

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2023
ICPP 2023: 52nd International Conference on Parallel Processing
August 7 - 10, 2023
UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 181
    Total Downloads
  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)13
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media