Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3307650.3322230acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Public Access

MGPUSim: enabling multi-GPU performance modeling and optimization

Published: 22 June 2019 Publication History

Abstract

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next-generation multi-GPU system designs.
In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5× and a 2.5× average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation.
We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose <u>P</u>rogressive P<u>a</u>ge <u>S</u>plitting M<u>i</u>gration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6× (geometric mean), and PASI can improve the system performance by 2.6× (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

References

[1]
AMD. 2015. AMD Radeon R9 Series Gaming Graphics Cards with High-Bandwidth Memory.
[2]
AMD. 2016. Graphics Core Next Architecture, Generation 3, Reference Guide. (2016).
[3]
AMD. 2017. AMD APP SDK 3.0 Getting Started. (2017).
[4]
AMD. 2017. Vega Instruction Set Architecture, Reference Guide. (2017).
[5]
AMD. 2018. Radeon Compute Profiler. https://github.com/GPUOpen-Tools/RCP
[6]
AMD. 2018. Radeon Instinct MI60 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi60
[7]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. ACM SIGARCH Computer Architecture News 45, 2 (2017), 320--332.
[8]
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 136--150.
[9]
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 503--518.
[10]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.
[11]
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678 (2016).
[12]
Cen Chen, Kenli Li, Aijia Ouyang, Zhuo Tang, and Keqin Li. 2017. Gpu-accelerated parallel hierarchical extreme learning machine on flink for big data. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47, 10 (2017), 2740--2753.
[13]
Chia Chen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2014. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--12.
[14]
Sylvain Collange, David Defour, and David Parello. 2009. Barra, a parallel functional GPGPU simulator. (2009).
[15]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248--255.
[16]
Shi Dong, Xiang Gong, Yifan Sun, Trinayan Baruah, and David Kaeli. 2018. Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus. In Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 96--106.
[17]
Richard M Fujimoto. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (1990), 30--53.
[18]
Nitin A Gawande, Jeff A Daily, Charles Siegel, Nathan R Tallent, and Abhinav Vishnu. 2018. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and intel knights landing. Future Generation Computer Systems (2018).
[19]
Anthony Gutierrez, Bradford M Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, et al. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 608--619.
[20]
Wen-mei Hwu. 2015. Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann.
[21]
CL Jermain, GE Rowlands, RA Buhrman, and DC Ralph. 2016. GPU-accelerated micromagnetic simulations using cloud computing. Journal of Magnetism and Magnetic Materials 401 (2016), 320--322.
[22]
Hai Jiang, Yi Chen, Zhi Qiao, Tien-Hsiung Weng, and Kuan-Ching Li. 2015. Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing 18, 1 (2015), 369--383.
[23]
Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W Keckler, Mahmut T Kandemir, and Chita R Das. 2015. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 223--234.
[24]
David Kanter. 2015. Graphics processing requirements for enabling immersive vr. AMD White Paper (2015).
[25]
Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 484--495.
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[27]
Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU triggered networking for intra-kernel communications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22.
[28]
Michael LeBeane, Brandon Potter, Abhisek Pan, Alexandru Dutu, Vinay Agarwala, Wonchan Lee, Deepak Majeti, Bibek Ghimire, Eric Van Tassell, Samuel Wasmundt, et al. 2016. Extended task queuing: Active messages for heterogeneous systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 80.
[29]
Sangpil Lee and Won Woo Ro. 2013. Parallel GPU architecture simulation framework exploiting work allocation unit parallelism. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. IEEE, 107--117.
[30]
Sangpil Lee and Won Woo Ro. 2016. Parallel gpu architecture simulation framework exploiting architectural-level parallelism with timing error prediction. IEEE Trans. Comput. 4 (2016), 1253--1265.
[31]
Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63.
[32]
Geetika Malhotra, Seep Goel, and Smruti R Sarangi. 2014. Gputejas: A parallel simulator for gpu architectures. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--10.
[33]
Mitesh R Meswani, Gabriel H Loh, Sergey Blagodurov, David Roberts, John Slice, and Mike Ignatowski. 2014. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In 2014 Hardware-Software Co-Design for High Performance Computing. IEEE, 9--16.
[34]
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the socket: NUMA-aware GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 123--135.
[35]
Saiful Mojumder, Marcia Louis, Yifan Sun, Amir Kavyan Ziabari, José L. Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling DNN Workloads on a Volta-based DGX-1 System. In Proceedings of the International Symposium on Workload Characterization (IISWC'18). IEEE.
[36]
NVIDIA. 2010. CUDA Programming guide.
[37]
NVIDIA. 2018. Developing a Linux Kernel Module using GPUDirect RDMA. (2018).
[38]
NVIDIA. 2018. NVIDIA DGX-2. (2018). https://www.nvidia.com/en-us/data-center/dgx-2/
[39]
NVIDIA. 2018. NVIDIA TITAN RTX. https://www.nvidia.com/en-us/titan/titan-rtx/
[40]
Open Source Initiative. {n. d.}. The MIT Licence.
[41]
Rajat Raina, Anand Madhavan, and Andrew Y Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning. ACM, 873--880.
[42]
Ahmed Sanaullah, Saiful A Mojumder, Kathleen M Lewis, and Martin C Herbordt. 2016. GPU-accelerated charge mapping. In HPEC. 1--7.
[43]
Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very deep multilingual convolutional neural networks for LVCSR. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 4955--4959.
[44]
Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis Lectures on Computer Architecture 6, 3 (2011), 1--212.
[45]
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.
[46]
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Heteromark, a benchmark suite for CPU-GPU collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1--10.
[47]
Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating performance tradeoffs on the radeon open compute platform. In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 209--218.
[48]
The Go Authors. 2009. Effective Go. (2009).
[49]
Josep Torrellas. 1999. Cache-Only Memory Architecture. In IEEE Computer Magazine.
[50]
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: a simulation framework for CPU-GPU computing. In Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on. IEEE, 335--344.
[51]
Jan Vesely, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H Loh, Mark Oskin, and Steven K Reinhardt. 2018. Generic system calls for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 843--856.
[52]
Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. ISCA.
[53]
Jun Wang, Eric Papenhausen, Bing Wang, Sungsoo Ha, Alla Zelenyuk, and Klaus Mueller. 2017. Progressive clustering of big data with GPU acceleration and visualization. In Scientific Data Summit (NYSDS), 2017 New York. IEEE, 1--9.
[54]
Siyue Wang, Xiao Wang, Shaokai Ye, Pu Zhao, and Xue Lin. 2018. Defending DNN Adversarial Attacks with Pruning and Logits Augmentation. In 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1144--1148.
[55]
James Whitney, Chandler Giford, and Maria Pantoja. 2018. Distributed execution of communicating sequential process-style concurrency: Golang case study. The Journal of Supercomputing (2018), 1--14.
[56]
Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, and Gang Sun. 2015. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876 (2015).
[57]
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Villa Oreste. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. ACM.
[58]
Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. 2016. UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 35.

Cited By

View all
  • (2024)Revealing Hidden Features of Chaotic Systems Using High-Performance Bifurcation Analysis Tools Based on CUDA TechnologyInternational Journal of Bifurcation and Chaos10.1142/S0218127424501347Online publication date: 27-Aug-2024
  • (2024)LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00082(1080-1096)Online publication date: 29-Jun-2024
  • (2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
June 2019
849 pages
ISBN:9781450366694
DOI:10.1145/3307650
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. memory management
  2. multi-GPU systems
  3. simulation

Qualifiers

  • Research-article

Funding Sources

Conference

ISCA '19
Sponsor:

Acceptance Rates

ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,355
  • Downloads (Last 6 weeks)209
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revealing Hidden Features of Chaotic Systems Using High-Performance Bifurcation Analysis Tools Based on CUDA TechnologyInternational Journal of Bifurcation and Chaos10.1142/S0218127424501347Online publication date: 27-Aug-2024
  • (2024)LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00082(1080-1096)Online publication date: 29-Jun-2024
  • (2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
  • (2024)Supporting Secure Multi-GPU Computing with Dynamic and Batched Metadata Management2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00025(204-217)Online publication date: 2-Mar-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2024)Performance analysis and modeling for quantum computing simulation on distributed GPU platformsQuantum Information Processing10.1007/s11128-024-04580-x23:11Online publication date: 6-Nov-2024
  • (2023)Efficient Simulation of Volumetric Deformable Objects in Unity3D: GPU-Accelerated Position-Based DynamicsElectronics10.3390/electronics1210222912:10(2229)Online publication date: 14-May-2023
  • (2023)Photon: A Fine-grained Sampled Simulation Methodology for GPU WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623773(1227-1241)Online publication date: 28-Oct-2023
  • (2023)GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic EncryptionProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614279(670-684)Online publication date: 28-Oct-2023
  • (2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media