research-article

Dymaxion: optimizing memory access patterns for heterogeneous systems

Authors:

Jeremy W. Sheaffer,

Kevin SkadronAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 13, Pages 1 - 11

https://doi.org/10.1145/2063384.2063401

Published: 12 November 2011 Publication History

Abstract

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface---or no particular memory interface at all!---in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems.

In this paper, we propose a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA's CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.

References

[1]

AMD Fusion APU. Web resource. fusion.amd.com/.

[2]

N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec 2008.

[3]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, Oct 2009.

Digital Library

[4]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA. J. Parallel and Dist. Comp., 68(10):1370--1380, 2008.

Digital Library

[5]

C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In ISPASS, April 2011.

Digital Library

[6]

CUDA C Programming Best Practices Guide. Web resource. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf.

[7]

NVIDIA CUDA Programming Guide. Web resource. http://developer.nvidia.com/object/gpucomputing.html.

[8]

B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In SC, Nov 2007.

Digital Library

[9]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data parallel architectures. TPDS, 22:105--118, 2010.

Digital Library

[10]

S. T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report TR 95-09-01, University of Washington, Sept 1995.

[11]

The Thrust library. Web resource. http://code.google.com/p/thrust/.

[12]

M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: A programming model for heterogeneous multi-core systems. In ASPLOS, Mar 2008.

Digital Library

[13]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008.

Digital Library

[14]

C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO-42, 2009.

Digital Library

[15]

D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, Dec 2009.

[16]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, 2008.

Digital Library

[17]

NVIDIA CUDA. Web resource. http://www.nvidia.com/object/cuda_home_new.html.

[18]

OpenCL. Web resource. http://www.khronos.org/opencl/.

[19]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH, Aug 2007.

Digital Library

[20]

K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis. Micro-pages: increasing DRAM efficiency with locality-aware data placement. In ASPLOS, Mar 2010.

Digital Library

[21]

I-J Sung, J. A. Stratton, and W-M W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In PACT, Sept 2010.

Digital Library

[22]

E. Z. Zhang, Z. Guo Y. Jiang, K. Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, Mar 2011.

Digital Library

[23]

L. Zhang, Z. Fang, M. Parker, B. K. Mathew, L. Schaelicke, J. B. Carter, W. C. Hsieh, and S. A. McKee. The impulse memory controller. IEEE Trans. Comp., 50(11):1117--1132, 2001.

Digital Library

Cited By

Swatman SVarbanescu APimentel ASalzburger AKrasznahorkay ABalsamo SKnottenbelt WAbad CShang W(2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645034
Zhang DLang QWang RShen L(2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/3631528Online publication date: 7-Nov-2023
https://dl.acm.org/doi/10.1145/3631528
Xu YHe TSun RMa YJin YZou AMitra TYoung EXiong J(2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549409
Show More Cited By

Index Terms

Dymaxion: optimizing memory access patterns for heterogeneous systems
1. Hardware
  1. Communication hardware, interfaces and storage

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

Graphics Processing Units (GPUs) leverage massive thread-level parallelism (TLP) to achieve high computation throughput and hide long memory latency. However, recent studies have shown that the GPU performance does not scale with the GPU occupancy or ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computer and Network Systems

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
861
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Swatman SVarbanescu APimentel ASalzburger AKrasznahorkay ABalsamo SKnottenbelt WAbad CShang W(2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645034
Zhang DLang QWang RShen L(2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/3631528Online publication date: 7-Nov-2023
https://dl.acm.org/doi/10.1145/3631528
Xu YHe TSun RMa YJin YZou AMitra TYoung EXiong J(2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549409
Zhang JSwift MLi JFalsafi BFerdman MLu SWenisch T(2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507774
Wang XTumeo ALeidel JLi JChen Y(2021)HAM: Hotspot-Aware Manager for Improving Communications With 3D-Stacked MemoryIEEE Transactions on Computers10.1109/TC.2021.306698270:6(833-848)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3066982
Muthukrishnan HNellans DLustig DFessler JWenisch TMartínez JDuato JJohn L(2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00020
Lavin PYoung JVuduc RRiedy JVose AErnst D(2020)Evaluating Gather and Scatter Performance on CPUs and GPUsProceedings of the International Symposium on Memory Systems10.1145/3422575.3422794(209-222)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422794
Lenjani MGonzalez PSadredini ELi SXie YAkel AEilert SStan MSkadron K(2020)Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00052(556-569)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00052
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867
Lin YLin CLee CChung Y(2019)qCUDA: GPGPU Virtualization for High Bandwidth Efficiency2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom.2019.00025(95-102)Online publication date: Dec-2019
https://doi.org/10.1109/CloudCom.2019.00025
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents