Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063384.2063401acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Dymaxion: optimizing memory access patterns for heterogeneous systems

Published: 12 November 2011 Publication History

Abstract

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface---or no particular memory interface at all!---in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems.
In this paper, we propose a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA's CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.

References

[1]
AMD Fusion APU. Web resource. fusion.amd.com/.
[2]
N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec 2008.
[3]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, Oct 2009.
[4]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA. J. Parallel and Dist. Comp., 68(10):1370--1380, 2008.
[5]
C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In ISPASS, April 2011.
[6]
CUDA C Programming Best Practices Guide. Web resource. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf.
[7]
NVIDIA CUDA Programming Guide. Web resource. http://developer.nvidia.com/object/gpucomputing.html.
[8]
B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In SC, Nov 2007.
[9]
B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data parallel architectures. TPDS, 22:105--118, 2010.
[10]
S. T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report TR 95-09-01, University of Washington, Sept 1995.
[11]
The Thrust library. Web resource. http://code.google.com/p/thrust/.
[12]
M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: A programming model for heterogeneous multi-core systems. In ASPLOS, Mar 2008.
[13]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008.
[14]
C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO-42, 2009.
[15]
D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, Dec 2009.
[16]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, 2008.
[17]
NVIDIA CUDA. Web resource. http://www.nvidia.com/object/cuda_home_new.html.
[18]
OpenCL. Web resource. http://www.khronos.org/opencl/.
[19]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH, Aug 2007.
[20]
K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis. Micro-pages: increasing DRAM efficiency with locality-aware data placement. In ASPLOS, Mar 2010.
[21]
I-J Sung, J. A. Stratton, and W-M W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In PACT, Sept 2010.
[22]
E. Z. Zhang, Z. Guo Y. Jiang, K. Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, Mar 2011.
[23]
L. Zhang, Z. Fang, M. Parker, B. K. Mathew, L. Schaelicke, J. B. Carter, W. C. Hsieh, and S. A. McKee. The impulse memory controller. IEEE Trans. Comp., 50(11):1117--1132, 2001.

Cited By

View all
  • (2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
  • (2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/3631528Online publication date: 7-Nov-2023
  • (2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
  • Show More Cited By

Index Terms

  1. Dymaxion: optimizing memory access patterns for heterogeneous systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. heterogeneous computer architectures
    3. latency hiding
    4. memory access and data layout

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '11
    Sponsor:

    Acceptance Rates

    SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
    • (2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/3631528Online publication date: 7-Nov-2023
    • (2022)SHAPEProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549409(1-9)Online publication date: 30-Oct-2022
    • (2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
    • (2021)HAM: Hotspot-Aware Manager for Improving Communications With 3D-Stacked MemoryIEEE Transactions on Computers10.1109/TC.2021.306698270:6(833-848)Online publication date: 1-Jun-2021
    • (2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
    • (2020)Evaluating Gather and Scatter Performance on CPUs and GPUsProceedings of the International Symposium on Memory Systems10.1145/3422575.3422794(209-222)Online publication date: 28-Sep-2020
    • (2020)Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00052(556-569)Online publication date: Feb-2020
    • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
    • (2019)qCUDA: GPGPU Virtualization for High Bandwidth Efficiency2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom.2019.00025(95-102)Online publication date: Dec-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media