Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2503210.2503268acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A large-scale cross-architecture evaluation of thread-coarsening

Published: 17 November 2013 Publication History

Abstract

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device.
In this paper we consider a data parallel compiler transformation --- thread-coarsening --- and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.

References

[1]
AMD Inc., AMD APP Profiler http://developer.amd.com/tools/heterogeneous-computing/amd-app-profiler/.
[2]
The llvm compiler infrastructure http://llvm.org.
[3]
NVIDIA Corporation, NVIDIA Profiler http://docs.nvidia.com/cuda/profiler-users-guide/.
[4]
Nvidia's Next Generation CUDA Compute Architecture: Fermi http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.
[5]
AMD Accelerated parallel processing OpenCL, 2012.
[6]
Nvidia's Next Generation CUDA Compute Architecture: Kepler http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.
[7]
MICA: Microarchitecture-Independent Characterization of Applications http://boegel.kejo.be/ELIS/mica/, 2013.
[8]
B. Coutinho, D. Sampaio, F. Pereira, and W. Meira. Divergence analysis and optimizations. PACT, pages 320--329, oct. 2011.
[9]
G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. PACT '10, pages 353--364, New York, NY, USA, 2010. ACM.
[10]
C. Dubach, P. Cheng, R. M. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In PLDI, pages 1--12, 2012.
[11]
D. Grewe, Z. Wang, and M. F. O'Boyle. Portable mapping of data parallel programs to opencl for heterogeneous systems. CGO '13. ACM, 2013.
[12]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. ASPLOS '11, pages 381--392, New York, NY, USA, 2011. ACM.
[13]
K. Hoste and L. Eeckhout. Comparing benchmarks using key microarchitecture-independent characteristics. pages 83--92, oct. 2006.
[14]
R. Karrenberg and S. Hack. Whole-function vectorization. CGO '11, pages 141--150, april 2011.
[15]
R. Karrenberg and S. Hack. Improving performance of opencl on cpus. CC, pages 1--20, 2012.
[16]
A. Kerr, G. Diamos, and S. Yalamanchili. Modeling gpu-cpu workloads and systems. GPGPU '10, pages 31--42, New York, NY, USA, 2010. ACM.
[17]
Y. Liu, E. Zhang, and X. Shen. A cross-input adaptive framework for gpu program optimizations. IPDPS '09, pages 1--10, may 2009.
[18]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. SIGPLAN Not., 40(6):190--200, June 2005.
[19]
S. Moll. Decompilation of LLVM IR, 2011.
[20]
B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Jan. 1996.
[21]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.
[22]
J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. PPoPP '12, pages 11--22, New York, NY, USA, 2012. ACM.
[23]
S. Unkule, C. Shaltz, and A. Qasem. Automatic restructuring of gpu kernels for exploiting inter-thread data locality. CC, pages 21--40, 2012.
[24]
V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. SC '08, pages 31:1--31:11, Piscataway, NJ, USA, 2008. IEEE Press.
[25]
M. Weiser. Program slicing. ICSE '81, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.
[26]
X. Xiang, B. Bao, C. Ding, and Y. Gao. Linear-time modeling of program working set in shared cache. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 350--360, 2011.
[27]
Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou. A unified optimizing compiler framework for different gpgpu architectures. TACO, 9(2):9, 2012.
[28]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM.

Cited By

View all
  • (2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
  • (2023)Transpilers: A Systematic Mapping Review of Their Usage in Research and IndustryApplied Sciences10.3390/app1306366713:6(3667)Online publication date: 13-Mar-2023
  • (2022)Advanced TutorialProceedings of the Winter Simulation Conference10.5555/3586210.3586232(268-282)Online publication date: 11-Dec-2022
  • Show More Cited By

Index Terms

  1. A large-scale cross-architecture evaluation of thread-coarsening

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. OpenCL
    3. regression trees
    4. thread coarsening

    Qualifiers

    • Research-article

    Conference

    SC13
    Sponsor:

    Acceptance Rates

    SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Retargeting and Respecializing GPU Workloads for Performance Portability2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
    • (2023)Transpilers: A Systematic Mapping Review of Their Usage in Research and IndustryApplied Sciences10.3390/app1306366713:6(3667)Online publication date: 13-Mar-2023
    • (2022)Advanced TutorialProceedings of the Winter Simulation Conference10.5555/3586210.3586232(268-282)Online publication date: 11-Dec-2022
    • (2022)Packet Processing Algorithm Identification using Program EmbeddingsProceedings of the 6th Asia-Pacific Workshop on Networking10.1145/3542637.3542649(76-82)Online publication date: 1-Jul-2022
    • (2022)Advanced Tutorial: Parallel and Distributed Methods for Scalable Discrete Simulation2022 Winter Simulation Conference (WSC)10.1109/WSC57314.2022.10015291(268-282)Online publication date: 11-Dec-2022
    • (2022)Optimal Launch Bound Selection in CPU-GPU Hybrid Graph Applications with Deep Learning2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969364(1-7)Online publication date: 24-Oct-2022
    • (2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
    • (2021)Achieving diverse redundancy for GPU KernelsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2021.3101922(1-1)Online publication date: 2021
    • (2021)Exploring Thread Coarsening on FPGA2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00062(436-441)Online publication date: Dec-2021
    • (2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media