research-article

Architecture-aware optimization targeting multithreaded stream computing

Authors:

Byunghyun Jang,

David KaeliAuthors Info & Claims

GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

Pages 62 - 70

https://doi.org/10.1145/1513895.1513903

Published: 08 March 2009 Publication History

Abstract

Optimizing program execution targeted for Graphics Processing Units (GPUs) can be very challenging. Our ability to efficiently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Programmers are left to attempt trial and error to produce optimized codes.

Recent publication of the underlying instruction set architecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this information to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect program statistics provided by the AMD Graphics Shader Analyzer (GSA) profiling toolset. We explore optimizations targeting three different computing resources: 1) ALUs, 2) fetch bandwidth, and 3) thread usage, and present optimization techniques that consider how to better utilize each resource.

We demonstrate the effectiveness of our proposed optimization approach on an AMD Radeon HD3870 GPU using the Brook+ stream programming language. We describe our optimizations using two commonly-used GPGPU applications that present very different program characteristics and optimization spaces: matrix multiplication and back-projection for medical image reconstruction. Our results show that optimized code can improve performance by 1.45x--6.7x as compared to unoptimized code run on the same GPU platform. The speedup obtained with our optimized implementations are 882x (matrix multiply) and 19x (back-projection) faster as compared with serial implementations run on an Intel 2.66 GHz Core 2 Duo with a 2 GB main memory.

References

[1]

AMD. Brook+ Programming Guide, V 1.1 Beta, Brook+ SDK.

[2]

AMD. R600 Assembly Language Document, Brook+ SDK, 2007.

[3]

AMD. R600-Family Instruction Set Architecture, Revision 0.31, 2007.

[4]

AMD. HW Guide, Brook+ SDK, 2008.

[5]

A. Andersen and A. Kak. Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrason Imaging, 6(1):81--94, 1984.

[6]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04: ACM SIGGRAPH 2004 Papers, pages 777--786, New York, NY, USA, 2004. ACM.

Digital Library

[7]

S. Do, Z. Liang, W. Karl, T. Brady, and H. Pien. A projection-driven pre-correction technique for iterative reconstruction of helical cone-beam cardiac CT images. In Proceedings of SPIE, volume 6913, page 69132U. SPIE, 2008.

[8]

K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication. In Graphics Hardware, 2004.

Digital Library

[9]

GPGPU Website. www.gpgpu.org.

[10]

K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.

Digital Library

[11]

D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: general purpose computation on graphics hardware. In SIGGRAPH '04: ACM SIGGRAPH 2004 Course Notes, page 33, New York, NY, USA, 2004. ACM.

Digital Library

[12]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. In Proceedings of the IEEE, volume 96, pages 879--899, 2008.

[13]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, New York, NY, USA, 2008. ACM.

Digital Library

[14]

M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on GPUs through software-managed cache. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 309--318, New York, NY, USA, 2008. ACM.

Digital Library

[15]

J. B. Thibault, K. D. Sauer, C. A. Bouman, and J. Hsieh. A Three-dimensional Statistical Approach to Improved Image Quality for Multislice Helical CT. Med. Physics, 34(11):4526--44, 2007.

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Mammeri NJuurlink B(2019)VComputeLibProceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia10.1145/3365921.3365936(242-251)Online publication date: 2-Dec-2019
https://dl.acm.org/doi/10.1145/3365921.3365936
Crawford LO'Boyle M(2018)A Cross-platform Evaluation of Graphics Shader Compiler Optimization2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00035(219-228)Online publication date: Apr-2018
https://doi.org/10.1109/ISPASS.2018.00035
Show More Cited By

Index Terms

Architecture-aware optimization targeting multithreaded stream computing
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Optimization and Implementation of LBM Benchmark on Multithreaded GPU
DSDE '10: Proceedings of the 2010 International Conference on Data Storage and Data Engineering

With fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

March 2009

107 pages

ISBN:9781605585178

DOI:10.1145/1513895

Conference Chairs:
David Kaeli
Northeastern University
,
Miriam Leeser
Northeastern University

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU '09

GPGPU '09: Second Workshop on General-Purpose Computation on Graphics Processing Units

March 8, 2009

D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Mammeri NJuurlink B(2019)VComputeLibProceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia10.1145/3365921.3365936(242-251)Online publication date: 2-Dec-2019
https://dl.acm.org/doi/10.1145/3365921.3365936
Crawford LO'Boyle M(2018)A Cross-platform Evaluation of Graphics Shader Compiler Optimization2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00035(219-228)Online publication date: Apr-2018
https://doi.org/10.1109/ISPASS.2018.00035
Phuong TLee DLee J(2017)Impacts of optimization strategies on performance, power/energy consumption of a GPU based parallel reductionJournal of Central South University10.1007/s11771-017-3676-524:11(2624-2637)Online publication date: 16-Dec-2017
https://doi.org/10.1007/s11771-017-3676-5
Chu SHsiao C(2014)Optimising space exploration of OpenCL for GPGPUsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2014.0586999:1/2(64-79)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1504/IJCSE.2014.058699
Ukidave YZiabari AMistry PSchirner GKaeli D(2014)Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platformsThe International Journal of High Performance Computing Applications10.1177/109434201452690728:3(319-334)Online publication date: 21-Mar-2014
https://doi.org/10.1177/1094342014526907
White ALee S(2014)Derivation of optimal input parameters for minimizing execution time of matrix-based computations on a GPUParallel Computing10.1016/j.parco.2014.09.00240:10(628-645)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1016/j.parco.2014.09.002
Ukidave YKaeli D(2013)Analyzing Optimization Techniques for Power Efficiency on Heterogeneous PlatformsProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.220(1040-1049)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.220
Junfeng ZGang CKeliang ZBaifeng W(2012)An experimental GPU global memory performance estimation and optimization2012 International Conference on Systems and Informatics (ICSAI2012)10.1109/ICSAI.2012.6223155(910-914)Online publication date: May-2012
https://doi.org/10.1109/ICSAI.2012.6223155
Zhu JChen GWu B(2012)GPGPU Memory Estimation and Optimization Targeting OpenCL ArchitectureProceedings of the 2012 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2012.9(449-458)Online publication date: 24-Sep-2012
https://dl.acm.org/doi/10.1109/CLUSTER.2012.9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten