research-article

General transformations for GPU execution of tree traversals

Authors:

Michael Goldfarb,

Milind KulkarniAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 10, Pages 1 - 12

https://doi.org/10.1145/2503210.2503223

Published: 17 November 2013 Publication History

Abstract

With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38x over 32--thread CPU versions.

References

[1]

J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. nature, 324:4, 1986.

[2]

M. Burtscher and K. Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. 2011.

[3]

T. Foley and J. Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '05, pages 15--22, 2005.

Digital Library

[4]

M. Goldfarb, Y. Jo, and M. Kulkarni. General Transformations for GPU Execution of Tree Traversals. Technical Report TR-ECE-13-09, Purdue University, 2013.

Digital Library

[5]

J. Gunther, S. Popov, H.-P. Seidel, and P. Slusallek. Real-time ray tracing on gpu with bvh-based packet traversal. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 113--118, 2007.

Digital Library

[6]

M. Hapala, T. Davidovic, I. Wald, V. Havran, and P. Slusallek. Efficient Stack-less BVH Traversal for Ray Tracing. In Proceedings 27th Spring Conference of Computer Graphics (SCCG) 2011, pages 29--34, 2011.

Digital Library

[7]

D. M. Hughes and I. S. Lim. Kd-jump: a path-preserving stackless traversal for faster isosurface raytracing on gpus. IEEE Transactions on Visualization and Computer Graphics, 15(6):1555--1562, Nov. 2009.

Digital Library

[8]

X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on gpus. In Proceedings of the 27th international ACM conference on International conference on supercomputing, ICS '13, pages 409--420, New York, NY, USA, 2013. ACM.

Digital Library

[9]

Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic vectorization of tree traversals. In PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, 2013.

Digital Library

[10]

Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '11, pages 463--482, 2011.

Digital Library

[11]

Y. Jo and M. Kulkarni. Automatically enhancing locality for tree traversals with traversal splicing. In Proceedings of the 2012 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '12, 2012.

Digital Library

[12]

M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, April 2009.

[13]

S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, 2010.

Digital Library

[14]

C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, 2009.

Digital Library

[15]

J. Makino. Vectorization of a treecode. J. Comput. Phys., 87:148--160, March 1990.

Digital Library

[16]

E. Mansson, J. Munkberg, and T. Akenine-Moller. Deep coherent ray tracing. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 79--85, 2007.

Digital Library

[17]

M. Méndez-Lojo, M. Burtscher, and K. Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 107--116. ACM, 2012.

Digital Library

[18]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 117--128, 2012.

Digital Library

[19]

B. Moon, Y. Byun, T.-J. Kim, P. Claudio, H.-S. Kim, Y.-J. Ban, S. W. Nam, and S.-E. Yoon. Cache-oblivious ray reordering. ACM Trans. Graph., 29(3):28:1--28:10, July 2010.

Digital Library

[20]

A. Moore, A. Connolly, C. Genovese, A. Gray, L. Grone, N. Kanidoris II, R. Nichol, J. Schneider, A. Szalay, I. Szapudi, et al. Fast algorithms and efficient statistics: N-point correlation functions. Mining the Sky, pages 71--82, 2001.

[21]

S. Popov, J. Günther, H.-P. Seidel, and P. Slusallek. Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum, 26(3):415--424, Sept. 2007. (Proceedings of Eurographics).

[22]

M. Rehman, K. Kothapalli, and P. Narayanan. Fast and scalable list ranking on the gpu. In Proceedings of the 23rd international conference on supercomputing, pages 235--243, 2009.

Digital Library

[23]

J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, June 1995.

Digital Library

[24]

V. Vineet, P. Harish, S. Patidar, and P. Narayanan. Fast minimum spanning tree for large graphs on the gpu. In Proceedings of the Conference on High Performance Graphics 2009, pages 167--171. ACM, 2009.

Digital Library

[25]

Z. Wei and J. JaJa. Optimization of linked list prefix computations on multithreaded gpus using cuda. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--8. IEEE, 2010.

[26]

B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '13, pages 57--68, New York, NY, USA, 2013. ACM.

Digital Library

[27]

P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, SODA '93, pages 311--321, 1993.

Digital Library

[28]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS XVI, pages 369--380, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Koparkar CSinghal VGupta ARainey MVollmer MPelenitsyn ATobin-Hochstadt SKulkarni MNewton RBond MLee JPayer H(2024)Garbage Collection for Mostly Serialized HeapsProceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management10.1145/3652024.3665512(1-14)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652024.3665512
Mandarapu DNagarajan VPelenitsyn AKulkarni M(2024)Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray TracingProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656601(14-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656601
Chowdhury AKesserwani GRougé CRichmond P(2023)GPU-parallelisation of Haar wavelet-based grid resolution adaptation for fast finite volume modelling: application to shallow water flowsJournal of Hydroinformatics10.2166/hydro.2023.15425:4(1210-1234)Online publication date: 16-Jun-2023
https://doi.org/10.2166/hydro.2023.154
Show More Cited By

Index Terms

General transformations for GPU execution of tree traversals
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Hybrid CPU-GPU scheduling and execution of tree traversals
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Importance of explicit vectorization for CPU and GPU software performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are ...
A GPU implementation of inclusion-based points-to analysis
PPOPP '12

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
537
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koparkar CSinghal VGupta ARainey MVollmer MPelenitsyn ATobin-Hochstadt SKulkarni MNewton RBond MLee JPayer H(2024)Garbage Collection for Mostly Serialized HeapsProceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management10.1145/3652024.3665512(1-14)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652024.3665512
Mandarapu DNagarajan VPelenitsyn AKulkarni M(2024)Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray TracingProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656601(14-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656601
Chowdhury AKesserwani GRougé CRichmond P(2023)GPU-parallelisation of Haar wavelet-based grid resolution adaptation for fast finite volume modelling: application to shallow water flowsJournal of Hydroinformatics10.2166/hydro.2023.15425:4(1210-1234)Online publication date: 16-Jun-2023
https://doi.org/10.2166/hydro.2023.154
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Nagarajan VKulkarni M(2023)RT-DBSCAN: Accelerating DBSCAN using Ray Tracing Hardware2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00100(963-973)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00100
Shah MNeff RWu HMinutoli MTumeo ABecchi M(2022)Accelerating Random Forest Classification on GPU and FPGAProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545067(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545067
Iannetta PGonnord LRadanne GTilevich EDe Roover C(2021)Compiling pattern matching to in-place modificationsProceedings of the 20th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3486609.3487204(123-129)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3486609.3487204
Koparkar CRainey MVollmer MKulkarni MNewton R(2021)Efficient tree-traversals: reconciling parallelism and dense data representationsProceedings of the ACM on Programming Languages10.1145/34735965:ICFP(1-29)Online publication date: 19-Aug-2021
https://dl.acm.org/doi/10.1145/3473596
Zhao WWang WWang Q(2021)Optimization of cosmological N-body simulation with FMM-PM on SIMT acceleratorsThe Journal of Supercomputing10.1007/s11227-021-04153-078:5(7186-7205)Online publication date: 5-Nov-2021
https://doi.org/10.1007/s11227-021-04153-0
Li MHawrylak PHale J(2020)Implementing an Attack Graph Generator in CUDA2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00128(730-738)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00128
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten