Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO51591.2021.9370321acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Compiling graph applications for GPUs with graphit

Published: 17 September 2021 Publication History

Abstract

The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations and hardware platforms to use. The GraphIt programming language makes it easy for the programmer to write the algorithm once and optimize it for different inputs using a scheduling language. However, GraphIt currently has no support for generating highperformance code for GPUs. Programmers must resort to re-implementing the entire algorithm from scratch in a low-level language with an entirely different set of abstractions and optimizations in order to achieve high performance on GPUs.
We propose G2, an extension to the GraphIt compiler framework, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining load balancing, edge traversal direction, active vertexset creation, active vertexset processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balancing by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within the L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.

References

[1]
C. Eksombatchai, P. Jindal, J. Z. Liu, Y. Liu, R. Sharma, C. Sugnet, M. Ulrich, and J. Leskovec, "Pixie: A system for recommending 3+ billion items to 200+ million users in real-time," in Proceedings of the 2018 World Wide Web Conference (WWW), 2018, pp. 1775--1784.
[2]
R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, "Graph convolutional neural networks for web-scale recommender systems," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 974--983.
[3]
A. Sharma, J. Jiang, P. Bommannavar, B. Larson, and J. Lin, "Graphjet: Real-time content recommendations at Twitter," Proc. VLDB Endow., vol. 9, no. 13, pp. 1281--1292, Sep. 2016.
[4]
N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani, "TAO: Facebook's distributed data store for the social graph," in USENIX Annual Technical Conference (USENIX ATC), 2013, pp. 49--60.
[5]
S. Pallottino and M. G. Scutellà, Shortest path algorithms in transportation models: Classical and innovative aspects, 1998, pp. 245--281.
[6]
Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe, "GraphIt: A high-performance graph DSL," Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 121:1--121:30, 2018.
[7]
Y. Zhang, A. Brahmakshatriya, X. Chen, L. Dhulipala, S. Kamil, S. Amarasinghe, and J. Shun, "Optimizing ordered graph algorithms with GraphIt," in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO), 2020, p. 158170.
[8]
J. Sun, H. Vandierendonck, and D. S. Nikolopoulos, "GraphGrind: Addressing load imbalance of graph partitioning," in Proceedings of the International Conference on Supercomputing (ICS), 2017, pp. 16:1--16:10.
[9]
J. Shun and G. E. Blelloch, "Ligra: A lightweight graph processing framework for shared memory," in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2013, pp. 135--146.
[10]
J. Shun, L. Dhulipala, and G. E. Blelloch, "Smaller and faster: Parallel processing of compressed graphs with Ligra+," in IEEE Data Compression Conference (DCC), 2015, pp. 403--412.
[11]
S. Grossman, H. Litz, and C. Kozyrakis, "Making pull-based graph processing performant," in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2018, pp. 246--260.
[12]
N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey, "GraphMat: High performance graph analytics made productive," Proc. VLDB Endow., vol. 8, no. 11, pp. 1214--1225, Jul. 2015.
[13]
Z. Peng, A. Powell, B. Wu, T. Bicer, and B. Ren, "Graphphi: Efficient parallel graph processing on emerging throughput-oriented architectures," in Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2018, pp. 1--14.
[14]
Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia, "Making caches work for graph analytics," in 2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 293--302.
[15]
D. Nguyen, A. Lenharth, and K. Pingali, "A lightweight infrastructure for graph analytics," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 456--471.
[16]
D. Merrill, M. Garland, and A. Grimshaw, "High-performance and scalable GPU graph traversal," ACM Trans. Parallel Comput., vol. 1, no. 2, pp. 14:1--14:30, Feb. 2015.
[17]
S. Pai and K. Pingali, "A compiler for throughput optimization of graph algorithms on GPUs," in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2016, pp. 1--19.
[18]
T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali, "Groute: An asynchronous multi-GPU programming model for irregular computations," in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017, pp. 235--248.
[19]
K. Meng, J. Li, G. Tan, and N. Sun, "A pattern based algorithmic autotuner for graph processing on GPUs," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019, pp. 201--213.
[20]
H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang, "SEP-Graph: Finding shortest execution paths for graph processing under a hybrid framework on GPU," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019, pp. 38--52.
[21]
H. Liu and H. H. Huang, "SIMD-x: Programming and processing of graph algorithms on GPUs," in USENIX Annual Technical Conference (ATC), 2019, pp. 411--428.
[22]
F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, "CuSha: Vertex-centric graph processing on GPUs," in Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2014, pp. 239--252.
[23]
F. Khorasani, R. Gupta, and L. N. Bhuyan, "Scalable SIMD-efficient graph processing on GPUs," in 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015, pp. 39--50.
[24]
Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel et al., "Gunrock: GPU graph analytics," ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, p. 3, 2017.
[25]
C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan, "Multigraph: Efficient graph processing on GPUs," in 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017, pp. 27--40.
[26]
NVIDIA, "CUDA C++ programming guide," https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, Aug. 2019.
[27]
A. H. Nodehi Sabet, J. Qiu, and Z. Zhao, "Tigr: Transforming irregular graphs for GPU-friendly graph processing," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 622--636.
[28]
A. H. N. Sabet, Z. Zhao, and R. Gupta, "Subway: Minimizing data transfer during out-of-GPU-memory graph processing," in Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys), 2020, pp. 1--16.
[29]
J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke et al., "Mathematical foundations of the GraphBLAS," in IEEE High Performance Extreme Computing Conference (HPEC), 2016, pp. 1--9.
[30]
C. Yang, A. Bulucc, and J. D. Owens, "Implementing push-pull efficiently in GraphBLAS," in Proceedings of the 47th International Conference on Parallel Processing (ICPP), 2018, pp. 89:1--89:11.
[31]
U. Meyer and P. Sanders, "Δ-stepping: A parallelizable shortest path algorithm," J. Algorithms, vol. 49, no. 1, pp. 114--152, 2003.
[32]
G. Gill, R. Dathathri, L. Hoang, A. Lenharth, and K. Pingali, "Abelian: A compiler for graph analytics on distributed, heterogeneous platforms," in European Conference on Parallel Processing (Euro-Par), 2018, pp. 249--264.
[33]
P. Harish and P. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in International Conference on High-Performance Computing (HiPC), 2007, pp. 197--208.
[34]
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA graph algorithms at maximum warp," in Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 2011, pp. 267--276.
[35]
H. Liu and H. H. Huang, "Enterprise: breadth-first graph traversal on GPUs," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015, pp. 1--12.
[36]
X. Shi, X. Luo, J. Liang, P. Zhao, S. Di, B. He, and H. Jin, "Frog: Asynchronous graph processing on GPU with hybrid coloring model," IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 1, pp. 29--42, 2017.
[37]
A. Davidson, S. Baxter, M. Garland, and J. D. Owens, "Work-efficient parallel GPU methods for single-source shortest paths," in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014, pp. 349--359.
[38]
J. Soman, K. Kishore, and P. Narayanan, "A fast GPU algorithm for graph connectivity," in 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp. 1--8.
[39]
R. Nasre, M. Burtscher, and K. Pingali, "Data-driven versus topology-driven irregular computations on GPUs," in IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS), 2013, pp. 463--474.
[40]
S. Che, "GasCL: A vertex-centric graph model for GPUs," in IEEE High Performance Extreme Computing Conference (HPEC), 2014, pp. 1--6.
[41]
M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim, "GTS: A fast and scalable graph processing method based on streaming topology to GPUs," in ACM SIGMOD International Conference on Management of Data, 2016, pp. 447--461.
[42]
A. Gaihre, Z. Wu, F. Yao, and H. Liu, "XBFS: eXploring runtime optimizations for breadth-first search on GPUs," in Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2019, pp. 121--131.
[43]
W. Han, D. Mawhirter, B. Wu, and M. Buland, "Graphie: Large-scale asynchronous graph traversals on just a GPU," in 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017, pp. 233--245.
[44]
Y. Zhang, X. Liao, H. Jin, B. He, H. Liu, and L. Gu, "DiGraph: An efficient path-based iterative directed graph processing system on multiple GPUs," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, p. 601614.
[45]
K. Zhang, R. Chen, and H. Chen, "NUMA-aware graph-structured analytics," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2015, pp. 183--193.
[46]
S. Hong, H. Chafi, E. Sedlar, and K. Olukotun, "Green-Marl: A DSL for easy and efficient graph analysis," in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012, pp. 349--362.
[47]
M. S. Lam, S. Guo, and J. Seo, "SociaLite: Datalog extensions for efficient social network analysis," in IEEE International Conference on Data Engineering (ICDE), 2013, pp. 278--289.
[48]
C. R. Aberger, A. Lamb, S. Tu, A. Ne, K. Olukotun, and C. Re, "EmptyHeaded: A relational engine for graph processing," vol. 42, no. 4, Oct. 2017, pp. 20:1--20:44.
[49]
K. Vora, R. Gupta, and G. Xu, "KickStarter: Fast and accurate computations on streaming graphs via trimmed approximations," in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 237--251.
[50]
R. Chen, J. Shi, Y. Chen, B. Zang, H. Guan, and H. Chen, "PowerLyra: Differentiated graph computation and partitioning on skewed graphs," ACM Transactions on Parallel Computing (TOPC), vol. 5, no. 3, pp. 13:1--13:39, 2018.
[51]
X. Zhu, W. Chen, W. Zheng, and X. Ma, "Gemini: A computation-centric distributed graph processing system," in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp. 301--316.
[52]
R. Dathathri, G. Gill, L. Hoang, H.-V. Dang, A. Brooks, N. Dryden, M. Snir, and K. Pingali, "Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics," in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2018, pp. 752--768.
[53]
R. Dathathri, G. Gill, L. Hoang, V. Jatala, K. Pingali, V. K. Nandivada, H.-V. Dang, and M. Snir, "Gluon-Async: A bulk-asynchronous system for distributed and heterogeneous graph analytics," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019, pp. 15--28.
[54]
R. R. McCune, T. Weninger, and G. Madey, "Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing," ACM Comput. Surv., vol. 48, no. 2, pp. 25:1--25:39, Oct. 2015.
[55]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein, "GraphLab: A new framework for parallel machine learning," in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI), 2010, pp. 340--349.
[56]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, "PowerGraph: Distributed graph-parallel computation on natural graphs," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2012, pp. 17--30.
[57]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A system for large-scale graph processing," in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010, pp. 135--146.
[58]
V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan, "Managing large graphs on multi-cores with graph awareness," in USENIX Conference on Annual Technical Conference (ATC), 2012.
[59]
A. Kyrola, G. Blelloch, and C. Guestrin, "GraphChi: Large-scale graph computation on just a PC," in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2012, pp. 31--46.
[60]
A. Roy, I. Mihailovic, and W. Zwaenepoel, "X-Stream: Edge-centric graph processing using streaming partitions," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 472--488.
[61]
K. Vora, G. Xu, and R. Gupta, "Load the edges you need: A generic I/O optimization for disk-based graph processing," in 2016 USENIX Annual Technical Conference (ATC), 2016, pp. 507--522.
[62]
K. Wang, G. Xu, Z. Su, and Y. D. Liu, "GraphQ: Graph query processing with abstraction refinement---scalable and programmable analytics over very large graphs on a single PC," in USENIX Annual Technical Conference (ATC), 2015, pp. 387--401.
[63]
M. Zhang, Y. Wu, Y. Zhuo, X. Qian, C. Huan, and K. Chen, "Wonderland: A novel abstraction-based out-of-core graph processing system," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 608--621.
[64]
Z. Zuo, J. Thorpe, Y. Wang, Q. Pan, S. Lu, K. Wang, G. H. Xu, L. Wang, and X. Li, "Grapple: A graph system for static finite-state property checking of large-scale systems code," in Proceedings of the Fourteenth European Conference on Computer Systems (EuroSys), 2019, pp. 1--17.
[65]
S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim, "Mosaic: Processing a trillion-edge graph on a single machine," in Proceedings of the Twelfth European Conference on Computer Systems (EuroSys), 2017, pp. 527--543.
[66]
K. Wang, A. Hussain, Z. Zuo, G. Xu, and A. Amiri Sani, "Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 389--404.
[67]
X. Zhu, W. Han, and W. Chen, "GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning," in Proceedings of the USENIX Annual Technical Conference (ATC), 2015, pp. 375--386.
[68]
D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, "Flashgraph: Processing billion-node graphs on an array of commodity SSDs," in USENIX Conference on File and Storage Technologies (FAST), 2015, pp. 45--58.
[69]
K. Wang, Z. Zuo, J. Thorpe, T. Q. Nguyen, and G. H. Xu, "RStream: marrying relational algebra with streaming for efficient graph mining on a single machine," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 763--782.
[70]
S. Beamer, K. Asanovic, and D. Patterson, "Locality exists in graph processing: Workload characterization on an Ivy Bridge server," in IEEE International Symposium on Workload Characterization (IISWC), 2015, pp. 56--65.
[71]
S. Beamer, K. Asanović, and D. Patterson, "Reducing Pagerank communication via propagation blocking," in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017, pp. 820--831.
[72]
V. Kiriansky, Y. Zhang, and S. Amarasinghe, "Optimizing indirect memory references with Milk," in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT), 2016, pp. 299--312.
[73]
Y. Nagasaka, A. Nukada, and S. Matsuoka, "Cache-aware sparse matrix formats for Kepler GPU," in IEEE International Conference on Parallel and Distributed Systems (ICPADS), Dec 2014, pp. 281--288.
[74]
C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, "Adaptive sparse tiling for sparse matrix multiplication," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019, pp. 300--314.
[75]
S. Beamer, K. Asanović, and D. Patterson, "Direction-optimizing breadth-first search," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012, pp. 12:1--12:10.
[76]
M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, "To push or to pull: On reducing communication and synchronization in graph computations," in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2017, pp. 93--104.
[77]
J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe, "OpenTuner: An extensible framework for program autotuning," in International Conference on Parallel Architectures and Compilation Techniques, 2014, pp. 303--315.
[78]
R. Rossi and N. Ahmed, "The network data repository with interactive graph analytics and visualization," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015, pp. 4292--4293.
[79]
T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1--1:25, Dec. 2011.
[80]
C. Demetrescu, A. Goldberg, and D. Johnson, "9th DIMACS implementation challenge - shortest paths," http://www.dis.uniroma1.it/challenge9/.

Cited By

View all
  • (2024)TLPGNN: A Lightweight Two-level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUsACM Transactions on Parallel Computing10.1145/364471211:2(1-28)Online publication date: 8-Jun-2024
  • (2024)Kimbap: A Node-Property Map System for Distributed Graph AnalyticsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640421(566-581)Online publication date: 27-Apr-2024
  • (2023)A Bucket-aware Asynchronous Single-Source Shortest Path Algorithm on GPUProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605746(50-60)Online publication date: 7-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '21: Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization
February 2021
395 pages
ISBN:9781728186139
  • General Chair:
  • Jae W. Lee

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 17 September 2021

Check for updates

Badges

Author Tags

  1. GPUs
  2. compiler optimizations
  3. domain-specific languages
  4. graph processing

Qualifiers

  • Research-article

Conference

CGO '21
CGO '21: 19th ACM/IEEE International Symposium on Code Generation and Optimization
February 27 - March 3, 2021
Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)5
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TLPGNN: A Lightweight Two-level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUsACM Transactions on Parallel Computing10.1145/364471211:2(1-28)Online publication date: 8-Jun-2024
  • (2024)Kimbap: A Node-Property Map System for Distributed Graph AnalyticsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640421(566-581)Online publication date: 27-Apr-2024
  • (2023)A Bucket-aware Asynchronous Single-Source Shortest Path Algorithm on GPUProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605746(50-60)Online publication date: 7-Aug-2023
  • (2023)AdaptGearProceedings of the 20th ACM International Conference on Computing Frontiers10.1145/3587135.3592199(52-62)Online publication date: 9-May-2023
  • (2023)uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural NetworksProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575723(878-891)Online publication date: 27-Jan-2023
  • (2023)A Programming Model for GPU Load BalancingProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577434(79-91)Online publication date: 25-Feb-2023
  • (2022)Decoupling Schedule, Topology Layout, and Algorithm to Easily Enlarge the Tuning Space of GPU Graph ProcessingProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569686(198-210)Online publication date: 8-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media