research-article

Open access

Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs

Authors:

Dominique LaSalle,

George KarypisAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4582 - 4591

https://doi.org/10.1145/3534678.3539177

Published: 14 August 2022 Publication History

Abstract

Graph neural networks (GNN) have shown great success in learn- ing from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large and heterogeneous, containing many millions or billions of vertices and edges of different types. To tackle this challenge, we develop DistDGLv2, a system that extends DistDGL for training GNNs on massive heterogeneous graphs in a mini-batch fashion, using distributed hybrid CPU/GPU training. DistDGLv2 places graph data in distributed CPU memory and performs mini-batch computation in GPUs. For ease of use, DistDGLv2 adopts API compatible with Deep Graph Library (DGL)'s mini-batch training and heterogeneous graph API, which enables distributed training with almost no code modification. To ensure model accuracy, DistDGLv2 follows a synchronous training approach and allows ego-networks forming mini-batches to include non-local vertices. To ensure data locality and load balancing, DistDGLv2 partitions heterogeneous graphs by using a multi-level partitioning algorithm with min-edge cut and multiple balancing constraints. DistDGLv2 deploys an asynchronous mini- batch generation pipeline that makes computation and data access asynchronous to fully utilize all hardware (CPU, GPU, network, PCIe). We demonstrate DistDGLv2 on various GNN workloads. Our results show that DistDGLv2 achieves 2 - 3x speedup over DistDGL and 18× speedup over Euler. It takes only 5 - 10 seconds to complete an epoch on graphs with hundreds of millions of vertices on a cluster with 64 GPUs.

References

[1]

Euler github. https://github.com/alibaba/euler, 2020.

[2]

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.

[3]

Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul 2019.

[4]

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.

[5]

Swapnil Gandhi and Anand Padmanabha Iyer. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 551--568. USENIX Association, July 2021.

[6]

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 2017.

[7]

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 1025--1035, 2017.

Digital Library

[8]

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.

[9]

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.

[10]

Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. Improving the accuracy, scalability, and performance of graph neural networks with roc. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 187--198. 2020.

[11]

G. Karypis and Kirk Schloegel. Parmetis 4.0: Parallel graph partitioning and sparse matrix ordering library. Technical report, Department of Computer Science, University of Minnesota, 2011. http://www.cs.umn.edu/?metis.

[12]

George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998.

Digital Library

[13]

George Karypis and Vipin Kumar. Multilevel algorithms for multi-constraint graph partitioning. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, page 1--13, USA, 1998.

Digital Library

[14]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

[15]

Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. Pagraph: Scaling gnn training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 401--415, 2020.

Digital Library

[16]

Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, and Chuanxiong Guo. Bgl: Gpu-efficient gnn training by optimizing graph data i/o and preprocessing. arXiv preprint arXiv:2112.08541, 2021.

[17]

Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. Neugraph: Parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 443--458, Renton, WA, July 2019.

[18]

Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj D. Kalamkar, Nesreen K. Ahmed, and Sasikanth Avancha. Distgnn: Scalable distributed training for large-scale graph neural networks. CoRR, abs/2104.06700, 2021.

Digital Library

[19]

Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. CoRR, abs/1703.06103, 2017.

[20]

John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, JinliangWei, Keval Vora, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu.Dorylus: Affordable, scalable, and accurate GNN training with distributed CPUservers and serverless threads. In 15th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 21), pages 495--514. USENIXAssociation, July2021.

[21]

Alok Tripathy, Katherine Yelick, and Aydin Buluc. Reducing communication in graph neural network training. arXiv preprint arXiv:2005.03300, 2020.

[22]

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. CoRR, abs/1710.10903, 2018.

[23]

MinjieWang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.

[24]

D Randall Wilson and Tony R Martinez. The general inefficiency of batch training for gradient descent learning. Neural networks, 16(10):1429--1451, 2003.

Digital Library

[25]

Dalong Zhang, Xin Huang, Ziqi Liu, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Zhiqiang Zhang, Lin Wang, Jun Zhou, Yang Shuang, and Yuan Qi. AGL: a scalable system for industrial-purpose graph machine learning. arXiv preprint arXiv:2003.02454, 2020.

[26]

Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. Distdgl: Distributed graph neural network training for billion-scale graphs. arXiv preprint arXiv:2010.05337, 2021.

[27]

Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. AliGraph: A comprehensive graph neural network platform. arXiv preprint arXiv:1902.08730, 2019.

[28]

Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Layer-dependent importance sampling for training deep and large graph convolutional networks. arXiv preprint arXiv:1911.07323, 2019.

Cited By

Guliyev RHaldar AFerhatosmanoglu H(2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681961
Li ZJian XWang YShao YChen L(2024)DAHA: Accelerating GNN Training with Data and Hardware Aware Execution PlanningProceedings of the VLDB Endowment10.14778/3648160.364817617:6(1364-1376)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648176
Yuan HLiu YZhang YAi XWang QChen CGu YYu G(2024)Comprehensive Evaluation of GNN Training Systems: A Data Management PerspectiveProceedings of the VLDB Endowment10.14778/3648160.364816717:6(1241-1254)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648167
Show More Cited By

Index Terms

Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
1. Computing methodologies
  1. Distributed computing methodologies
  2. Machine learning

Recommendations

NeutronStar: Distributed GNN Training with Hybrid Dependency Management
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

GNN's training needs to resolve issues of vertex dependencies, i.e., each vertex representation's update depends on its neighbors. Existing distributed GNN systems adopt either a dependencies-cached approach or a dependencies-communicated approach. ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

BigGPU enables users to regard a hybrid CPU/GPU cluster as a big GPU.BigGPU supports users to develop applications on hybrid CPU/GPU clusters by using only CUDA.BigGPU supports load balance, large virtual global memory and thread configuration for CUDA ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
1,407
Total Downloads

Downloads (Last 12 months)583
Downloads (Last 6 weeks)46

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guliyev RHaldar AFerhatosmanoglu H(2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681961
Li ZJian XWang YShao YChen L(2024)DAHA: Accelerating GNN Training with Data and Hardware Aware Execution PlanningProceedings of the VLDB Endowment10.14778/3648160.364817617:6(1364-1376)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648176
Yuan HLiu YZhang YAi XWang QChen CGu YYu G(2024)Comprehensive Evaluation of GNN Training Systems: A Data Management PerspectiveProceedings of the VLDB Endowment10.14778/3648160.364816717:6(1241-1254)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648167
Song YChen PLu YAbrar NKalavri V(2024)In situ neighborhood sampling for large-scale GNN trainingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663443(1-5)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663443
Shao YLi HGu XYin HLi YMiao XZhang WCui BChen L(2024)Distributed Graph Neural Network Training: A SurveyACM Computing Surveys10.1145/364835856:8(1-39)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3648358
Li SGu JWang JYao TLiang ZShi YLi SXi WLi SZhou CWang YChi XLee IChabbi MSteuwer M(2024)POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU ClustersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638488(469-471)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638488
Xia YZhang ZYang DHu CZhou XChen HSang QCheng D(2024)Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline ParallelismIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343285535:11(1904-1919)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3432855
Besta MHoefler T(2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.330343146:5(2584-2606)Online publication date: May-2024
https://doi.org/10.1109/TPAMI.2023.3303431
Chen HDuan YNie FWang RLi X(2024)Fuzzy Clustering From Subset-Clustering to Fullset-MembershipIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.342157632:9(5359-5370)Online publication date: Sep-2024
https://doi.org/10.1109/TFUZZ.2024.3421576
Kamath APeter S(2024) (MC) 2 : Lazy MemCopy at the Memory Controller 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00084(1112-1128)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00084
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents