research-article

Open access

TT-GNN: Efficient On-Chip Graph Neural Network Training via Embedding Reformation and Hardware Optimization

Authors:

Yuan XieAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 452 - 464

https://doi.org/10.1145/3613424.3614305

Published: 08 December 2023 Publication History

All formats PDF

Abstract

Training Graph Neural Networks on large graphs is challenging due to the need to store graph data and move them along the memory hierarchy. In this work, we tackle this by effectively compressing graph embedding matrix such that the model training can be fully enabled with on-chip compute and memory resources. Specifically, we leverage the graph homophily property and consider using Tensor-train to represent the graph embedding. This allows nodes with similar neighborhoods to partially share the feature representation.

While applying Tensor-train reduces the size of the graph embedding, it imposes several challenges to hardware design. On one hand, utilizing low-rank representation requires the features to be decompressed before being sent to GNN models, which introduces extra computation overhead. On the other hand, the decompressed features might still exceed on-chip memory capacity even with the minibatch setting, causing inefficient off-chip memory access. Thus, we propose the TT-GNN hardware accelerator with a specialized dataflow tailored for on-chip Tensor-train GNN learning. Based on the on-chip memory capacity and training configuration, TT-GNN adaptively breaks down a minibatch into smaller microbatches that can be fitted on-chip. The microbatch composition and scheduling order are designed to maximize data reuse and reduce redundant computations both across and within microbatches. To mitigate TT computation overhead, we further propose a unified algorithm to jointly handle TT decompression during forward propagation and TT gradient derivation during backward propagation. Evaluated on a series of benchmarks, the proposed software-hardware solution is able to outperform existing CPU-GPU training systems on both training performance (1.55 ∼ 4210 ×) and energy efficiency (2.83 ∼ 2254 ×). We believe TT-GNN introduces a new perspective to address large-scale GNN training and enables possibilities to train GNN models even under a significantly constrained resource budget.

References

[1]

Youhui Bai, Cheng Li, Zhiqi Lin, Yufei Wu, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2021. Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs. IEEE Transactions on Parallel and Distributed Systems 32, 10 (2021), 2541–2556. https://doi.org/10.1109/TPDS.2021.3065737

Abstract

References

Index Terms

Recommendations

NeuronLink: An Efficient Chip-to-Chip Interconnect for Large-Scale Neural Network Accelerators

Reconfigurable network-on-chip for 3D neural network accelerators

Graph Coarsening via Convolution Matching for Scalable Graph Neural Network Training

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations