An efficient graph accelerator with parallel data conflict management
Proceedings of the 27th International Conference on Parallel Architectures …, 2018•dl.acm.org
Graph-specific computing with the support of dedicated accelerator has greatly boosted the
graph processing in both efficiency and energy. Nevertheless, their data conflict
management is still sequential when certain vertex needs a large number of conflicting
updates at the same time, leading to prohibitive performance degradation. This is
particularly true and serious for processing natural graphs. In this paper, we have the insight
that the atomic operations for the vertex updating of many graph algorithms (eg, BFS …
graph processing in both efficiency and energy. Nevertheless, their data conflict
management is still sequential when certain vertex needs a large number of conflicting
updates at the same time, leading to prohibitive performance degradation. This is
particularly true and serious for processing natural graphs. In this paper, we have the insight
that the atomic operations for the vertex updating of many graph algorithms (eg, BFS …
Graph-specific computing with the support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain vertex needs a large number of conflicting updates at the same time, leading to prohibitive performance degradation. This is particularly true and serious for processing natural graphs.
In this paper, we have the insight that the atomic operations for the vertex updating of many graph algorithms (e.g., BFS, PageRank, and WCC) are typically incremental and simplex. This hence allows us to parallelize the conflicting vertex updates in an accumulative manner. We architect AccuGraph, a novel graph-specific accelerator that can simultaneously process atomic vertex updates for massive parallelism while ensuring the correctness. A parallel accumulator is designed to remove the serialization in atomic protections for conflicting vertex updates through merging their results in parallel. Our implementation on Xilinx FPGA with a wide variety of typical graph algorithms shows that our accelerator achieves an average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in comparison with state-of-the-art ForeGraph (with its single-chip version).
ACM Digital Library