research-article

Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUs

Authors: Shiyang Li, Ruiqi Tang, Jingyu Zhu, Ziyi Zhao, Xiaoli Gong, Wenwen Wang, Jin Zhang, Pen-Chung YewAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6

Pages 1954 - 1967

https://doi.org/10.1109/TPDS.2023.3268662

Published: 01 June 2023 Publication History

Abstract

Graph analytics are widely used including recommender systems, scientific computing, and data mining. Meanwhile, GPU has become the major accelerator for such applications. However, the graph size increases rapidly and often exceeds the GPU memory, incurring severe performance degradation due to frequent data transfers between the main memory and GPUs. To relieve this problem, we focus on the utilization of data in GPUs by taking advantage of the data reuse across iterations. In our studies, we deeply analyze the memory access patterns of graph applications at different granularities. We have found that the memory footprint is accessed with a roughly sequential scan without a hotspot, which infers an extremely long reuse distance. Based on our observation, we propose a novel framework, called <italic>Liberator</italic>, to exploit the data reuse within GPU memory. In <italic>Liberator</italic>, GPU memory is reserved for the data potentially accessed across iterations to avoid excessive data transfer between the main memory and GPUs. For the data not existing in GPU memory, a Merged and Aligned memory access manner is employed to improve the transmission efficiency. We also further optimize the framework by parallel processing of data in GPU memory and data in the main memory. We have implemented a prototype of the <italic>Liberator</italic> framework and conducted a series of experiments on performance evaluation. The experimental results show that <italic>Liberator</italic> can significantly reduce the data transfer overhead, which achieves an average of 2.7x speedup over a state-of-the-art approach.

References

[1]

W. Kim, “Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods,” Tsinghua Sci. Technol., vol. 17, no. 6, pp. 645–658, Dec. 2012.

[2]

P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “UbiCrawler: A scalable fully distributed web crawler,” Softw., Pract. Experience, vol. 34, no. 8, pp. 711–726, 2004. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.587

Digital Library

[3]

R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 4292–4293.

[4]

S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite,” 2015,.

[5]

S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu, “The ubiquity of large graphs and surprising challenges of graph processing,” Proc. VLDB Endow, vol. 11, no. 4, pp. 420–431, Dec. 2017.

Digital Library

[6]

NVIDIA, “NVIDIA tesla P100–the most advanced datacenter accelerator ever built featuring pascal GP100,” 2006, Accessed: Nov. 26, 2022. [Online]. Available: https://www.nvidia.cn/content/dam/en-zz/Solutions/Data-Center/tesla-p100/pdf/nvidia-teslap100-techoverview.pdf

[7]

“NVIDIA V100 tensor core GPU,” 2022. [Online]. Available: https://www.nvidia.cn/content/dam/en-zz/zh_cn/Solutions/Data-Center/tesla-v100/volta-v100s-datasheet-a4-nvidia-1209899-r1-web-zhcn.pdf

[8]

“NVIDIA A100 80GB PCIe GPU,” 2022, [Online]. Available: https://www.nvidia.cn/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577–001_v02.pdf

[9]

P. Harish and P. J. Narayanan, “Accelerating large graph algorithms on the GPU using CUDA,” in Proc. 14th Int. Conf. High Perform. Comput., Berlin, Germany:Springer, 2007, pp. 197–208.

[10]

L. Page, S. Brin, R. Motwani, and T. Winograd, “The pageRank citation ranking: Bringing order to the web,” Stanford InfoLab, Stanford, CA, USA, Tech. Rep., Nov. 1999.

[11]

J. Kunegis, “KONECT: The koblenz network collection,” in Proc. 22nd Int. Conf. World Wide Web, 2019, pp. 1343–1350.

[12]

N. Agarwal, D. Nellans, M. Stephenson, M. O’Connor, and S. W. Keckler, “Page placement strategies for GPUs within heterogeneous memory systems,” SIGPLAN Notices, vol. 50, no. 4, pp. 607–618, Mar. 2015.

Digital Library

[13]

R. Ausavarungnirun et al., “Mosaic: An application-transparent hardware-software cooperative memory manager for GPUs,” 2018,.

[14]

P. Gera, H. Kim, P. Sao, H. Kim, and D. Bader, “Traversing large graphs on GPUs with unified memory,” Proc. VLDB Endowment, vol. 13, no. 7, pp. 1119–1133, Mar. 2020.

Digital Library

[15]

A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu, “A yoke of oxen and a thousand chickens for heavy lifting graph processing,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Techn., 2012, pp. 345–354.

[16]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein, “GraphLab: A new framework for parallel machine learning,” in Proc. 26th Conf. Uncertainty Artif. Intell., Arlington, VA, USA: AUAI Press, 2010, pp. 340–349.

[17]

H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified memory management in GPUs for irregular workloads,” in Proc. 25th Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA: Association for Computing Machinery, 2020, pp. 1357–1370.

[18]

NVIDIA, “Nvprof: A CUDA profiling tool that traps memory access addresses,” 2021. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide

[19]

D. Sengupta, S. L. Song, K. Agarwal, and K. Schwan, “GraphReduce: Processing large-scale graphs on accelerator-based systems,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., New York, NY, USA: Association for Computing Machinery, 2015, pp. 1–12.

Digital Library

[20]

W. Han, D. Mawhirter, B. Wu, and M. Buland, “Graphie: Large-scale asynchronous graph traversals on just a GPU,” in Proc. 26th Int. Conf. Parallel Architectures Compilation Techn., 2017, pp. 233–245.

[21]

P. Boldi and S. Vigna, “The WebGraph framework I: Compression techniques,” in Proc. Thirteenth Int. World Wide Web Conf., Manhattan, NY, USA: ACM Press, 2004, pp. 595–601.

[22]

A. H. N. Sabet, Z. Zhao, and R. Gupta, “Subway: Minimizing data transfer during out-of-GPU-memory graph processing,” in Proc. 15th Eur. Conf. Comput. Syst., New York, NY, USA: Association for Computing Machinery, 2020, pp. 1–16.

Digital Library

[23]

“CUDA toolkit documentation,” 2022. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[24]

D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2020, pp. 451–461.

[25]

S. W. Min, V. S. Mailthody, Z. Qureshi, J. Xiong, E. Ebrahimi, and W.-M. Hwu, “EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs,” Proc. VLDB Endow, vol. 14, no. 2, pp. 114–127, Oct. 2020.

Digital Library

[26]

A. ElTantawy and T. M. Aamodt, “Warp scheduling for fine-grained synchronization,” in Proc. IEEE Int. Symp. High Perform. Comput. Architecture, 2018, pp. 375–388.

[27]

A. K. Kamath, A. A. George, and A. Basu, “ScoRD: A scoped race detector for GPUs,” in Proc. IEEE/ACM 47th Annu. Int. Symp. Comput. Architecture, 2020, pp. 1036–1049.

[28]

C. Giannoula et al., “SynCron: Efficient synchronization support for near-data-processing architectures,” in Proc. IEEE Int. Symp. High- Perform. Comput. Architecture, 2021, pp. 263–276.

[29]

R. Tang et al., “Ascetic: Enhancing cross-iterations data efficiency in out-of-memory graph processing on GPUs,” in Proc. 50th Int. Conf. Parallel Process., New York, NY, USA: Association for Computing Machinery, 2021, pp. 1–10.

Digital Library

[30]

J. R. Goodman, “Using cache memory to reduce processor-memory traffic,” in Proc. 10th Annu. Int. Symp. Comput. Architecture, 1983, pp. 124–131.

[31]

J. Leskovec and A. Krevl, “SNAP datasets: Stanford large network dataset collection,” Jun. 2014. [Online]. Available: http://snap.stanford.edu/data

[32]

M. Harris, “Unified memory for cuda beginners,” Jan. 2021, Accessed: Dec. 31. 2020. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

[33]

D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory,” in Proc. 46th Int. Symp. Comput. Architecture, New York, NY, USA: Association for Computing Machinery, 2019, pp. 224–235.

[34]

F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable SIMD-efficient graph processing on GPUs,” in Proc. 24th Int. Conf. Parallel Architectures Compilation Techn., 2015, pp. 39–50.

[35]

I. Corporation, “Compute express link (CXL) specification,” 2020, Accessed: Mar. 2, 2023. [Online]. Available: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/cxl-specification-2.0.pdf

[36]

K. Wang, A. Hussain, Z. Zuo, G. Xu, and A. A. Sani, “Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code,” ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 389–404, 2017.

Digital Library

[37]

M. Zhang, Y. Wu, Y. Zhuo, X. Qian, C. Huan, and K. Chen, “Wonderland: A novel abstraction-based out-of-core graph processing system,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 608–621, 2018.

Digital Library

[38]

G. Malewicz et al., “Pregel: A system for large-scale graph processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135–146.

[39]

A. Kyrola, G. E. Blelloch, and C. Guestrin, “GraphChi: Large-scale graph computation on just a PC,” in Proc. 10th USENIX Conf. Operating Syst. Des. Implementation, 2012, pp. 31–46.

[40]

X. Zhu, W. Han, and W. Chen, “GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning,” in Proc. USENIX Annu. Tech. Conf., 2015, pp. 375–386.

[41]

F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, “CuSha: Vertex-centric graph processing on GPUs,” in Proc. 23rd Int. Symp. High- Perform. Parallel Distrib. Comput., 2014, pp. 239–252.

[42]

H. Liu and H. H. Huang, “Enterprise: Breadth-first graph traversal on GPUs,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2015, pp. 1–12.

[43]

Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: A high-performance graph processing library on the GPU,” in Proc. 21st ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2016, pp. 1–12.

[44]

A. Gharaibeh, L. Beltr ão Costa, E. Santos-Neto, and M. Ripeanu, “A yoke of oxen and a thousand chickens for heavy lifting graph processing,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Techn., 2012, pp. 345–354.

Recommendations

Exploiting data-parallelism in gpus
Brook for GPUs: stream computing on graphics hardware
SIGGRAPH '04: ACM SIGGRAPH 2004 Papers

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a ...
An evaluation of unified memory technology on NVIDIA GPUs
CCGRID '15: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Unified Memory is an emerging technology which is supported by CUDA 6.X. Before CUDA 6.X, the existing CUDA programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. CUDA 6.X ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 34, Issue 6

June 2023

299 pages

ISSN:1045-9219

Issue’s Table of Contents

1045-9219 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 June 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents