Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUs

Published: 01 June 2023 Publication History

Abstract

Graph analytics are widely used including recommender systems, scientific computing, and data mining. Meanwhile, GPU has become the major accelerator for such applications. However, the graph size increases rapidly and often exceeds the GPU memory, incurring severe performance degradation due to frequent data transfers between the main memory and GPUs. To relieve this problem, we focus on the utilization of data in GPUs by taking advantage of the data reuse across iterations. In our studies, we deeply analyze the memory access patterns of graph applications at different granularities. We have found that the memory footprint is accessed with a roughly sequential scan without a hotspot, which infers an extremely long reuse distance. Based on our observation, we propose a novel framework, called <italic>Liberator</italic>, to exploit the data reuse within GPU memory. In <italic>Liberator</italic>, GPU memory is reserved for the data potentially accessed across iterations to avoid excessive data transfer between the main memory and GPUs. For the data not existing in GPU memory, a Merged and Aligned memory access manner is employed to improve the transmission efficiency. We also further optimize the framework by parallel processing of data in GPU memory and data in the main memory. We have implemented a prototype of the <italic>Liberator</italic> framework and conducted a series of experiments on performance evaluation. The experimental results show that <italic>Liberator</italic> can significantly reduce the data transfer overhead, which achieves an average of 2.7x speedup over a state-of-the-art approach.

References

[1]
W. Kim, “Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods,” Tsinghua Sci. Technol., vol. 17, no. 6, pp. 645–658, Dec. 2012.
[2]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “UbiCrawler: A scalable fully distributed web crawler,” Softw., Pract. Experience, vol. 34, no. 8, pp. 711–726, 2004. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.587
[3]
R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 4292–4293.
[4]
S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite,” 2015,.
[5]
S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu, “The ubiquity of large graphs and surprising challenges of graph processing,” Proc. VLDB Endow, vol. 11, no. 4, pp. 420–431, Dec. 2017.
[6]
NVIDIA, “NVIDIA tesla P100–the most advanced datacenter accelerator ever built featuring pascal GP100,” 2006, Accessed: Nov. 26, 2022. [Online]. Available: https://www.nvidia.cn/content/dam/en-zz/Solutions/Data-Center/tesla-p100/pdf/nvidia-teslap100-techoverview.pdf
[9]
P. Harish and P. J. Narayanan, “Accelerating large graph algorithms on the GPU using CUDA,” in Proc. 14th Int. Conf. High Perform. Comput., Berlin, Germany:Springer, 2007, pp. 197–208.
[10]
L. Page, S. Brin, R. Motwani, and T. Winograd, “The pageRank citation ranking: Bringing order to the web,” Stanford InfoLab, Stanford, CA, USA, Tech. Rep., Nov. 1999.
[11]
J. Kunegis, “KONECT: The koblenz network collection,” in Proc. 22nd Int. Conf. World Wide Web, 2019, pp. 1343–1350.
[12]
N. Agarwal, D. Nellans, M. Stephenson, M. O’Connor, and S. W. Keckler, “Page placement strategies for GPUs within heterogeneous memory systems,” SIGPLAN Notices, vol. 50, no. 4, pp. 607–618, Mar. 2015.
[13]
R. Ausavarungnirun et al., “Mosaic: An application-transparent hardware-software cooperative memory manager for GPUs,” 2018,.
[14]
P. Gera, H. Kim, P. Sao, H. Kim, and D. Bader, “Traversing large graphs on GPUs with unified memory,” Proc. VLDB Endowment, vol. 13, no. 7, pp. 1119–1133, Mar. 2020.
[15]
A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu, “A yoke of oxen and a thousand chickens for heavy lifting graph processing,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Techn., 2012, pp. 345–354.
[16]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein, “GraphLab: A new framework for parallel machine learning,” in Proc. 26th Conf. Uncertainty Artif. Intell., Arlington, VA, USA: AUAI Press, 2010, pp. 340–349.
[17]
H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified memory management in GPUs for irregular workloads,” in Proc. 25th Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA: Association for Computing Machinery, 2020, pp. 1357–1370.
[18]
NVIDIA, “Nvprof: A CUDA profiling tool that traps memory access addresses,” 2021. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide
[19]
D. Sengupta, S. L. Song, K. Agarwal, and K. Schwan, “GraphReduce: Processing large-scale graphs on accelerator-based systems,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., New York, NY, USA: Association for Computing Machinery, 2015, pp. 1–12.
[20]
W. Han, D. Mawhirter, B. Wu, and M. Buland, “Graphie: Large-scale asynchronous graph traversals on just a GPU,” in Proc. 26th Int. Conf. Parallel Architectures Compilation Techn., 2017, pp. 233–245.
[21]
P. Boldi and S. Vigna, “The WebGraph framework I: Compression techniques,” in Proc. Thirteenth Int. World Wide Web Conf., Manhattan, NY, USA: ACM Press, 2004, pp. 595–601.
[22]
A. H. N. Sabet, Z. Zhao, and R. Gupta, “Subway: Minimizing data transfer during out-of-GPU-memory graph processing,” in Proc. 15th Eur. Conf. Comput. Syst., New York, NY, USA: Association for Computing Machinery, 2020, pp. 1–16.
[23]
“CUDA toolkit documentation,” 2022. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[24]
D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., 2020, pp. 451–461.
[25]
S. W. Min, V. S. Mailthody, Z. Qureshi, J. Xiong, E. Ebrahimi, and W.-M. Hwu, “EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs,” Proc. VLDB Endow, vol. 14, no. 2, pp. 114–127, Oct. 2020.
[26]
A. ElTantawy and T. M. Aamodt, “Warp scheduling for fine-grained synchronization,” in Proc. IEEE Int. Symp. High Perform. Comput. Architecture, 2018, pp. 375–388.
[27]
A. K. Kamath, A. A. George, and A. Basu, “ScoRD: A scoped race detector for GPUs,” in Proc. IEEE/ACM 47th Annu. Int. Symp. Comput. Architecture, 2020, pp. 1036–1049.
[28]
C. Giannoula et al., “SynCron: Efficient synchronization support for near-data-processing architectures,” in Proc. IEEE Int. Symp. High- Perform. Comput. Architecture, 2021, pp. 263–276.
[29]
R. Tang et al., “Ascetic: Enhancing cross-iterations data efficiency in out-of-memory graph processing on GPUs,” in Proc. 50th Int. Conf. Parallel Process., New York, NY, USA: Association for Computing Machinery, 2021, pp. 1–10.
[30]
J. R. Goodman, “Using cache memory to reduce processor-memory traffic,” in Proc. 10th Annu. Int. Symp. Comput. Architecture, 1983, pp. 124–131.
[31]
J. Leskovec and A. Krevl, “SNAP datasets: Stanford large network dataset collection,” Jun. 2014. [Online]. Available: http://snap.stanford.edu/data
[32]
M. Harris, “Unified memory for cuda beginners,” Jan. 2021, Accessed: Dec. 31. 2020. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
[33]
D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory,” in Proc. 46th Int. Symp. Comput. Architecture, New York, NY, USA: Association for Computing Machinery, 2019, pp. 224–235.
[34]
F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable SIMD-efficient graph processing on GPUs,” in Proc. 24th Int. Conf. Parallel Architectures Compilation Techn., 2015, pp. 39–50.
[35]
I. Corporation, “Compute express link (CXL) specification,” 2020, Accessed: Mar. 2, 2023. [Online]. Available: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/cxl-specification-2.0.pdf
[36]
K. Wang, A. Hussain, Z. Zuo, G. Xu, and A. A. Sani, “Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code,” ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 389–404, 2017.
[37]
M. Zhang, Y. Wu, Y. Zhuo, X. Qian, C. Huan, and K. Chen, “Wonderland: A novel abstraction-based out-of-core graph processing system,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 608–621, 2018.
[38]
G. Malewicz et al., “Pregel: A system for large-scale graph processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135–146.
[39]
A. Kyrola, G. E. Blelloch, and C. Guestrin, “GraphChi: Large-scale graph computation on just a PC,” in Proc. 10th USENIX Conf. Operating Syst. Des. Implementation, 2012, pp. 31–46.
[40]
X. Zhu, W. Han, and W. Chen, “GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning,” in Proc. USENIX Annu. Tech. Conf., 2015, pp. 375–386.
[41]
F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, “CuSha: Vertex-centric graph processing on GPUs,” in Proc. 23rd Int. Symp. High- Perform. Parallel Distrib. Comput., 2014, pp. 239–252.
[42]
H. Liu and H. H. Huang, “Enterprise: Breadth-first graph traversal on GPUs,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2015, pp. 1–12.
[43]
Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: A high-performance graph processing library on the GPU,” in Proc. 21st ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2016, pp. 1–12.
[44]
A. Gharaibeh, L. Beltr ão Costa, E. Santos-Neto, and M. Ripeanu, “A yoke of oxen and a thousand chickens for heavy lifting graph processing,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Techn., 2012, pp. 345–354.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 34, Issue 6
June 2023
299 pages

Publisher

IEEE Press

Publication History

Published: 01 June 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media