research-article

Ascetic: Enhancing Cross-Iterations Data Efficiency in Out-of-Memory Graph Processing on GPUs

Authors:

Pen-Chung YewAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 41, Pages 1 - 10

https://doi.org/10.1145/3472456.3472457

Published: 05 October 2021 Publication History

Abstract

Graph analytics are widely used in real-world applications, and GPUs are major accelerators for such applications. However, as graph sizes become significantly larger than the capacity of GPU memory, the performance can degrade significantly due to the heavy overhead required in moving a large amount of graph data between CPU main memory and GPU memory.

Some existing approaches have tried to exploit data locality and addressed the issues of memory oversubscription on GPUs. However, these approaches have yet to take advantage of the data reuse cross iterations because of the data sizes in most large-graph analytics. In our studies, we have found that in most graph applications the graph traversals exhibit a roughly sequential scan over the graph data with an extremely large memory footprint. Based on the observation, we propose a novel framework, called Ascetic, to exploit temporal locality with very long reuse distances.

In Ascetic, the GPU memory is divided into a Static Region and an On-demand Region. The static region can exploit data reuse across iterations. The on-demand region is designed to load the data requested in the iteration of the graph traversal while not found in the static region.

We have implemented a prototype of the Ascetic framework and conducted a series of experiments on performance evaluation. The experimental results show that Ascetic can significantly reduce the data transfer overhead, and allow more overlapped execution between GPU and CPU, which leads to an average of 2.0x speedup over a state-of-the-art approach.

References

[1]

Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference. ACM Press, Manhattan, USA, 595–601.

Digital Library

[2]

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. 345–354.

Digital Library

[3]

W. Han, D. Mawhirter, B. Wu, and M. Buland. 2017. Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 233–245. https://doi.org/10.1109/PACT.2017.41

[4]

Pawan Harish and P. J. Narayanan. 2007. Accelerating Large Graph Algorithms on the GPU Using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (Goa, India) (HiPC’07). Springer-Verlag, Berlin, Heidelberg, 197–208.

Digital Library

[5]

Mark Harris. 2021. Unified Memory for CUDA Beginners. Accessed: 2020-12-31.

[6]

Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Scalable SIMD-Efficient Graph Processing on GPUs. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques(PACT ’15). 39–50.

Digital Library

[7]

Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N Bhuyan. 2014. CuSha: vertex-centric graph processing on GPUs. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. 239–252.

Digital Library

[8]

Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. Batch-Aware Unified Memory Management in GPUs for Irregular Workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1357–1370.

Digital Library

[9]

Jérôme Kunegis. 2019. The koblenz network collection.

[10]

Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI. USENIX Association, 31–46.

[11]

Hang Liu and H Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.

Digital Library

[12]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A New Framework For Parallel Machine Learning. In UAI. AUAI Press, 340–349.

[13]

Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.

Digital Library

[14]

NVIDIA. 2006. NVIDIA Tesla P100—The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100. Accessed: 2020-12-31.

[15]

NVIDIA. 2021. Nvprof : A CUDA profiling tool that traps memory access addresses.https://docs.nvidia.com/cuda/profiler-users-guide.

[16]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web.Technical Report 1999-66. Stanford InfoLab. Previous number = SIDL-WP-1999-0120.

[17]

Amir Hossein Nodehi Sabet, Zhijia Zhao, and Rajiv Gupta. 2020. Subway: minimizing data transfer during out-of-GPU-memory graph processing. In EuroSys. ACM, 12:1–12:16.

[18]

Dipanjan Sengupta, Shuaiwen Leon Song, Kapil Agarwal, and Karsten Schwan. 2015. GraphReduce: processing large-scale graphs on accelerator-based systems. In SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

[19]

SNAP. 2021. Stanford network analysis project.https://snap.stanford.edu/.

[20]

Kai Wang, Aftab Hussain, Zhiqiang Zuo, Guoqing Xu, and Ardalan Amiri Sani. 2017. Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code. ACM SIGARCH Computer Architecture News 45, 1 (2017), 389–404.

Digital Library

[21]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1–12.

Digital Library

[22]

Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang Chen. 2018. Wonderland: A novel abstraction-based out-of-core graph processing system. ACM SIGPLAN Notices 53, 2 (2018), 608–621.

Digital Library

[23]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In USENIX Annual Technical Conference. USENIX Association, 375–386.

Cited By

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
Zhao JZhang YHe LLi QZhang XJiang XYu HLiao XJin HGu LLiu HHe BZhang JSong XWang LZhou J(2023)GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph ProcessingACM Transactions on Architecture and Code Optimization10.1145/360009120:3(1-24)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600091
Li STang RZhu JZhao ZGong XWang WZhang JYew P(2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326866234:6(1954-1967)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3268662
Show More Cited By

Recommendations

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs ...
Improving the Virtual Memory Efficiency of GPUs
Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Faculty startup funding of the University of Georgia
National Key Research and Development Program of China

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
Zhao JZhang YHe LLi QZhang XJiang XYu HLiao XJin HGu LLiu HHe BZhang JSong XWang LZhou J(2023)GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph ProcessingACM Transactions on Architecture and Code Optimization10.1145/360009120:3(1-24)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600091
Li STang RZhu JZhao ZGong XWang WZhang JYew P(2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326866234:6(1954-1967)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3268662
Wang QAi XZhang YChen JYu G(2023)HyTGraph: GPU-Accelerated Graph Processing with Hybrid Transfer Management2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00049(558-571)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00049
Li SZhu JHan JPeng YWang ZGong XWang GZhang JWang X(2023)OneGraph: a cross-architecture framework for large-scale graph computing on GPUs based on oneAPICCF Transactions on High Performance Computing10.1007/s42514-023-00172-w6:2(179-191)Online publication date: 9-Nov-2023
https://doi.org/10.1007/s42514-023-00172-w
Zhu WCox GVesely JHairgrove MCox ARixner S(2022)UVM Discard: Eliminating Redundant Memory Transfers for Accelerators2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00013(27-38)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00013

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents