Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3404397.3404433acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

Published: 17 August 2020 Publication History

Abstract

We propose, implement and evaluate a work stealing based scheduler, called HWS, for graph processing on heterogeneous CPU-FPGA systems that tightly couple the CPU and the FPGA to share system memory. HWS addresses unique concerns that arise with work stealing in the context of our target system. Our evaluation is conducted on the Intel Heterogeneous Architecture Research Platform (HARPv2), using three key processing kernels and seven real-world graphs. We show that HWS effectively balances workloads. Further, the use of HWS results in better graph processing performance compared to static scheduling and a representative of existing adaptive partitioning techniques, called HAP. Improvements vary by graph processing application, input graph and number of threads, and can be up to 100% over static scheduling, and up to 17% over HAP. We also compare to an oracle chunk self-scheduler, in which the best chunk size is known a priori for each number of threads and each input graph. HWS performs no worse than 1-3% in most cases. Finally, our graph processing throughput scales well with increasing threads. These results collectively demonstrate the effectiveness of work stealing for graph processing on our heterogeneous target platform.

References

[1]
U. A. Acar, A. Chargueraud, and M. Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of Symp. on Principles and Practice of Parallel Programming. 219–228.
[2]
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. 2013. A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4 (2013).
[3]
R. D. Blumofe and Ch. E. Leiserson. 1999. J. ACM 46, 5 (Sept. 1999), 720–748.
[4]
E. Chacko and S. Ranganathan. 2011. Graphs in Bioinformatics. In Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications, A. Y. Zomaya and M. Elloumi (Eds.). O’Reily, Chapter 10.
[5]
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. 2008. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In Proc. of Int’l Conf. on Parallel Processing. 536–545.
[6]
G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA: A Case Study of Breadth-First Search. In Proc. of Symp. on Field-Programmable Gate Arrays. 105–110.
[7]
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009. Scalable Work Stealing. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis. 53:1–53:11.
[8]
N. Engelhardt and H. K. So. 2016. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In Proc. of Int’l Conf. on Field Programmable Logic and Applications (FPL). 1–4.
[9]
M. Frigo, C.E. Leiserson, and K.H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of Conf. on Programming Language Design and Implementation. 212–223.
[10]
M. D. Galanis, A. Milidonis, G. Theodoridis, D. Soudris, and C. E. Goutis. 2005. A partitioning methodology for accelerating applications in hybrid reconfigurable platforms. In Proc. of Design, Automation and Test in Europe. 247–252 Vol. 3.
[11]
Graph 500. 2019. Graph500 Benchmarks. http://www.graph500.org
[12]
P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. http://www.ece.cmu.edu/~calcm/carl/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_center
[13]
D. Hendler and N. Shavit. 2002. Non-Blocking Steal-Half Work Queues. In Proc. of Symp. on Principles of Distributed Computing. 280–289.
[14]
C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan. 2017. MultiGraph: Efficient Graph Processing on GPUs. In Proc. of Parallel Architectures and Compilation Techniques.
[15]
Intel Corp.2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdf
[16]
Intel Corp.2019. The Open Programmable Acceleration Engine (OPAE). https://01.org/opae
[17]
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-Scale Graph Processing. In Proc. of the European Conference on Computer Systems. 169–182.
[18]
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media?. In Proc. of int’l Conf. on World Wide Web. 591–600.
[19]
J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
[20]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.
[21]
G. Malewicz, M. H. Austern, A. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proc. of Int’l Conf. on Management of Data. 135–146.
[22]
D. Merrill, M. Garland, and A. Grimshaw. 2012. Scalable GPU Graph Traversal. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.
[23]
R. Nakashima, H. Yoritaka, M. Yasugi, T. Hiraishi, and S. Umatani. 2019. Extending a Work-Stealing Framework with Priorities and Weights. In Proc. of Workshop on Irregular Applications: Architectures and Algorithms. 9–16.
[24]
A. Navarro, F. Corbera, A. Rodriguez, A. Vilches, and R. Asenjo. 2019. Heterogeneous Parallel_for Template for CPU—GPU Chips. J. Parallel Programming 47, 2 (April 2019), 213–233.
[25]
D. Nguyen, A. Lenharth, and K. Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proc. of Symp. on Operating Systems Principles. 456–471.
[26]
F. O’Brien. 2020. A Streamig Accelerator for Graph Analytics on Tightly-Coupled CPU-FPGA Systems. Master’s thesis. University of Toronto.
[27]
S. Perarnau and M. Sato. 2014. Victim Selection and Distributed Work Stealing Performance: A Case Study. In Proc. of Parallel and Distributed Processing Symp.659–668.
[28]
N. Ramanathan, J. Wickerson, F. Winterstein, and G. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proc. of Int’l Symp. on Field-Programmable Gate Arrays. 48–53.
[29]
A. Rodriguez, A. Navarro, R. Asenjo, F. Corbera, R. Gran Tejero, D. Suarez Gracia, and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. Journal of Supercomputing (06 2019).
[30]
A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-Centric Graph Processing Using Streaming Partitions. In Proc. of Symp. on Operating Systems Principles. 472–488.
[31]
X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua. 2018. Graph Processing on GPUs: A Survey. ACM Comput. Surv. 50, 6 (2018).
[32]
S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In Proc. of Int’l Conf. on Parallel Processing. 77–86.
[33]
P. Stutz, A. Bernstein, and W. Cohen. 2010. Signal/Collect: Graph Algorithms for the (Semantic) Web. In Proc. of Int’l Semantic Web Conference on The Semantic Web - Volume Part I. 764–780.
[34]
J. L. Tripp, A. A. Hanson, M. Gokhale, and H. Mortveit. 2005. Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study. In Proc. of Conference on Supercomputing. 27–27.
[35]
Q. D. Truong, Q. B. Truong, and T. Dkaki. 2016. Graph Methods for Social Network Analysis. In Nature of Computation and Communication, P. C. Vinh and L. Barolli (Eds.). 276–286.
[36]
A. Vilches, R. Asenjo, A. G. Navarro, F. Corbera, R. Gran Tejero, and M. Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. In Proc. of the Int’l Conf. on Computational Science, Vol. 51. 140–149.
[37]
Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.
[38]
Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 136–144.
[39]
B. Wile. 2014. CAPI is Core to POWER. http://www-03.ibm.com/linux/blogs/capi/
[40]
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596(2019). arxiv:1901.00596http://arxiv.org/abs/1901.00596
[41]
Xilinx Inc.2014. Zynq-7000: all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html
[42]
Y. Xuejun, C. Haibo, C. Yungui, C. Fujie, and C. Lijie. 1990. Processor self-scheduling for parallel loops in preemptive environments. Future Generation Computer Systems 6, 1 (1990), 97–103.
[43]
J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C. Cheung, A. P. C. Chan, and P. H. W. Leong. 2008. Map-reduce as a Programming Model for Custom Computing Machines. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 149–159.
[44]
S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (Oct 2019).
[45]
S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In Proc. of Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144.

Cited By

View all
  • (2024)A disk I/O optimized system for concurrent graph processing jobsFrontiers of Computer Science10.1007/s11704-023-2361-018:3Online publication date: 22-Jan-2024
  • (2023)Efficient Data Streaming for a Tightly-Coupled Coarse-Grained Reconfigurable Array2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00075(435-443)Online publication date: May-2023
  • (2023)A machine learning-based resource-efficient task scheduler for heterogeneous computer systemsThe Journal of Supercomputing10.1007/s11227-023-05266-479:14(15700-15728)Online publication date: 1-Sep-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
August 2020
844 pages
ISBN:9781450388160
DOI:10.1145/3404397
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CPU-FPGA systems
  2. Heterogeneous processing
  3. graph processing
  4. load balancing
  5. work stealing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '20

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A disk I/O optimized system for concurrent graph processing jobsFrontiers of Computer Science10.1007/s11704-023-2361-018:3Online publication date: 22-Jan-2024
  • (2023)Efficient Data Streaming for a Tightly-Coupled Coarse-Grained Reconfigurable Array2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00075(435-443)Online publication date: May-2023
  • (2023)A machine learning-based resource-efficient task scheduler for heterogeneous computer systemsThe Journal of Supercomputing10.1007/s11227-023-05266-479:14(15700-15728)Online publication date: 1-Sep-2023
  • (2022)HybriDC: A Resource-Efficient CPU-FPGA Heterogeneous Acceleration System for Lossless Data CompressionMicromachines10.3390/mi1311202913:11(2029)Online publication date: 19-Nov-2022
  • (2022)Analysis of Graph Processing in Reconfigurable Devices for Edge Computing Applications2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00012(16-23)Online publication date: Aug-2022
  • (2021)A Streaming Accelerator for Heterogeneous CPU-FPGA Processing of Graph Applications2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00014(26-35)Online publication date: Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media