research-article

SCU: a GPU stream compaction unit for graph processing

Authors:

Jose-Maria Arnau,

Antonio GonzálezAuthors Info & Claims

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

Pages 424 - 435

https://doi.org/10.1145/3307650.3322254

Published: 22 June 2019 Publication History

Abstract

Graph processing algorithms are key in many emerging applications in areas such as machine learning and data analytics. Although the processing of large scale graphs exhibits a high degree of parallelism, the memory access pattern tend to be highly irregular, leading to poor GPGPU efficiency due to memory divergence. To ameliorate this issue, GPGPU applications perform a stream compaction operation each iteration of the algorithm to extract the subset of active nodes/edges, so subsequent steps work on compacted dataset.

We show that GPGPU architectures are inefficient for stream compaction, and propose to offload this task to a programmable Stream Compaction Unit (SCU) tailored to the requirements of this kernel. The SCU is a small unit tightly integrated in the GPU that efficiently gathers the active nodes/edges into a compacted array in memory. Applications can make use of it through a simple API. The remaining steps of the graph-based algorithm are executed on the GPU cores taking benefit of the large amount of parallelism in the GPU, but they operate on the SCU-prepared data and achieve larger memory coalescing and, hence, much higher efficiency. Besides, the SCU performs filtering of repeated and already visited nodes during the compaction process, significantly reducing GPGPU workload, and writes the compacted nodes/edges in an order that improves memory coalescing by reducing memory divergence.

We evaluate the performance of a state-of-the-art GPGPU architecture extended with our SCU for a wide variety of applications. Results show that for high-performance and for low-power GPU systems the SCU achieves speedups of 1.37x and 2.32x, 84.7% and 69% energy savings, and an area increase of 3.3% and 4.1% respectively.

References

[1]

M. Mohri, F. Pereira, and M. Riley, "Weighted finite-state transducers in speech recognition," Computer Speech & Language, vol. 16, no. 1, pp. 69--88, 2002.

Digital Library

[2]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica, "Graphx: Graph processing in a distributed dataflow framework.," in OSDI, vol. 14, pp. 599--613, 2014.

Digital Library

[3]

Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, "Graphlab: A new framework for parallel machine learning," arXiv preprint arXiv:1408.2041, 2014.

Digital Library

[4]

M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, and M. J. Franklin, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56--65, 2016.

Digital Library

[5]

M. Capotă, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, and P. Boncz, "Graphalytics: A big data benchmark for graph-processing platforms," in Proceedings of the GRADES'15, p. 7, ACM, 2015.

Digital Library

[6]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, and A. Ghodsi, "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383--1394, ACM, 2015.

Digital Library

[7]

A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, "Challenges in parallel graph processing," Parallel Processing Letters, vol. 17, no. 01, pp. 5--20, 2007.

[8]

A. H. Nodehi Sabet, J. Qiu, and Z. Zhao, "Tigr: Transforming irregular graphs for gpu-friendly graph processing," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 622--636, ACM, 2018.

[9]

S. Beamer III, Understanding and improving graph algorithm performance. University of California, Berkeley, 2016.

[10]

M. Billeter, O. Olsson, and U. Assarsson, "Efficient stream compaction on wide simd many-core architectures," in Proceedings of the conference on high performance graphics 2009, pp. 159--166, ACM, 2009.

[11]

D. Merrill, M. Garland, and A. Grimshaw, "High-performance and scalable gpu graph traversal," ACM Transactions on Parallel Computing, vol. 1, no. 2, p. 14, 2015.

Digital Library

[12]

A. Davidson, S. Baxter, M. Garland, and J. D. Owens, "Work-efficient parallel gpu methods for single-source shortest paths," in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 349--359, IEEE, 2014.

[13]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the conference on high performance computing networking, storage and analysis, p. 18, ACM, 2009.

Digital Library

[14]

N. Bell and J. Hoberock, "Thrust: A productivity-oriented library for cuda," in GPU computing gems Jade edition, pp. 359--371, Elsevier, 2011.

[15]

A. Geil, Y. Wang, and J. D. Owens, "Wtf, gpu! computing twitter's who-to-follow on the gpu," in Proceedings of the second ACM conference on Online social networks, pp. 63--68, ACM, 2014.

Digital Library

[16]

L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web.," tech. rep., Stanford InfoLab, 1999.

[17]

D. Compiler, "Synopsys inc," 2000.

[18]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 469--480, IEEE, 2009.

Digital Library

[19]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "Dramsim2: A cycle accurate memory system simulator," IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16--19, 2011.

Digital Library

[20]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "Gpuwattch: enabling energy optimizations in gpgpus," in ACM SIGARCH Computer Architecture News, vol. 41, pp. 487--498, ACM, 2013.

Digital Library

[21]

M. Technology, TN-53-01. LPDDR4 Power Calculator. Technical Report, 2016.

[22]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163--174, IEEE, 2009.

[23]

T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.

Digital Library

[24]

DIMACS, "10th dimacs implementation challenge - graph partitioning and graph clustering," 2010.

[25]

T. D. Han and T. S. Abdelrahman, "Reducing branch divergence in gpu programs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 3, ACM, 2011.

[26]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating cuda graph algorithms at maximum warp," in ACM SIGPLAN Notices, vol. 46, pp. 267--276, ACM, 2011.

Digital Library

[27]

F. Khorasani, R. Gupta, and L. N. Bhuyan, "Scalable simd-efficient graph processing on gpus," in Parallel Architecture and Compilation (PACT), 2015 International Conference on, pp. 39--50, IEEE, 2015.

Digital Library

[28]

A. Gharaibeh, T. Reza, E. Santos-Neto, L. B. Costa, S. Sallinen, and M. Ripeanu, "Efficient large-scale graph processing on hybrid cpu and gpu systems," arXiv preprint arXiv:1312.3018, 2013.

[29]

F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, "Cusha: vertex-centric graph processing on gpus," in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pp. 239--252, ACM, 2014.

Digital Library

[30]

Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, "Gunrock: A high-performance graph processing library on the gpu," in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, (New York, NY, USA), pp. 11:1--11:12, ACM, 2016.

Digital Library

[31]

T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--13, IEEE, 2016.

Digital Library

[32]

J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen, "Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on, pp. 731--734, IEEE, 2017.

Digital Library

[33]

G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang, "Graphh: A processing-in-memory architecture for large-scale graph processing," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.

[34]

L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "Graphr: Accelerating graph processing using reram," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 531--543, Feb 2018.

Cited By

Zhang CScheffler PBenz TPerotti MBenini L(2024)Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546797(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546797
Bontes JGain J(2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675782
Huang ECheng BLin MLee CYeh TKim T(2024)WER: Maximizing Parallelism of Irregular Graph Applications through GPU Warp EqualizeRProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473955(201-206)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473955
Show More Cited By

Index Terms

SCU: a GPU stream compaction unit for graph processing
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Efficient stream compaction on wide SIMD many-core architectures
HPG '09: Proceedings of the Conference on High Performance Graphics 2009

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage.

For wide SIMD many-core ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A BSP model graph processing system on many cores

Large-scale graph processing plays an increasingly important role for many data-related applications. Recently GPU has been adopted to accelerate various graph processing algorithms. However, since the architecture of GPU is very different from ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

June 2019

849 pages

ISBN:9781450366694

DOI:10.1145/3307650

General Chair:
Srilatha (Bobbie) Manne
Microsoft
,
Program Chairs:
Hillery Hunter
IBM
,
Erik Altman
IBM Research

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Agencia Estatal de Investigación

Conference

ISCA '19

Sponsor:

SIGARCH

ISCA '19: The 46th Annual International Symposium on Computer Architecture

June 22 - 26, 2019

Arizona, Phoenix

Acceptance Rates

ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
577
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)7

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CScheffler PBenz TPerotti MBenini L(2024)Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546797(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546797
Bontes JGain J(2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675782
Huang ECheng BLin MLee CYeh TKim T(2024)WER: Maximizing Parallelism of Irregular Graph Applications through GPU Warp EqualizeRProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473955(201-206)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473955
Zhao JYang YZhang YLiao XGu LHe LHe BJin HLiu HJiang XYu HSalapura VZahran MChong FTang L(2022)TDGraphProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527409(116-129)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527409
Segura AArnau JGonzalez A(2022)Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory PartitionsIEEE Transactions on Computers10.1109/TC.2021.310474971:7(1711-1723)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TC.2021.3104749
Deng JYe ZZhou KFu XZhang BXie X(2022)Design of Software-Hardware Collaborative Graph Computing Accelerator Based on RISC-V2022 7th International Conference on Integrated Circuits and Microsystems (ICICM)10.1109/ICICM56102.2022.10011254(404-410)Online publication date: 28-Oct-2022
https://doi.org/10.1109/ICICM56102.2022.10011254
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Segura AArnau JGonzalez A(2022)Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloadsThe Journal of Supercomputing10.1007/s11227-022-04621-179:1(762-787)Online publication date: 18-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04621-1
Liu LWei SZhu JDeng CLiu LWei SZhu JDeng C(2022)Current Application FieldsSoftware Defined Chips10.1007/978-981-19-7636-0_4(167-277)Online publication date: 15-Nov-2022
https://doi.org/10.1007/978-981-19-7636-0_4
Basak AQu ZLin JAlameldeen AChishti ZDing YXie Y(2021)Improving Streaming Graph Processing Performance using Input KnowledgeMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480096(1036-1050)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480096
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents