Article

SIMD-X: programming and processing of graph algorithms on GPUs

Authors:

H. Howie HuangAuthors Info & Claims

USENIX ATC '19: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference

Pages 411 - 427

Published: 10 July 2019 Publication History

Abstract

With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However, graph algorithms such as breadth-first search and k-core, often fail to take full advantage of GPUs, due to irregularity in memory access and control flow. To address this challenge, we have developed SIMD-X, for programming and processing of single instruction multiple, complex, data on GPUs. Specifically, the new Active-Compute-Combine (ACC) model not only provides ease of programming to programmers, but more importantly creates opportunities for system-level optimizations. To this end, SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing. In addition, SIMD-X leverages push-pull based kernel fusion that, with the help of a new deadlock-free global barrier, reduces a large number of computation kernels to very few. Using SIMD-X, a user can program a graph algorithm in tens of lines of code, while achieving 3×, 6×, 24×, 3× speedup over Gunrock, Galois, CuSha, and Ligra, respectively.

References

[1]

Nvidia cuda c programming guide. NVIDIA Corporation, 2011.

[2]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16, pages 265-283, 2016.

Digital Library

[3]

Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. Squeezing out all the value of loaded data: An out-ofcore graph processing system with reduced disk i/o. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 125-137, 2017.

Digital Library

[4]

S Beamer, K Asanovic, and D Patterson. Direction-optimizing Breadth-First Search. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1-10. IEEE, 2012.

Digital Library

[5]

Bibek Bhattarai, Hang Liu, and H Howie Huang. CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD, volume 19, 2019.

Digital Library

[6]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-MAT: A Recursive Model for Graph Mining. In SDM, 2004.

[7]

Rong Chen, Xin Ding, Peng Wang, Haibo Chen, Binyu Zang, and Haibing Guan. Computation and communication efficient graph processing with distributed immutable view. In Proceedings of the 23rd international symposium on Highperformance parallel and distributed computing, pages 215-226. ACM, 2014.

Digital Library

[8]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

[9]

Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. Kineograph: taking the pulse of a fast-changing and connected world. In Proceedings of the 7th ACM european conference on Computer Systems, pages 85-98. ACM, 2012.

Digital Library

[10]

Sharan Chetlur, CliffWoolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.

[11]

Andrew Davidson, Sean Baxter, Michael Garland, and John D Owens. Work-efficient parallel GPU methods for single-source shortest paths. In 28th International Symposium on Parallel & Distributed Processing (IPDPS), pages 349-359. IEEE, 2014.

Digital Library

[12]

European Open Stream Map. http://download.geofabrik.de/europe-latest.osm.bz2,.

[13]

Eric Finnerty, Zachary Sherer, Hang Liu, and Yan Luo. Dr. BFS: Data Centric Breadth-First Search on FPGAs. In Proceedings of the 56th Annual Design Automation Conference 2019, page 208. ACM, 2019.

Digital Library

[14]

Anil Gaihre, Yan Luo, and Hang Liu. Do Bitcoin Users Really Care About Anonymity? An Analysis of the Bitcoin Transaction Graph. In 2018 IEEE International Conference on Big Data (Big Data), pages 1198-1207. IEEE, 2018.

[15]

Anil Gaihre, Zhenlin Wu, Fan Yao, and Hang Liu. XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs. In Proceedings of the international symposium on High-performance parallel and distributed computing (HPDC). ACM, 2019.

Digital Library

[16]

Benedict R Gaster and Lee Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 2012.

Digital Library

[17]

Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. Practical Recommendations on Crawling Online Social Networks. IEEE Journal on Selected Areas in Communications, 2011.

[18]

Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI, volume 12, page 2, 2012.

Digital Library

[19]

GTgraph: A suite of synthetic random graph generators. http://www.cse.psu.edu/~madduri/software/GTgraph/.

[20]

Kshitij Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1-14. IEEE, 2012.

[21]

Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, and Enhong Chen. Chronos: a graph engine for temporal graph analysis. In Proceedings of the Ninth European Conference on Computer Systems, page 1. ACM, 2014.

Digital Library

[22]

Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, and Hwanjo Yu. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In Proceedings of international conference on Knowledge discovery and data mining (SIGKDD), pages 77-85, 2013.

Digital Library

[23]

Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. Green-Marl: a DSL for easy and efficient graph analysis. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), volume 40, pages 349-362, 2012.

Digital Library

[24]

Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster, Mark D Hill, Steven K Reinhardt, and David A Wood. Heterogeneous-race-free memory models. ACM SIGARCH Computer Architecture News, 42(1):427-440, 2014.

Digital Library

[25]

Yang Hu, Hang Liu, and H Howie Huang. High-Performance Triangle Counting on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1-5. IEEE, 2018.

[26]

Yang Hu, Hang Liu, and H Howie Huang. Tricore: Parallel triangle counting on gpus. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 171-182. IEEE, 2018.

Digital Library

[27]

H Howie Huang and Hang Liu. Big data machine learning and graph analytics: Current state and future challenges. In 2014 IEEE International Conference on Big Data (Big Data), pages 16-17. IEEE, 2014.

[28]

Yuede Ji, Hang Liu, and H Howie Huang. iSpan: Parallel Identification of Strongly Connected Components with Spanning Trees. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 731- 742. IEEE, 2018.

Digital Library

[29]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675-678. ACM, 2014.

Digital Library

[30]

Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken. A Distributed Multi-GPU System for Fast Graph Processing. Proceedings of the VLDB Endowment, 11(3):297-310, 2017.

Digital Library

[31]

Farzad Khorasani. High Performance Vertex-Centric Graph Analytics on GPUs. PhD Dissertation: University of California, Riverside, 2016.

[32]

Farzad Khorasani, Rajiv Gupta, and Laxmi N Bhuyan. Scalable simd-efficient graph processing on gpus. In Parallel Architecture and Compilation (PACT), 2015 International Conference on, pages 39-50. IEEE, 2015.

Digital Library

[33]

Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N Bhuyan. CuSha: vertex-centric graph processing on GPUs. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 239-252. ACM, 2014.

Digital Library

[34]

Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, pages 447-461. ACM, 2016.

Digital Library

[35]

Pradeep Kumar and H Howie Huang. G-store: high-performance graph store for trillion-edge processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 71. IEEE Press, 2016.

Digital Library

[36]

Pradeep Kumar and H Howie Huang. Falcon: scaling IO performance in multi-SSD volumes. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, pages 41-53. USENIX Association, 2017.

Digital Library

[37]

Pradeep Kumar and H Howie Huang. GraphOne: A Data Store for Real-time Analytics on Evolving Graphs. In 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 249-263, 2019.

Digital Library

[38]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social network or a news media? In WWW, 2010.

Digital Library

[39]

Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. GraphChi: large-scale graph computation on just a PC. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 31-46. USENIX Association, 2012.

Digital Library

[40]

Hang Liu and H Howie Huang. Graphene: Fine-Grained IO Management for Graph Computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 285-300. USENIX Association.

Digital Library

[41]

Hang Liu and H. Howie Huang. Enterprise: Breadth-First Graph Traversal on GPU Servers. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015.

Digital Library

[42]

Hang Liu and H. Howie Huang. Graphene: Fine-Grained IO Management for Graph Computing. In Proceedings of the 15th USENIX Conference on File and Storage Technologies. USENIX Association, 2017.

Digital Library

[43]

Hang Liu, H Howie Huang, and Yang Hu. iBFS: Concurrent Breadth-First Search on GPUs. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD), 2016.

Digital Library

[44]

Weifeng Liu and Brian Vinter. CSR5: An efficient storage format for cross-platform sparse matrixvector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 339-350. ACM, 2015.

Digital Library

[45]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M Hellerstein. Graphlab: A new framework for parallel machine learning. 2010.

[46]

Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective GPU implementation of breadth-first search. In Proceedings of the 47th design automation conference, pages 52-55. ACM, 2010.

Digital Library

[47]

Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the Twelfth European Conference on Computer Systems, pages 527- 543. ACM, 2017.

Digital Library

[48]

Sepideh Maleki, Annie Yang, and Martin Burtscher. Higher-order and tuple-based massively-parallel prefix sums, volume 51. ACM, 2016.

[49]

Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135-146. ACM, 2010.

Digital Library

[50]

Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable GPU graph traversal. In PPoPP, 2012.

Digital Library

[51]

Ulrich Meyer and Peter Sanders. D-Stepping: A Parallel Single Source Shortest Path Algorithm. Algorithms-- ESA'98, 1998.

Digital Library

[52]

Youshan Miao, Wentao Han, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Enhong Chen, and Wenguang Chen. Immortalgraph: A system for storage and analysis of temporal graphs. ACM Transactions on Storage (TOS), 2015.

Digital Library

[53]

Alberto Montresor, Francesco De Pellegrini, and Daniele Miorandi. Distributed k-Core Decomposition. IEEE Transactions on Parallel and Distributed Systems, 2013.

Digital Library

[54]

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), pages 456-471. ACM, 2013.

Digital Library

[55]

Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 622-636. ACM, 2018.

Digital Library

[56]

Nvidia. NVIDIA Kepler GK110 Architecture Whitepaper. 2013.

[57]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.

[58]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.

[59]

Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Lidong Zhou, and Maya Haridasan. Managing large graphs on multi-cores with graph awareness. In Proceedings of USENIX conference on Annual Technical Conference. USENIX Association, 2012.

Digital Library

[60]

Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. Chaos: Scaleout Graph Processing from Secondary Storage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410-424. ACM, 2015.

Digital Library

[61]

Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 472-488. ACM, 2013.

Digital Library

[62]

Dipanjan Sengupta, Shuaiwen Leon Song, Kapil Agarwal, and Karsten Schwan. GraphReduce: processing large-scale graphs on accelerator-based systems. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, pages 1-12. IEEE, 2015.

Digital Library

[63]

Bin Shao, Haixun Wang, and Yatao Li. Trinity: A distributed graph engine on a memory cloud. In Proceedings of International Conference on Management of Data (SIGMOD), pages 505-516, 2013.

Digital Library

[64]

Zachary Sherer, Eric Finnerty, Yan Luo, and Hang Liu. Software Hardware Co-Optimized BFS on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 190-190. ACM, 2019.

Digital Library

[65]

Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen, and Feifei Li. Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) 16), pages 317-332.

Digital Library

[66]

Julian Shun and Guy E Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013.

Digital Library

[67]

George M Slota, Sivasankaran Rajamanickam, and Kamesh Madduri. BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems. In International Parallel and Distributed Processing Symposium (IPDPS), 2014.

Digital Library

[68]

SNAP: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data/.

[69]

Tyler Sorensen, Alastair F Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. Portable inter-workgroup barrier synchronisation for GPUs. In ACM SIGPLAN Notices, volume 51, pages 39-58. ACM, 2016.

Digital Library

[70]

The University of Florida: Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/.

[71]

Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. From Think Like a Vertex to Think Like a Graph. Proceedings of the VLDB Endowment, 2013.

Digital Library

[72]

Stanley Tzeng, Anjul Patney, and John D Owens. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 2010.

Digital Library

[73]

Mohamed Wahib and Naoya Maruyama. Scalable Kernel Fusion for Memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 2014.

Digital Library

[74]

Kai Wang and Zhendong Su. GraphQ: Graph Query Processing with Abstraction Refinement-Scalable and Programmable Analytics over Very Large Graphs on a Single PC.

[75]

Siyuan Wang, Chang Lou Lou, Rong Chen, and Haibo Chen. Fast and Concurrent RDF Queries using RDMA-assisted GPU Graph Exploration. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, 2018. USENIX Association.

Digital Library

[76]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 265-266. ACM, 2015.

Digital Library

[77]

Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T Riffel, et al. Gunrock: GPU Graph Analytics. arXiv preprint arXiv:1701.01170, 2017.

Digital Library

[78]

Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, and Lidong Zhou. G ra M: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 408- 421. ACM, 2015.

Digital Library

[79]

Shucai Xiao and Wu-chun Feng. Inter-block GPU communication via fast barrier synchronization. In International Symposium on Parallel & Distributed Processing (IPDPS), pages 1-12, 2010.

[80]

Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In ACM SIGPLAN Notices (PPoPP), volume 50, pages 194-204. ACM, 2015.

Digital Library

[81]

Da Yan and Hang Liu. Parallel graph processing. Encyclopedia of Big Data Technologies, pages 1- 8, 2018.

[82]

Shengen Yan, Guoping Long, and Yunquan Zhang. StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In PPoPP, 2013.

Digital Library

[83]

Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, and Seung Woo Son. Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/ Restart. In IEEE Symposium on Mass Storage Systems and Technologies, 2019.

[84]

Kaiyuan Zhang, Rong Chen, and Haibo Chen. NUMA-aware graph-structured analytics. ACM SIGPLAN Notices (PPoPP), 50(8):183-193, 2015.

Digital Library

[85]

Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. Exploring the Hidden Dimension in Graph Processing. In OSDI, pages 285-300, 2016.

Digital Library

[86]

Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang Chen. Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 608-621. ACM, 2018.

Digital Library

[87]

Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation. IEEE Transactions on Parallel and Distributed Systems, 2014.

[88]

Yunhao Zhang, Rong Chen, and Haibo Chen. Submillisecond Stateful Stream Querying over Fastevolving Linked Data. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), pages 614-630. ACM, 2017.

Digital Library

[89]

Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Saman Amarasinghe, and Matei Zaharia. Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data), pages 293-302. IEEE, 2017.

[90]

Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E Priebe, and Alexander S Szalay. FlashGraph: processing billion-node graphs on an array of commodity SSDs. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, pages 45-58. USENIX Association, 2015.

Digital Library

[91]

Jianlong Zhong and Bingsheng He. Medusa: Simplified graph processing on gpus. Parallel and Distributed Systems, IEEE Transactions on, 25(6):1543-1552, 2014.

Digital Library

[92]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 375-386. USENIX Association, 2015.

Digital Library

Cited By

Bowman BHuang H(2021)Towards Next-Generation Cybersecurity with Graph AIACM SIGOPS Operating Systems Review10.1145/3469379.346938655:1(61-67)Online publication date: 6-Jun-2021
https://dl.acm.org/doi/10.1145/3469379.3469386
Brahmakshatriya AZhang YHong CKamil SShun JAmarasinghe SLee J(2021)Compiling graph applications for GPUs with graphitProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370321(248-261)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370321

SIMD-X: programming and processing of graph algorithms on GPUs
1. General and reference
  1. Cross-computing tools and techniques
2. Social and professional topics
  1. Professional topics
    1. Computing profession

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi

The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

USENIX ATC '19: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference

July 2019

1076 pages

ISBN:9781939133038

Program Chairs:
Tsafrir Dan
Technion-Israel Institute of Technology & VMware Research
,
Malkhi Dahlia
VMware Research

Sponsors

VMware
Nutanix: Nutanix
NSF
Facebook: Facebook
ORACLE: ORACLE

Publisher

USENIX Association

United States

Publication History

Published: 10 July 2019

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bowman BHuang H(2021)Towards Next-Generation Cybersecurity with Graph AIACM SIGOPS Operating Systems Review10.1145/3469379.346938655:1(61-67)Online publication date: 6-Jun-2021
https://dl.acm.org/doi/10.1145/3469379.3469386
Brahmakshatriya AZhang YHong CKamil SShun JAmarasinghe SLee J(2021)Compiling graph applications for GPUs with graphitProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370321(248-261)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370321

View Options

View options

Media

Figures

Other

Tables

View Table of Contents