research-article

Spatial hardware implementation for sparse graph algorithms in GraphStep

Authors:

Michael Delorimier,

Nachiket Kapre,

André DehonAuthors Info & Claims

ACM Transactions on Autonomous and Adaptive Systems (TAAS), Volume 6, Issue 3

Article No.: 17, Pages 1 - 20

https://doi.org/10.1145/2019583.2019584

Published: 29 September 2011 Publication History

Abstract

How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping.

Supplementary Material

Delorimier (delorimier.zip)

Supplemental movie, appendix, image and software files for, Spatial Hardware Implementation for Sparse Graph Algorithms in GraphStep

Download
77.13 KB

References

[1]

Agha, G. 1998. ACTORS: A model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA.

Digital Library

[2]

Blelloch, G. E., Chatterjee, S., Hardwick, J. C., Sipelstein, J., and Zagha, M. 1993. Implementation of a portable nested data-parallel language. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 102--111.

Digital Library

[3]

Boykov, Y., Veksler, O., and Zabih, R. 1998. Markov random fields with efficient approximations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--655.

Digital Library

[4]

Brook Project. 2004. Brook project web page. http://brook.sourceforge.net.

[5]

Caldwell, A., Kahng, A., and Markov, I. 2000. Improved algorithms for hypergraph bipartitioning. In Proceedings of the Asia and South Pacific Design Automation Conference. 661--666.

Digital Library

[6]

Caspi, E., Chu, M., Huang, R., Weaver, N., Yeh, J., Wawrzynek, J., and DeHon, A. 2000. Stream computations organized for reconfigurable execution (SCORE): Extended abstract. In Proceedings of the International Conference on Field-Programmable Logic and Applications. Lecture Notes in Computer Science. Springer, 605--614.

Digital Library

[7]

Castanos, J. and Savage, J. 2000. Repartitioning unstructured adaptive meshes. In Proceedings of the Parallel and Distributed Processing Symposium. IEEE, 823--832.

Digital Library

[8]

Chandy, K. M. and Misra, J. 1982. Distributed computation on graphs: Shortest path algorithms. Comm. ACM 25, 11, 833--837.

Digital Library

[9]

Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA.

Digital Library

[10]

Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating System Design and Implementation. 137--150.

Digital Library

[11]

DeHon, A. 2000. Compact, multilayer layout for butterfly fat-tree. In Proceedings of the 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA'00). ACM, 206--215.

Digital Library

[12]

DeHon, A., Huang, R., and Wawrzynek, J. 2006. Stochastic spatial routing for reconfigurable networks. J. Microprocess. Microsyst. 30, 6, 301--318.

[13]

deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 75--85.

Digital Library

[14]

deLorimier, M., Kapre, N., Mehta, N., Rizzo, D., Eslick, I., Rubin, R., Uribe, T. E., Knight, Jr., T. F., and DeHon, A. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 143--151.

Digital Library

[15]

Fahlman, S. E. 1979. NETL: A System for Representing and Using Real-World Knowledge. MIT Press, Cambridge, MA.

[16]

Habata, S., Yokokawa, M., and Kitawaki, S. 2003. The earth simulator system. NEC Res. & Develop. 44, 1, 21--26.

[17]

Hestenes, M. R. and Stiefel, E. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 6, 409--436.

[18]

Hillis, W. D. 1985. The Connection Machine. MIT Press, Cambridge, MA.

Digital Library

[19]

Hillis, W. D. and Steele, G. L. 1986. Data parallel algorithms. Comm. ACM 29, 12, 1170--1183.

Digital Library

[20]

Kahn, G. 1974. The semantics of a simple language for parallel programming. In Proceedings of the IFIP CONGRESS 74. North-Holland Publishing Company, 471--475.

[21]

Kapre, N. and DeHon, A. 2009. Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. IEEE, 190--198.

[22]

Kapre, N., Mehta, N., deLorimier, M., Rubin, R., Barnor, H., Wilson, M. J., Wrighton, M., and DeHon, A. 2006. Packet-Switched vs. time-multiplexed FPGA overlay networks. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 205--213.

Digital Library

[23]

Karypis, G. and Kumar, V. 1999. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1.

Digital Library

[24]

Kildall, G. A. 1973. A unified approach to global program optimization. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'73). ACM Press, New York, 194--206.

Digital Library

[25]

Kim, J.-T. and Moldovan, D. I. 1993. Classification and retrieval of knowledge on a parallel marker-passing architecture. IEEE Trans. Knowl. Data Engin. 5, 5, 753--761.

Digital Library

[26]

Koelbel, C. H., Loveman, D. B., Schreiber, R. S., Guy L. Steele, J., and Zosel, M. E. 1994. The High Performance Fortran Handbook. MIT Press, Cambridge, MA.

Digital Library

[27]

Kolmogorov, V. and Zabih, R. 2001. Computing visual correspondence with occlusions using graph cuts. In Proceedings of the IEEE International Conference on Computer Vision. Vol. 2. 508--515.

[28]

Landman, B. S. and Russo, R. L. 1971. On pin versus block relationship for partitions of logic circuits. IEEE Trans. Comput. 20, 1469--1479.

Digital Library

[29]

Lee, E. 2005. UC Berkley ptolemy project. http://www.ptolemy.eecs.berkeley.edu/.

[30]

Lee, E. A. and Messerschmitt, D. G. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.

[31]

Leiserson, C., Rose, F., and Saxe, J. 1983. Optimizing synchronous circuitry by retiming. In Proceedings of the 3rd Caltech Conference On VLSI.

[32]

Leiserson, C. E. 1985. Fat-Trees: Universal networks for hardware efficient supercomputing. IEEE Trans. Comput. C-34, 10, 892--901.

Digital Library

[33]

Lieberman, H. 1987. Concurrent Object-Oriented Programming in Act 1. MIT Press, Cambridge, MA.

[34]

Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. Nvidia tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55.

Digital Library

[35]

Liu, H. and Singh, P. 2004. Conceptnet -- A practical commonsense reasoning tool-kit. BT Tech. J. 22, 4, 211.

Digital Library

[36]

Logemann, G., Loveland, D., and Davis, M. 1962. A machine program for theorem proving. Comm. ACM 5, 7, 394--397.

Digital Library

[37]

Marques-Silva, J. P. and Sakallah, K. A. 1999. GRASP: A search algorithm for propositional satisfiability. IEEE Trans. Comput. 48, 5, 506--521.

Digital Library

[38]

Microsystems, S. 1995. The java language environment. White paper. http://java.sun.com/docs/white/langenv/.

[39]

Pierce, B. C. 2002. Types and Programming Languages. MIT Press, Cambridge, MA.

Digital Library

[40]

Shah, N., Plishker, W., Ravindran, K., and Keutzer, K. 2004. NP-Click: A productive software development approach for network processors. IEEE Micro 24, 5, 45--54.

Digital Library

[41]

Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111.

Digital Library

[42]

Wrighton, M. and DeHon, A. 2003. Hardware-assisted simulated annealing with application for fast FPGA placement. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 33--42.

Digital Library

Cited By

Deiana ATran NAgar JBlott MDi Guglielmo GDuarte JHarris PHauck SLiu MNeubauer MNgadiuba JOgrenci-Memik SPierini MAarrestad TBähr SBecker JBerthold ABonventre RMüller Bravo TDiefenthaler MDong ZFritzsche NGholami AGovorkova EGuo DHazelwood KHerwig CKhan BKim SKlijnsma TLiu YLo KNguyen TPezzullo GRasoulinezhad SRivera RScholberg KSelig JSen SStrukov DTang WThais SUnger KVilalta Rvon Krosigk BWang SWarburton T(2022)Applications and Techniques for Fast Machine Learning in ScienceFrontiers in Big Data10.3389/fdata.2022.7874215Online publication date: 12-Apr-2022
https://doi.org/10.3389/fdata.2022.787421
Engelhardt NSo H(2019)GraVF-MACM Transactions on Reconfigurable Technology and Systems10.1145/335759612:4(1-28)Online publication date: 21-Nov-2019
https://dl.acm.org/doi/10.1145/3357596
Chen XBajaj RChen YHe JHe BWong WChen D(2019)On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs2019 29th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2019.00020(67-73)Online publication date: Sep-2019
https://doi.org/10.1109/FPL.2019.00020
Show More Cited By

Index Terms

Spatial hardware implementation for sparse graph algorithms in GraphStep

Recommendations

Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph

An i-triangulated graph is a graph in which every odd cycle has two non-crossing chords; i-triangulated graphs form a subfamily of perfect graphs. A slightly more general family of perfect graphs are clique-separable graphs. A graph is clique-separable ...
Algorithms for GA-H reduced graphs

Let GA be a hereditary family of graphs and H a hereditary family of acyclically directed family of graphs. A graph G ( V , E ) is a GA-Hreduced graph if it can be obtained from a graph GA ( V , D ) GA by deleting the edges of an edge subgraph H ( V , E ...
Complexity and Algorithms for Semipaired Domination in Graphs
Combinatorial Algorithms
Abstract
For a graph with no isolated vertices, a set is called a semipaired dominating set of G if (i) D is a dominating set of G, and (ii) D can be partitioned into two element subsets such that the vertices in each two element set are at distance at ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Autonomous and Adaptive Systems

ACM Transactions on Autonomous and Adaptive Systems Volume 6, Issue 3

September 2011

150 pages

ISSN:1556-4665

EISSN:1556-4703

DOI:10.1145/2019583

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2011

Accepted: 01 June 2010

Revised: 01 April 2010

Received: 01 August 2009

Published in TAAS Volume 6, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
428
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deiana ATran NAgar JBlott MDi Guglielmo GDuarte JHarris PHauck SLiu MNeubauer MNgadiuba JOgrenci-Memik SPierini MAarrestad TBähr SBecker JBerthold ABonventre RMüller Bravo TDiefenthaler MDong ZFritzsche NGholami AGovorkova EGuo DHazelwood KHerwig CKhan BKim SKlijnsma TLiu YLo KNguyen TPezzullo GRasoulinezhad SRivera RScholberg KSelig JSen SStrukov DTang WThais SUnger KVilalta Rvon Krosigk BWang SWarburton T(2022)Applications and Techniques for Fast Machine Learning in ScienceFrontiers in Big Data10.3389/fdata.2022.7874215Online publication date: 12-Apr-2022
https://doi.org/10.3389/fdata.2022.787421
Engelhardt NSo H(2019)GraVF-MACM Transactions on Reconfigurable Technology and Systems10.1145/335759612:4(1-28)Online publication date: 21-Nov-2019
https://dl.acm.org/doi/10.1145/3357596
Chen XBajaj RChen YHe JHe BWong WChen D(2019)On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs2019 29th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2019.00020(67-73)Online publication date: Sep-2019
https://doi.org/10.1109/FPL.2019.00020
Zhou SKannan RZeng HPrasanna VKaeli DPericàs M(2018)An FPGA framework for edge-centric graph processingProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203233(69-77)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203233
Zhou SPrasanna V(2017)Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2017.25(137-144)Online publication date: Oct-2017
https://doi.org/10.1109/SBAC-PAD.2017.25
Kadric ELakata DDehon A(2016)Impact of Parallelism and Memory Architecture on FPGA Communication EnergyACM Transactions on Reconfigurable Technology and Systems10.1145/28570579:4(1-23)Online publication date: 22-Aug-2016
https://dl.acm.org/doi/10.1145/2857057
Kapre NBayliss S(2016)Survey of domain-specific languages for FPGA computing2016 26th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2016.7577380(1-12)Online publication date: Aug-2016
https://doi.org/10.1109/FPL.2016.7577380
Zhou SChelmis CPrasanna V(2016)High-Throughput and Energy-Efficient Graph Processing on FPGA2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2016.35(103-110)Online publication date: May-2016
https://doi.org/10.1109/FCCM.2016.35
Pacher DBinna RSpecht G(2016)Optimizing large knowledge networks in spatial computersThe Knowledge Engineering Review10.1017/S026988891600018731:4(367-390)Online publication date: 7-Dec-2016
https://doi.org/10.1017/S0269888916000187
Torres-Huitzil C(2016)A Review of Image Interest Point Detectors: From Algorithms to FPGA Hardware ImplementationsImage Feature Detectors and Descriptors10.1007/978-3-319-28854-3_3(47-74)Online publication date: 23-Feb-2016
https://doi.org/10.1007/978-3-319-28854-3_3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents