Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2691365.2691510acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article

Multithreaded pipeline synthesis for data-parallel kernels

Published: 03 November 2014 Publication History

Abstract

Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels. Since existing pipelining techniques typically enforce in-order thread execution, a variable-latency operation in one thread would block all subsequent threads, resulting in considerable performance degradation. In this paper, we propose a multithreaded pipelining approach that enables context switching to allow out-of-order thread execution for data-parallel kernels. To ensure that the synthesized pipeline is complexity effective, we further propose efficient scheduling algorithms for minimizing the hardware overhead associated with context management. Experimental results show that our proposed techniques can significantly improve the effective pipeline throughput over conventional approaches while conserving hardware resources.

References

[1]
SuiteSparse: A Suite of Sparse Matrix Packages. https://www.cise.ufl.edu/research/sparse/SuiteSparse/.
[2]
The Green Graph 500. http://www.graph500.org.
[3]
CPLEX: High-Performance Software for Mathematical Programming and Optimization, 2005.
[4]
Y. Ben-Asher, D. Meisler, and N. Rotem. Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs. ACM Trans. on Reconfigurable Technology and Systems, 3(3):1--19, 2010.
[5]
A. Canis, J. H. Anderson, and S. D. Brown. Modulo SDC Scheduling with Recurrence Minimization in High-Level Synthesis. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2014.
[6]
A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski. LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), pages 33--36, Mar 2011.
[7]
J. Choi, S. Brown, and J. Anderson. From Software Threads to Parallel Hardware in High-Level Synthesis for FPGAs. Int'l Conf. on Field Programmable Technology (FPT), pages 270--277, Dec 2013.
[8]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level Synthesis for FPGAs: from Prototyping to Deployment. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 30(4):473--491, 2011.
[9]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to Algorithms, volume 2. MIT press Cambridge, 2001.
[10]
P. Coussy and A. Morawiec. High-Level Synthesis: from Algorithm to Digital Circuit. Springer, 2008.
[11]
T. S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong, D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D. Brown. OpenCL for FPGAs: Prototyping a Compiler. Int'l Conf. on Engineering of Reconfigurable Systems and Algorithms (ERSA), pages 3--12, Jul 2012.
[12]
S. Dai, M. Tan, K. Hao, and Z. Zhang. Flushing-Enabled Loop Pipelining for High-Level Synthesis. Design Automation Conf. (DAC), Jun 2014.
[13]
R. J. Halstead and W. Najjar. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. Intl'l Conf. on Compilers, Architectures and Synthesis of Embedded Systems (CASES), Oct 2013.
[14]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. Int'l Symp. on Code Generation and Optimization (CGO), pages 75--86, Mar 2004.
[15]
F. Liu, S. Ghosh, N. P. Johnson, and D. I. August. CGPA: Coarse-Grained Pipelined Accelerators. Design Automation Conf. (DAC), Jun 2014.
[16]
J. Llosa, E. Ayguadé, A. Gonzalez, M. Valero, and J. Eckhardt. Lifetime-Sensitive Modulo Scheduling in a Production Environment. IEEE Trans. on Computers (TC), 50(3):234--249, Mar 2001.
[17]
J. Llosa, A. González, E. Ayguadé, and M. Valero. Swing Module Scheduling: A Lifetime-Sensitive Approach. Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 80--86, Oct 1996.
[18]
A. Morvan, S. Derrien, and P. Quinton. Polyhedral Bubble Insertion: A Method to Improve Nested Loop Pipelining for High-Level Synthesis. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 32(3):339--352, 2013.
[19]
R. Nelson. Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modeling. Springer, 1995.
[20]
M. Nemirovsky and D. M. Tullsen. Multithreading Architecture. Synthesis Lectures on Computer Architecture, 8(1):1--109, 2013.
[21]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879--899, 2008.
[22]
B. R. Rau. Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. Int'l Symp. on Microarchitecture (MICRO), pages 63--74, Nov 1994.
[23]
B. R. Rau and C. D. Glaeser. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. ACM SIGMICRO Newsletter, 12(4):183--198, 1981.
[24]
R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman. PICO-NPA: High-level Synthesis of Nonprogrammable Hardware Accelerators. Journal of VLSI Signal Processing, 31(2):127--142, 2002.
[25]
J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Design & Test, 12(3):66--73, 2010.
[26]
Z. Zhang and B. Liu. SDC-Based Modulo Scheduling for Pipeline Synthesis. Int'l Conf. on Computer-Aided Design (ICCAD), pages 211--218, Nov 2013.
[27]
W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong. Improving High Level Synthesis Optimization Opportunity through Polyhedral Transformations. In Int'l Symp. on Field-Programmable Gate Arrays (FPGA), pages 9--18. ACM, 2013.

Cited By

View all
  • (2018)A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT FormulationProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174268(137-146)Online publication date: 15-Feb-2018
  • (2016)Co-designing accelerators and SoC interfaces using gem5-aladdinThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195697(1-12)Online publication date: 15-Oct-2016
  • (2016)Efficient data supply for hardware accelerators with prefetching and access/execute decouplingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195694(1-12)Online publication date: 15-Oct-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICCAD '14: Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design
November 2014
801 pages
ISBN:9781479962778
  • General Chair:
  • Yao-Wen Chang

Sponsors

In-Cooperation

  • IEEE SSCS Shanghai Chapter
  • IEEE-EDS: Electronic Devices Society

Publisher

IEEE Press

Publication History

Published: 03 November 2014

Check for updates

Qualifiers

  • Research-article

Conference

ICCAD '14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT FormulationProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174268(137-146)Online publication date: 15-Feb-2018
  • (2016)Co-designing accelerators and SoC interfaces using gem5-aladdinThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195697(1-12)Online publication date: 15-Oct-2016
  • (2016)Efficient data supply for hardware accelerators with prefetching and access/execute decouplingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195694(1-12)Online publication date: 15-Oct-2016
  • (2015)ElasticFlowProceedings of the IEEE/ACM International Conference on Computer-Aided Design10.5555/2840819.2840831(78-85)Online publication date: 2-Nov-2015
  • (2015)Area-efficient pipelining for FPGA-targeted high-level synthesisProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744801(1-6)Online publication date: 7-Jun-2015
  • (2015)Mapping-Aware Constrained Scheduling for LUT-Based FPGAsProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689063(190-199)Online publication date: 22-Feb-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media