Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2593069.2593090acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

Published: 01 June 2014 Publication History

Abstract

High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level (RTL) specifications. The high-throughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of non-uniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of non-uniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.

References

[1]
Y.-t. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou, "Accelerator-Rich CMPs: From Concept to Real Hardware," in International Conference on Computer Design, 2013.
[2]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, "High-Level Synthesis for FPGAs: From Prototyping to Deployment," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 4, pp. 473--491, Apr. 2011.
[3]
J. Cong, P. Zhang, and Y. Zou, "Optimizing memory hierarchy allocation with loop transformations for high-level synthesis," in Design Automation Conference, 2012, pp. 1229--1234.
[4]
L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, "Polyhedral-based data reuse optimization for configurable computing," Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '13, p. 29, 2013.
[5]
J. Cong, W. Jiang, B. Liu, and Y. Zou, "Automatic memory partitioning and scheduling for throughput and power optimization," in International Conference on Computer-Aided Design, 2009, p. 697.
[6]
Y. Wang, P. Zhang, C. Xu, and J. Cong, "An integrated and automated memory optimization flow for FPGA behavioral synthesis," in Asia and South Pacific Design Automation Conference, Jan. 2012, pp. 257--262.
[7]
P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong, "Memory partitioning and scheduling co-optimization in behavioral synthesis," in International Conference on Computer-Aided Design, 2012, pp. 488--495.
[8]
Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, "Memory partitioning for multidimensional arrays in high-level synthesis," in Design Automation Conference, 2013, p. 1.
[9]
Y. Ben-Asher and N. Rotem, "Automatic memory partitioning," in International Conference on Hardware/Software Codesign and System Synthesis, 2010, p. 155.
[10]
Xilinx, "Vivado High-Level Synthesis." {Online}. Available: http://www.xilinx.com/products/design-tools/vivado/integration/esldesign/index.htm
[11]
J. Cong, P. Li, B. Xiao, and P. Zhang, "An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers," Computer Science Department, UCLA, TR140009, Tech. Rep., 2014. {Online}. Available: http://fmdb.cs.ucla.edu/Treports/140009.pdf
[12]
T. Henretty, J. Holewinski, N. Sedaghati, L.-N. Pouchet, A. Rountev, and P. Sadayappan, "Stencil Domain Specific Language (SDSL) User Guide 0.2.1 draft," OSU TR OSU-CISRC-4/13-TR09, Tech. Rep., 2013.
[13]
J. Cong, V. Sarkar, G. Reinman, and A. Bui, "Customizable Domain-Specific Computing," IEEE Design and Test of Computers, vol. 28, no. 2, pp. 6--15, Mar. 2011.
[14]
A. A. Nacci, V. Rana, F. Bruschi, D. Sciuto, I. Beretta, and D. Atienza, "A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices," in Design Automation Conference, 2013, p. 1.
[15]
"Bicubic interpolation." {Online}. Available: http://www.mpihd.mpg.de/astrophysik/HEA/internal/Numerical_Recipes/f3-6.pdf
[16]
S. Verdoolaege, H. Nikolov, and T. Stefanov, "pn: A Tool for Improved Derivation of Process Networks," EURASIP Journal on Embedded Systems, vol. 2007, pp. 1--13, 2007.
[17]
W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, "Improving high level synthesis optimization opportunity through polyhedral transformations," in International Symposium on FPGAs, 2013, p. 9.
[18]
"LLVM-polly." {Online}. Available: http://llvm.org/svn/llvm-project/polly/
[19]
"ROSE compiler infrastucture." {Online}. Available: http://rosecompiler.org/
[20]
Xilinx, "Virtex-7 FPGA data sheets." {Online}. Available: http://www.xilinx.com/support/documentation/7_series.htm

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
  • (2023)Efficient Super-Resolution System With Block-Wise Hybridization and Quantized Winograd on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324762142:11(3910-3924)Online publication date: Nov-2023
  • Show More Cited By
  1. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DAC '14: Proceedings of the 51st Annual Design Automation Conference
    June 2014
    1249 pages
    ISBN:9781450327305
    DOI:10.1145/2593069
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    DAC '14

    Acceptance Rates

    Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
    • (2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
    • (2023)Efficient Super-Resolution System With Block-Wise Hybridization and Quantized Winograd on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324762142:11(3910-3924)Online publication date: Nov-2023
    • (2022)Auto-Partitioning Heterogeneous Task-Parallel Programs with StreamBlocksProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569659(398-411)Online publication date: 8-Oct-2022
    • (2021)Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL ComponentsACM Transactions on Reconfigurable Technology and Systems10.1145/346147814:3(1-33)Online publication date: 12-Aug-2021
    • (2020)Exploiting Computation Reuse for Stencil Accelerators2020 57th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18072.2020.9218680(1-6)Online publication date: Jul-2020
    • (2020)LLVM-based automation of memory decoupling for OpenCL applications on FPGAsMicroprocessors & Microsystems10.1016/j.micpro.2019.10290972:COnline publication date: 1-Feb-2020
    • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
    • (2019)An Efficient Memory Partitioning Approach for Multi-Pattern Data Access via Data ReuseACM Transactions on Reconfigurable Technology and Systems10.1145/330129612:1(1-22)Online publication date: 5-Feb-2019
    • (2018)Locality aware memory assignment and tilingProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196070(1-6)Online publication date: 24-Jun-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media