MATCHUP: Memory abstractions for heap manipulating programs
Proceedings of the 2015 ACM/SIGDA International Symposium on Field …, 2015•dl.acm.org
Memory-intensive implementations often require access to an external, off-chip memory
which can substantially slow down an FPGA accelerator due to memory bandwidth
limitations. Buffering frequently reused data on chip is a common approach to address this
problem and the optimization of the cache architecture introduces yet another complex
design space. This paper presents a high-level synthesis (HLS) design aid that generates
parallel application-specific multi-scratchpad architectures including on-chip caches. Our …
which can substantially slow down an FPGA accelerator due to memory bandwidth
limitations. Buffering frequently reused data on chip is a common approach to address this
problem and the optimization of the cache architecture introduces yet another complex
design space. This paper presents a high-level synthesis (HLS) design aid that generates
parallel application-specific multi-scratchpad architectures including on-chip caches. Our …
Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.
ACM Digital Library