1 Introduction
Hardware accelerators have emerged as the method to implement complex algorithms on energy-constrained devices, as exemplified by the explosion of image processing and machine learning accelerators [
5,
12,
19,
39,
42]. Two common hardware targets are FPGAs [
7,
23] and ASICs [
28]. However, FPGAs have low efficiency from being too fine-grained, and single-application ASICs cannot adapt to quickly evolving applications. Programmable domain-specific accelerators like those shown in Table
1 are promising, but have historically been challenging compiler targets.
A key compiler challenge is that efficient domain-specific accelerators use a different memory abstraction than CPUs and GPUs. General-purpose hardware architectures, like CPUs, issue load and store instructions to their memory system to
pull in the data needed for computation, as is shown in Figure
1(a). They implicitly orchestrate data movement [
36], their instructions normally contain the global address of the data, and they rely on hardware-managed caches to interpret the address and wait for requested data to be fetched from any of their memory hierarchy levels. However, by leveraging domain knowledge of the application, accelerators can perfectly prefetch the data by using memory units with software-managed memory controllers that are decoupled from the compute units. Accelerators can thus overlap the memory loads with computation and reduce the hardware overhead introduced by cache-like pull memory. We refer to such memory units as
push memories, since they push data to the computational units instead of waiting for a CPU to issue a request for data.
Unlike general-purpose hardware that has a centralized memory system, the memory systems on programmable push-memory accelerators are distributed. For instance, Figure
1(b) demonstrates the memory hierarchy of a
coarse-grained reconfigurable array (CGRA) with a multi-banked global buffer (L2) and on-chip memory tiles (L1). To fully utilize the resources, push-memory accelerators often have parallel computation units fed by a number of discrete memories or a computation pipeline with memories and compute units interleaved with each other. This yields a unique programming challenge: The compilation target is not just a single piece of code but a set of programs (or configuration bit-streams) running on every memory’s controller (green ovals in Figure
1(b)) that manage the data movement. Not only do the programs contain information on which data should be accessed (the addresses), but they must also align the timing of read and write events inside the buffers to synchronize the flow of data from buffers, through processing elements for computation, to another buffer.
Since push memories control both temporary storage and the flow of data, they account for a large fraction of the chip area and power in push-memory accelerators, as shown in Table
1. Unfortunately, unlike caches used by CPUs, programmable push-memory accelerators do not have a widely adopted hardware implementation for their memory systems. To minimize area and energy, these accelerators typically use their own unique custom implementation of push memory hardware optimized for specific applications or classes of applications. Thus, push memories require the compiler to optimize for a different memory abstraction for every application.
We address these challenges by creating a new push memory abstraction that we call a
unified buffer, so named because it generalizes push memories for different application domains (such as image processing and machine learning) and different reconfigurable targets (such as our custom CGRA shown in Figure
9, ready-valid CGRAs, and FPGAs). The unified buffer abstraction allows us to compile a program to a single well-defined
intermediate representation (IR), perform application-specific optimization at that level, and then map to different hardware targets. This abstraction is described in more detail in Section
2. It also facilitates efficient push memory implementations that use custom circuits to create address generation and control logic that can be more sophisticated in how they use the SRAM. An example of an optimized implementation, our
Physical Unified Buffer (PUB), is described in Section
3. Taken together, the abstraction lets the compiler and hardware generator create more optimized solutions.
Using our unified buffer abstraction, we have developed a compiler backend for a subset of Halide [
40] that supports tensor computations, stencils with affine indexing, and lookup tables. The compiler backend is described in Section
4 and targets programmable push memories. The compiler design is based on a simple observation: successful compilers refine and lower a program from a high-level description to a low-level description. This observation has a profound implication for compiling to programmable accelerators: If our target hardware contains high-level, optimized push memory primitives, then every stage in the compiler that deals with memories must also represent them at this level or higher. In particular, we propose that the representation of push memories in the compiler must combine storage, address generation, and control logic in a single structure—the unified buffer. Unified buffers serve as the interface inside the compiler between the application and the architecture. They define both the IR used by the compiler during push memory mapping and the logical behavior that the hardware architects must implement. By leveraging this IR, we create a robust compiler system that supports CGRAs as well as other hardware targets.
Figure
2 shows the compiler pipeline that takes a Halide program and transforms it into a composition of physical buffers. The example program brightens and blurs an image by multiplying each pixel by 2 and averaging
\(2\times 2\) boxes. There are three main steps in the compiler: scheduling, buffer extraction, and buffer mapping. Section
4.1 describes the scheduling step that lowers Halide programs to low-level Halide IR, following user-defined schedules that determine loop tiling, loop ordering, and where intermediate values are stored [
40]. We reinterpret Halide scheduling commands to optimize for a CGRA with programmable push memories. Section
4.2 describes how the buffer extraction step extracts unified buffers from Halide IR. This step uses polyhedral techniques to determine the necessary ports, to summarize the statement instances that use each port and the values they write to or read from the port, and to calculate the cycle times when the instances use each port. Furthermore, we can apply optimization on this compiler abstraction automatically. Finally, the buffer mapping step takes as input an abstract unified buffer specification and derives a correct configuration for the hardware target. We describe mapping to our PUBs and other hardware targets in Section
4.3.
Taken together, we describe the compilation from an application specification to a configuration of custom hardware. Our contributions are as follows:
•
a compiler abstraction of push memories, called a unified buffer, that represents data storage, address generation, and control in the same structure;
•
a hardware memory primitive, called the PUB, that implements an efficient version of unified buffers for CGRA accelerators;
•
a compiler that combines polyhedral analysis with vectorization to map unified buffers into configurations of physical buffers;
•
an evaluation of compiling image processing and machine learning applications to a CGRA using our PUB.
2 The Unified Buffer Abstraction
Unified buffers separate the part of the compiler that analyzes programs to determine and optimize data movement from the part that implements the data movement by configuring physical memories. Therefore, they have two objectives:
(1)
provide a precise description of the requirements of each push memory at its interface and
(2)
maximize opportunities for independent optimization on each side of the interface.
The first objective preserves the functionality of the application, while the second ensures its efficient implementation. Push memories are fundamentally defined by their data streams, so we need a compact representation of these streams. For this representation we use the polyhedral model [
3], which provides a compact way to represent schedules and memory access patterns as integer sets and relations. Figure
3 shows the unified buffer that is generated to communicate between the brighten and blur stages of the example in Figure
2. This buffer accepts one pixel per cycle from the brighten compute kernel and delivers a
\(2\times 2\) window of pixels per cycle (after a startup delay) to the blur kernel. To accommodate the required bandwidth, this unified buffer has five ports: one input port and four output ports. The data stream through each port is specified with three pieces of information:
•
The iteration domain of the operations (statement instances) that use the port. The domain is defined by the bounds of loops in the loop nest.
•
The access map of the operations, that maps each iteration domain point to a value read or written on the port.
•
The schedule of the operations in the iteration domain. This schedule specifies the number of unstalled cycles between reset and each operation.
The iteration domain integer set and the access map and schedule relations are implemented using the polyhedral analysis tool ISL [
47]. For the input port in our example, the iteration domain is the integer set
Since the brighten operation, which is the only user of the input port, is surrounded by a two-dimensional loop, the iteration domain has two index variables: an outermost index variable
\(y\) and an innermost index variable
\(x\) .
The unified buffer does not merely specify what operations use a port. To synthesize address generation code and optimize memory sizes, it must also specify what memory locations are accessed by those operations. To specify the memory locations, each port has an access map. For example, the second output port of the brighten buffer has the access map \((x, y) \rightarrow \text{brighten}(x+1, y)\) , which means the accessed value is to the right of each point in the operation’s iteration space. The other output ports have different maps, thus collectively fetching the \(2\times 2\) stencil required by the blur kernel.
The polyhedral schedules used by the unified buffer map loop nests to cycle times in a hardware design. This contrasts with conventional polyhedral schedules, such as those produced by Feautrier’s algorithm [
11] or PLUTO [
3], that map elements of the iteration domain to multidimensional timestamps. While these algorithms essentially map loop nests to loop nests, our schedules map loop nests to the number of unstalled cycles after reset when each operation begins. To accommodate pipelined hardware designs, our schedules map several operations to the same timestamps.
The schedule is used to calculate when reads and writes occur. In our example, the schedule for the input port is the integer function \((x, y) \rightarrow 64y + x\) . It specifies that the first write to the brighten buffer input port, at coordinate \((0, 0)\) , happens \(64*0 + 0 = 0\) cycles after execution begins and that the second brighten operation, at coordinate \((1, 0)\) , happens after \(64*0 + 1 = 1\) cycle. Furthermore, the output ports emit their first value after \(65 + 64*0 + 0 = 65\) cycles, which is the time the buffer must delay the first value to generate the correctly aligned output. The internal distances refer to the number of cycles from when a value arrives at an input port to when it leaves an output port.
The schedules count unstalled cycles instead of clock cycles, to accommodate variable-latency operations. Since our applications are statically analyzable, our schedules guarantee that data dependencies are not violated, assuming the input data are valid and the output data can be stored. However, all hardware accelerators have to deal with variable-latency operations like main memory accesses. We accommodate variable latency by counting how many cycles the buffer has not been stalled by a dynamic operation. A stall occurs when any buffer input is invalid during a write (e.g., a DRAM read is late) or when any buffer cannot be read since the destination is not ready. During a stall, the cycle count is not incremented for all the unified buffers in the application, keeping the data waves aligned. Once the stall condition resolves, all the cycle counters resume.
In our target CGRA, the interface between the accelerator and the host memory system uses latency-insensitive channels, while the memory inside the CGRA uses (gated) cycle counters. Our Halide program tiles the inputs to create execution blocks that our compiler statically schedules. All statements within the tiled execution block are assigned a timestamp consistent with the global cycle accurate schedule. This context information ensures the scheduler adds enough buffering to allow internal compute kernel nodes to read the data from multiple predecessors simultaneously even if they are produced at different times. More details about this scheduler are given in Section
4.2. Between the tiled execution, we use double-buffered ready-valid channels. Thus, we only need to stall if the next tile has not been prefetched from DRAM into the accelerator, or the previous tile output has not been stored in DRAM before the current tile stops execution. For a CGRA implementation using ready-valid channels with buffet [
36] style memory blocks that contain dependency checking capabilities, our compiler outputs address patterns and drops the cycle accurate information in its schedule. It thus lets the hardware handle execution timing and potential port conflicts.
The unified buffer interface describes the observed behavior of the memory at its interfaces, in terms of the operations in the original program. The unified buffer does not specify the internal implementation of its behavior and can be used to map to different hardware backends. Only externally visible scheduling and binding decisions are expressed. Crucially, unified buffers omit the physical capacity of the memory and the physical mapping of data to locations in memory. Thus, it is a precise specification in terms of familiar data structures for a compiler—the sets and relations of the polyhedral model—and leaves the architects considerable room to optimize the design. Next, we describe how we implemented this interface to design a high-performance, programmable push memory for image processing and machine learning applications.
4 Compiler Design
Users of our system specify their applications in Halide, a high-level
domain-specific language (DSL). Halide separates the application algorithm from its schedule to isolate computation from optimizations in execution [
40]. The algorithm specifies the computation of an output, while the schedule specifies the order in which the computation should be performed. Our compiler divides the problem of compiling Halide buffers to push-buffer implementations into three steps (shown in Figure
2):
(1)
Scheduling leverages the Halide scheduling system, whose scheduling language controls loop transformations and that we extend with accelerator commands. We support a subset of the Halide input language that includes stencils, tensor computations, and lookup tables.
(2)
Buffer extraction uses polyhedral scheduling techniques to turn the multidimensional iteration spaces of Halide loops into one-dimensional cycle times at every buffer port, thus yielding pipeline parallelism. The same step then extracts the full specification of each buffer port in the unified buffer abstraction, as shown in Figure
3.
(3)
Buffer mapping maps the abstract unified buffers to physical buffers built from low-level hardware primitives based on the chosen compiler target.
We chose to keep the Halide scheduling language for tiling instead of placing it in the second step (like the PLUTO scheduling algorithm [
3]). The reason is that a high-quality, general-purpose tiling algorithm for all dense linear algebra applications has not yet been found. As a result, we believe tiling is best left to either performance experts through a scheduling language or to domain-specific search procedures such as Reference [
51]. Thus, we limit our use of polyhedral techniques to memory analysis and semantic-preserving loop fusion.
4.1 Scheduling
We extended the Halide scheduling language, which lets users define loop tiling but has no notion of push memories, to include commands to designate the portion of an application that should be placed on the accelerator. Figure
2 shows an example. The placement is done by defining the accelerator output with
hw_accelerate and each of the accelerator inputs with
stream_to_accelerator. After loop tiling, the user can designate the buffers that should become push memories, as opposed to fused with adjacent kernels, by using
store_at and
compute_at, along the lines of Reference [
39]. Users can create a memory hierarchy by using the Halide scheduling directive
in on buffers. Finally, the user can use
unroll to designate that loops should be parallelized spatially as opposed to executed iteratively. After these scheduling directives, all following optimizations and mapping are performed automatically without user input.
Limitations to Addressing. The Halide frontend is used to specify our application, but our system does not handle some parts of the Halide language. Our compiler represents memory operations with three parts: memory read address, memory write address, and memory data. By analyzing these parts of a memory statement, we are able to classify which memory statements we can optimize, which statements are supported by our PUB hardware implementation, and which statements are not. Halide and our abstraction can express all of these different statements, but our backend compiler optimizations and hardware implementation handle a subset. Modifications to the mapper or hardware could enable and further optimize more of these use cases. Table
2 shows examples of supported and unsupported statements.
Our address controllers for PUB only handle affine expressions of index variables. During memory categorization, some memories are identified as unsupported if their indexing is data dependent or contains non-affine indexing. Our compiler identifies unsupported cases and throws an error that no mapping support exists for these memory operations. The fourth use case in Table
2 shows an unsupported example where the read address is non-affine. Histogramming is shown in the last row where the write addresses are data dependent, which is not supported. Data-dependent addressing introduces read-after-write dependencies that are not statically known, which require the compiler to be careful and conservative during scheduling. Non-affine expressions require the hardware to increment the address by a non-constant stride. Both of these cases are currently not supported by our compiler and hardware.
A special, supported case of data-dependent addresses is for a memory that holds constant, precomputed values. These include constant arrays and lookup tables as shown in the second and third use cases of Table
2. Stencil taps to a Gaussian filter are known statically during compile time, so we can preload them into registers. Another variant is where the read address is data dependent, also known as a lookup table. Our compiler analyzes the memory indices and memory values and identifies these cases. Lookup tables are precalculated with values and placed into memories functioning as basic SRAMs where the read address is connected to the data-dependent calculation.
Multiple Updates. One feature of Halide is multiple update statements to a single memory. A single memory can have values stored, and then particular addresses can be modified multiple times. For example, a memory could be initialized to store a constant, \(mem(x,y)=10\) ; then one column could be updated to a new value, \(mem(x,1) \mathrel {+}=in(x)\) ; and another update, \(mem(2,y) \mathrel {+}=3\) . We choose to support only a single update statement for each memory in hardware so that basic accumulation is possible. Further updates are translated into their own memories, effectively converting the computation into a static single-assignment form.
One optimization for multiple updates is fusing unrolled reductions into a single statement. Our naive decision to use a memory for every update would result in a memory created for each update stage. Thus, a \(3\times 3\) convolution would have nine memory temporaries. Instead, we optimize this series of reduction statements, such as a series of adds, to a single statement. This effectively minimizes the number of memories by combining all of the compute kernels.
After these compiler passes and checks, the Halide compiler separates the Halide IR used for memories from the IR used for computation. The compute kernels are represented as graphs of operators and are used during the finishing steps. Meanwhile, the IR for memory operations is used for buffer extraction.
4.2 Buffer Extraction
The buffer extraction step analyzes the Halide IR to turn both loops and arrays into push memories expressed using the unified buffer abstraction. That is, Halide programs describe computation as operations on arrays over iteration domains defined by index variables. To compile the arrays to optimized push memories, buffer extraction analyzes array reads and writes to trace movement of values through memories. It then uses this information to distribute the control flow across the address generators in the push memories themselves.
Each static read from or write to a Halide buffer is given a unique port on the corresponding unified buffer. For each port, buffer extraction then computes an iteration domain, an access map, and a schedule. The iteration domain is the Cartesian product of the bounds of the loops surrounding the buffer access in the Halide IR and the access map is the buffer access expression. The main work of unified buffer extraction, however, is computing the cycle-accurate schedule that maps operations in the Halide program to the cycle times when they will happen in hardware.
Our cycle time scheduler exploits pipeline parallelism in two broad classes of workloads: stencil pipelines from image processing and coarse-grained pipelines from deep neural networks (DNNs). Classical image processing applications, such as Harris corner detection, consist of many stencil operations that each produce pixels from small windows of input pixels. No single stage dominates the total compute cost and every pixel in a given stage depends on a small number of pixels in prior stages, making it cheap to create dedicated hardware for executing each pair of producer and consumer stages in parallel.
In DNNs, however, a single stage containing a large compute unit, such as a systolic array, dominates the compute cost of the application. Furthermore, values produced by that stage depend on large number of values from prior stages, making it expensive to parallelize across stages. We create pipeline parallelization for both stencils and DNNs using line buffers and double buffers respectively.
The buffer extraction detects these two pipeline types separately. Then it selects the scheduling policy with a simple rule: If the most compute intensive stage in the pipeline can achieve full temporal utilization after loop fusion, then it uses a scheduling strategy that is tailored to stencil pipelines, which produces schedules that can be implemented efficiently using line buffers. Otherwise, if none of the stages can fully occupy the computation hardware after loop fusion, then it uses an algorithm tailored to the DNN-style pipeline, which uses coarse-grained pipeline parallelism and double buffering to maximize utilization of the most expensive compute unit, as shown in Figure
7. Both policies use the polyhedral analysis tool ISL [
47] to compute data dependencies between operations and to solve the optimization problems used in formulating the schedule.
Stencil Pipeline. If the pipeline is classified as a stencil pipeline, then we apply the scheduling algorithm described by Huff et al. [
15]. This algorithm produces a cycle-accurate schedule in two stages. First, it fuses all loop nests in the application into a single perfect loop nest. Then it computes a cycle-accurate schedule for the fused, perfect loop nest at an initiation interval of one (II = 1). The fusion is done incrementally, from the outermost loop level to the innermost. The fusion procedure uses an SDF-style constraint problem to set the relative rate and delay of each operation in a way that makes dependence distances as small and uniform as possible. To be specific, a global schedule is constructed with the constraint that the start of a statement must happen after the latest end time among all of its producer statements. The end time of a statement is calculated using the start time, the latency of memory loads, the latency of memory store, and the latency of its computation. This is demonstrated in the following equations:
Once fusion is finished, we compute a cycle-accurate schedule for the loop nest using a standard
High level synthesis (HLS) loop scheduler [
56], which sets the II of every outer loop and delay of all operations in the innermost loop. Once the schedule is set, each unified buffer behavior is defined on its ports and the capacity is sized so that each pixel can be store until needed.
Coarse-grained Pipeline. The DNN buffer extraction creates a schedule for a double-buffered pipeline. This pipeline is coarse grained: Operations on one tile of an image proceed sequentially but are overlapped with operations on the next tile of the image. So, for example, while a computation is being performed on a tile that has already been loaded onto the accelerator, the next tile is being loaded onto the accelerator. Buffer extraction first identifies the overlapping tile loops to form the coarse-grained pipeline. It then walks from the root of the program inward and collects loop nests up to and including the innermost loop whose body is not a single perfect loop. These perfectly nested loops form the outer coarse-grained pipeline. We refer to the operations inside the pipeline as pipeline stages, but these stages are themselves typically extracted from loop nests. For instance, in the coarse-grained pipeline pseudo code shown in Figure
7, the outer coarse-grained loop,
for loop on line 1, is pipelined and it contains three pipeline stages.
With the coarse-grained pipeline loops selected, the buffer extraction creates a cycle-accurate schedule for each pipeline stage using the same scheduler as for the stencil pipeline. It then creates the coarse-grained pipeline by laying out each pipeline stage sequentially and setting the initiation intervals (IIs) of the coarse-grained pipeline loops to the sequential execution latency. We refer to this as sequential scheduling.
Next, the compiler applies double buffering to achieve better throughput, which reduces the IIs of the coarse-grained pipeline loops to the latency of the most compute-intensive stage. For example, the latency of operations in the DNN pipeline in Figure
7 are 2, 4, and 2 cycles, so the schedule has a coarse-grained pipeline with
\(\text{II}=4\) . A standard HLS loop scheduler uses
for loops as boundaries of pipelining. Nested and imperfect loops at the boundary will result in redundant pipeline flush stages at the end of each loop [
39,
52]. The same reasoning applies to our coarse-grained pipeline. To reduce this extra latency in DNN pipelines, buffer extraction applies loop perfection and loop flattening. Loop perfection pushes all coarse pipeline stages under one
for loop with
if guards, and loop flattening merges all loops above the coarse-grained pipeline into one single, merged loop.
4.3 Physical Buffer Mapping
With scheduling finished, all operations have been assigned to unstalled clock cycles (one-dimensional affine schedules), and the bandwidth of each memory is known. The next task of the compiler is to map the abstract unified buffers to implementations built out of the available physical primitives. This mapping produces the configuration bits for each physical buffer used in the design. In principle, the unified buffers can be mapped directly to physical buffers on the target accelerator. In practice, however, this is rarely possible for the following reasons:
•
Limited buffer bandwidth. The physical buffers on the accelerator may not have sufficient bandwidth. For example, our PUB only has a single four-word-wide SRAM in each physical buffer, meaning that each buffer can support at most four memory operations per cycle. However, the unified buffer from our brighten example performs five memory operations per cycle, and many image processing patterns need even larger bandwidth.
•
Wide fetch width. The accesses in the Halide program may have a bitwidth that is narrower than the bitwidth of the underlying SRAMs. For example, accesses to the four-word-wide SRAM in the physical buffers we built are done in vectors of four 16-bit integers, with four-word data vectors buffered in the aggregator and the transpose buffer between writes and reads to the SRAM.
•
Limited buffer capacity. The cycle-accurate scheduler reduces storage requirements by bringing the consumer closer to the producer, but unified buffers may need more space than any single physical buffer.
Shift Register Optimization and Banking. To address the need for high bandwidth, each unified buffer must be broken into multiple smaller unified buffers to increase the number of memory ports. Our compiler uses two strategies for servicing high bandwidth accesses: shift register optimization and banking. Shift register optimization is possible whenever the dependence distance (delay) between a port (the source) and another (the destination) is constant and the set of values that appear on the source is a superset of the values that appear on the destination. Our compiler performs an exhaustive shift register analysis that finds all opportunities to convert memory output ports into ports driven by shift registers fed from fewer memory ports. For instance, according to Figure
3, the buffer feeding the
\(2\times 2\) blur kernel has four output ports, whose dependence distances to the input port are 0, 1, 64, 65 respectively. As shown in Figure
8(a), these three delays can be implemented with two shift registers and a physical buffer that delays by 64 cycles.
After shift register optimization, the remaining ports must be serviced from banks of memory with address generators (Figure
8(b)). Our compiler uses a version of an optimal banking algorithm for stencil computations [
9]. It first groups the ports with overlapping schedules, meaning they access the buffer at the same time. If they access different parts of the buffer, then our compiler banks the memory, meaning it splits the memory into sub-memories dedicated to each port. This increases bandwidth over having the ports take turns reading. The compiler does the memory splitting by comparing the access maps of the ports to determine the block of the buffer that each needs. These blocks then become the banks. If the compiler cannot find a partition, then it falls back to exhaustive banking by creating a bank between each pair of input and output ports that access the same data.
Vectorization. To make efficient use of physical buffers with wide-fetch SRAMs, the access patterns of the buffers must be broken into sub-sequences with the same length as the SRAM fetch width as shown in Figure
8(c). At each input port of the buffer, this sub-sequence is assembled serially by the AGG. Once the aggregator is full, the sub-sequence is written to the SRAM. When the TB at an output port is empty, it receives a new sub-sequence from the SRAM, which it then sends out serially on the output port.
We can think of the introduction of the AGG, SRAM, and TB components as strip-mining the innermost loops of the original program and adding wide fetch-width loads and stores. The compiler generates the access maps and schedules at the SRAM ports and records them in the abstract unified buffer. It also adjusts the schedules of aggregator-SRAM and SRAM-transpose buffer transactions to minimize the storage requirement in the AGG and TB while respecting data dependencies and hardware resource limitations.
As mentioned in Section
3.2, when utilization of fetched words is too low, the buffer will be scheduled so that its overall rate is reduced. The transpose buffer keeps a local storage that helps during transposes, but cannot alleviate applications that have no locality.
Address Linearization. The access pattern in the unified buffer abstraction supports an arbitrary number of data dimensions, but a physical buffer requires the N-dimensional addresses to be converted to a single dimension. So, an inner product is applied between each N-dimensional address \(\vec{a}\) and an offset vector \(\vec{o}\) that encodes the memory layout: \(\text{MEM}[a_0, a_1, \ldots, a_{N-1}] \rightarrow \text{MEM}[\Sigma ^{i} a_{i} \cdot o_i].\)
Chaining. To map unified buffers with higher capacity than any one physical buffer, buffer mapping chains several buffers into a single logical buffer (Figure
8(d)). Each memory tile on the CGRA is assigned a unique tile ID. Our compiler statically analyzes the access map and the schedule of the unified buffer and partitions the access map into pieces implemented by multiple chained physical buffers. The following equations transform a logical address
\(a\) in the access map into a tile ID and a physical address in the memory tile, using the capacity
\(C\) of the memory tile:
Memory Hierarchy. Constructing a memory hierarchy involves copying from a unified buffer with a large capacity to a smaller unified buffer. On our CGRA, we refer to the large, outer memory as the global buffer, while the smaller ones are called memory tiles. The global buffer uses ready-valid signaling to connect to the processor’s memory system, and will stall the execution engine if a block of data has not been loaded before the time it needs to be pushed to the memory tiles. Our unified buffer abstraction is agnostic to the memory hierarchy level until the mapping stage. We thus carry hierarchy information downstream until hardware mapping where we generate the correct configuration for each physical memory primitive.
Finishing Steps. After we have finished generating the configuration information for all the physical buffers, we map the compute kernels produced by the Halide frontend to processing elements (PEs) on the CGRA. We place and route (PnR) this mapped graph of PEs and physical buffers on the CGRA following standard multi-stage optimization by performing global PnR followed by detailed PnR to obtain the final configuration bitstream.
5 Evaluation Methodology
To evaluate our compiler, we use it to compile the applications listed in Table
3 to CGRAs and compare the resulting performance to a Zynq UltraScale+ 7EV FPGA. The applications span stencil operations in image processing and tensor operations in deep neural networks as found in previous Halide scheduling papers [
1,
32]. To generate an FPGA bitstream, our compiler transforms the buffers in each Halide application into unified buffers and applies all optimizations. We build upon the work by Huff et al. [
15] to generate synthesizable C code that we feed into Vitis HLS. Our system generates identical synthesizable C as Huff et al., which compares favorably to other competitive DSL-FPGA systems [
15]. The HLS output is fed into Xilinx’s Vivado system that synthesizes, places, and routes the resulting design at 200 MHz. We use Vivado [
50] to report resource consumption, energy use, and performance.
Our CGRA, shown in Figure
9, resembles an island-style FPGA, with LUTs replaced by PE tiles with 16 bit integer/floating-point ALUs, and BRAMs replaced by
memory (MEM) tiles with different unified buffer implementations, including our optimized PUBs. The CGRA is embedded in a full system-on-chip. It directly connects to a large multi-banked, double-buffered memory called the global buffer. The global buffer has 16 banks; each bank is 256 kB and connects to a different section of the top edge of the CGRA. The data tiles required by the CGRA are first brought into the global buffer and then streamed into the CGRA. This allows computation on the current tile in the CGRA to be overlapped with the movement of the next tile into the global buffer. The global buffer provides deterministic access latency to the CGRA and hides the non-deterministic latency of the main memory. When targeting the CGRA, our compiler outputs a logical description of the design that is fed into custom mapping, placement, and routing tools designed for this CGRA. To generate power and area numbers, we created a complete Verilog design of the CGRA and used Cadence Genus and Innovus tools to synthesize, place, and route the MEMs and PEs of the CGRA in a 16-nm technology at 900 MHz. Power numbers are extracted from gate-level simulations.
7 Related Work
Creating tools for application-tailored accelerators is an active area of research and our system builds on these ideas.
Compiler Frameworks. Neither conventional software compilers nor existing hardware compilers are well suited to target push memories. HLS tools such as Vivado [
49], LegUp [
4], Catapult [
27], and others [
17,
24], are designed to solve scheduling and resource binding problems at a finer granularity than those seen when compiling to push memories. Their strategy works well when targeting FPGAs or ASIC technology libraries, because the architectural primitives (such as registers and LUTs) are more fine-grained than the compiler IR instructions. When compiling to programmable push memory accelerators, where the architectural primitives are much more coarse-grained than a typical RISC instruction, this approach does not work. Academic languages such as Spatial [
20], HeteroCL [
22], and Exo [
16] provide a more abstract programming model for accelerators, but the user must define the memory micro-architecture. Although HeteroCL uses a unified DSL frontend to describe their memory optimization and spatial architecture, their backend implementation still depends on separate frameworks. Exo provides more control to the user to enable more performance, but with this is increased complexity with user-defined memory management and accelerator functionality.
While HLS tools can translate the Halide IR directly to hardware, they do not support the memory optimizations we describe. Modern HLS tools such as Vivado HLS or Catapult HLS are well suited to arithmetic mapping and exploiting pipeline parallelism within the bodies of individual loops [
25]. However, they perform limited memory [
39] and cross-loop optimizations [
57]. As a result, they are not good at exploiting pipeline parallelism across different loop nests in a computation and require a great deal of manual effort by users to create high-quality code for deep pipelines [
22].
Push Memory Abstractions. Our unified buffer borrows from buffets [
36], a buffer implementation idiom with
explicit decoupled data orchestration (EDDO) that can be reused across multiple domains. Buffets are a hardware primitive, not a compiler abstraction, while our unified buffer is both. We improve productivity by using compiler techniques to extract, optimize, and map the buffers from high-level application code.
Recently a new abstraction called
hierarchical space-time (HST) [
35] was proposed to capture the memory behavior of EDDO architecture and an automated analysis/code generation framework called PolyEDDO is in progress. Different from our unified buffer that stays agnostic to the backend hardware, HST targets the EDDO architecture that relies on buffet controller’s synchronization. It is not trivial for PolyEDDO to target other memory implementations like a scratchpad. Moreover, its frontend only supports perfect loop nests, while ours can support coarse-grained pipelines.
Some CGRA designs have proposed using PEs for calculating memory addresses [
10,
26,
45]. However, most systems, like Plasticine [
38], create dedicated addressing units associated with their push memories. Spatial [
20] provides a high-level programming language for Plasticine but requires users to explicitly orchestrate data movement between different memories. SARA [
55] improves upon the Plasticine compiler by scaling applications to utilize all hardware resources but leaves inter-loop optimizations to the user. Nowatzki [
34] proposes a low-level programming model for stream dataflow accelerators. Since their memory architecture is a global scratchpad, their memory ISA contains dynamic scheduling that may not be suitable for accelerators with distributed push memories.
Domain-Specific Accelerator Generators. Other work seeks to automate FPGA and ASIC domain-specific accelerator design. Image processing accelerator generation languages such as Darkroom [
13], Rigel [
14], Aetherling [
8], Hetero-Halide [
23], HIPACC-FPGA [
41], PolyMage-FPGA [
7], SODA [
6], and Halide-HLS [
39] automatically generate FPGA implementations of image processing pipelines. These systems target either FPGAs that have large overheads or ASICs that are inflexible. AutoSA [
48], AutoDSE [
44], and Clockwork [
15] are some systems that use polyhedral analysis for scheduling, but they do not consider CGRAs.
To efficiently execute DNNs, Zhang et al. [
53] optimize DNN data blocking using double buffer structures and synthesize a pipelined FPGA accelerator from Caffe [
18]. DNNWeaver [
43] also generates synthesizable designs automatically from Caffe, with support for more types of layer implementations. DNNBuilder [
54] proposes a fine-grained layer-based pipeline architecture with a line-buffer-based scheme to reduce FPGA on-chip memory usage. VTA [
30,
31] provides a full hardware-software stack for DNN acceleration using a modified version of Halide IR. It proposes an ISA to map DNN layers onto optimized operators on their proposed FPGA accelerator. These domain-specific hardware generators reduce design effort when mapping a DNN to an accelerator. However, their frameworks heavily rely on the backend implementation. With the architectures determined, extending them to support new applications or more efficient hardware implementations would require significant development effort from domain experts.