SaS Auction is a flexible FPGA accelerator that efficiently solves the assignment problem. There are multiple reasons for using an FPGA rather than a CPU or GPU for solving the assignment problem. Due to its ability to implement arbitrary datapaths, it has greater potential for taking advantage of sparsity. This can yield designs with lower latency and lower energy consumption. In robotic applications embedding target tracking the Auction algorithm will typically be on the critical path. Reducing the latency of the algorithm can thus allow the robot to navigate faster, Also, for mobile battery-driven autonomous robots, energy efficiency is paramount. FPGAs can achieve much higher energy efficiency than CPUs and GPUs [
26]. Last, for some autonomous robots, like drones, the weight of the computational platform is an important parameter. In some cases, the weight can be the limiting factor for performance [
23]. Due to reduced power consumption, using an FPGA could remove the need for a heavy heatsink.
4.1 Sparse Problems
An assignment problem is sparse if the majority of the elements in the reward matrix
\(R\) are zero. If
\(r_{ij} = 0\), then object-
\(j\) is never assigned to agent-
\(i\). We also refer to these as non-valid object rewards. Figure
4 shows the level of sparsity found in the assignment problems in the MOT dataset. The x-axis shows the density, which is the inverse of the sparsity. Clearly, these are highly sparse assignment problems. The vast majority of problems have less than 5% density.
Figure
5 shows how the depth of the SearchTask scales with the number of PEs. Each level of the search tree is implemented in a separate pipeline stage. Over-dimensioning the accelerator by using too many PEs has a negative performance impact. The throughput is affected by the depth of the SearchTask, as pipeline stalling takes place at the PE stage until the preceding agent has finished executing in the SearchTask. The number of PEs also impacts the area and the obtainable clock rate of the accelerator.
However, under-dimensioning the accelerator with too few PEs is also costly and reduces the throughput, since it makes each agent span multiple pipeline stages.
Generally, the optimal number of PEs for a particular problem is the number of object rewards per agent. By taking advantage of sparsity, the optimal number of PEs can be greatly reduced and thus improving throughput and latency.
4.1.1 Representing Sparsity.
There are multiple ways of representing sparse matrices [
30]. A simple tuple representation [
28] was chosen, where the element and the column index are stored as a tuple. The original reward matrix
\(R\) resides in the off-chip DRAM and each element of the reward matrix is represented by a 64-bit data word. The transformed vector
\(C\) is the sparse representation of
\(R\) and resides in the on-chip BRAM. For architecture with
\(n\) PEs, each BRAM data word in
\(C\) contains
\(n\) tuples of object reward and column index as given:
The number of bits used to represent a single object reward is configurable. The words of the transformed BRAM vector \(C\) are padded such that the first non-zero element of each row of the original DRAM matrix \(R\) always occupies the first element \(e_1\) of a BRAM word \(c\). This means that each BRAM word only contains rewards from the same row of the original matrix. A row of the original matrix may span several words of the transformed vector \(C\). A separate array is used to map the row indices of the original matrix to indices of the transformed matrix.
In Figure
6, an example of a transformation between a problem matrix
\(R\) and its sparse representation
\(C\) is given. An architecture with 4 PEs is assumed for simplicity. This means that each BRAM word stores up to four valid, non-zero object rewards and their column indices. The reward matrix
\(R\) is sparse with the majority of the rewards being zero. The bottom right corner has indices (0,0). Each element
\(c_i\) in the transformed vector
\(C\) consists of four tuples
\(e_1 \ldots e_4\). The individual tuples consist of the element value and its column index in the original matrix
\(R\). The transformed vector
\(C\) is not fully dense, because the first element of each
row in
\(R\) is aligned to the beginning of a data
word in
\(C\). This is done to efficiently read out an entire row of the original input matrix by reading a single BRAM data word. In the example, elements 1 and 2 of
\(C\) (zero-indexed starting from the bottom) both belong to row 1 of
\(R\). When the number of valid elements in a row in
\(R\) exceeds the number of PEs in our architecture the elements are spread out over multiple words. The row index of each element in
\(C\) is stored in an array that maps each row of
\(R\) to an element in
\(C\).
4.1.2 Cost of Sparsity-aware Processing.
There are three major costs associated with using a sparse representation. First, the translation from dense to sparse representation must be performed. In the SaS Auction architecture, the translation is done in hardware, and this incurs an area cost as opposed to the time cost associated with a software implementation. Second, a sparse representation removes the static relationship between PE and object price. In SoA Auction, PE0 always calculates the benefit of object 0 for the various agents. This means that each PE can store the prices locally and thus can all the prices be accessed in parallel at low cost. To achieve parallel lookup of all PEs, SaS Auction duplicates the BRAM with object prices for each PE. Last, sparse representation adds overhead, because two values are needed for each element: the column index and the element value. The memory needed to store a sparse matrix is given by:
where
\(w_{reward}\) is the bits needed to represent a reward,
\(n_{max}\) is the maximum problem size, and
\(r_{sparse}\) is the rate of non-zero elements. The memory requirements for a dense representation is as follows:
For an FPGA implementation, both \(Mem_{sparse}\), \(n_{max}\) and \(w_{reward}\) are fixed at compile-time. For a given combination of these parameters, there is an upper bound on the size and density of a problem matrix that can be stored.
4.2 Dependency Speculation
Each iteration of the Auction algorithm consists of reading the object rewards for a specific agent and comparing them to the current object prices. An iteration may result in a bid and thus change the state of the object prices. This means that unless the processing pipeline is stalled until the current agent is done, the next agent might use stale prices when calculating its bid. We define an
unresolved agent as an agent that is currently computing a bid but has yet to commit it by updating the prices. There is a data dependency between agents
\(a_1\) and
\(a_2\) if the output of processing
\(a_1\) is dependent on whether it executes before or after
\(a_2\). Dependencies are dynamic, i.e., two agents might be dependent at some point during the processing and independent at another. Dependencies can be divided into two categories. A
strong dependency occurs if both
\(a_1\) and
\(a_2\) have the same most-beneficial object. In this case,
\(a_1\) calculates an illegal bid if it does not wait for
\(a_2\) to commit its bid first. A
weak dependency occurs when
\(a_2\) bids for the second-most beneficial object to
\(a_1\). In this case, the order does not change which object
\(a_1\) bids for, but it affects the size of the bid. It makes
\(a_1\) bid more conservative, which is equivalent to making a smaller step toward the global optimum. In Reference [
36], the dependencies are handled by stalling all agents in the PE stage until one agent computes its bid introducing expensive pipeline bubbles.
The processing stages for agent
\(a_i\) through the pipeline of SoA Auction with 8 PEs are shown in Figure
7(a). The components will be described in more detail in Section
4.3. The pipeline is stalled such that only one agent traverses the loop consisting of PEs, SearchTask, and AssignmentEngine stages at the time. In the proposed SaS Auction, a basic static dependency speculation scheme is employed. It is always speculated that there are no strong dependencies between the agent currently at the PE stage and the unresolved agents that have not committed their bids yet. In the cases where there actually is a strong dependency, the agents bid is disregarded and it is added back in the unassigned queue. This is detected by comparing the bid to the current price. If it is lower, then a misspeculation has occurred.
Figure
7(b) shows a timing diagram for SaS Auction, an implementation supporting dependency speculation, characterized by no pipeline bubbles, which provides 9
\(\times\) increase in throughput compared to SoA auction implementation in Reference [
36]. The actual speedup is limited by the rate of misspeculations and pipeline bubbles naturally occurring, because there are few unassigned agents left.
4.3 Hardware Architecture
The proposed SaS Auction is an optimized accelerator generator for efficiently solving the assignment problem. It is specifically optimized for the sparse problems found in, e.g., MOT.
Figure
8 shows an overview over the SaS Auction hardware architecture, and Figure
9 shows the speculative Auction algorithm represented as a flow-chart. A streaming dataflow architecture optimized for throughput is proposed. It is inspired by SoA Auction by Zhu et al. The accelerator is characterized by
\(n_{PE}\), its number of PEs. This decides the width of the pipeline and BRAM interfaces. The accelerator reads the problem matrix from the off-chip DRAM into the on-chip BRAM. It also initializes the queue of unassigned agents. Then, it proceeds to perform the steps in the algorithm in Figure
1. The accelerator reads one row of the problem matrix, finds the most beneficial object, and calculates the bid. Then, it updates the price and assignment vector. This continues until there are no unassigned agents left. In the following, each component is described briefly.
(1) Register File provides the user interface of the accelerator. It is conventional AXI4-Lite-based memory mapped control and status register file. The user software writes the memory address of the input problem matrix along with its dimensions into registers residing at predefined addresses. Starting and stopping the accelerator is also achieved through the register file.
(2) Dram2Bram functions as a DMA that reads problem data from the off-chip DRAM, compresses it to the proposed sparse representation, and writes it to the on-chip BRAM. The parsing and compressing is pipelined in two stages.
(1)
Non-valid, i.e., rewards that are zero, are filtered out. The valid rewards are entered into an register-based array, \(ValidR\), corresponding to a single BRAM word.
(2)
When either the array is full or a complete row of the input problem has been read, the array is written to the on-chip BRAM, and the address at which it is stored is written to the AgentRowStore
The length of \(ValidR\), and consequently the width of the on-chip BRAM interface, is equal to \(n_\mathrm{PE}\). This allows the rewards for all the PEs to be read out in a single cycle and.
(3) qUnassignedAgents and qRequestedAgents are first-in-first-out (FIFO) queues between the AssignmentEngine and the BramController. They hold the unassigned agents and the currently executing agents, respectively. BramContoller dequeues agent-indices from the qUnassignedAgents and enqueues them back on the qRequestedAgents after their rewards have been requested from the BRAM. When a bid arrives from the SearchTask, the AssignmentEngine dequeues from the qRequestedAgents and will potentially enqueue a new agent-index on the qUnassignedQueue.
(4) AgentRowStore is a BRAM storing the mapping of the row index of the original input matrix to index of the on-chip BRAM data word containing the transformed version of that row. The Dram2Bram module writes to the AgentRowStore as it is parsing and compressing the input matrix. The AgentRowStore is read by the BramController before fetching rewards in the on-chip BRAM.
(5) BramController controls the requesting and the parsing of the rewards from the BRAM. It is pipelined over multiple stages.
(1)
Dequeue index of unassigned agent from qUnassigned
(2–3)
Read the BRAM address of the agent from AgentRowStore
(4–5)
REad rewards from BRAM and forward to DataMux
(6) PEs perform the actual computation as expressed in Equation (
4). The PE stage is pipelined with the following stages:
(1)
Read current price of object from PriceStore
(2)
Calculate benefit of object
(3)
Forward benefit to SearchTask.
The PE performs static dependency speculation and always assumes that there are no dependencies between current agent and any unresolved agent. Any violation, which is caused by a strong dependency, will be caught in the AssignmentEngine.
(7) PriceStore is the BRAM-based centralized storage for the object prices. Each PE has a dedicated BRAM where all the prices are stored, i.e., the prices are duplicated for each PE. The write ports of all the duplicate BRAMs are connected to the AssignmentEngine, which updates all the price copies after a bid. Each duplicate BRAM has a single read port that is connected to its PE. The AssignmentEngine also has a duplicate BRAM, which is used resolve the dependency speculation.
(8) SearchTask is illustrated in Figure
5 and is a tree of comparators that finds the highest and second-highest benefits, among the benefits outputted by the PE stage. Each level of the tree is implemented as a pipeline stage. The SearchTask can perform multiple iterations to support solving problems where the agents span multiple pipeline stages. When the highest and second-highest benefits are found, the SearchTask computes the bid as expressed in Equation (
5).
(9) AssignmentEngine is in charge of maintaining the prices, assignments, and detecting and handling misspeculations. The AssignmentEngine is pipelined with the following stages:
(1)
Receive bid \(b\) for object \(o_j\) from the SearchTask and the corresponding agent-idx \(a_i\) from qRequestedAgents. Also read current price of the object \(P(o_j)\) from the PriceStore. And the currently assigned agent \(objectAssignments(o_j)\).
(2)
Resolve speculation by comparing \(P(o_j)\) with \(b\)
(3a)
If \(b \lt = P(o_j)\), then there has been a misspeculation caused by a strong dependency between \(a_j\) and a previous agent. In that case, \(a_i\) is enqueued back to qUnassignedAgents.
(3b)
If \(b \gt P(o_j)\), then both the PriceStore and ObjectAssignments are updated. The previously assigned agent, which was requested in the first stage, is enqueued to qUnassignedAgents.
Weak dependencies are not detected or handled by the AssignmentEngine.
(10) Controller performs control tasks. It fills up the qUnassignedAgents FIFO when the accelerator is started, detects the end of accelerator’s operation, and initiates writing the result back to DRAM.
Figure
9 shows a flowchart model of the pipeline. It is focused on the core processing pipeline, which starts and ends with the qUnassigned. The interaction with off-chip DRAM, host software, and transformation of input problems to sparse representations are abstracted away in the Init/Setup process.
4.5 Performance Model for Speculation
In the following, a performance model for speculative execution in SaS Auction is derived.
4.5.1 Stalling Cost.
First, a cost model for the state-of-the-art implementation is derived. The cost of stalling execution at the PE stage depends on the number of clock cycles needed by the preceding agent to calculate and commit its bid. It can be expressed as:
where
\(d_\mathrm{PE}\),
\(d_\mathrm{SearchTask}\) and
\(d_\mathrm{AssignmentEngine}\) refer to the latency (or depth) of the respective modules, given in Table
2. Note that these values are based on our implementation inspired by Zhu et al. [
36], where no data regarding the latencies of the original implementation is available.
Clearly, the stalling cost is independent of the input problem dimension and only depends on the number of PEs. However, the total processing latency of one agent also depends on the size of input problem dimension relative to the number of PEs. Thus, the relative cost of stalling decreases the more underdimensioned the accelerator is.
4.5.2 Speculation Cost.
The speculation cost is defined as the additional number of clock cycles needed to process an agent if it is not stalled at the PE stage, but rather being speculatively processed. This cost is a stochastic variable, as it depends on whether there is a data dependency between the agent and any unresolved agents. The goal is thus to find the
expected value of the speculative cost. The misspeculation cost is first derived. It is the penalty of dispatching an agent from the PE stage if there is an unresolved strong dependency and is defined as:
The constant cost of
\(C_{stall}\) is the worst-case expected cost of redoing the loop consisting of PE, SearchTask, and AssignmentEngine stages once the agent is ready in the PE-stage again. The variable
\(n_\mathrm{bubbles}\) is the number of pipeline bubbles occurring between the misspeculation and the next instant the agent is ready in the PE-stage. It is estimated as:
If
\(n_\mathrm{unassigned}\) is larger than the number of clock cycles from the AssignmentEngine to the PE stage, then there are no bubbles. The expected costs of stalling and speculating are expressed as:
where
\(p_\mathrm{miss}\) is the probability of a misspeculation, i.e., the probability that the agent at the PE stage has a strong dependency on an unresolved agent.
4.5.3 Probability of Misspeculation.
To estimate the probability of a strong data dependencies,
\(p_{miss}\), some simplifications must be made. It is assumed that the most beneficial object
\(o_{pick}\) for an agent
\(a_k\) is an independent random variable drawn from a uniform distribution. This can be expressed as follows:
where
\(m\) is the number of objects.
The probability of a strong dependency can thus be expressed as the probability of an agent picking the same object as any of the unresolved agents.
where
\(n\) is the number of unresolved agents and
\(m\) is the number of objects. The actual number of unresolved agents
\(n\) depends several factors.
First, it has an upper bound defined by the number of PEs in the accelerator. The number of PEs decides the number of pipeline stages between reading the prices in the PE stage and committing a bid in the AssignmentEngine. In an architecture with 8 PEs, there are 7 pipeline stages between PE and the commit operation.
Second, it also depends on how many pipeline stages each agent spans. In a scenario with 8 PEs and agents with 64 valid object rewards, the agent will span 8 pipeline stages. This means that when a new agent arrives at the PE stage, there will be at most one unresolved agent that would span all the 7 pipeline stages between the PE stage and the commit stage. Consider another scenario, also with 8 PEs, but this time with agents with at most 8 valid object rewards. This implies that each agent only spans a single pipeline stage. In this scenario, when a new agent arrives at the PE stage, there might be as many as 7 unresolved agents.
Last, it also depends on how many unassigned agents are left. When the agent of the previous scenario arrives at the PE stage. The number of unresolved agents could be anywhere from 0–7. The actual number depends on how full the pipeline is, which depends on how many unassigned agents are left.
Finally, using the derived probability of misspeculation
\(p_{miss}\), Equation (
11) is plotted in Figure
10 for an accelerator with 8 PEs with different input problem sizes. The grey horizontal line represents the cost of stalling, which is constant and independent of the number of unresolved agents and the number of potential pipeline bubbles. The expected cost of speculating depends on both
\(n_\mathrm{unresolved}\) and the potential number of bubbles
\(n_\mathrm{bubbles}\). Thus, Figure
10 shows the
range of expected cost of speculating. The maximum expected cost occurs in the case of an empty pipeline
behind the agent in the PE stage. The minimum cost arises when the pipeline is full and there are enough unassigned agents such that
\(n_\mathrm{bubbles}=0\).
Clearly, for bigger problems the probability of misspeculation is very small and thus also the expected cost of speculation. Intuitively, increasing number of unresolved agents leads to increased chance of misspeculation and thus increased expected cost. Given the assumption of uniform probability distribution, a static speculation scheme is sufficient. In static speculation, as opposed to dynamic speculation, the pipeline always proceeds in the PE stage, regardless of the number of unresolved agents or the number of expected pipeline bubbles. Such a static speculation scheme is implemented by SaS Auction.