Approximate NoC and Memory Controller Architectures For GPGPU Accelerators
Approximate NoC and Memory Controller Architectures For GPGPU Accelerators
Approximate NoC and Memory Controller Architectures For GPGPU Accelerators
Abstract—High interconnect bandwidth is crucial for achieving better performance in many-core GPGPU architectures that execute
highly data parallel applications. The parallel warps of threads running on shader cores generate a high volume of read requests to the
main memory due to the limited size of data caches at the shader cores. This leads to a scenarios with rapid arrival of an even larger
volume of reply data from the DRAM, which creates a bottleneck at memory controllers (MCs) that send reply packets back to the
requesting cores over the network-on-chip (NoC). Coping with such high volumes of data requires intelligent memory scheduling and
innovative NoC architectures. To mitigate memory bottlenecks in GPGPUs, we first propose a novel approximate memory controller
architecture (AMC) that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory
request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the
NoC. To further realize high throughput and low energy communication in GPGPUs, we propose a low power, approximate NoC
architecture (Dapper) that increases the utilization of the available network bandwidth by using single cycle overlay circuits for the reply
traffic between MCs and shader cores. Experimental results show that Dapper and AMC together increase NoC throughput by up to 21
percent; and reduce NoC latency by up to 45.5 percent and energy consumed by the NoC and MC by up to 38.3 percent, with minimal
impact on output accuracy, compared to state-of-the-art approximate NoC/MC architectures.
1 INTRODUCTION
today’s high-performance computing workloads, gen- components of a GPGPU when executing data parallel
F OR
eral purpose graphics processing units (GPGPUs) have
become highly popular due to their support for massive
CUDA applications from the CUDA SDK sample code suite.
For memory intensive applications that generate high vol-
thread and data level parallelism. GPGPUs typically have umes of memory requests, the NoC and MC together dissi-
many computational units within their streaming multiproc- pate up to 30 percent of the overall chip power.
essors (SM) [1] (also known as shader cores) that execute Thus, there is a need for innovative NoC and MC architectures
thousands of threads simultaneously. Libraries such as that can support the high volumes of data generated and con-
OpenCL [2], and CUDA [3] have enabled programmers to sumed in GPGPUs, while dissipating low power (and energy),
efficiently parallelize a program and reap the best perfor- and without sacrificing application performance.
mance out of the available computational resources on a Recent works [4], [5] have demonstrated the impact of a
GPGPU. However, parallel applications often generate large new paradigm called approximate computing that trades-off
volumes of memory requests to memory controllers, result- computation accuracy for savings in energy consumption.
ing in a rapid influx of reply data from DRAM to be transmit- Many emerging applications in the domains of machine
ted from MCs to SMs, which creates memory bottlenecks. learning, image processing, and pattern recognition are
Traditional MCs use first-ready first-come-first-serve (FR- today exploring approximate computing techniques to save
FCFS) that is not designed to handle such requests that may energy and also improve application performance while tol-
have high row buffer locality (RBL) and bank level parallel- erating a small range of output errors. One of the main goals
ism (BLP) simultaneously. Also, traditional mesh based NoC of our work is to exploit data approximation intelligently to
architectures are not designed to handle the high volumes of increase the throughput of data movement between MCs and
reply traffic between MCs and cores. To support the high traf- shader cores (SMs) to speed up application execution and also
fic rates of GPGPUs, NoC channel widths should be minimize the energy consumed during application execution on
increased multifold compared to conventional NoCs, but this GPGPU platforms.
leads to a significant increase in GPGPU power dissipation. In a typical many-core GPGPU platform, the NoC traffic
Fig. 1 gives a breakdown of power consumed by various consists of load/store (LD/ST) data with LD replies forming
a majority of the traffic that causes MC bottlenecks [6]. Typi-
The authors are with the Department of Electrical and Computer Engineering, cally, the traffic in a NoC is skewed with memory requests
Colorado State University, Fort Collins, CO 80523. that follow a many-to-few pattern from cores to MCs and
E-mail: {yaswanth, sudeep}@colostate.edu. replies that follow a few-to-many pattern from MCs to cores.
Manuscript received 11 June 2019; revised 24 Sept. 2019; accepted 1 Dec. With high volume of LD reply data at MCs, a bottleneck is cre-
2019. Date of publication 9 Dec. 2019; date of current version 20 Jan. 2020. ated which leads to high wait times for the thread blocks exe-
(Corresponding author: V. Y. Raparti.)
Recommended for acceptance by K. Mohror. cuting on shader cores, and hence a slowdown in overall
Digital Object Identifier no. 10.1109/TPDS.2019.2958344 performance. Hardware designers have come up with high
1045-9219 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1023
stored in the caches, and increase the cache hit rate by asso-
ciating tags of multiple approximately similar blocks to the
same data line. In [17], the authors use approximate com-
puting with spatial and temporal locality to increase the
benefit of instruction reuse in GPGPUs. None of these techni-
ques can be applied for resolving the problems arising in MCs and
NoC due to the heavy influx of memory requests that cause MC
bottlenecks in GPGPUs.
In [18], the authors have proposed a NoC centric approach
to minimize the latency caused by congestion in many-core
CPUs by introducing approximate compression and decom-
pression of packets at network interfaces, to reduce the number
of flits entering the NoC. This work utilizes a dictionary-based
compression/decompression technique that consumes high
energy and leads to performance overheads when the network
traffic is high. However, this work ignores the challenges with
MC scheduling to minimize the memory access latency, in
GPGPUs. In contrast to these works, in this article, we provide a Fig. 2. (a) Example image showing similar data values in pixels; (b) RGB
holistic solution to the memory bottleneck problem in GPGPUs, with values of the marked locations as stored in texture memory.
a multi-pronged solution that uses approximate computing design
principles with smart resource adaptation at both the MC and NoC 3.2 Data Value Approximability
architecture levels. Several types of data parallel applications are typically exe-
cuted on GPGPU platforms. Many of them belong to the
domain of image processing, signal processing, pattern rec-
3 BACKGROUND AND MOTIVATION ognition, and machine learning. There also exist other scien-
3.1 Baseline Configuration tific and financial applications that use GPGPUs that operate
We consider a heterogeneous accelerator-based system with on large input data sets. The data used for executing image
an x86 CPU and a grid of shader-cores of GPGPUs that have and signal processing applications in many cases is highly
private L1 caches (instruction and data) to support data approximable. For example, as shown in Fig. 2a, the areas in
parallel multithreaded application execution. A shader core boxes contain pixels that are very similar to their adjacent
consists of parallel integrated pipelines with a common pixels. The RGB values of two pixels for each box are shown
instruction fetch unit that executes a single instruction on in Fig. 2b. These values (for each box) are quite similar and it
multiple data (SIMD) simultaneously. Each integrated pipe- is not energy efficient to save and transmit cache lines con-
line has an integer arithmetic logic unit and a floating-point taining similar RGB values separately from DRAM to the
unit. A shader core also has several load store units that fetch cores. Instead, if cache lines with same (or similar) data can
data from a private L1 cache or from the main memory. A be identified at the MCs, we can avoid their repeated trans-
GPGPU based accelerator has an L2 cache that is shared missions to the cores, to save energy. Dapper and AMC are
across the shader cores. Each shader core also has a texture designed to realize this idea.
cache and a scratchpad memory. All shader cores and MCs The next logical question one may ask is: how approx-
are connected to an on-chip interconnection network. imable are typical applications that run on GPGPUs?
Typically, there is little to no direct communication Fig. 3a shows the percentage of approximable data in dif-
between the shader cores on the chip, as there is no data ferent parallel CUDA applications, while still generating
shared between shader cores that are executing thread acceptable results. Applications such as Discrete Cosine
blocks. Hence, the L2 is used to cache the contents of private Transformation, Convolution Texture, and Raytrace can have
memory of each shader core. The communication between up to 78 percent of approximable input data that can be
CPU and GPU cores takes place through main memory. The exploited to mitigate the memory bottleneck issue and
shader cores send read/write requests (via LD/ST instruc- save NoC energy. By making use of the programming par-
tions) to MCs over the NoC. A memory reply takes several adigm proposed in [19], it is possible to specify approxim-
cycles based on the location of and availability of data in the able variables using the @approx keyword as highlighted
L2 or DRAM. The baseline NoC architecture between the in green in Fig. 3b. Such approximable variables will have
shader cores and the MCs has a channel width of 128-bits, a distinct set of instructions for load and store that are
twice the size of 64 bit NoC channels in traditional CPU used when compiling the Cþþ code to machine code.
based platforms, and consists of 4-stage routers (stage 1: These instructions can be used by cores and network inter-
buffer write; stage 2: route computation, stage 3: virtual- faces to identify approximable data and accept inexact val-
channel/switch allocation; stage 4: switch/link traversal). ues for them from main memory.
There are 5 virtual channels (VCs) per input port and 4 flit
buffers for each VC. Flits are routed along the XY path from 3.3 Memory Scheduling in GPGPUs
source to destination. DRAM is organized into a hierarchy of modules as follows:
In this work we focus on optimizing the MC as well as the each MC has a dedicated channel to access DRAM DIMMs
network between shader-cores and MCs. Henceforth, the that each consist of two ranks. Each rank typically has 8 or
term cores implies shader-cores for the remainder this article. 16 DRAM chips depending on the data bus width (8 or 16
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
Fig. 5. A 4 x 4 NoC showing overlay circuits for two MC’s read reply
paths, in a few-to-many traffic scenario.
Fig. 3. (a) Normalized percentage of approximable data transmitted from 3.4 Overlay Circuits for Low Latency Traversal
main memory to cores in different CUDA applications; (b) Example of
marking approximable variables using EnerJ [19].
As discussed earlier, in GPGPUs, read reply data is the
main source of MC bottlenecks. Minimizing read reply
latency is crucial for application performance. Also, read
bytes). Each chip has up to 32 banks that activate rows of reply data comprises most of the traffic from MCs to cores
data to be accessed, which are then buffered in each bank. in a few-to-many traffic pattern. To ensure that MCs do not
An MC sends requests to access specific rows and columns get clogged by a heavy influx of reply data, it is intuitive to
within those rows across banks to read/write data. Once design a NoC with all-to-all connections. However, power
accessed, a row is either kept open (to enable fast accesses and wire routability constraints in modern chips limit such
for future requests to the same row) or closed (written back architectures. Hence, researchers have come up with smart
to its respective bank before a new request arrives). Memory solutions in [6], [7] where they propose intelligent NoC
accesses that hit the rows that are already buffered are said routing schemes in tandem with MC placement to create
to have row buffer locality (RBL). The performance of conflict free paths in NoCs for packets to traverse from MCs
DRAM is enhanced by scheduling requests intelligently to to cores. However, the complex router design in such archi-
increase the utilization of rows by maximizing RBL. If the tectures adds to NoC power overheads.
subsequent accesses are not in the same row, they have to In our work, we make use of the few-to-many traffic pat-
wait until the next row is buffered. tern of the reply packets and utilize an overlay circuit topol-
To minimize access latency, DRAMs today also support ogy that forms dedicated circuits for each MC. An overlay
bank level parallelism where accesses to multiple banks are circuit connects an MC to all the cores, forming return paths
scheduled in parallel. Fig. 4a shows the average RBL of mem- for read reply packets. Fig. 5 shows how the circuits are estab-
ory requests waiting in the MC input queue at any instance, lished from two MCs (represented as colored circles) to cores.
and Fig. 4b shows the average BLP in memory requests wait- These overlay circuits are established from each MC to all the
ing in the MC input queue, for different CUDA applications. cores, on a 2D mesh-based NoC topology. An overlay circuit
formed between an MC to the cores stays for a fixed time win-
dow during which it transmits the reply packets of that MC,
before switching to the next overlay circuit (for another MC).
On overlay circuits (shown with red and green colored
arrows in the figure) each MC transmits flits of packets wait-
ing in its queues that reach their destination cores in 3 cycles
(or less) using asynchronous links and repeaters that bypass
one or more NoC routers. Flits traverse in X and Y directions
in one cycle each, stopping only at the turns. Hence, flits tra-
versing over overlay circuits do not need switch arbitration
Fig. 4: (a) Row buffer locality in memory requests and (b) Bank level par- and route computation at every hop. This leads to a low
allelism for memory requests across CUDA applications. energy consumption NoC that establishes high-throughput,
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1025
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1027
packets that are coalesced if their data values are within an error_threshold, can be updated by GPGPU firmware accord-
error_threshold. The list of destination addresses along with ing to the needs of the application.
the original address is appended to the first reply (line 18).
The approximated data and the list of addresses are then 4.2 Data Aware Approximate NoC (Dapper)
sent to the NI (line 20) to generate a packet that is broadcast The request plane of Dapper is a conventional 2D mesh based
to the list of destination nodes on the overlay circuits as NoC with XY routing as mentioned in Section 3.1. In the reply
explained in the following sections. The cleared entries in the plane, Dapper establishes dedicated overlay circuits from
output queue are then released for incoming data from AMCs to cores. Each AMC’s overlay circuit is a predefined
DRAM. The data types considered for approximation mapping of input and output ports in the reply plane NoC
include both integer and floating-point types. For floating routers, during which flits traverse from source (AMC) to des-
point data, only the mantissa is approximated for a faster tinations (cores) in 3 cycles or less. An overlay circuit assigned
computation. to an AMC lasts for a time duration determined at run-time.
To establish overlay circuits, the reply plane NoC is equipped
Algorithm 2. Data Approximator with (i) a global overlay controller that decides the time dura-
tion for which an overlay circuit is established, and (ii) modi-
Inputs: error_threshold, check_depth
fied routers called bypass routers through which flits traverse
1: while Output_queue.size() > 0 do
2: + 1 in X or Y axes in a single cycle, stopping only at a turn.
3: dest_list Ø
4: reply_data Output_queue. front() 4.2.1 Global Overlay Controller (GOC)
5: if reply_data.approximable ¼ ¼ 0 then The primary function of the global overlay controller (GOC)
6: send_to_NI(reply_data) is to determine and assign time durations for overlay circuits,
7: else for each AMC. The execution time is divided into epochs,
8: while + < check_depth do and each epoch is further divided into time windows which
9: nxt_data Output_queue.get(+) are computed at the beginning of every epoch by the GOC. A
10: if nxt_data.approximable ¼ ¼ 1 then // coalesce the data time window for an AMC is determined at run-time based
11: if (reply_data – nxt_data)/reply_data < error_threshold on the number of reply packets waiting at the AMC output
then queues, and the reply arrival rate at the AMC from the previ-
12: dest_list. add(nxt_data.dest) ous epoch. Each AMC sends the stats collected during an
13: Output_queue.erase(+) epoch to the GOC at the end of that epoch, using the overlay
14: end if circuits. The GOC uses this received information to compute
15: end if
a weight function as shown in equation (1):
16: +þþ
17: end ðmÞ ¼ a:AðmÞ þ g:BðmÞ; (1)
18: reply_data.add_dest(dest_list)
19: end if where AðmÞ is the reply arrival rate and BðmÞ is the average
20: send_to_NI(reply_data) queue occupancy at the mth AMC in the previous epoch. a
21: Output_queue.pop() and g are coefficients of the weight function. The GOC com-
22: end while pares ’s of each AMC and computes time window dura-
tions T1 ; T2 ; T3 . . . ; Tm for the next epoch as:
Ti ¼ K ðiÞ=½ð1Þ þ ð2Þ þ ð3Þ þ . . . þ ðmÞ; (2)
4.1.3 Overheads
The power and area overheads involved in approximate where Ti is the time window of the ith AMC overlay and ðiÞ
data-aware scheduling in the request plane, and data is its weight function. The ratio of the weight functions is then
approximation in the reply plane are minimal compared the multiplied by a constant K which is equal to the periodicity of
power consumed by the computing shader-cores. Algorithm the time windows in an epoch. The time windows repeat peri-
1 needs n additional counters to keep track of the credits, odically for E/K iterations in an epoch, where E is the epoch
where n ¼ total number of input queues, an additional com- interval duration and K is the periodicity of the time window
parator, and queue mux. The credit manager operation takes set. By having the time windows repeat and overlay circuits
place in 1 cycle, which is pipelined with the other stages of switch multiple times in an epoch, AMCs send flits in multiple
the AMC to reduce the overall latency. Algorithm 2 takes bursts across an epoch. The time window durations are
check_depth number of cycles for execution, and requires broadcast by the GOC at the beginning of an epoch, which is
logic for division, a comparator, and registers to store the saved by the reply plane routers in dedicated registers, and
parameters check_depth and error_threshold. The NI connected then used for the setup and tear down of circuits. At the end
to an AMC is often fully occupied at run-time, resulting in a of each epoch, the arrival rate AðmÞ and buffer occupancy
wait time for data at AMCs in most cases, before they are BðmÞ of each AMC are sent to GOC using the corresponding
sent to the NI. This wait time helps to mask the overhead overlay circuits. This requires 2 additional cycles at the end of
involved in data approximation. A packet is only sent to the each epoch (which is 10000 cycles in duration).
NI after it passes the data approximation stage. We consider
all of these performance and power overheads when model- 4.2.2 Bypass Router Architecture
ing the AMC for our experiments (Section 5). Note also The reply plane NoC is made up of bypass routers that sup-
that the parameters used in AMC, such as check_depth and port flits passing through them without stopping at each
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1029
input latches, and the north/ south/local output ports based output queue out-of-order. Based on the request type (read/
on the Y coordinates of the current router at which the flit is write) the addresses of the missed request and the subse-
latched and the destination router. Once the hinge connections quent request are compared before the next packet is selected
between latches and output ports are established, the flits for transmission into the NoC to avoid read-after-write
traverse in the Y direction (shown via the red line in Fig. 8) (RAW) or write-after-read (WAR) errors. Further, to avoid
and reach the destination router where they are sent to the starvation of long waiting packets, the output queues in the
local output port. When the tail flit passes the bypass router, AMC are designed as cyclic buffers so that all the waiting
the hinge connections are reset to tear down the connections packets are processed in a round robin manner. Thus, this
between latches and output ports. flow control mechanism ensures that packets at the head of
the queue at the AMC do not block the subsequent packets
Algorithm 4. Selection unit operation that can use the available overlay circuit. To facilitate data
Inputs: fT1 . . . Tk g; fMC1 . . . MCk g
loss detection, we assume the presence of a simple XOR-
1: Ý 0; i 1 based parity bit error detection mechanism.
2: for each cycle do
3: if ݼ ¼ ¼Ti then 4.2.3 Overheads
4: i ði þ 1Þ mod k The overheads involved with implementing the overlay con-
5: if local:ðX; Y Þ ¼¼ MCi :ðX; Y Þ then // scenario 1 troller, selection unit, and Y-compare are minimal. The over-
6: Lin ! Eout ; Wout ; LaðLÞ
lay controller uses a counter, and a register to store the time
7: else if local:ðY Þ ¼¼ MCi :ðY Þ then // scenario 2
window values received from the GOC, while the selection
8: Ein ! Wout ; LaðEÞ and Win ! Eout ; LaðW Þ
unit uses a counter, a cyclic register, and five 3-bit comparator
9: else if local:ðY Þ > MCi :ðY Þ then
10: Nin ! Sout ; LaðNÞ == scenario 3 circuits. The bypass router also has an additional Y-compara-
11: else if local:ðY Þ < MCi :ðY Þ then // scenario 4 tor, and logic to establish hinge connections. All of these compo-
12: Sin ! Nout ; LaðSÞ nents together take up a very small fraction of the power and
13: end if area of a router. The bypass routers also do not have input
14: else buffers to save incoming flits, VCA, SA, and route computa-
15: Ýþþ tion logic. Hence, a bypass router consumes up to 50 percent
16: end if less power compared to a traditional router. However, we
17: end for increase the output buffer capacity of AMC by 50 percent as it
takes at least 6 cycles for the complete read-reply transaction.
Flow control: Since Dapper has lower latency than tradi- All of these performance and power overheads are consid-
tional NoCs, it fills up the receiving cores’ queues much ered in our experimental analysis, presented next.
more rapidly than a traditional, 4-stage hop-by-hop, router.
Hence, it is crucial that a robust flow control mechanism is
integrated to ensure that the receiving queues of cores do 5 EXPERIMENTS
not fill up quickly and start creating backpressure. Unlike 5.1 Experimental Setup
the wormhole switching scheme where the intermediate We target a baseline 16-core GPGPU based accelerator to
routers at each hop save flits for re-transmission, in Dapper test the performance, energy consumption, network latency,
the flits are only saved at the receiving core. As a result, if and output error of Dapper þ AMC compared to the state-
there is any data loss on the asynchronous links, they may of-the-art. We also test the scalability of our proposed archi-
be undetected at the receiving core. Hence, there is also a tectures on a 64-core GPGPU. Table 1 lists the platform con-
need for a mechanism to ensure lossless transmission of figurations. We used GPGPU-Sim [21] to collect detailed
packets traversing on overlay circuits. application traces and simulated the network and memory
To meet these goals, we use an acknowledgement-based traffic on a customized Noxim NoC simulator [22] that inte-
flow control mechanism where the receiving cores send grates our Dapper þ AMC architecture model. We obtained
ACK signals back to the AMC when the packet is completely traces for 9 CUDA-based applications [1]. Table 2 gives an
received. We establish ACK circuit using dedicated links of 7 overview of each application and the approximable regions
bits width to send ACK signal back to AMCs. Bypass routers of their working data sets.
establish an ACK circuit using ACK links for each overlay As mentioned earlier, we use the programming paradigm
circuit. These ACK circuits are not shared with data packets. proposed in EnerJ [19] to specify the variables in these appli-
A core does not send an ACK signal when its input queues cations that are potentially approximable. We set the epoch
are already full or when there is data loss in the packet. To duration in Dapper þ AMC as 10,000 cycles. We performed
facilitate fast transmission of the ACK signal from the cores experimental sensitivity analysis for parameter values (not
to an AMC, each bypass router has an ACK in and ACK out included for brevity) and accordingly set the values of a; g
port (Fig. 8), which are connected using asynchronous links, coefficients of the weight function from equation (1) to 0.6,
switch, and a selection-unit similar to the fast overlay circuit 0.4, and K ¼ 1000 in equation (2) as they give the best NoC
to relay the ACK message back to the AMC from the cores in throughput for the applications we considered, in both 16
1-3 cycles. The AMC, upon receiving the ACK signals from and 64 core platforms. We have also considered that the
the destination cores releases its output queue for the read Threshold (line 4) in Algorithm 1 is equal to half of the check_-
reply data coming from DRAM. If the AMC does not receive depth parameter value because having a threshold larger
an ACK signal, it services the next waiting packet in the than check_depth/2 leads to poor approximability as the
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1030 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
TABLE 1 TABLE 3
GPGPU-SIM Parameters Configuration of Comparison Works
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1031
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1032 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
Fig. 11. NoC latency analysis of Dapper þ AMC with Baseline NoC, Fig. 12. Comparison of application execution times for baseline NoC,
Baseline NoC with SMS memory scheduling [29], Approx-NoC [16], baseline NoC with SMS memory scheduling [29], Approx-NoC [16],
MACRO NoC [31] for (a) 16 core and (b) 64 core GPGPU accelerator, MACRO NoC [31] and Dapper þ AMC for (a) 16 core and (b) 64 core
across CUDA applications. GPGPU accelerator, across CUDA applications.
cycles at both the sending and receiving nodes of a NoC for Dapper þ AMC, the ACK based flow-control mechanism
all packets. Dapper þ AMC outperforms MACRO by around together with the cyclic output buffer ensures that MCs utilize
7 percent. These benefits are due to the joint optimization of their share of overlay circuits efficiently. Approx-NoC has a
request scheduling and reply packet coalescing in AMC that higher latency compared to the baseline due to its slower
is absent in MACRO. For memory intensive and high encoding and decoding steps at the NIs.
approximable workloads such as ConvTex, DCT, and HIST Fig. 11b shows the comparison of NoC latency in 64-core
the throughput improvement of Dapper þ AMC reaches up accelerators. The baseline NoC latency in 64-core platforms
to 40 percent compared to the baseline. For minimal approx- is up to 3 higher than the 16-core platforms because of the
imable benchmarks like MergeSort, the throughput benefits increased NoC traffic due to the higher number of cores in
of Dapper þ AMC are diminished, however it achieves same the platform. Dapper þ AMC shows up to 78.7 percent
throughput as the baseline. Fig. 10b shows that Dapperþ improvement compared to the baseline, and 85 percent
AMC is highly scalable with average throughput improve- improvement compared to base þ SMS. The NoC latency in
ment of up to 15.7 percent compared to the baseline and up base þ SMS is higher than baseline due to the waiting time
to 15.34 percent compared to base þ SMS. The improve- (at the incoming buffers of the MC) needed to create batches
ments are slightly diminished in Dapper þ AMC in the 64- of requests according to the SMS scheduler. This impact is
core platform compared to the 16-core platform due to the higher in a 64-core platform than a 16-core platform. In
higher number of cores sending more requests at the MC Approx NoC, the latency degradation is reduced from
which increases DRAM latency for memory requests that 50 percent to around 22 percent compared to the baseline. As
worsens the MC bottleneck issue. Unlike Dapper þ AMC, the ratio of MCs to cores decreases from 16-cores to 64-cores,
Approx-NoC loses up to 50 percent of its throughput in this the overhead of encoding/decoding is hidden in the packet
configuration. Dapper þ AMC has up to 5.35 percent better traversal latency, allowing the approach to recover some
throughput than MACRO. This shows the importance of of its lost latency. Dapper þ AMC performs better than
approximate data-aware request scheduling and data coalescing MACRO by 4.8 percent in 64-core platform which is better
with the AMC. than improvements of 3 percent in 16-core platform due to
the higher ratio of approximable packets in Dapper þ AMC
5.4 NoC Latency Analysis in a 64-core platform.
Fig. 11a shows the average NoC latency observed from the
source to destination in a 16-core accelerator. Dapper þ AMC 5.5 Application Execution Time Analysis
is much faster than baseline and base þ SMS by up to 45.5 Fig. 12a shows the application execution times observed in
and 46 percent, respectively. This is one of the key contribu- 16-core GPGPU accelerators for the various architectures.
tors of increase in throughput. The AMC’s approximate data- The throughput and latency improvements observed in the
aware scheduling scheme is responsible for Dapper þ AMC0s previous sections are translated into shorter application exe-
2.5 percent improvement compared to MACRO due to the cution time for Dapper þ AMC compared to baseline, and
increased approximable and coalesced read reply data that base þ SMS. In 16-core platforms, Dapper þ AMC achieves
reduces the MC bottleneck and NoC congestion. Also, in around 9.5 percent improvement compared to both baseline
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1033
and base þ SMS. The improvements for applications such as energy due to the higher power dissipated in its routers that
DXTC, and MergeSort the application execution time are not need additional logic and content addressable storage (CAS)
proportional to the NoC throughput and latency. These based registers for saving the most used values for compres-
applications are also compute intensive where they have sion and decompression. In Dapper þ AMC; energy savings
enough kernels to execute while waiting for data. The faster are observed for all the applications including those that have
NoC latency can reduce the number of kernels concurrently lower to medium ratio of approximable data such as Merge-
scheduled on the cores that could potentially increase the Sort, FWT, and Blackscholes, due to the low power NoC used
execution time in these applications. In these applications, in the reply plane and faster execution times. However, the
Approx-NoC performs similar to Dapper þ AMC. The same energy savings are higher when applications have a higher
trends are observed in the 64-core platform as shown in ratio of approximable packets, such as in ConvTex, DXTC
Fig. 12b. On average Dapper þ AMC shows 5.4 percent and DCT.
improvement compared to baseline, and 4.5 percent Fig. 13b shows the energy consumption of different archi-
improvement compared to base þ SMS. In the MergeSort tectures for a 64-core platform. In the 64-core platform, the
application, base þ SMS gives 35 percent improvement com- energy savings follow the same trend as in the 16-core plat-
pared to baseline in a 64-core platform due to the batch- form. On average, Dapper þ AMC consumes up to 30 percent
based memory scheduling that is optimized for highly data lower energy compared to the baseline, and 27.5 percent
parallel applications. However, most of the contemporary lower energy compared to base þ SMS. However, Dapperþ
applications that are highly approximable perform better AMC consumes slightly higher energy while scheduling req-
with Dapper þ AMC. uests and coalescing reply packets than MACRO by around
1.5 percent. But this is not a major drawback given the
5.6 Energy Consumption Analysis improvements in NoC throughput and latency given by the
Fig. 13a shows the energy consumption of Dapper þ AMC addition of AMC in Dapper þ AMC.
compared to the prior works across different benchmarks in a
16-core platform. On average, Dapper þ AMC consumes up 5.7 Output Error Percentage Analysis
to 38.3 percent less energy compared to the baseline and 38.1 Obviously, it is important to analyze the error in applications
percent less energy compared to base þ SMS architectures. that are subjected to approximation. Fig. 14a shows the com-
MACRO and Dapper þ AMC consume lower energy due to parison of output error percent for Approx-NoC at 10 percent
their low power router architectures in the reply plane. Even error_threshold and Dapper þ AMC at 10 and 15 percent
though Dapper þ AMC has energy overhead involved in the error_threshold. Note that the baseline, base þ SMS, and
AMC architecture, AMCs are fewer in number compared to MACRO NoC are not shown, as they have no output error.
NoC routers and NIs. Also, the overall reduction in the num- The error percentage is computed as 100 error from
ber of packets injected from the AMC into the NoC reduces equation (4). Approx-NoC shows high percentage of error
the total energy consumption of Dapper þ AMC compared to in its output due to a flaw (right shift division) in its error_-
the baseline. Approx-NoC on the other hand consumes higher threshold computation which is used for identifying the
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
1034 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 5, MAY 2020
ACKNOWLEDGMENTS
This research is supported by Grant from NSF (CCF-
1813370).
Fig. 15. DCT output: (a) original (no error), (b) with 10 percent error_-
threshold (c) 15 percent error_threshold in DapperþAMC.
REFERENCES
approximable data before compression. This results in a very [1] NVIDIA GPUs, Jun. 4, 2012. [Online]. Available: https://
developer.nvidia.com/cuda-gpus
high error percentage when data values are incorrectly [2] E. S. John, D. Gohara, and G. Shi, “OpenCL: A parallel program-
marked for approximation. Dapper þ AMC on the other ming standard for heterogeneous computing systems,” Comput.
hand incurs 1 to 4 percent output error in the application. On Sci. Eng., vol. 12, no. 3, pp. 66–73, 2010.
[3] NVIDIA CUDA, Jun. 4, 2012. [Online]. Available: https://
average, Dapper þ AMC gives around 2.3 percent error in developer.nvidia.com/about-cuda
application output in the 16-core platform. The same trend is [4] K. Palem and A. Lingamneni, “Ten years of building broken
shown in Fig. 14b for a 64-core platform. Although it seems chips: The physics and engineering of inexact computing,” ACM
intuitive that a 64-core platform may coalesce more packets Trans. Embedded Comput. Syst., vol 12, no. 2s, 2013, Art. no. 87.
[5] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan,
than a 16-core platform, the ratio of approximable packets is “Approximate computing and the quest for computing efficiency,”
dependent on the application’s input and not the platform in Proc. 52nd ACM/EDAC/IEEE Des. Autom. Conf., 2015, Art. no. 120.
size. Hence, the output error percent has no discernable [6] H. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, “Providing cost-effec-
tive on-chip network bandwidth in GPGPUs,” in Proc. IEEE 30th
change as we move from a 16-core to a 64-core platform. Int. Conf. Comput. Des., 2012, pp. 407–412.
Thus, DapperþAMC represents a promising NoC-and-MC-cen- [7] H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, “Bandwidth-
tric solution to improve performance and save energy consumption efficient on-chip interconnect designs for GPGPUs,” in Proc. IEEE/
in GPGPUs for applications that have a potential for being approx- ACM Des. Autom. Conf., 2015, Art. no. 9.
[8] Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen, “A stall-
imated (i.e., applications that can tolerate some output error). aware warp scheduling for dynamically optimizing thread-level
As an illustrative example, Fig. 15 shows the DCT applica- parallelism in GPGPUs,” in Proc. ACM Int. Conf. Supercomputing,
tion output under no error (when using the baseline NoC), 2015, pp. 15–24.
and when using the Dapper þ AMC in a 16-core platform. It [9] S. Das, J. R. Doppa, P. P. Pande, and K. Chakrabarty, “Design-space
exploration and optimization of an energy-efficient and reliable
can be observed that the output for the configuration of 3-D small-world network-on-chip,” IEEE Trans. Comput.-Aided
Dapper þ AMC with 10 percent error_threshold is virtually Design Integr. Circuits Syst., vol. 36, no. 5, pp. 719–732, May 2017.
indistinguishable from the no error case, while saving almost [10] T. Pimpalkhute and S. Pasricha, “An application-aware heteroge-
neous prioritization framework for NoC based chip multiproc-
38 percent energy (Fig. 13a) compared to the baseline NoC. essors,” in Proc. 15th Int. Symp. Quality Electron. Des., 2014, pp. 76–83.
This example highlights the exciting potential of Dapperþ [11] Y. Zou and S. Pasricha, “Reliability-Aware and energy-efficient syn-
AMC to save energy in GPGPUs while increasing overall thesis of NoC based MPSoCs,” in Proc. Int. Symp. Quality Electron.
performance. Des., 2013, pp: 643–650.
[12] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy,
“IMPACT: Imprecise adders for low-power approximate compu-
ting,” in Proc. IEEE/ACM Int. Symp. Low Power Electron. Des., 2011,
6 CONCLUSION pp. 409–414.
[13] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency
In this paper, we propose a novel data-aware approximate generic accuracy configurable adder,” in Proc. 52nd ACM/EDAC/
NoC Dapper, and an approximate memory controller, for IEEE Des. Autom. Conf., 2015, pp. 1–6.
GPGPU accelerators. Our Dapper þ AMC architecture uses [14] J.S. Miguel, J. Albericio, A. Moshovos, and N. E. Jerger, “Doppel-
intelligent memory scheduling in the MC to increase g€anger: A cache for approximate computing,” in Proc. 48th Annu.
IEEE/ACM Int. Symp. Microarchit., 2015, pp. 1–6.
throughput and also identifies the approximable data wait- [15] J. S Miguel, J. Albericio, N. E. Jerger, and A. Jaleel, “The bunker
ing in the output buffers of the MC and coalesces them to cache for spatio-value approximation,” in Proc. 49th Annu. IEEE/
reduce the number of packets injected into the NoC. Also, we ACM Int. Symp. Microarchit., 2016, pp. 1–12.
advocate for a fast overlay circuit based NoC architecture in [16] R. Boyapati, J. Huang, P. Majumder, K. H. Yum, and E. J. Kim,
“APPROX-NoC: A data approximation framework for network-
the reply plane of the NoC for the reply packets to reach their on-chip architectures,” in Proc. ACM/IEEE 44th Annu. Int. Symp.
destinations in 3 cycles or less. Experimental results show Comput. Archit., 2017, pp. 666–677.
that on average Dapper þ AMC increase the NoC through- [17] A. Rahimi, L. Benini, and R. K. Gupta, “CIRCA-GPUs: Increasing
instruction reuse through inexact computing in GP-GPUs,” IEEE
put by 21 percent in 16-core platforms and up to 7 percent in Des. Test, vol. 33, no. 6, pp. 85–92, Dec. 2016.
64-core platforms while consuming up to 38 percent less [18] C. H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan,
energy compared to baseline NoC, with up to 9.5 percent and L.S. Peh, “SMART: A single-cycle reconfigurable NoC for
lower application execution time and around 2.3 percent SoC applications,” in Proc. IEEE Des. Autom. Test Europe Conf.
Exhib., 2013, pp. 338–343.
error in the application output compared to the baseline. [19] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze,
Thus, for the class of emerging applications that have the and D. Grossman, “EnerJ: Approximate data types for safe and
ability to tolerate a small amount of error in their outputs, general low-power computation,” ACM SIGPLAN, vol. 46, no. 6,
pp. 164–174, 2011.
Dapper þ AMC can provide significant savings at the NoC [20] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural
and MC level. These savings are orthogonal to (and can acceleration for general-purpose approximate programs,” in Proc.
be combined with) further approximation strategies at the 45th Ann. IEEE/ACM Int. Symp. Microarchit., 2012, pp. 449–460.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.
RAPARTI AND PASRICHA: APPROXIMATE NOC AND MEMORY CONTROLLER ARCHITECTURES FOR GPGPU ACCELERATORS 1035
[21] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, Venkata Yaswanth Raparti (S’15) received the
“Analyzing CUDA workloads using a detailed GPU simulator,” in bachelor’s degree in electrical and electronics engi-
Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2009, pp. 163–174. neering from the Birla Institute of Technology and
[22] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, Science, Pilani, India, in 2011. He is currently work-
“Noxim: An open, extensible and cycle-accurate network on chip ing toward the PhD degree from Electrical and
simulator,” in Proc. IEEE 26th Int. Conf. Appl, Specific Syst. Architec- Computer Engineering Department, Colorado
tures Processors, 2015, pp. 162–163. State University, Fort Collins. From 2011–2013 he
[23] C. Sun et al., “DSENT-A tool connecting emerging photonics with worked as a software engineer at Samsung
electronics for opto-electronic networks-on-chip modeling,” in Proc. Research Institute in Bangalore, India. His research
IEEE/ACM 6th Int. Symp. Netw.-on-Chip, 2012, pp. 201–210. interests include reliability aware task scheduling
[24] J. Leng et al., “GPUWattch: Enabling energy optimizations in and mapping methodologies for 2D and 3D CMPs
GPGPUs,” ACM SIGARCH Comput. Architecture News, vol. 41, no. 3, in dark silicon era and network-on-chip design for
pp. 487–498, 2013. GPGPUs. He is a student member of the IEEE.
[25] V. Y. Raparti and S. Pasricha, “Memory-aware circuit overlay
NoCs for latency optimized GPGPU architectures,” in Proc. 17th
Int. Symp. Quality Electron. Des., 2016, pp. 63–68. Sudeep Pasricha (M’02, SM’13) received the
[26] V. Y. Raparti, N. Kapadia, and S. Pasricha, “ARTEMIS: An aging- PhD degree in computer science from the Univer-
aware runtime application mapping framework for 3D NoC- sity of California, Irvine, in 2008. He is currently a
based chip multiprocessors,” IEEE Trans. Multi-Scale Comput. professor of electrical and computer engineering
Syst., vol. 3, no. 2, pp. 72–85, Apr.–Jun. 2017. with Colorado State University, Fort Collins. His
[27] V. Y. Raparti and S. Pasricha, “PARM: Power supply noise aware research interests include energy-efficient and
resource management for NoC based multicore systems in the fault-tolerant design for network and memory archi-
dark silicon era,” in Proc. 55th ACM/ESDA/IEEE Des. Autom. Conf., tectures in multicore computing. He is in the edito-
Jun. 2018, pp. 1–6. rial board of various IEEE/ACM journals, such as
[28] N. Kapadia, and S. Pasricha, “VERVE: A framework for variation- Transactions on Computer-Aided Design of Inte-
aware energy efficient synthesis of NoC-based MPSoCs with grated Circuits and Systems, Transactions on
voltage Islands,” in Proc. Int. Symp. Quality Electron. Des., 2013, Embedded Computing Systems, Design and Technology, etc. He is the
pp. 603–610. recipient of seven best paper awards, the 2019 George T. Abell Outstand-
[29] R. Ausavarungnirun, K. K. W. Chang, L. Subramanian, G. H. Loh, ing Research Faculty Award, the 2018 IEEE-CS/TCVLSI Mid-Career
and O. Mutlu, “Staged memory scheduling: Achieving high perfor- Research Achievement Award, and, the 2015 IEEE TCSC Award for
mance and scalability in heterogeneous systems,” ACM SIGARCH Excellence for a mid-career researcher. He is a senior member of the IEEE.
Comput. Archit. News, vol. 40, no. 3, pp. 416–427, Jun. 2012.
[30] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving mem-
ory bank-level parallelism in the presence of prefetching,” in Proc. " For more information on this or any other computing topic,
42nd Annu. IEEE/ACM Int. Symp. Microarchit., 2009, pp. 327–336. please visit our Digital Library at www.computer.org/csdl.
[31] V. Y. Raparti and S. Pasricha, “RAPID: Memory-aware NoC for
latency optimized GPGPU architectures,” IEEE Trans. Multi-Scale
Comput. Syst., vol. 4, no. 4, pp: 874–887, Oct.–Dec. 2018.
[32] V.Y. Raparti and S. Pasricha, “DAPPER: Data aware approximate
NoC for GPGPU architectures,” in Proc. 12th IEEE/ACM Int. Symp.
Netw.-on-Chip. 2018, Art. no. 7.
[33] K. Duraisamy, Y. Xue, P. Bogdan, and P. P. Pande, “Multicast-aware
high-performance wireless network-on-chip architectures,” IEEE
Trans. Very Large Scale Integr. Syst., vol. 25, no. 3, pp. 1126–1139,
Mar. 2016.
[34] Y. Xue and P. Bogdan, “User cooperation network coding
approach for NoC performance improvement,” in Proc. 9th Int.
Symp. Netw.-on-Chip. 2017, Art. no. 17.
[35] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan, “A load balancing
inspired optimization framework for exascale multicore systems:
A complex networks approach,” in Proc. IEEE/ACM Int. Confl
Comput.-Aided Des., 2017, pp. 217–224.
[36] Y. Xiao, S. Nazarian, and P. Bogdan “Self-optimizing and self-
programming computing systems: A combined compiler, com-
plex networks, and machine learning approach,” IEEE Trans. Very
Large Scale Integr. Syst., vol. 27, no. 6, pp. 1416–1427, Jun. 2019.
[37] B. K. Joardar, R. G. Kim, J. R. Doppa, and P. P. Pande, “Design and
optimization of heterogeneous manycore systems enabled by
emerging interconnect technologies: Promises and challenges,” in
Proc. Des. Autom. Test Europe Conf. Exhib., 2019, pp. 138–143.
[38] J. Lee, S. Li, H. Kim, and S. Yalamanchili, “Design space explora-
tion of on-chip ring interconnection for a CPU–GPU heteroge-
neous architecture,” J. Parallel Distrib. Comput., vol. 73, no. 12,
pp. 1525–1538.
[39] Z. Xu, X. Zhao, Z. Wang, and C. Yang, “Application-aware NoC
management in GPUs multitasking,” J. Supercomputing, vol. 75,
no. 8, pp. 4710–4730, 2019.
[40] X. Cheng, Y. Zhao, H. Zhao, and Y. Xie “Packet pump: Overcom-
ing network bottleneck in on-chip interconnects for GPGPUs,” in
Proc. 55th ACM/ESDA/IEEE Des. Autom. Conf., 2018, pp. 1–6.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 19:42:38 UTC from IEEE Xplore. Restrictions apply.