Deep NN - Theory, Tutorial and Survey
Deep NN - Theory, Tutorial and Survey
Deep NN - Theory, Tutorial and Survey
Abstract—Deep neural networks (DNNs) are currently widely representation of an input space. This is different from earlier
used for many artificial intelligence (AI) applications including approaches that use hand-crafted features or rules designed by
arXiv:1703.09039v2 [cs.CV] 13 Aug 2017
x0 w0
Artificial Intelligence synapse
axon
Machine Learning w0 x0 dendrite
Brain-Inspired
neuron
yj
w1 x1 ⎛ ⎞
Spiking Neural y j = f ⎜ ∑ wi xi + b ⎟
Networks ⎝ i ⎠ axon
Deep w2 x2
Learning
Neurons
(activations) ∂L
backpropagation
Layer 1 Layer 2 ∂w11
X1
backpropagation ∂L
W11 Y1
∂y1
X1
∂L
Y2 W11 ∂L
……..
∂y1
X2 ∂L ∂y2
Y3 ∂x1 ∂L
∂y2
∂L
X3 ∂L ∂y3
W34 Y4 L2 Output Neurons ∂x2 ∂L
L1 Input Neurons ∂y3 ∂L
∂L
Synapses (e.g. image pixels) L1 Output Neurons ∂y4
∂x3 ∂L
(weights) a.k.a. Activations W34 ∂y4
∂L X3
(a) Neurons and synapses (b) Compute weighted sum for each layer ∂w34
Fig. 3. Simple neural network example and terminology (Figure adopted (a) Compute the gradient of the loss (b) Compute the gradient of the loss
from [7]). relative to the filter inputs relative to the weights
the basic program does not change as it learns to perform its relative to the weights from the filter inputs (i.e., the forward activations) and
the gradients of the loss relative to the filter outputs; (2) compute the gradient
given tasks. In the specific case of DNNs, this learning involves of the loss relative to the filter inputs from the filter weights and the gradients
determining the value of the weights (and bias) in the network, of the loss relative to the filter outputs.
4
25
Large error rate reduction due to Deep CNN Atari [31] as well as Go [6], where an exhaustive search
20
AlexNet of all possibilities is not feasible due to the unimaginably
15 OverFeat
huge number of possible moves.
10 VGG GoogLeNet • Robotics DNNs have been successful in the domain of
Clarifai
5 ResNet
robotic tasks such as grasping with a robotic arm [32],
0
2010 2011 2012 2013 2014 2015 Human motion planning for ground robots [33], visual naviga-
tion [4, 34], control to stabilize a quadcopter [35] and
Fig. 7. Results from the ImageNet Challenge [14]. driving strategies for autonomous vehicles [36].
DNNs are already widely used in multimedia applications
today (e.g., computer vision, speech recognition). Looking
improvements. forward, we expect that DNNs will likely play an increasingly
In conjunction with the trend to deep learning approaches important role in the medical and robotics fields, as discussed
for the ImageNet Challenge, there has been a corresponding above, as well as finance (e.g., for trading, energy forecasting,
increase in the number of entrants using GPUs. From 2012 and risk assessment), infrastructure (e.g., structural safety, and
when only 4 entrants used GPUs to 2014 when almost all traffic control), weather forecasting and event detection [37].
the entrants (110) were using them. This reflects the almost The myriad application domains pose new challenges to the
complete switch from traditional computer vision approaches efficient processing of DNNs; the solutions then have to be
to deep learning-based approaches for the competition. adaptive and scalable in order to handle the new and varied
In 2015, the ImageNet winning entry, ResNet [15], exceeded forms of DNNs that these applications may employ.
human-level accuracy with a top-5 error rate4 below 5%. Since
then, the error rate has dropped below 3% and more focus F. Embedded versus Cloud
is now being placed on more challenging components of the
The various applications and aspects of DNN processing
competition, such as object detection and localization. These
(i.e., training versus inference) have different computational
successes are clearly a contributing factor to the wide range
needs. Specifically, training often requires a large dataset5 and
of applications to which DNNs are being applied.
significant computational resources for multiple weight-update
iterations. In many cases, training a DNN model still takes
E. Applications of DNN several hours to multiple days and thus is typically performed
Many applications can benefit from DNNs ranging from in the cloud. Inference, on the other hand, can happen either
multimedia to medical space. In this section, we will provide in the cloud or at the edge (e.g., IoT or mobile).
examples of areas where DNNs are currently making an impact In many applications, it is desirable to have the DNN
and highlight emerging areas where DNNs hope to make an inference processing near the sensor. For instance, in computer
impact in the future. vision applications, such as measuring wait times in stores
• Image and Video Video is arguably the biggest of the or predicting traffic patterns, it would be desirable to extract
big data. It accounts for over 70% of today’s Internet meaningful information from the video right at the image
traffic [16]. For instance, over 800 million hours of video sensor rather than in the cloud to reduce the communication
is collected daily worldwide for video surveillance [17]. cost. For other applications such as autonomous vehicles,
Computer vision is necessary to extract meaningful infor- drone navigation and robotics, local processing is desired since
mation from video. DNNs have significantly improved the the latency and security risks of relying on the cloud are
accuracy of many computer vision tasks such as image too high. However, video involves a large amount of data,
classification [14], object localization and detection [18], which is computationally complex to process; thus, low cost
image segmentation [19], and action recognition [20]. hardware to analyze video is challenging yet critical to enabling
4 The top-5 error rate is measured based on whether the correct answer 5 One of the major drawbacks of DNNs is their need for large datasets to
appears in one of the top 5 categories selected by the algorithm. prevent over-fitting during training.
6
H Activation
R E Functions -1 -1
1 1 1 -1 0 1 -1 0 1
y=1/(1+e-x) y=(ex-e-x)/(ex+e-x)
S W F
Rectified Linear Unit Exponential LU
…
…
Leaky ReLU
(ReLU)
1 1 1
C C M
Modern
Non-Linear 0 0 0
R E
M H Activation
S
N Functions
N F
-1
-1 0 1
-1
-1 0 1
-1
-1 0 1
W x,
x≥0
y=max(0,x) y=max(αx,x) y=
α(ex-1),
x<0
(b) High dimensional convolutions in CNNs α = small const. (e.g. 0.1)
Fig. 9. Dimensionality of convolutions. Fig. 11. Various forms of non-linear activation functions (Figure adopted
from Caffe Tutorial [46]).
Modern Deep CNN: 5 – 1000 Layers 1 – 3 Layers
2x2 pooling, stride 2 DNN is run only once), which is more consistent with what
9 3 5 3 Max pooling Average pooling
would likely be deployed in real-time and/or energy-constrained
applications.
10 32 2 2 32 5 18 3 LeNet [11] was one of the first CNN approaches introduced
1 3 21 9 in 1989. It was designed for the task of digit classification in
6 21 3 12
2 6 11 7 grayscale images of size 28×28. The most well known version,
LeNet-5, contains two CONV layers and two FC layers [48].
Fig. 12. Various forms of pooling (Figure adopted from Caffe Tutorial [46]). Each CONV layer uses filters of size 5×5 (1 channel per filter)
with 6 filters in the first layer and 16 filters in the second layer.
the network to be robust and invariant to small shifts and Average pooling of 2×2 is used after each convolution and a
distortions. Pooling combines, or pools, a set of values in sigmoid is used for the non-linearity. In total, LeNet requires
its receptive field into a smaller number of values. It can be 60k weights and 341k multiply-and-accumulates (MACs) per
configured based on the size of its receptive field (e.g., 2×2) image. LeNet led to CNNs’ first commercial success, as it was
and pooling operation (e.g., max or average), as shown in deployed in ATMs to recognize digits for check deposits.
Fig. 12. Typically pooling occurs on non-overlapping blocks AlexNet [3] was the first CNN to win the ImageNet Challenge
(i.e., the stride is equal to the size of the pooling). Usually a in 2012. It consists of five CONV layers and three FC layers.
stride of greater than one is used such that there is a reduction Within each CONV layer, there are 96 to 384 filters and the
in the dimension of the representation (i.e., feature map). filter size ranges from 3×3 to 11×11, with 3 to 256 channels
3) Normalization: Controlling the input distribution across each. In the first layer, the 3 channels of the filter correspond
layers can help to significantly speed up training and improve to the red, green and blue components of the input image.
accuracy. Accordingly, the distribution of the layer input A ReLU non-linearity is used in each layer. Max pooling of
activations (σ, µ) are normalized such that it has a zero mean 3×3 is applied to the outputs of layers 1, 2 and 5. To reduce
and a unit standard deviation. In batch normalization (BN), computation, a stride of 4 is used at the first layer of the
the normalized value is further scaled and shifted, as shown network. AlexNet introduced the use of LRN in layers 1 and
in Eq. (2), where the parameters (γ, β) are learned from 2 before the max pooling, though LRN is no longer popular
training [47]. is a small constant to avoid numerical problems. in later CNN models. One important factor that differentiates
Prior to this, local response normalization (LRN) [3] was AlexNet from LeNet is that the number of weights is much
used, which was inspired by lateral inhibition in neurobiology larger and the shapes vary from layer to layer. To reduce the
where excited neurons (i.e., high value activations) should amount of weights and computation in the second CONV layer,
subdue its neighbors (i.e., cause low value activations); however, the 96 output channels of the first layer are split into two groups
BN is now considered standard practice in the design of of 48 input channels for the second layer, such that the filters in
CNNs while LRN is mostly deprecated. Note that while LRN the second layer only have 48 channels. Similarly, the weights
usually is performed after the non-linear function, BN is mostly in fourth and fifth layer are also split into two groups. In total,
performed between the CONV or FC layer and the non-linear AlexNet requires 61M weights and 724M MACs to process
function. one 227×227 input image.
x−µ Overfeat [49] has a very similar architecture to AlexNet with
y=√ γ+β (2) five CONV layers and three FC layers. The main differences
σ2 +
are that the number of filters is increased for layers 3 (384
to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not
B. Popular DNN Models split into two groups, the first fully connected layer only has
Many DNN models have been developed over the past 3072 channels rather than 4096, and the input size is 231×231
two decades. Each of these models has a different ‘network rather than 227×227. As a result, the number of weights grows
architecture’ in terms of number of layers, layer types, layer to 146M and the number of MACs grows to 2.8G per image.
shapes (i.e., filter size, number of channels and filters), and Overfeat has two different models: fast (described here) and
connections between layers. Understanding these variations accurate. The accurate model used in the ImageNet Challenge
and trends is important for incorporating the right flexibility gives a 0.65% lower top-5 error rate than the fast model at the
in any efficient DNN engine. cost of 1.9× more MACs
In this section, we will give an overview of various popular VGG-16 [50] goes deeper to 16 layers consisting of 13
DNNs such as LeNet [48] as well as those that competed in CONV layers and 3 FC layers. In order to balance out the
and/or won the ImageNet Challenge [14] as shown in Fig. 7, cost of going deeper, larger filters (e.g., 5×5) are built from
most of whose models with pre-trained weights are publicly multiple smaller filters (e.g., 3×3), which have fewer weights,
available for download; the DNN models are summarized in to achieve the same receptive fields as shown in Fig. 13(a).
Table II. Two results for top-5 error results are reported. In the As a result, all CONV layers have the same filter size of 3×3.
first row, the accuracy is boosted by using multiple crops from In total, VGG-16 requires 138M weights and 15.5G MACs
the image and an ensemble of multiple trained models (i.e., to process one 224×224 input image. VGG has two different
the DNN needs to be run several times); these results were models: VGG-16 (described here) and VGG-19. VGG-19 gives
used to compete in the ImageNet Challenge. The second row a 0.1% lower top-5 error rate than VGG-16 at the cost of
reports the accuracy if only a single crop was used (i.e., the 1.27× more MACs.
9
5x5 filter Apply sequentially 1x1 CONV 3x3 CONV 5x5 CONV 1x1 CONV
5x1 filter
1x5 filter C=64 C=128 C=32 C=32
decompose
Output
feature
map
C=256
(b) Constructing a 5×5 support from 1×5 and 5×1 filter. Used in
GoogleNet/Inception v3 and v4.
Fig. 14. Inception module from GoogleNet [51] with example channel lengths.
Fig. 13. Decomposing larger filters into smaller filters. Note that each CONV layer is followed by a ReLU (not drawn).
that are more discriminative and also provides more levels B. Models
of hierarchy in the learned representation [15, 50, 51, 55]. Pretrained DNN models can be downloaded from various
The number of filter shapes continues to vary across layers, websites [56–59] for the various different frameworks. It should
thus flexibility is still important. Furthermore, most of the be noted that even for the same DNN (e.g., AlexNet) the
computation has been placed on CONV layers rather than FC accuracy of these models can vary by around 1% to 2%
layers. In addition, the number of weights in the FC layers is depending on how the model was trained, and thus the results
reduced and in most recent networks (since GoogLeNet) the do not always exactly match the original publication.
CONV layers also dominate in terms of weights. Thus, the
focus of hardware implementations should be on addressing
the efficiency of the CONV layers, which in many domains C. Popular Datasets for Classification
are increasingly important. It is important to factor in the difficulty of the task when
comparing different DNN models. For instance, the task of
IV. DNN DEVELOPMENT RESOURCES classifying handwritten digits from the MNIST dataset [62]
is much simpler than classifying an object into one of 1000
One of the key factors that has enabled the rapid development
classes as is required for the ImageNet dataset [14](Fig. 16).
of DNNs is the set of development resources that have been
It is expected that the size of the DNNs (i.e., number of
made available by the research community and industry. These
weights) and the number of MACs will be larger for the more
resources are also key to the development of DNN accelerators
difficult task than the simpler task and thus require more
by providing characterizations of the workloads and facilitating
energy and have lower throughput. For instance, LeNet-5[48]
the exploration of trade-offs in model complexity and accuracy.
is designed for digit classification, while AlexNet[3], VGG-
This section will describe these resources such that those who
16[50], GoogLeNet[51], and ResNet[15] are designed for the
are interested in this field can quickly get started.
1000-class image classification.
There are many AI tasks that come with publicly available
A. Frameworks datasets in order to evaluate the accuracy of a given DNN.
For ease of DNN development and to enable sharing of Public datasets are important for comparing the accuracy of
trained networks, several deep learning frameworks have been different approaches. The simplest and most common task
developed from various sources. These open source libraries is image classification, which involves being given an entire
contain software libraries for DNNs. Caffe was made available image, and selecting 1 of N classes that the image most likely
in 2014 from UC Berkeley [46]. It supports C, C++, Python belongs to. There is no localization or detection.
and MATLAB. Tensorflow was released by Google in 2015, MNIST is a widely used dataset for digit classification
and supports C++ and python; it also supports multiple CPUs that was introduced in 1998 [62]. It consists of 28×28 pixel
and GPUs and has more flexibility than Caffe, with the grayscale images of handwritten digits. There are 10 classes
computation expressed as dataflow graphs to manage the (for 10 digits) and 60,000 training images and 10,000 test
tensors (multidimensional arrays). Another popular framework images. LeNet-5 was able to achieve an accuracy of 99.05%
is Torch, which was developed by Facebook and NYU and when MNIST was first introduced. Since then the accuracy has
supports C, C++ and Lua. There are several other frameworks increased to 99.79% using regularization of neural networks
such as Theano, MXNet, CNTK, which are described in [60]. with dropconnect [63]. Thus, MNIST is now considered a fairly
There are also higher-level libraries that can run on top of easy dataset.
the aforementioned frameworks to provide a more universal CIFAR is a dataset that consists of 32×32 pixel colored
experience and faster development. One example of such images of of various objects, which was released in 2009 [64].
libraries is Keras, which is written in Python and supports CIFAR is a subset of the 80 million Tiny Image dataset [65].
Tensorflow, CNTK and Theano. CIFAR-10 is composed of 10 mutually exclusive classes. There
The existence of such frameworks are not only a convenient are 50,000 training images (5000 per class) and 10,000 test
aid for DNN researchers and application designers, but they images (1000 per class). A two-layer convolutional deep belief
are also invaluable for engineering high performance or more network was able to achieve 64.84% accuracy on CIFAR-10
efficient DNN computation engines. In particular, because the when it was first introduced [66]. Since then the accuracy has
frameworks make heavy use of a set primitive operations, increased to 96.53% using fractional max pooling [67].
such processing of a CONV layer, they can incorporate use of ImageNet is a large scale image dataset that was first
optimized software or hardware accelerators. This acceleration introduced in 2010; the dataset stabilized in 2012 [14]. It
is transparent to the user of the framework. Thus, for example, contains images of 256×256 pixel in color with 1000 classes.
most frameworks can use Nvidia’s cuDNN library for rapid The classes are defined using the WordNet as a backbone to
execution on Nvidia GPUs. Similarly, transparent incorporation handle ambiguous word meanings and to combine together
of dedicated hardware accelerators can be achieved as was synonyms into the same object category. In otherwords, there
done with the Eyeriss chip [61]. is a hierarchy for the ImageNet categories. The 1000 classes
Finally, these frameworks are a valuable source of workloads were selected such that there is no overlap in the ImageNet
for hardware researchers. They can be used to drive experi- hierarchy. The ImageNet dataset contains many fine-grained
mental designs for different workloads, for profiling different categories including 120 different breeds of dogs. There are
workloads and for exploring hardware-software trade-offs. 1.3M training images (732 to 1300 per class), 100,000 testing
11
MNIST ImageNet be localized and classified (out of 1000 classes). The DNN
outputs the top five categories and top five bounding box
locations. There is no penalty for identifying an object that
is in the image but not included in the ground truth. For
object detection, all objects in the image must be localized
and classified (out of 200 classes). The bounding box for all
objects in these categories must be labeled. Objects that are
Fig. 16. MNIST (10 classes, 60k training, 10k testing) [62] vs. ImageNet not labeled are penalized as are duplicated detections.
(1000 classes, 1.3M training, 100k testing)[14] dataset. Beyond ImageNet, there are also other popular image
datasets for computer vision tasks. For object detection, there
images (100 per class) and 50,000 validation images (50 per is the PASCAL VOC (2005-2012) dataset that contains 11k
class). images representing 20 classes (27k object instances, 7k of
The accuracy of the ImageNet Challenge are reported using which has detailed segmentation) [68]. For object detection,
two metrics: Top-5 and Top-1 error. Top-5 error means that if segmentation and recognition in context, there is the MS COCO
any of the top five scoring categories are the correct category, dataset with 2.5M labeled instances in 328k images (91 object
it is counted as a correct classification. The Top-1 requires categories) [69]; compared to ImageNet, COCO has fewer
that the top scoring category be correct. In 2012, the winner categories but more instances per category, which is useful for
of the ImageNet Challenge (AlexNet) was able to achieve an precise 2-D localization. COCO also has more labeled instances
accuracy of 83.6% for the top-5 (which is substantially better per image to potentially help with contextual information.
than the 73.8% which was second place that year that did not Most recently even larger scale datasets have been made
use DNNs); it achieved 61.9% on the top-1 of the validation available. For instance, Google has an Open Images dataset
set. In 2017, the highest accuracy was 97.7% for the top-5. with over 9M images [70], spanning 6000 categories. There is
In summary of the various image classification datasets, it also a YouTube dataset with 8M videos (0.5M hours of video)
is clear that MNIST is a fairly easy dataset, while ImageNet covering 4800 classes [71]. Google also released an audio
is a challenging one with a wider coverage of classes. Thus dataset comprised of 632 audio event classes and a collection
in terms of evaluating the accuracy of a given DNN, it is of 2M human-labeled 10-second sound clips [72]. These large
important to consider that dataset upon which the accuracy is datasets will be evermore important as DNNs become deeper
measured. with more weight parameters to train.
Undoubtedly, both larger datasets and datasets for new
D. Datasets for Other Tasks domains will serve as important resources for profiling and
exploring the efficiency of future DNN engines.
Since the accuracy of the state-of-the-art DNNs are perform-
ing better than human-level accuracy on image classification
tasks, the ImageNet Challenge has started to focus on more V. H ARDWARE FOR DNN P ROCESSING
difficult tasks such as single-object localization and object Due to the popularity of DNNs, many recent hardware
detection. For single-object localization, the target object must platforms have special features that target DNN processing. For
12
instance, the Intel Knights Landing CPU features special vector Temporal Architecture Spatial Architecture
instructions for deep learning; the Nvidia PASCAL GP100 (SIMD/SIMT) (Dataflow Processing)
GPU features 16-bit floating point (FP16) arithmetic support Memory Hierarchy Memory Hierarchy
to perform two FP16 operations on a single precision core for Register File
faster deep learning computation. Systems have also been built ALU ALU ALU ALU
specifically for DNN processing such as Nvidia DGX-1 and ALU ALU ALU ALU
Facebook’s Big Basin custom DNN server [73]. DNN inference ALU ALU ALU ALU ALU ALU ALU ALU
has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as ALU ALU ALU ALU
ALU ALU ALU ALU
well as FPGAs. Accordingly, it’s important to have a good
ALU ALU ALU ALU
understanding of how the processing is being performed on
these platforms, and how application-specific accelerators can ALU ALU ALU ALU
Control
be designed for DNNs for further improvement in throughput
and energy efficiency. Fig. 17. Highly-parallel compute paradigms.
The fundamental component of both the CONV and FC lay-
ers are the multiply-and-accumulate (MAC) operations, which
Filters Input fmaps Output fmaps
can be easily parallelized. In order to achieve high performance, 1
CHW
highly-parallel compute paradigms are very commonly used,
1
including both temporal and spatial architectures as shown in
Fig. 17. The temporal architectures appear mostly in CPUs
or GPUs, and employ a variety of techniques to improve M × CHW = M
parallelism such as vectors (SIMD) or parallel threads (SIMT).
Such temporal architecture use a centralized control for a large
number of ALUs. These ALUs can only fetch data from the
memory hierarchy and cannot communicate directly with each (a) Matrix Vector multiplication is used when computing a single output
other. In contrast, spatial architectures use dataflow processing, feature map from a single input feature map.
i.e., the ALUs form a processing chain so that they can pass data
Filters Input fmaps Output fmaps
from one to another directly. Sometimes each ALU can have
CHW N N
its own control logic and local memory, called a scratchpad or
register file. We refer to the ALU with its own local memory as CHW
a processing engine (PE). Spatial architectures are commonly
used for DNNs in ASIC and FPGA-based designs. In this M × = M
section, we will discuss the different design strategies for
efficient processing on these different platforms, without any
impact on accuracy (i.e., all approaches in this section produce
bit-wise identical results); specifically, (b) Matrix Multiplications is used when computing N output feature
maps from N input feature maps.
• For temporal architectures such as CPUs and GPUs, we
will discuss how computational transforms on the kernel Fig. 18. Mapping to matrix multiplication for fully connected layers
can reduce the number of multiplications to increase
throughput.
• For spatial architectures used in accelerators, we will
width is the number of input feature maps (one in Fig. 18(a)
discuss how dataflows can increase data reuse from low and N in Fig. 18(b)); finally, the height of the output feature
cost memories in the memory hierarchy to reduce energy map matrix is the number of channels in the output feature
consumption. maps (M ), and the width is the number of output feature maps
(N ), where each output feature map of the FC layer has the
dimension of 1×1×number of output channels (M ).
A. Accelerate Kernel Computation on CPU and GPU Platforms The CONV layer in a DNN can also be mapped to a matrix
CPUs and GPUs use parallelizaton techniques such as SIMD multiplication using a relaxed form of the Toeplitz matrix as
or SIMT to perform the MACs in parallel. All the ALUs share shown in Fig. 19. The downside for using matrix multiplication
the same control and memory (register file). On these platforms, for the CONV layers is that there is redundant data in the input
both the FC and CONV layers are often mapped to a matrix feature map matrix as highlighted in Fig. 19(a). This can lead
multiplication (i.e., the kernel computation). Fig. 18 shows how to either inefficiency in storage, or a complex memory access
a matrix multiplication is used for the FC layer. The height of pattern.
the filter matrix is the number of filters and the width is the There are software libraries designed for CPUs (e.g., Open-
number of weights per filter (input channels (C) × width (W ) BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN,
× height (H), since R = W and S = H in the FC layer); etc.) that optimize for matrix multiplications. The matrix
the height of the input feature maps matrix is the number of multiplication is tiled to the storage hierarchy of these platforms,
activations per input feature map (C × W × H), and the which are on the order of a few megabytes at the higher levels.
13
S W F
F F I
Toeplitz Matrix F
T
F F
T F
(w/ redundant data) T
1 2 3 4 × 1 2 4 5 = 1 2 3 4
Matrix Mult: 2 3 5 6 FFT(W) X FFT(I) = FFT(0)
4 5 7 8
5 6 8 9
(a) Mapping convolution to Toeplitz matrix Fig. 20. FFT to accelerate DNN.
ALU 1× (Reference)
2
0.5 – 1.0 kB RF ALU 1× 2
Compilation Execution
decides what data gets read into which level of the memory
DNN Shape and Size Processed
hierarchy and when are they getting processed. Since there is (Program) Data
no randomness in the processing of DNNs, it is possible to
Dataflow, …
design a fixed dataflow that can adapt to the DNN shapes and (Architecture)
sizes and optimize for the best energy efficiency. The optimized Mapper DNN Accelerator
dataflow minimizes access from the more energy consuming (Compiler) (Processor)
Implementation
levels of the memory hierarchy. Large memories that can store Details
a significant amount of data consume more energy than smaller (µArch) Mapping Input
memories. For instance, DRAM can store gigabytes of data, but Data
(Binary)
consumes two orders of magnitude higher energy per access
than a small on-chip memory of a few kilobytes. Thus, every
time a piece of data is moved from an expensive level to a Fig. 24. An analogy between the operation of DNN accelerators (texts in
black) and that of general-purpose processors (texts in red). Figure adopted
lower cost level in terms of energy, we want to reuse that piece from [81].
of data as much as possible to minimize subsequent accesses
to the expensive levels. The challenge, however, is that the
storage capacity of these low cost memories are limited. Thus program into machine-readable binary codes for execution
we need to explore different dataflows that maximize reuse given the hardware architecture (e.g., x86 or ARM); in the
under these constraints. processing of DNNs, the mapper translates the DNN shape
For DNNs, we investigate dataflows that exploit three forms and size into a hardware-compatible computation mapping
of input data reuse (convolutional, feature map and filter) as for execution given the dataflow. While the compiler usually
shown in Fig. 23. For convolutional reuse, the same input optimizes for performance, the mapper optimizes for energy
feature map activations and filter weights are used within efficiency.
a given channel, just in different combinations for different The following taxonomy (Fig. 25) can be used to classify
weighted sums. For feature map reuse, multiple filters are the DNN dataflows in recent works [82–93] based on their
applied to the same feature map, so the input feature map data handling characteristics [80]:
activations are used multiple times across filters. Finally, for 1) Weight stationary (WS): The weight stationary dataflow
filter reuse, when multiple input feature maps are processed at is designed to minimize the energy consumption of reading
once (referred to as a batch), the same filter weights are used weights by maximizing the accesses of weights from the register
multiple times across input features maps. file (RF) at the PE (Fig. 25(a)). Each weight is read from
If we can harness the three types of data reuse by storing DRAM into the RF of each PE and stays stationary for further
the data in the local memory hierarchy and accessing them accesses. The processing runs as many MACs that use the
multiple times without going back to the DRAM, it can save same weight as possible while the weight is present in the RF;
a significant amount of DRAM accesses. For example, in it maximizes convolutional and filter reuse of weights. The
AlexNet, the number of DRAM reads can be reduced by up to inputs and partial sums must move through the spatial array
500× in the CONV layers. The local memory can also be used and global buffer. The input fmap activations are broadcast to
for partial sum accumulation, so they do not have to reach all PEs and then the partial sums are spatially accumulated
DRAM. In the best case, if all data reuse and accumulation across the PE array.
can be achieved by the local memory hierarchy, the 3000M One example of previous work that implement weight
DRAM accesses in AlexNet can be reduced to only 61M. stationary dataflow is nn-X, or neuFlow [85], which uses
The operation of DNN accelerators is analogous to that of eight 2-D convolution engines for processing a 10×10 filter.
general-purpose processors as illustrated in Fig. 24 [81]. In There are total 100 MAC units, i.e. PEs, per engine with each
conventional computer systems, the compiler translates the PE having a weight that stays stationary for processing. The
15
…
W0 W1 W2 W3 W4 W5 W6 W7 PE Parallel
Weight Output Region E E E
…
(a) Weight Stationary E E E
Global Buffer
Weight
OSC are [89], [88], and [90], respectively.
Act 3) No local reuse (NLR): While small register files are
PE
Psum
efficient in terms of energy (pJ/bit), they are inefficient in terms
of area (µm2 /bit). In order to maximize the storage capacity,
(c) No Local Reuse
and minimize the off-chip memory bandwidth, no local storage
Fig. 25. Dataflows for DNNs [80]. is allocated to the PE and instead all that area is allocated
to the global buffer to increase its capacity (Fig. 25(c)). The
no local reuse dataflow differs from the previous dataflows in
input fmap activations are broadcast to all MAC units and the that nothing stays stationary inside the PE array. As a result,
partial sums are accumulated across the MAC units. In order to there will be increased traffic on the spatial array and to the
accumulate the partial sums correctly, additional delay storage global buffer for all data types. Specifically, it has to multicast
elements are required, which are counted into the required size the activations, single-cast the filter weights, and then spatially
of local storage. Other weight stationary examples are found accumulate the partial sums across the PE array.
in [82–84, 86, 87]. In an example of the no local reuse dataflow from
2) Output stationary (OS): The output stationary dataflow is UCLA [91], the filter weights and input activations are read
designed to minimize the energy consumption of reading and from the global buffer, processed by the MAC units with custom
writing the partial sums (Fig. 25(b)). It keeps the accumulation adder trees that can complete the accumulation in a single cycle,
of partial sums for the same output activation value local in the and the resulting partial sums or output activations are then put
RF. In order to keep the accumulation of partial sums stationary back to the global buffer. Another example is DianNao [92],
in the RF, one common implementation is to stream the input which also reads input activations and filter weights from
activations across the PE array and broadcast the weight to all the buffer, and processes them through the MAC units with
PEs in the array. custom adder trees. However, DianNao implements specialized
One example that implements the output stationary dataflow registers to keep the partial sums in the PE array, which helps
is ShiDianNao [89], where each PE handles the processing for to further reduce the energy consumption of accessing partial
each output activation value by fetching the corresponding input sums. Another example of no local reuse dataflow is found
activations from neighboring PEs. The PE array implements in [93].
dedicated networks to pass data horizontally and vertically. 4) Row stationary (RS): A row stationary dataflow is
Each PE also has data delay registers to keep data around for proposed in [80], which aims to maximize the reuse and
the required amount of cycles. At the system level, the global accumulation at the RF level for all types of data (weights,
buffer streams the input activations and broadcasts the weights pixels, partial sums) for the overall energy efficiency. This
into the PE array. The partial sums are accumulated inside differs from WS or OS dataflows, which optimize for only
each PE and then get streamed out back to the global buffer. weights and partial sums, respectively.
Other examples of output stationary are found in [88, 90]. The row stationary dataflow assigns the processing of a
There are multiple possible variants of output stationary as 1-D row convolution into each PE for processing as shown
shown in Fig. 26 since the output activations that get processed in Fig. 27. It keeps the row of filter weights stationary inside
at the same time can come from different dimensions. For the RF of the PE and then streams the input activations into
example, the variant OSA targets the processing of CONV the PE. The PE does the MACs for each sliding window at a
layers, and therefore focuses on the processing of output time, which uses just one memory space for the accumulation
activations from the same channel at a time in order to of partial sums. Since there are overlaps of input activations
maximize data reuse opportunities. The variant OSC targets between different sliding windows, the input activations can
the processing of FC layers, and focuses on generating output then be kept in the RF and get reused. By going through all the
activations from all different channels, since each channel only sliding windows in the row, it completes the 1-D convolution
has one output activation. The variant OSB is something in and maximize the data reuse and local accumulation of data
between OSA and OSC . Example of variants OSA , OSB , and in this row.
16
* =
* =
* =
Row 1 Row 1
PE
Row 1 Row 2
PE
Row 1 * Row 3
PE
* *
Reg File PE Reg File PE Reg File PE PE PE PE
c b a c b a c b a
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
e d c b a e d c b e d c
a b a
a b c PE PE PE
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
PE 3 PE 6 PE 9
CNN Configurations
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5 C
M
C
H
R E
1 1 1
R H E
Optimization
* =
* =
* =
…
C C
Compiler
R E
M H
N
R N E
H
PE PE PE
ALU ALU ALU ALU Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE PE PE
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
ALU ALU ALU ALU
Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
With each PE processing a 1-D convolution, multiple * ALU ALU ALU ALU Multiple filters:
Filter 1
=
Image
Fmap 1 Psum
Multiple channels:
* =
PEs can be aggregated to complete the 2-D convolution as
shown in Fig. 28. For example, to generate the first row of
output activations with a filter having three rows, three 1-D Fig. 30. Mapping optimization takes in hardware and DNNs shape constraints
to determine optimal energy dataflow [80].
convolutions are required. Therefore, we can use three PEs in
a column, each running one of the three 1-D convolutions. The
partial sums are further accumulated vertically across the three
PEs to generate the first output row. To generate the second different channels are interleaved, and run through the same PE
row of output, we use another column of PEs, where three as a 1-D convolution. The partial sums from different channels
rows of input activations are shifted down by one row, and use then naturally get accumulated inside the PE.
the same rows of filters to perform the three 1-D convolutions. The number of filters, channels, and fmaps that can be
Additional columns of PEs are added until all rows of the processed at the same time is programmable, and there exists an
output are completed (i.e., the number of PE columns equals optimal mapping for the best energy efficiency, which depends
the number of output rows). on the shape configuration of the DNN as well as the hardware
This 2-D array of PEs enables other forms of reuse to reduce resources provided, e.g., the number of PEs and the size of the
accesses to the more expensive global buffer. For example, each memory in the hierarchy. Since all of the variables are known
filter row is reused across multiple PEs horizontally. Each row before runtime, it is possible to build a compiler (i.e., mapper)
of input activations is reused across multiple PEs diagonally. to perform this optimization off-line to configure the hardware
And each row of partial sums are further accumulated across for different mappings of the RS dataflow for different DNNs
the PEs vertically. Therefore, 2-D convolutional data reuse and as shown in Fig. 30.
accumulation are maximized inside the 2-D PE array. One example that implements the row stationary dataflow
To address the high-dimensional convolution of the CONV is Eyeriss [94]. It consists of a 14×12 PE array, a 108KB
layer (i.e., multiple fmaps, filters, and channels), multiple rows global buffer, ReLU and fmap compression units as shown
can be mapped onto the same PE as shown in Fig. 29. The in Fig. 31. The chip communicates with the off-chip DRAM
2-D convolution is mapped to a set of PEs, and the additional using a 64-bit bidirectional data bus to fetch data into the
dimensions are handled by interleaving or concatenating the global buffer. The global buffer then streams the data into the
additional data. For filter reuse within the PE, different rows PE array for processing.
of fmaps are concatenated and run through the same PE In order to support the RS dataflow, two problems need to be
as a 1-D convolution. For input fmap reuse within the PE, solved in the hardware design. First, how can the fixed-size PE
different filter rows are interleaved and run through the same array accommodate different layer shapes? Second, although
PE as a 1-D convolution. Finally, to increase local partial sum the data will be passed in a very specific pattern, it still changes
accumulation within the PE, filter rows and fmap rows from with different shape configurations. How can the fixed design
17
Link Clock Core Clock Configuration Bits Accelerator needs of each dataflow under the same area constraint. For
Top-Level Control Config Scan Chain 12✕14
Filter
PE Array Processing example, since the no local reuse dataflow does not require any
Filter Element
…
Ifmap
RF in PE, it is allocated with a much larger global buffer. The
Global Ifmap …
Spad
Off-Chip RLC Buffer
MAC
Control
simulation uses the layer configurations from AlexNet with a
DRAM Decoder Psum
64 108KB
…
batch size of 16. The simulation also takes into account the
bits Ofmap
Psum
…
RLC ReLU
Enc.
fact that accessing different levels of the memory hierarchy
…
requires different energy cost.
Fig. 31. Eyeriss DNN accelerator [94]. Fig. 33 compares the chip and DRAM energy consumption
of each dataflow for the CONV layers of AlexNet with a
batch size of 16. The WS and OS dataflows have the lowest
Replication Folding energy consumption for accessing weights and partial sums,
13 27 respectively. However, the RS dataflow has the lowest total
AlexNet .. AlexNet .. energy consumption since it optimizes for the overall energy
.. ..
Layer 3-5 3 Layer 2 5 efficiency instead of only for a certain data type.
..
..
..
..
..
..
.. ..
1.5
psums
Normalized In this section, we will discuss how moving compute and data
1 weights
Energy/MAC closer to reduce data movement (i.e., near-data processing) can
pixels
be achieved using mixed-signal circuit design and advanced
0.5
memory technologies.
0
Many of these works use analog processing which has the
WS OSA OSB OSC NLR RS drawback of increased sensitivity to circuit and device non-
DNN Dataflows idealities. Consequentially, the computation is often performed
at reduced precision, which can be accounted for during
(b) Energy breakdown across data type
the training of the DNNs using the techniques discussed in
Section VII. Another factor to take into consideration is that
Fig. 33. Comparison of energy efficiency between different dataflows in the
CONV layers of AlexNet with a batch size of 16 [3]: (a) breakdown in terms DNNs are often trained in the digital domain; thus for analog
of storage levels and ALU, (b) breakdown in terms of data types. OSA , OSB processing, there is an additional overhead cost for analog-
and OSC are three variants of the OS dataflow that are commonly seen in to-digital conversion (ADC) and digital-to-analog conversion
different implementations [80].
(DAC).
2
A. DRAM
Advanced memory technology can reduce the access energy
1.5 for high density memories such as DRAMs. For instance,
psums
Normalized embedded DRAM (eDRAM) brings high density memory on-
weights
Energy/MAC 1
pixels chip to avoid the high energy cost of switching off-chip
0.5 capacitance [97]; eDRAM is 2.85× higher density than SRAM
and 321× more energy efficient than DRAM (DDR3) [93].
0 eDRAM also offers higher bandwidth and lower latency
WS OSA OSB OSC NLR RS
compared to DRAM. In DNN processing, eDRAM can be used
DNN Dataflows
to store tens of megabytes of weights and activations on-chip
to avoid off-chip access, as demonstrated in DaDianNao [93].
Fig. 34. Comparison of energy efficiency between different dataflows in the
FC layers of AlexNet with a batch size of 16 [80].
The downside of eDRAM is that it has lower density than
off-chip DRAM and can increase the cost of the chip.
Rather than integrating DRAM into the chip itself, the
DRAM can also be stacked on top of the chip using through
VI. N EAR -DATA P ROCESSING silicon vias (TSV). This technology is often referred to as 3-D
The previous section highlighted that data movement domi- memory, and has been commercialized in the form of Hybrid
nates energy consumption. While spatial architectures distribute Memory Cube (HMC) [98] and High Bandwidth Memory
the on-chip memory such that it is closer to the computation (HBM) [99]. 3-D memory delivers an order of magnitude higher
(e.g., into the PE), there have also been efforts to bring the bandwidth and reduces access energy by up to 5× relative to
off-chip high density memory closer to the computation or to existing 2-D DRAMs, as TSV have lower capacitance than
integrate the computation into the memory itself; the latter is typical off-chip interconnects. Recent works have explored the
often referred to as processing-in-memory or logic-in-memory. use of HMC for efficient DNN processing in a variety of ways.
In embedded systems, there have also been efforts to bring the For instance, Neurocube [100] integrates SIMD processors into
computation into the sensor where the data is first collected. the logic die of the HMC to bring the memory and computation
19
Ideal transfer curve G2 is embedded within memory which reduces data movement,
Standard Deviation
0.06 (from Monte Carlo and increased density since memory and computation can be
ΔVBL (V)
# of weights
CNN Shape Configuration Hardware Energy Costs of each
(# of channels, # of filters, etc.) MAC and Memory Access
…
Optimization # acc. at mem. level n Edata
Fig. 42. Energy estimation methodology from [142], which estimates the
energy based on data movement from different levels of the memory hierarchy,
(a) Compressed sparse row (CSR)
number of MACs, and data sparsity.
93%
91% ResNet-50
VGG-16
Top-5 Accuracy
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump9on
(b) Compressed sparse column (CSC)
Original DNN
Fig. 44. Sparse matrix-vector multiplications using different storage formats
(a) Energy versus accuracy trade-off of popular DNN models. (Figure from [144]).
93%
91% ResNet-50
VGG-16 multiplication, as shown in Fig. 18(a), one challenge is
Top-5 Accuracy
89% GoogLeNet
87% GoogLeNet to determine how to store the sparse weight matrix in a
85% compressed format. The compression can be applied either
83% 1.74x SqueezeNet in row or column order. A compressed sparse row (CSR)
81% AlexNet SqueezeNet format, as shown in Fig. 44(a), is often used to perform Sparse
AlexNet AlexNet SqueezeNet
79%
Matrix-Vector multiplication. However, the input vector needs
77%
5E+08 5E+09 5E+10 to be read in multiple times even though only a subset of it is
Normalized Energy Consump9on used since each row of the matrix is sparse. Alternatively,
Original DNN Magnitude-based Pruning Energy-aware Pruning
a compressed sparse column (CSC) format, as shown in
(b) Impact of energy-aware pruning. Fig. 44(b), can be used, where the output is updated several
times, and only one element of the input vector is read at
Fig. 43. Energy values estimated with methodology in [142]. a time [144]. The CSC format will provide an overall lower
memory bandwidth than CSR if the output is smaller than the
input, or in the case of DNN, if the number of filters is not
can then be used to prune weights based on energy to reduce significantly larger than the number of weights in the filter
the overall energy across all layers by 3.7× for AlexNet, which (C × R × S from Fig. 9(b)). Since this is often true, CSC can
is 1.74× more efficient than magnitude-based approaches [141] be an effective format for sparse DNN processing.
as shown in Fig. 43(b). As mentioned previously, it is well Custom hardware has been explored to efficiently support
known that AlexNet is over-parameterized. The energy-aware pruned DNN models. Many works aim to perform the process-
pruning can also be applied to GoogleNet, which is already a ing without decompressing the weights or activations. EIE [145]
small DNN model, for a 1.6× energy reduction. performs the sparse matrix-vector multiplication specifically for
Recent works have examine how to efficiently support the fully connected layers. It stores the weights in a CSC format
processing of sparse weights in hardware. One area of interest along with the start location of each column, which needs to be
is how to best store the sparse weights after pruning. Similar to stored since the compressed weights have variable length. When
compressing the sparse activations discussed in Section VII-B1, the input is not zero, the compressed weight column is read and
the sparse weights can be compressed to reduce memory access the output is updated. To handle the sparsity, additional logic
bandwidth by 20 to 30% [118]. is used to keep track of the location of the output that should
When DNN processing is performed as a matrix-vector be updated. SCNN [146] supports processing of convolutional
25
layers in a compressed format. It uses an input stationary weights [154]. It proposes a fire module that first ‘squeezes’
dataflow to deliver the compressed weights and activations to the network with 1×1 convolution filters and then expands
a multiplier array followed by a scatter network to add the it with multiple 1×1 and 3×3 convolution filters. It achieves
scattered partial sums. an overall 50× reduction in number of weights compared to
Recent works have also explored the use of structured AlexNet, while maintaining the same accuracy. It should be
pruning to avoid the need for custom hardware [147, 148]. noted, however, that reducing the number of weights does not
Rather than pruning individual weights (also referred to as fine- necessarily reduce energy; for instance, SqueezeNet consumes
grained pruning), structured pruning involves pruning groups more energy than AlexNet, as shown in Fig. 43(a).
of weights (also referred to as coarse-grained pruning). The b) After Training: Tensor decomposition can be used to
benefits of structured pruning are (1) the resulting weights can decompose filters in a trained network without impacting the
better align with the data-parallel architecture (e.g., SIMD) accuracy. It treats weights in a layer as a 4-D tensor and breaks
found in existing general purpose hardware, which results in it into a combination of smaller tensors (i.e., several layers).
more efficient processing [149]; (2) it amortizes the overhead Low-rank approximation can then be applied to further increase
cost required to signal the location of the non-zero weights the compression rate at the cost of accuracy degradation, which
across a group of weights, which improves compression and can be restored by fine-tuning the weights.
thus reduces storage cost. These groups of weights can include This approach is demonstrated using Canonical Polyadic (CP)
a pair of neighboring weights, an entire row or column of a decomposition, a high-order extension of singular value decom-
filter, an entire channel of a filter or the entire filter itself; using position that can be solved by various methods, such as a greedy
larger groups tends to result in higher loss in accuracy [150]. algorithm [155] or a non-linear least-square method [156].
3) Compact Network Architectures: The number of weights Combining CP-decomposition with low-rank approximation
and operations can also be reduced by improving the network achieves a 4.5× speed-up on CPUs [156]. However, CP-
architecture itself. The trend is to replace a large filter with a decomposition cannot be computed in a numerically stable
series of smaller filters, which have fewer weights in total; when way when the dimension of the tensor, which represents the
the filters are applied sequentially, they achieve the same overall weights, is larger than two [156]. To alleviate this problem,
effective receptive field (i.e., the region the filter uses from input Tucker decomposition is adopted instead in [157].
image to compute an output). This approach can be applied 4) Knowledge Distillation: Using a deep network or av-
during the network architecture design (before training) or by eraging the predictions of different models (i.e., ensemble)
decomposing the filters of a trained network (after training). gives a better accuracy than using a single shallower network.
The latter one avoids the hassle of training networks from However, the computational complexity is also higher. To get
scratch. However, it is less flexible than the former one. For the best of both worlds, knowledge distillation transfers the
example, existing methods can only decompose a filter in a knowledge learned by the complex model (teacher) to the
trained network into a series of filters without non-linearity simpler model (student). The student network can therefore
between them. achieve an accuracy that would be unachievable if it was
a) Before Training: In recent DNN models, filters with directly trained with the same dataset [158, 159]. For example,
a smaller width and height are used more frequently because [160] shows how using knowledge distillation can improve the
concatenating several of them can emulate a larger filter as speech recognition accuracy of a student net by 2%, which is
shown in Fig. 13. For example, one 5×5 convolution can be similar to the accuracy of a teacher net that is composed of
replaced with two 3×3 convolutions. Alternatively, one N×N an ensemble of 10 networks.
convolution can be decomposed into two 1-D convolutions, one Fig. 45 shows the simplest knowledge distillation
1×N and one N×1 convolution [53]; this basically imposes method [158]. The softmax layer is commonly used as the
a restriction that the 2-D filter must be separable, which is output layer in the image classification networks to generate
a common constraint in image processing [151]. Similarly, a the class probabilities from the class scores12 ; it squashes the
3-D convolution can be replaced by a set of 2-D convolutions class scores into values between 0 and 1 that sum up to 1.
(i.e., applied only on one of the input channels) followed by For this knowledge distillation method, soft targets (values
1×1 3-D convolutions as demonstrated in Xception [152] and between 0 and 1) such as the class scores of the teacher DNN
MobileNets [153]. The order of the 2-D convolutions and 1×1 (or an ensemble of teacher DNNs) are used instead of the
3-D convolutions can be switched. hard targets (values of either 0 or 1) such as the labels in the
1×1 convolutional layers can also be used to reduce the dataset. The objective is to minimize the squared difference
number of channels in the output feature map for a given between the soft targets and the class scores of the student DNN.
layer, which reduces the number of filter channels and thus Class scores are used as the soft targets instead of the class
computation cost for the filters in the next layer as demonstrated probabilities because small values in the class scores contain
in [15, 51, 52]; this is often referred to as a ‘bottleneck’ as important information that may be eliminated by the softmax.
discussed in Section III-B. For this purpose, the number of 1×1 Alternatively, class probabilities after the softmax layer can be
filters has to be less than the number of channels in the 1×1 used as soft targets if the softmax is configured to generate
filter. For example, 32 filters of 1×1×64 can transform an input softer class probabilities where the smaller values retain more
with 64 channels to an output of 32 channels and reduce the information [160]. Finally, the intermediate representations of
number of filter channels in the next layer to 32. SqueezeNet
uses many 1×1 filters to aggressively reduce the number of 12 Also commonly referred to as logits.
26
class robotics. For data analytics, high throughput means that more
scores probabilities data can be analyzed in a given amount of time. As the amount
softmax
Complex
softmax
Complex of visual data is growing exponentially, high-throughput big
DNN A DNN B data analytics becomes important, particularly if an action needs
(teacher) (teacher)
to be taken based on the analysis (e.g., security or terrorist
prevention; medical diagnosis).
Try to match
Low latency is necessary for real-time interactive applications.
Latency measures the time between when the pixel arrives
softmax
Simple DNN to a system and when the result is generated. Latency is
(student) measured in terms of seconds, while throughput is measured
in operations/second. Often high throughput is obtained by
batching multiple images/frames together for processing; this
Fig. 45. Knowledge distillation matches the class scores of a small DNN to
an ensemble of large DNNs. results in multiple frame latency (e.g., at 30 frames per second,
a batch of 100 frames results in a 3 second delay). This delay
is not acceptable for real-time applications, such as high-speed
navigation where it would reduce the time available for course
the teacher DNN can also be incorporated as the extra hints
correction. Thus achieving low latency and high throughput
to train the student DNN [161].
simultaneously can be a challenge.
Hardware cost is in large part dictated by the amount of
VIII. B ENCHMARKING M ETRICS FOR DNN E VALUATION on-chip storage and the number of cores. Typical embedded
AND C OMPARISON
processors have limited on-chip storage on the order of a few
As we have seen in this article, there has been a significant hundred kilobytes. Since there is a trade-off between the amount
amount of research on efficient processing of DNNs. We should of on-chip memory and the external memory bandwidth, both
consider several key metrics to compare the various strengths metrics should be reported. Similarly, there is a correlation
and weaknesses of different designs and proposed techniques. between the number of cores and the throughput. In addition,
These metrics should cover important attributes such as accu- while many cores can be built on a chip, the number of cores
racy/robustness, power/energy consumption, throughput/latency that can actually be used at a given time should be reported. It is
and cost. Reporting all these metrics is important in order often unrealistic to assume peak utilization and performance due
to provide a complete picture of the trade-offs made by a to limitations of mapping and memory bandwidth. Accordingly,
proposed design or technique. We have prepared a website to the power and throughput should be reported for running actual
collect these metrics from various publications [162]. DNNs as opposed to only reporting theoretical limits.
In terms of accuracy and robustness, it is important that the
accuracy be reported on widely-accepted datasets as discussed
in Section IV. The difficulty of the dataset and/or task should A. Metrics for DNN Models
be considered when measuring the accuracy. For instance, the
To evaluate the properties of a given DNN model, we should
MNIST dataset for digit recognition is significantly easier than
consider the following metrics:
the ImageNet dataset. As a result, a DNN that performs well
on MNIST may not necessarily perform well on ImageNet. • The accuracy of the model in terms of the top-5 error
Thus it is important that the same dataset and task is used when on datasets such as ImageNet. Also, the type of data
comparing the accuracy of different DNN models; currently augmentation used (e.g., multiple crops, ensemble models)
ImageNet is preferred since it presents a challenge for DNNs, should be reported.
as opposed to MNIST, which can also be addressed with simple • The network architecture of the model should be reported,
non-DNN techniques. To demonstrate primarily hardware including number of layers, filter sizes, number of filters
innovations, it would be desirable to report results for widely- and number of channels.
used DNN models (e.g., AlexNet, GoogLeNet) whose accuracy • The number of weights impact the storage requirement of
and robustness have been well studied and tested. the model and should be reported. If possible, the number
Energy and power are important when processing DNNs at of non-zero weights should be reported since this reflects
the edge in embedded devices with limited battery capacity the theoretical minimum storage requirements.
(e.g., smart phones, smart sensors, UAVs, and wearables), or in • The number of MACs that needs to be performed should
the cloud in data centers with stringent power ceilings due to be reported as it is somewhat indicative of the number
cooling costs, respectively. Edge processing is preferred over of operations and potential throughput of the given DNN.
the cloud for certain applications due to latency, privacy or If possible, the number of non-zero MACs should also
communication bandwidth limitations. When evaluating the be reported since this reflects the theoretical minimum
power and energy consumption, it is important to account compute requirements.
for all aspects of the system including the chip and external Table IV shows how these metrics are reported for various
memory accesses. well known DNNs. The accuracy is reported for the case where
High throughput is necessary to deliver real-time perfor- only a single crop for a single model is used for classification,
mance for interactive applications such as navigation and such that the number of weights and MACs in the table are
27
AlexNet GoogLeNet v1
Metrics
dense sparse dense sparse reported in terms of the core area in squared millimeters
Top-5 error 19.6 20.4 11.7 12.7 per multiplier along with process technology.
Number of CONV Layers 5 5 57 57 In terms of cost, different platforms will have different
Depth in implementation-specific metrics. For instance, for an FPGA,
5 5 21 21
(Number of CONV Layers)
Filter Sizes 3,5,11 1,3,5,7 the specific device should be reported, along with the utilization
Number of Channels 3-256 3-832 of resources such as DSP, BRAM, LUT and FF; performance
Number of Filters 96-384 16-384 density such as GOPs/slice can also be reported.
Stride 1,4 1,2
NZ Weights 2.3M 351k 6.0M 1.5M
Each processor should report various specifications for each
NZ MACs 395M 56.4M 806M 220M metric as shown in Table V, using the Eyeriss chip as an
FC Layers 3 3 1 1 example. It is important that all metrics and specifications are
Filter Sizes 1,6 1 accounted for in order fairly evaluate all the design trade-offs.
Number of Channels 256-4096 1024
Number of Filters 1000-4096 1000
For instance, without the accuracy given for a specific dataset
NZ Weights 58.6M 5.4M 1M 870k and task, one could run a simple DNN and easily claim low
NZ MACs 14.5M 1.9M 635k 663k power, high throughput, and low cost – however, the processor
Total NZ Weights 61M 5.7M 7M 2.4M might not be usable for a meaningful task; alternatively, without
Total NZ MACs 410M 58.3M 806M 221M
TABLE IV reporting the off-chip bandwidth, one could build a processor
M ETRICS FOR P OPULAR DNN M ODELS . S PARSITY IS ACCOUNT FOR BY with only multipliers and easily claim low cost, high throughput,
REPORTING NON - ZERO (NZ) WEIGHTS AND MAC S . high accuracy, and low chip power – however, when evaluating
system power, the off-chip memory access would be substantial.
Finally, the test setup should also be reported, including whether
the results are measured or obtained from simulation14 and
consistent.13 Note that accounting for the number of non-zero
how many images were tested.
(NZ) operations significantly reduces the number of MACs
In summary, the evaluation process for whether a DNN
and weights. Since the number of NZ MACs depends on the
system is a viable solution for a given application might go as
input data, we propose using the publicly available 50,000
follows: (1) the accuracy determines if it can perform the given
validation images from ImageNet for the computation. Finally,
task; (2) the latency and throughput determine if it can run fast
there are various methods to reduce the weights in a DNN
enough and in real-time; (3) the energy and power consumption
(e.g., network pruning in Section VII-B2). Table IV shows
will primarily dictate the form factor of the device where the
another example of these DNN model metrics, by comparing
processing can operate; (4) the cost, which is primarily dictated
sparse DNNs pruned using [142] to dense DNNs.
by the chip area, determines how much one would pay for this
solution.
B. Metrics for DNN Hardware
To measure the efficiency of the DNN hardware, we should IX. S UMMARY
consider the following additional metrics: The use of deep neural networks (DNNs) has seen explosive
• The power and energy consumption of the design should growth in the past few years. They are currently widely used
be reported for various DNN models; the DNN model for many artificial intelligence (AI) applications including
specifications should be provided including which layers computer vision, speech recognition and robotics and are often
and bit precision are supported by the hardware during delivering better than human accuracy. However, while DNNs
measurement. In addition, the amount of off-chip accesses can deliver this outstanding accuracy, it comes at the cost
(e.g., DRAM accesses) should be included since it of high computational complexity. Consequently, techniques
accounts for a significant portion of the system power; it that enable efficient processing of deep neural network to
can be reported in terms of the total amount of data that improve energy-efficiency and throughput without sacrificing
is read and written off-chip per inference. accuracy with cost-effective hardware are critical to expanding
• The latency and throughput should be reported in terms the deployment of DNNs in both existing and new domains.
of the batch size and the actual run time for various Creating a system for efficient DNN processing should
DNN models, which accounts for mapping and memory begin with understanding the current and future applications
bandwidth effects. This provides a more useful and and the specific computations required both now and the
informative metric than peak throughput. potential evolution of those computations. This article surveys a
• The cost of the chip depends on the area efficiency, which number of the current applications, focusing on computer vision
accounts for the size and type of memory (e.g., registers applications, the associated algorithms, and the data being used
or SRAM) and the amount of control logic. It should be to drive the algorithms. These applications, algorithms and
input data are experiencing rapid change. So extrapolating
13 Data augmentation is often used to increase accuracy. This includes using
these trends to determine the degree of flexibility desired to
multiple crops of an image to account for misalignment; in addition, an handle next generation computations, becomes an important
ensemble of multiple models can be used where each model has different
weights due to different training settings, such as using different initializations ingredient of any design project.
or datasets, or even different network architectures. If multiple crops and
models are used, then the number of MACs and weights required would 14 If obtained from simulation, it should be clarified whether it is from
increase. synthesis or post place-and-route and what library corner was used.
28
During the design-space exploration process, it is critical to article both reviews a variety of these techniques and discusses
understand and balance the important system metrics. For DNN the frameworks that are available for describing, running and
computation these include the accuracy, energy, throughput training networks.
and hardware cost. Evaluating these metrics is, of course, Finally, DNNs afford the opportunity to use mixed-signal
key, so this article surveys the important components of circuit design and advanced technologies to improve efficiency.
a DNN workload. In specific, a DNN workload has two These include using memristors for analog computation and 3-D
major components. First, the workload is the form of each stacked memory. Advanced technologies can also can facilitate
DNN network including the ‘shape’ of each layer and the moving computation closer to the source by embedding compu-
interconnections between layers. These can vary both within tation near or within the sensor and the memories. Of course, all
and between applications. Second, the workload consists of of these techniques should also be considered in combination,
the specific the data input to the DNN. This data will vary while being careful to understand their interactions and looking
with the input set used for training or the data input during for opportunities for joint hardware/algorithm co-optimization.
operation for inference. In conclusion, although much work has been done, deep
This article also surveys a number of avenues that prior neural networks remain an important area of research with
work have taken to optimize DNN processing. Since data many promising applications and opportunities for innovation
movement dominates energy consumption, a primary focus at various levels of hardware design.
of some recent research has been to reduce data movement
while maintaining accuracy, throughput and cost. This means ACKNOWLEDGMENTS
selecting architectures with favorable memory hierarchies like Funding provided by DARPA YFA, MIT CICS, and gifts
a spatial array, and developing dataflows that increase data from Nvidia and Intel. The authors thank the anonymous
reuse at the low-cost levels of the memory hierarchy. We reviewers as well as James Noraky, Mehul Tikekar and
have included a taxonomy of dataflows and an analysis of Zhengdong Zhang for providing valuable feedback on this
their characteristics. Other work is presented that aims to save paper.
space and energy by changing the representation of data values
in the DNN. Still other work saves energy and sometimes R EFERENCES
increases throughput by exploiting the sparsity of weights [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
and/or activations. vol. 521, no. 7553, pp. 436–444, May 2015.
The DNN domain also affords an excellent opportunity [2] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer,
G. Zweig, X. He, J. Williams et al., “Recent advances in deep
for joint hardware/software co-design. For example, various learning for speech research at Microsoft,” in ICASSP, 2013.
efforts have noted that efficiency can be improved by increasing [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
sparsity (increasing the number of zero values) or optimizing Classification with Deep Convolutional Neural Networks,” in
the representation of data by reducing the precision of values NIPS, 2012.
or using more complex mappings of the stored value to the [4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
Learning affordance for direct perception in autonomous
actual value used for computation. However, to avoid losing driving,” in ICCV, 2015.
accuracy it is often useful to modify the network or fine-tune the [5] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
network’s weights to accommodate these changes. Thus, this Blau, and S. Thrun, “Dermatologist-level classification of skin
29
cancer with deep neural networks,” Nature, vol. 542, no. 7639, [25] J. Zhou and O. G. Troyanskaya, “Predicting effects of noncod-
pp. 115–118, 2017. ing variants with deep learning-based sequence model,” Nature
[6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, methods, vol. 12, no. 10, pp. 931–934, 2015.
G. van den Driessche, J. Schrittwieser, I. Antonoglou, [26] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, “Predicting the sequence specificities of dna-and rna-binding
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, proteins by deep learning,” Nature biotechnology, vol. 33, no. 8,
K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the pp. 831–838, 2015.
game of Go with deep neural networks and tree search,” Nature, [27] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolu-
vol. 529, no. 7587, pp. 484–489, Jan. 2016. tional neural network architectures for predicting dna–protein
[7] F.-F. Li, A. Karpathy, and J. Johnson, “Stanford CS class binding,” Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016.
CS231n: Convolutional Neural Networks for Visual Recogni- [28] M. Jermyn, J. Desroches, J. Mercier, M.-A. Tremblay, K. St-
tion,” http://cs231n.stanford.edu/. Arnaud, M.-C. Guiot, K. Petrecca, and F. Leblond, “Neural net-
[8] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, works improve brain cancer detection with raman spectroscopy
J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, in the presence of operating room light artifacts,” Journal of
Y. Nakamura et al., “A million spiking-neuron integrated circuit Biomedical Optics, vol. 21, no. 9, pp. 094 002–094 002, 2016.
with a scalable communication network and interface,” Science, [29] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck,
vol. 345, no. 6197, pp. 668–673, 2014. “Deep learning for identifying metastatic breast cancer,” arXiv
[9] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, preprint arXiv:1606.05718, 2016.
R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, [30] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
T. Melano, D. R. Barch et al., “Convolutional networks for forcement learning: A survey,” Journal of artificial intelligence
fast, energy-efficient neuromorphic computing,” Proceedings research, vol. 4, pp. 237–285, 1996.
of the National Academy of Sciences, 2016. [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
[10] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of D. Wierstra, and M. Riedmiller, “Playing Atari with Deep
convolutional networks through FFTs,” in ICLR, 2014. Reinforcement Learning,” in NIPS Deep Learning Workshop,
[11] Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, 2013.
I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard, [32] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end
“Handwritten digit recognition: applications of neural network training of deep visuomotor policies,” Journal of Machine
chips and automatic learning,” IEEE Commun. Mag., vol. 27, Learning Research, vol. 17, no. 39, pp. 1–40, 2016.
no. 11, pp. 41–46, Nov 1989. [33] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena,
[12] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in “From Perception to Decision: A Data-driven Approach to End-
1960 IRE WESCON Convention Record, 1960. to-end Motion Planning for Autonomous Ground Robots,” in
[13] B. Widrow, “Thinking about thinking: the discovery of the ICRA, 2017.
LMS algorithm,” IEEE Signal Process. Mag., 2005. [34] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik,
[14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, “Cognitive mapping and planning for visual navigation,” in
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, CVPR, 2017.
and L. Fei-Fei, “ImageNet Large Scale Visual Recognition [35] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
Challenge,” International Journal of Computer Vision (IJCV), control policies for autonomous aerial vehicles with mpc-guided
vol. 115, no. 3, pp. 211–252, 2015. policy search,” in ICRA, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning [36] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
for Image Recognition,” in CVPR, 2016. agent, reinforcement learning for autonomous driving,” in NIPS
[16] “Complete Visual Networking Index (VNI) Forecast,” Cisco, Workshop on Learning, Inference and Control of Multi-Agent
June 2016. Systems, 2016.
[17] J. Woodhouse, “Big, big, big data: higher and higher resolution [37] N. Hemsoth, “The Next Wave of Deep Learning Applications,”
video surveillance,” technology.ihs.com, January 2016. Next Platform, September 2016.
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich [38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Feature Hierarchies for Accurate Object Detection and Semantic Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Segmentation,” in CVPR, 2014. [39] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab-
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional hadran, “Deep convolutional neural networks for LVCSR,” in
Networks for Semantic Segmentation,” in CVPR, 2015. ICASSP, 2013.
[20] K. Simonyan and A. Zisserman, “Two-stream convolutional [40] V. Nair and G. E. Hinton, “Rectified Linear Units Improve
networks for action recognition in videos,” in NIPS, 2014. Restricted Boltzmann Machines,” in ICML, 2010.
[21] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [41] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlin-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep earities improve neural network acoustic models,” in ICML,
neural networks for acoustic modeling in speech recognition: 2013.
The shared views of four research groups,” IEEE Signal Process. [42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into
Mag., vol. 29, no. 6, pp. 82–97, 2012. rectifiers: Surpassing human-level performance on imagenet
[22] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, classification,” in ICCV, 2015.
and P. Kuksa, “Natural language processing (almost) from [43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and
scratch,” Journal of Machine Learning Research, vol. 12, no. Accurate Deep Network Learning by Exponential Linear Units
Aug, pp. 2493–2537, 2011. (ELUs),” ICLR, 2016.
[23] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [44] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and deep neural network acoustic models using generalized maxout
K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” networks,” in ICASSP, 2014.
CoRR abs/1609.03499, 2016. [45] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, , C. Laurent,
[24] H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, Y. Bengio, and A. Courville, “Towards End-to-End Speech
D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, Recognition with Deep Convolutional Neural Networks,” in
T. R. Hughes et al., “The human splicing code reveals new Interspeech, 2016.
insights into the genetic determinants of disease,” Science, vol. [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
347, no. 6218, p. 1254806, 2015. shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
30
architecture for fast feature embedding,” in ACM International [75] J. Cong and B. Xiao, “Minimizing computation in convolutional
Conference on Multimedia, 2014. neural networks,” in ICANN, 2014.
[47] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating [76] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
deep network training by reducing internal covariate shift,” in networks,” in CVPR, 2016.
ICML, 2015. [77] “Intel Math Kernel Library,” https://software.intel.com/en-us/
[48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- mkl.
based learning applied to document recognition,” Proc. IEEE, [78] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
vol. 86, no. 11, pp. 2278–2324, Nov 1998. B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives
[49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and for Deep Learning,” arXiv preprint arXiv:1410.0759, 2014.
Y. LeCun, “OverFeat: Integrated Recognition, Localization and [79] M. Horowitz, “Computing’s energy problem (and what we can
Detection using Convolutional Networks,” in ICLR, 2014. do about it),” in ISSCC, 2014.
[50] K. Simonyan and A. Zisserman, “Very Deep Convolutional [80] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Archi-
Networks for Large-Scale Image Recognition,” in ICLR, 2015. tecture for Energy-Efficient Dataflow for Convolutional Neural
[51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Networks,” in ISCA, 2016.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper [81] ——, “Using Dataflow to Optimize Energy Efficiency of Deep
With Convolutions,” in CVPR, 2015. Neural Network Accelerators,” IEEE Micro’s Top Picks from the
[52] M. Lin, Q. Chen, and S. Yan, “Network in Network,” in ICLR, Computer Architecture Conferences, vol. 37, no. 3, May-June
2014. 2017.
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, [82] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur-
“Rethinking the inception architecture for computer vision,” in danovic, E. Cosatto, and H. P. Graf, “A Massively Parallel
CVPR, 2016. Coprocessor for Convolutional Neural Networks,” in ASAP,
[54] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception- 2009.
v4, Inception-ResNet and the Impact of Residual Connections [83] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an
on Learning,” in AAAI, 2017. embedded biologically-inspired machine vision processor,” in
[55] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, FPT, 2010.
R. Caruana, A. Mohamed, M. Philipose, and M. Richardson, [84] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi,
“Do Deep Convolutional Nets Really Need to be Deep and “A Dynamically Configurable Coprocessor for Convolutional
Convolutional?” ICLR, 2017. Neural Networks,” in ISCA, 2010.
[56] “Caffe LeNet MNIST,” http://caffe.berkeleyvision.org/gathered/ [85] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
examples/mnist.html. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,”
[57] “Caffe Model Zoo,” http://caffe.berkeleyvision.org/model zoo. in CVPR Workshop, 2014.
html. [86] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “A
[58] “Matconvnet Pretrained Models,” http://www.vlfeat.org/ 1.93TOPS/W scalable deep learning/inference processor with
matconvnet/pretrained/. tetra-parallel MIMD architecture for big-data applications,” in
[59] “TensorFlow-Slim image classification library,” https://github. ISSCC, 2015.
com/tensorflow/models/tree/master/slim. [87] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[60] “Deep Learning Frameworks,” https://developer.nvidia.com/ L. Benini, “Origami: A Convolutional Network Accelerator,”
deep-learning-frameworks. in GLVLSI, 2015.
[61] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An [88] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,
Energy-Efficient Reconfigurable Accelerator for Deep Convolu- “Deep Learning with Limited Numerical Precision,” in ICML,
tional Neural Networks,” IEEE J. Solid-State Circuits, vol. 51, 2015.
no. 1, 2017. [89] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
[62] C. J. B. Yann LeCun, Corinna Cortes, “THE MNIST X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting
DATABASE of handwritten digits,” http://yann.lecun.com/exdb/ Vision Processing Closer to the Sensor,” in ISCA, 2015.
mnist/. [90] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal,
[63] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Memory-centric accelerator design for Convolutional Neural
“Regularization of neural networks using dropconnect,” in ICML, Networks,” in ICCD, 2013.
2013. [91] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
[64] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” mizing FPGA-based Accelerator Design for Deep Convolutional
https://www.cs.toronto.edu/∼kriz/cifar.html. Neural Networks,” in FPGA, 2015.
[65] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny [92] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
images: A large data set for nonparametric object and scene O. Temam, “DianNao: A Small-footprint High-throughput
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, Accelerator for Ubiquitous Machine-learning,” in ASPLOS,
no. 11, pp. 1958–1970, 2008. 2014.
[66] A. Krizhevsky and G. Hinton, “Convolutional deep belief [93] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li,
networks on cifar-10,” Unpublished manuscript, vol. 40, 2010. T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A
[67] B. Graham, “Fractional max-pooling,” arXiv preprint Machine-Learning Supercomputer,” in MICRO, 2014.
arXiv:1412.6071, 2014. [94] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An
[68] “Pascal VOC data sets,” http://host.robots.ox.ac.uk/pascal/ Energy-Efficient Reconfigurable Accelerator for Deep Convo-
VOC/. lutional Neural Networks,” in ISSCC, 2016.
[69] “Microsoft Common Objects in Context (COCO) dataset,” http: [95] V. Sze, M. Budagavi, and G. J. Sullivan, “High Efficiency Video
//mscoco.org/. Coding (HEVC): Algorithms and Architectures,” in Integrated
[70] “Google Open Images,” https://github.com/openimages/dataset. Circuit and Systems. Springer, 2014, pp. 1–375.
[71] “YouTube-8M,” https://research.google.com/youtube8m/. [96] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer
[72] “AudioSet,” https://research.google.com/audioset/index.html. CNN accelerators,” in MICRO, 2016.
[73] S. Condon, “Facebook unveils Big Basin, new server geared [97] D. Keitel-Schulz and N. Wehn, “Embedded DRAM develop-
for deep learning,” ZDNet, March 2017. ment: Technology, physical design, and application issues,”
[74] C. Dubout and F. Fleuret, “Exact acceleration of linear object IEEE Des. Test. Comput., vol. 18, no. 3, pp. 7–15, 2001.
detectors,” in ECCV, 2012. [98] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM
31
architecture increases density and performance,” in Symp. on and modularized RTL compilation of Convolutional Neural
VLSI, 2012. Networks onto FPGA,” in FPL, 2016.
[99] J. Standard, “High bandwidth memory (HBM) DRAM,” [122] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented
JESD235, 2013. Approximation of Convolutional Neural Networks,” in ICLR,
[100] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad- 2016.
hyay, “Neurocube: A programmable digital neuromorphic [123] S. Higginbotham, “Google Takes Unconventional Route with
architecture with high-density 3D memory,” in ISCA, 2016. Homegrown Machine Learning Chips,” Next Platform, May
[101] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, 2016.
“TETRIS: Scalable and Efficient Neural Network Acceleration [124] T. P. Morgan, “Nvidia Pushes Deep Learning Inference With
with 3D Memory,” in ASPLOS, 2017. New Pascal GPUs,” Next Platform, September 2016.
[102] J. Zhang, Z. Wang, and N. Verma, “A machine-learning [125] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
classifier implemented in a standard 6T SRAM array,” in Symp. A. Moshovos, “Stripes: Bit-serial deep neural network comput-
on VLSI, 2016. ing,” in MICRO, 2016.
[103] Z. Wang, R. Schapire, and N. Verma, “Error-adaptive classifier [126] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-
boosting (EACB): Exploiting data-driven training for highly scalable processor for real-time large-scale ConvNets,” in Symp.
fault-tolerant hardware,” in ICASSP, 2014. on VLSI, 2016.
[104] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, [127] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:
J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: Training deep neural networks with binary weights during
A Convolutional Neural Network Accelerator with In-Situ propagations,” in NIPS, 2015.
Analog Arithmetic in Crossbars,” in ISCA, 2016. [128] M. Courbariaux and Y. Bengio, “Binarynet: Training deep
[105] L. Chua, “Memristor-the missing circuit element,” IEEE Trans. neural networks with weights and activations constrained to+
Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971. 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
[106] L. Wilson, “International technology roadmap for semiconduc- [129] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-
tors (ITRS),” Semiconductor Industry Association, 2013. Net: ImageNet Classification Using Binary Convolutional
[107] Lu, Darsen, “Tutorial on Emerging Memory Devices,” 2016. Neural Networks,” in ECCV, 2016.
[108] S. B. Eryilmaz, S. Joshi, E. Neftci, W. Wan, G. Cauwenberghs, [130] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with
and H.-S. P. Wong, “Neuromorphic architectures with electronic low precision by half-wave gaussian quantization,” in CVPR,
synapses,” in ISQED, 2016. 2017.
[109] P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, [131] F. Li and B. Liu, “Ternary weight networks,” in NIPS Workshop
Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory on Efficient Methods for Deep Neural Networks, 2016.
Architecture for Neural Network Computation in ReRAM-based [132] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary
Main Memory,” in ISCA, 2016. Quantization,” ICLR, 2017.
[110] M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. [133] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An
Likharev, and D. B. Strukov, “Training and operation of Ultra-Low Power Convolutional Neural Network Accelerator
an integrated neuromorphic network based on metal-oxide Based on Binary Weights,” in ISVLSI, 2016.
memristors,” Nature, vol. 521, no. 7550, pp. 61–64, 2015. [134] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato,
[111] J. Zhang, Z. Wang, and N. Verma, “A matrix-multiplying ADC H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and
implementing a machine-learning classifier directly with data M. Kuroda, T.and Motomura, “BRein Memory: A 13-Layer
conversion,” in ISSCC, 2015. 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable
[112] E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched- In-Memory Deep Neural Network Accelerator in 65nm CMOS,”
capacitor matrix multiplier with co-designed local memory in in Symp. on VLSI, 2017.
40nm,” in ISSCC, 2016. [135] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional
[113] R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong, Neural Networks using Logarithmic Data Representation,”
“RedEye: analog ConvNet image sensor architecture for contin- arXiv preprint arXiv:1603.01025, 2016.
uous mobile vision,” in ISCA, 2016. [136] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental
[114] A. Wang, S. Sivaramakrishnan, and A. Molnar, “A 180nm Network Quantization: Towards Lossless CNNs with Low-
CMOS image sensor with on-chip optoelectronic image com- precision Weights,” in ICLR, 2017.
pression,” in CICC, 2012. [137] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
[115] H. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrish- “Compressing Neural Networks with the Hashing Trick,” in
nan, A. Veeraraghavan, and A. Molnar, “ASP Vision: Optically ICML, 2015.
Computing the First Layer of Convolutional Neural Networks [138] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,
using Angle Sensitive Pixels,” in CVPR, 2016. and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep
[116] A. Suleiman and V. Sze, “Energy-efficient HOG-based object neural network computing,” in ISCA, 2016.
detection at 1080HD 60 fps with multi-scale support,” in SiPS, [139] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K.
2014. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks,
[117] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Minerva: Enabling low-power, highly-accurate deep neural
“Lognet: Energy-Efficient Neural Networks Using Logrithmic network accelerators,” in ISCA, 2016.
Computations,” in ICASSP, 2017. [140] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain
[118] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Damage,” in NIPS, 1990.
Compressing Deep Neural Networks with Pruning, Trained [141] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights
Quantization and Huffman Coding,” in ICLR, 2016. and connections for efficient neural networks,” in NIPS, 2015.
[119] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Ben- [142] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efficient
gio, “Quantized neural networks: Training neural networks Convolutional Neural Networks using Energy-Aware Pruning,”
with low precision weights and activations,” arXiv preprint in CVPR, 2017.
arXiv:1609.07061, 2016. [143] “DNN Energy Estimation,” http://eyeriss.mit.edu/energy.html.
[120] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa- [144] R. Dorrance, F. Ren, and D. Marković, “A scalable sparse
Net: Training low bitwidth convolutional neural networks with matrix-vector multiplication kernel for energy-efficient sparse-
low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016. blas on FPGAs,” in ISFPGA, 2014.
[121] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable [145] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
32