Deep NN - Theory, Tutorial and Survey

1
Efficient Processing of Deep Neural Networks:

A Tutorial and Survey
Vivienne Sze, Senior Member, IEEE, Yu-Hsin Chen, Student Member, IEEE, Tien-Ju Yang, Student
Member, IEEE, Joel Emer, Fellow, IEEE
Abstract—Deep neural networks (DNNs) are currently widely representation of an input space. This is different from earlier
used for many artificial intelligence (AI) applications including approaches that use hand-crafted features or rules designed by
arXiv:1703.09039v2 [cs.CV] 13 Aug 2017
computer vision, speech recognition, and robotics. While DNNs experts.

deliver state-of-the-art accuracy on many AI tasks, it comes at the
cost of high computational complexity. Accordingly, techniques The superior accuracy of DNNs, however, comes at the
that enable efficient processing of DNNs to improve energy cost of high computational complexity. While general-purpose
efficiency and throughput without sacrificing application accuracy compute engines, especially graphics processing units (GPUs),
or increasing hardware cost are critical to the wide deployment have been the mainstay for much DNN processing, increasingly
of DNNs in AI systems. there is interest in providing more specialized acceleration of
This article aims to provide a comprehensive tutorial and
survey about the recent advances towards the goal of enabling the DNN computation. This article aims to provide an overview
efficient processing of DNNs. Specifically, it will provide an of DNNs, the various tools for understanding their behavior,
overview of DNNs, discuss various hardware platforms and and the techniques being explored to efficiently accelerate their
architectures that support DNNs, and highlight key trends in computation.
reducing the computation cost of DNNs either solely via hardware This paper is organized as follows:
design changes or via joint hardware design and DNN algorithm
changes. It will also summarize various development resources • Section II provides background on the context of why
that enable researchers and practitioners to quickly get started DNNs are important, their history and applications.
in this field, and highlight important benchmarking metrics and • Section III gives an overview of the basic components of
design considerations that should be used for evaluating the DNNs and popular DNN models currently in use.
rapidly growing number of DNN hardware designs, optionally
• Section IV describes the various resources used for DNN
including algorithmic co-designs, being proposed in academia
and industry. research and development.
The reader will take away the following concepts from this • Section V describes the various hardware platforms used
article: understand the key design considerations for DNNs; be to process DNNs and the various optimizations used
able to evaluate different DNN hardware implementations with to improve throughput and energy efficiency without
benchmarks and comparison metrics; understand the trade-offs
between various hardware architectures and platforms; be able to
impacting application accuracy (i.e., produce bit-wise
evaluate the utility of various DNN design techniques for efficient identical results).
processing; and understand recent implementation trends and • Section VI discusses how mixed-signal circuits and new
opportunities. memory technologies can be used for near-data processing
to address the expensive data movement that dominates
throughput and energy consumption of DNNs.
I. I NTRODUCTION
• Section VII describes various joint algorithm and hardware
Deep neural networks (DNNs) are currently the foundation optimizations that can be performed on DNNs to improve
for many modern artificial intelligence (AI) applications [1]. both throughput and energy efficiency while trying to
Since the breakthrough application of DNNs to speech recogni- minimize impact on accuracy.
tion [2] and image recognition [3], the number of applications • Section VIII describes the key metrics that should be
that use DNNs has exploded. These DNNs are employed in a considered when comparing various DNN designs.
myriad of applications from self-driving cars [4], to detecting
cancer [5] to playing complex games [6]. In many of these II. BACKGROUND ON D EEP N EURAL N ETWORKS (DNN)
domains, DNNs are now able to exceed human accuracy. The In this section, we describe the position of DNNs in the
superior performance of DNNs comes from its ability to extract context of AI in general and some of the concepts that motivated
high-level features from raw sensory data after using statistical its development. We will also present a brief chronology of
learning over a large amount of data to obtain an effective the major steps in its history, and some current domains to
V. Sze, Y.-H. Chen and T.-J. Yang are with the Department of Electrical which it is being applied.
Engineering and Computer Science, Massachusetts Institute of Technol-
ogy, Cambridge, MA 02139 USA. (e-mail: sze@mit.edu; yhchen@mit.edu,
tjy@mit.edu) A. Artificial Intelligence and DNNs
J. S. Emer is with the Department of Electrical Engineering and Computer DNNs, also referred to as deep learning, are a part of
Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA,
and also with Nvidia Corporation, Westford, MA 01886 USA. (e-mail: the broad field of AI, which is the science and engineering
jsemer@mit.edu) of creating intelligent machines that have the ability to
2
x0 w0
Artificial Intelligence synapse
axon
Machine Learning w0 x0 dendrite
Brain-Inspired
neuron
yj
w1 x1 ⎛ ⎞
Spiking Neural y j = f ⎜ ∑ wi xi + b ⎟
Networks ⎝ i ⎠ axon
Deep w2 x2
Learning
Fig. 2. Connections to a neuron in the brain. xi , wi , f (·), and b are the

activations, weights, non-linear function and bias, respectively. (Figure adopted
from [7].)
Fig. 1. Deep Learning in the context of Artificial Intelligence.
to be 1014 to 1015 synapses in the average human brain.

achieve goals like humans do, according to John McCarthy, A key characteristic of the synapse is that it can scale the
the computer scientist who coined the term in the 1950s. signal (xi ) crossing it as shown in Fig. 2. That scaling factor
The relationship of deep learning to the whole of artificial can be referred to as a weight (wi ), and the way the brain is
intelligence is illustrated in Fig. 1. believed to learn is through changes to the weights associated
Within artificial intelligence is a large sub-field called with the synapses. Thus, different weights result in different
machine learning, which was defined in 1959 by Arthur Samuel responses to an input. Note that learning is the adjustment
as the field of study that gives computers the ability to learn of the weights in response to a learning stimulus, while the
without being explicitly programmed. That means a single organization (what might be thought of as the program) of the
program, once created, will be able to learn how to do some brain does not change. This characteristic makes the brain an
intelligent activities outside the notion of programming. This is excellent inspiration for a machine-learning-style algorithm.
in contrast to purpose-built programs whose behavior is defined Within the brain-inspired computing paradigm there is a
by hand-crafted heuristics that explicitly and statically define subarea called spiking computing. In this subarea, inspiration
their behavior. is taken from the fact that the communication on the dendrites
The advantage of an effective machine learning algorithm and axons are spike-like pulses and that the information being
is clear. Instead of the laborious and hit-or-miss approach of conveyed is not just based on a spike’s amplitude. Instead,
creating a distinct, custom program to solve each individual it also depends on the time the pulse arrives and that the
problem in a domain, the single machine learning algorithm computation that happens in the neuron is a function of not just
simply needs to learn, via a processes called training, to handle a single value but the width of pulse and the timing relationship
each new problem. between different pulses. An example of a project that was
Within the machine learning field, there is an area that is inspired by the spiking of the brain is the IBM TrueNorth [8].
often referred to as brain-inspired computation. Since the brain In contrast to spiking computing, another subarea of brain-
is currently the best ‘machine’ we know for learning and inspired computing 1is called neural networks, which is the
solving problems, it is a natural place to look for a machine focus of this article.
learning approach. Therefore, a brain-inspired computation is
a program or algorithm that takes some aspects of its basic B. Neural Networks and Deep Neural Networks (DNNs)
form or functionality from the way the brain works. This is in Neural networks take their inspiration from the notion that
contrast to attempts to create a brain, but rather the program a neuron’s computation involves a weighted sum of the input
aims to emulate some aspects of how we understand the brain values. These weighted sums correspond to the value scaling
to operate. performed by the synapses and the combining of those values
Although scientists are still exploring the details of how the in the neuron. Furthermore, the neuron doesn’t just output that
brain works, it is generally believed that the main computational weighted sum, since the computation associated with a cascade
element of the brain is the neuron. There are approximately of neurons would then be a simple linear algebra operation.
86 billion neurons in the average human brain. The neurons Instead there is a functional operation within the neuron that
themselves are connected together with a number of elements is performed on the combined inputs. This operation appears
entering them called dendrites and an element leaving them to be a non-linear function that causes a neuron to generate
called an axon as shown in Fig. 2. The neuron accepts the an output only if the inputs cross some threshold. Thus by
signals entering it via the dendrites, performs a computation on analogy, neural networks apply a non-linear function to the
those signals, and generates a signal on the axon. These input weighted sum of the input values. We look at what some of
and output signals are referred to as activations. The axon of those non-linear functions are in Section III-A1.
one neuron branches out and is connected to the dendrites of 1 Note: Recent work using TrueNorth in a stylized fashion allows it to be
many other neurons. The connections between a branch of the used to compute reduced precision neural networks [9]. These types of neural
axon and a dendrite is called a synapse. There are estimated networks are discussed in Section VII-A.
3
Neurons
(activations) ∂L
backpropagation
Layer 1 Layer 2 ∂w11
X1
backpropagation ∂L
W11 Y1
∂y1
X1
∂L
Y2 W11 ∂L
……..
∂y1
X2 ∂L ∂y2
Y3 ∂x1 ∂L
∂y2
∂L
X3 ∂L ∂y3
W34 Y4 L2 Output Neurons ∂x2 ∂L
L1 Input Neurons ∂y3 ∂L
∂L
Synapses (e.g. image pixels) L1 Output Neurons ∂y4
∂x3 ∂L
(weights) a.k.a. Activations W34 ∂y4
∂L X3
(a) Neurons and synapses (b) Compute weighted sum for each layer ∂w34
Fig. 3. Simple neural network example and terminology (Figure adopted (a) Compute the gradient of the loss (b) Compute the gradient of the loss
from [7]). relative to the filter inputs relative to the weights
Fig. 4. An example of backpropagation through a neural network.

Fig. 3(a) shows a diagrammatic picture of a computational
neural network. The neurons in the input layer receive some
values and propagate them to the neurons in the middle layer and is referred to as training the network. Once trained, the
of the network, which is also frequently called a ‘hidden program can perform its task by computing the output of
layer’. The weighted sums from one or more hidden layers are the network using the weights determined during the training
ultimately propagated to the output layer, which presents the process. Running the program with these weights is referred
final outputs of the network to the user. To align brain-inspired to as inference.
terminology with neural networks, the outputs of the neurons In this section, we will use image classification, as shown
are often referred to as activations, and the synapses are often in Fig. 6, as a driving example for training and using a DNN.
referred to as weights as shown in Fig. 3(a). We will use the When we perform inference using a DNN, we give an input
activation/weight nomenclature in this article. image and the output of the DNN is a vector of scores, one for
Fig. 3(b) shows an example of the computation at each each object class; the class with the highest score indicates the
P 3 most likely class of object in the image. The overarching goal
layer: yj = f ( Wij × xi + b), where Wij , xi and yj are the for training a DNN is to determine the weights that maximize
i=1
weights, input activations and output activations, respectively, the score of the correct class and minimize the scores of the
and f (·) is a non-linear function described in Section III-A1. incorrect classes. When training the network the correct class
The bias term b is omitted from Fig. 3(b) for simplicity. is often known because it is given for the images used for
Within the domain of neural networks, there is an area called training (i.e., the training set of the network). The gap between
deep learning, in which the neural networks have more than the ideal correct scores and the scores computed by the DNN
three layers, i.e., more than one hidden layer. Today, the typical based on its current weights is referred to as the loss (L).
numbers of network layers used in deep learning range from Thus the goal of training DNNs is to find a set of weights to
five to more than a thousand. In this article, we will generally minimize the average loss over a large training set.
use the terminology deep neural networks (DNNs) to refer to When training a network, the weights (wij ) are usually
the neural networks used in deep learning. updated using a hill-climbing optimization process called
DNNs are capable of learning high-level features with more gradient descent. A multiple of the gradient of the loss relative
complexity and abstraction than shallower neural networks. An to each weight, which is the partial derivative of the loss with
example that demonstrates this point is using DNNs to process respect to the weight, is used to update the weight (i.e., updated
t+1 t ∂L
visual data. In these applications, pixels of an image are fed into wij = wij −α ∂w ij
, where α is called the learning rate). Note
the first layer of a DNN, and the outputs of that layer can be that this gradient indicates how the weights should change in
interpreted as representing the presence of different low-level order to reduce the loss. The process is repeated iteratively to
features in the image, such as lines and edges. At subsequent reduce the overall loss.
layers, these features are then combined into a measure of the An efficient way to compute the partial derivatives of
likely presence of higher level features, e.g., lines are combined the gradient is through a process called backpropagation.
into shapes, which are further combined into sets of shapes. Backpropagation, which is a computation derived from the
And finally, given all this information, the network provides a chain rule of calculus, operates by passing values backwards
probability that these high-level features comprise a particular through the network to compute how the loss is affected by
object or scene. This deep feature hierarchy enables DNNs to each weight.
achieve superior performance in many tasks. This backpropagation computation is, in fact, very similar
in form to the computation used for inference as shown
C. Inference versus Training in Fig. 4 [10].2 Thus, techniques for efficiently performing
Since DNNs are an instance of a machine learning algorithm, 2 To backpropagate through each filter: (1) compute the gradient of the loss
the basic program does not change as it learns to perform its relative to the weights from the filter inputs (i.e., the forward activations) and
the gradients of the loss relative to the filter outputs; (2) compute the gradient
given tasks. In the specific case of DNNs, this learning involves of the loss relative to the filter inputs from the filter weights and the gradients
determining the value of the weights (and bias) in the network, of the loss relative to the filter outputs.
4
inference can sometimes be useful for performing training. DNN Timeline

It is, however, important to note a couple of points. First, • 1940s - Neural networks were proposed
backpropagation requires intermediate outputs of the network • 1960s - Deep neural networks were proposed
to be preserved for the backwards computation, thus training • 1989 - Neural networks for recognizing digits (LeNet)
has increased storage requirements. Second, due to the gradients • 1990s - Hardware for shallow neural nets (Intel ETANN)
use for hill-climbing, the precision requirement for training • 2011 - Breakthrough DNN-based speech recognition
is generally higher than inference. Thus many of the reduced (Microsoft)
precision techniques discussed in Section VII are limited to • 2012 - DNNs for vision start supplanting hand-crafted
inference only. approaches (AlexNet)
A variety of techniques are used to improve the efficiency • 2014+ - Rise of DNN accelerator research (Neuflow,
and robustness of training. For example, often the loss from DianNao...)
multiple sets of input data, i.e., a batch, are collected before a
single pass of weight update is performed; this helps to speed Fig. 5. A concise history of neural networks. ’Deep’ refers to the number of
up and stabilize the training process. layers in the network.
There are multiple ways to train the weights. The most
common approach, as described above, is called supervised
learning, where all the training samples are labeled (e.g., with amount of available information to train the networks. To learn
the correct class). Unsupervised learning is another approach a powerful representation (rather than using a hand-crafted
where all the training samples are not labeled and essentially approach) requires a large amount of training data. For example,
the goal is to find the structure or clusters in the data. Semi- Facebook receives over 350 millions images per day, Walmart
supervised learning falls in between the two approaches where creates 2.5 Petabytes of customer data hourly and YouTube
only a small subset of the training data is labeled (e.g., use has 300 hours of video uploaded every minute. As a result,
unlabeled data to define the cluster boundaries, and use the the cloud providers and many businesses have a huge amount
small amount of labeled data to label the clusters). Finally, of data to train their algorithms.
reinforcement learning can be used to the train weights such The second factor is the amount of compute capacity
that given the state of the current environment, the DNN can available. Semiconductor device and computer architecture
output what action the agent should take next to maximize advances have continued to provide increased computing
expected rewards; however, the rewards might not be available capability, and we appear to have crossed a threshold where the
immediately after an action, but instead only after a series of large amount of weighted sum computation in DNNs, which
actions. is required for both inference and training, can be performed
Another commonly used approach to determine weights is in a reasonable amount of time.
fine-tuning, where previously-trained weights are available and The successes of these early DNN applications opened the
are used as a starting point and then those weights are adjusted floodgates of algorithmic development. It has also inspired the
for a new dataset (e.g., transfer learning) or for a new constraint development of several (largely open source) frameworks that
(e.g., reduced precision). This results in faster training than make it even easier for researchers and practitioners to explore
starting from a random starting point, and can sometimes result and use DNNs. Combining these efforts contributes to the third
in better accuracy. factor, which is the evolution of the algorithmic techniques that
This article will focus on the efficient processing of DNN have improved application accuracy significantly and broadened
inference rather than training, since DNN inference is often the domains to which DNNs are being applied.
performed on embedded devices (rather than the cloud) where An excellent example of the successes in deep learning can
resources are limited as discussed in more details later. be illustrated with the ImageNet Challenge [14]. This challenge
is a contest involving several different components. One of the
components is an image classification task where algorithms
D. Development History
are given an image and they must identify what is in the image,
Although neural nets were proposed in the 1940s, the first as shown in Fig. 6. The training set consists of 1.2 million
practical application employing multiple digital neurons didn’t images, each of which is labeled with one of 1000 object
appear until the late 1980s with the LeNet network for hand- categories that the image contains. For the evaluation phase,
written digit recognition [11]3 . Such systems are widely used the algorithm must accurately identify objects in a test set of
by ATMs for digit recognition on checks. However, the early images, which it hasn’t previously seen.
2010s have seen a blossoming of DNN-based applications with Fig. 7 shows the performance of the best entrants in the
highlights such as Microsoft’s speech recognition system in ImageNet contest over a number of years. One sees that
2011 [2] and the AlexNet system for image recognition in the accuracy of the algorithms initially had an error rate
2012 [3]. A brief chronology of deep learning is shown in of 25% or more. In 2012, a group from the University of
Fig. 5. Toronto used graphics processing units (GPUs) for their high
The deep learning successes of the early 2010s are believed compute capability and a deep neural network approach, named
to be a confluence of three factors. The first factor is the AlexNet, and dropped the error rate by approximately 10% [3].
3 In the early 1960s, single analog neuron systems were used for adaptive Their accomplishment inspired an outpouring of deep learning
filtering [12, 13]. style algorithms that have resulted in a steady stream of
5
• Speech and Language DNNs have significantly improved

Class Probabilities
the accuracy of speech recognition [21] as well as many
Dog (0.7) related tasks such as machine translation [2], natural
Cat (0.1) language processing [22], and audio generation [23].
Machine
Learning Bike (0.02) • Medical DNNs have played an important role in genomics
(Inference) Car (0.02) to gain insight into the genetics of diseases such as autism,
Plane (0.02) cancers, and spinal muscular atrophy [24–27]. They have
House (0.04) also been used in medical imaging to detect skin cancer [5],
brain cancer [28] and breast cancer [29].
Fig. 6. Example of an image classification task. The machine learning
• Game Play Recently, many of the grand AI challenges
platform takes in an image and outputs the confidence scores for a predefined
set of classes. involving game play have been overcome using DNNs.
These successes also required innovations in training
techniques and many rely on reinforcement learning [30].
30
DNNs have surpassed human level accuracy in playing
Accuracy (Top-5 error)
25
Large error rate reduction due to Deep CNN Atari [31] as well as Go [6], where an exhaustive search
20
AlexNet of all possibilities is not feasible due to the unimaginably
15 OverFeat
huge number of possible moves.
10 VGG GoogLeNet • Robotics DNNs have been successful in the domain of
Clarifai
5 ResNet
robotic tasks such as grasping with a robotic arm [32],
0
2010 2011 2012 2013 2014 2015 Human motion planning for ground robots [33], visual naviga-
tion [4, 34], control to stabilize a quadcopter [35] and
Fig. 7. Results from the ImageNet Challenge [14]. driving strategies for autonomous vehicles [36].
DNNs are already widely used in multimedia applications
today (e.g., computer vision, speech recognition). Looking
improvements. forward, we expect that DNNs will likely play an increasingly
In conjunction with the trend to deep learning approaches important role in the medical and robotics fields, as discussed
for the ImageNet Challenge, there has been a corresponding above, as well as finance (e.g., for trading, energy forecasting,
increase in the number of entrants using GPUs. From 2012 and risk assessment), infrastructure (e.g., structural safety, and
when only 4 entrants used GPUs to 2014 when almost all traffic control), weather forecasting and event detection [37].
the entrants (110) were using them. This reflects the almost The myriad application domains pose new challenges to the
complete switch from traditional computer vision approaches efficient processing of DNNs; the solutions then have to be
to deep learning-based approaches for the competition. adaptive and scalable in order to handle the new and varied
In 2015, the ImageNet winning entry, ResNet [15], exceeded forms of DNNs that these applications may employ.
human-level accuracy with a top-5 error rate4 below 5%. Since
then, the error rate has dropped below 3% and more focus F. Embedded versus Cloud
is now being placed on more challenging components of the
The various applications and aspects of DNN processing
competition, such as object detection and localization. These
(i.e., training versus inference) have different computational
successes are clearly a contributing factor to the wide range
needs. Specifically, training often requires a large dataset5 and
of applications to which DNNs are being applied.
significant computational resources for multiple weight-update
iterations. In many cases, training a DNN model still takes
E. Applications of DNN several hours to multiple days and thus is typically performed
Many applications can benefit from DNNs ranging from in the cloud. Inference, on the other hand, can happen either
multimedia to medical space. In this section, we will provide in the cloud or at the edge (e.g., IoT or mobile).
examples of areas where DNNs are currently making an impact In many applications, it is desirable to have the DNN
and highlight emerging areas where DNNs hope to make an inference processing near the sensor. For instance, in computer
impact in the future. vision applications, such as measuring wait times in stores
• Image and Video Video is arguably the biggest of the or predicting traffic patterns, it would be desirable to extract
big data. It accounts for over 70% of today’s Internet meaningful information from the video right at the image
traffic [16]. For instance, over 800 million hours of video sensor rather than in the cloud to reduce the communication
is collected daily worldwide for video surveillance [17]. cost. For other applications such as autonomous vehicles,
Computer vision is necessary to extract meaningful infor- drone navigation and robotics, local processing is desired since
mation from video. DNNs have significantly improved the the latency and security risks of relying on the cloud are
accuracy of many computer vision tasks such as image too high. However, video involves a large amount of data,
classification [14], object localization and detection [18], which is computationally complex to process; thus, low cost
image segmentation [19], and action recognition [20]. hardware to analyze video is challenging yet critical to enabling
4 The top-5 error rate is measured based on whether the correct answer 5 One of the major drawbacks of DNNs is their need for large datasets to
appears in one of the top 5 categories selected by the algorithm. prevent over-fitting during training.
6
Feed Forward Recurrent

Fully-Connected Sparsely-Connected
attention has been given to hardware acceleration specifically
for RNNs.
DNNs can be composed solely of fully-connected (FC)
layers (also referred to as multi-layer perceptrons, or MLP)
as shown in the leftmost layer of Fig. 8(b). In a FC layer,
all output activations are composed of a weighted sum of
all input activations (i.e., all outputs are connected to all
inputs). This requires a significant amount of storage and
(a) Feedforward versus feedback (re- (b) Fully connected versus sparse computation. Thankfully, in many applications, we can remove
current) networks
some connections between the activations by setting the weights
Fig. 8. Different types of neural networks (Figure adopted from [7]). to zero without affecting accuracy. This results in a sparsely-
connected layer. A sparsely connected layer is illustrated in
the rightmost layer of Fig. 8(b).
these applications. Speech recognition enables us to seamlessly We can also make the computation more efficient by limiting
interact with electronic devices, such as smartphones. While the number of weights that contribute to an output. This sort of
currently most of the processing for applications such as Apple structured sparsity can arise if each output is only a function
Siri and Amazon Alexa voice services is in the cloud, it is of a fixed-size window of inputs. Even further efficiency can
still desirable to perform the recognition on the device itself to be gained if the same set of weights are used in the calculation
reduce latency and dependency on connectivity, and to improve of every output. This repeated use of the same weight values is
privacy and security. called weight sharing and can significantly reduce the storage
Many of the embedded platforms that perform DNN infer- requirements for weights.
ence have stringent energy consumption, compute and memory An extremely popular windowed and weight-shared DNN
cost limitations; efficient processing of DNNs have thus become layer arises by structuring the computation as a convolution,
of prime importance under these constraints. Therefore, in this as shown in Fig. 9(a), where the weighted sum for each output
article, we will focus on the compute requirements for inference activation is computed using only a small neighborhood of input
rather than training. activations (i.e., all weights beyond beyond the neighborhood
are set to zero), and where the same set of weights are shared for
III. OVERVIEW OF DNN S every output (i.e., the filter is space invariant). Such convolution-
DNNs come in a wide variety of shapes and sizes depending based layers are referred to as convolutional (CONV) layers. 6
on the application. The popular shapes and sizes are also
evolving rapidly to improve accuracy and efficiency. In all A. Convolutional Neural Networks (CNNs)
cases, the input to a DNN is a set of values representing the
A common form of DNNs is Convolutional Neural Nets
information to be analyzed by the network. For instance, these
(CNNs), which are composed of multiple CONV layers as
values can be pixels of an image, sampled amplitudes of an
shown in Fig. 10. In such networks, each layer generates a
audio wave or the numerical representation of the state of some
successively higher-level abstraction of the input data, called
system or game.
a feature map (fmap), which preserves essential yet unique
The networks that process the input come in two major
information. Modern CNNs are able to achieve superior per-
forms: feed forward and recurrent as shown in Fig. 8(a). In
formance by employing a very deep hierarchy of layers. CNN
feed-forward networks all of the computation is performed as a
are widely used in a variety of applications including image
sequence of operations on the outputs of a previous layer. The
understanding [3], speech recognition [39], game play [6],
final set of operations generates the output of the network, for
robotics [32], etc. This paper will focus on its use in image
example a probability that an image contains a particular object,
processing, specifically for the task of image classification [3].
the probability that an audio sequence contains a particular
Each of the CONV layers in CNN is primarily composed of
word, a bounding box in an image around an object or the
high-dimensional convolutions as shown in Fig. 9(b). In this
proposed action that should be taken. In such DNNs, the
computation, the input activations of a layer are structured as
network has no memory and the output for an input is always
a set of 2-D input feature maps (ifmaps), each of which is
the same irrespective of the sequence of inputs previously given
called a channel. Each channel is convolved with a distinct
to the network.
2-D filter from the stack of filters, one for each channel; this
In contrast, recurrent neural networks (RNNs), of which
stack of 2-D filters is often referred to as a single 3-D filter.
Long Short-Term Memory networks (LSTMs) [38] are a
The results of the convolution at each point are summed across
popular variant, have internal memory to allow long-term
all the channels. In addition, a 1-D bias can be added to the
dependencies to affect the output. In these networks, some
filtering results, but some recent networks [15] remove its
intermediate operations generate values that are stored internally
usage from parts of the layers. The result of this computation
in the network and used as inputs to other operations in
is the output activations that comprise one channel of output
conjunction with the processing of a later input. In this article,
feature map (ofmap). Additional 3-D filters can be used on
we will focus on feed-forward networks since (1) the major
computation in RNNs is still the weighted sum, which is 6 Note: the structured sparsity in CONV layers is orthogonal to the sparsity
covered by the feed-forward networks, and (2) to-date little that occurs from network pruning as described in Section VII-B2.
7
Shape Parameter Description

input fmap output fmap N batch size of 3-D fmaps
filter (weights) an output M # of 3-D filters / # of ofmap channels
activation C # of ifmap/filter channels
H
R E H/W ifmap plane height/width
R/S filter plane height/width (= H or W in FC)
S W F E/F ofmap plane height/width (= 1 in FC)
Element-wise Partial Sum (psum) TABLE I
Multiplication Accumulation S HAPE PARAMETERS OF A CONV/FC LAYER .
(a) 2-D convolution in traditional image processing
Input fmaps Sigmoid Hyperbolic Tangent

Output fmaps 1 1
Filters C
M Traditional
C Non-Linear 0 0
H Activation
R E Functions -1 -1
1 1 1 -1 0 1 -1 0 1
y=1/(1+e-x) y=(ex-e-x)/(ex+e-x)
S W F
Rectified Linear Unit Exponential LU
…
…
Leaky ReLU
(ReLU)
1 1 1
C C M
Modern
Non-Linear 0 0 0
R E
M H Activation
S
N Functions
N F
-1
-1 0 1
-1
-1 0 1
-1
-1 0 1
W x,
x≥0
y=max(0,x) y=max(αx,x) y=
α(ex-1),
x<0
(b) High dimensional convolutions in CNNs α = small const. (e.g. 0.1)
Fig. 9. Dimensionality of convolutions. Fig. 11. Various forms of non-linear activation functions (Figure adopted
from Caffe Tutorial [46]).
Modern Deep CNN: 5 – 1000 Layers 1 – 3 Layers
Low-Level Mid-Level High-Level

CONV CONV CONV FC Class From five [3] to more than a thousand [15] CONV layers
Features … Features Features
Layer Layer Layer Layer Scores
are commonly used in recent CNN models. A small number,
e.g., 1 to 3, of fully-connected (FC) layers are typically applied
Convolu'on Non-linearity Normaliza'on Pooling Fully Non-linearity after the CONV layers for classification purposes. A FC layer
Connected
also applies filters on the ifmaps as in the CONV layers, but
× ×
the filters are of the same size as the ifmaps. Therefore, it
Optional
does not have the weight sharing property of CONV layers.
Eq. (1) still holds for the computation of FC layers with a
Fig. 10. Convolutional Neural Networks. few additional constraints on the shape parameters: H = R,
F = S, E = F = 1, and U = 1.
In addition to CONV and FC layers, various optional layers
the same input to create additional output channels. Finally, can be found in a DNN such as the non-linearity, pooling,
multiple input feature maps may be processed together as a and normalization. The function and computations for each of
batch to potentially improve reuse of the filter weights. these layers are discussed next.
Given the shape parameters in Table I, the computation of 1) Non-Linearity: A non-linear activation function is typi-
a CONV layer is defined as cally applied after each CONV or FC layer. Various non-linear
functions are used to introduce non-linearity into the DNN as
C−1
shown in Fig. 11. These include historically conventional non-
X S−1
X R−1
O[z][u][x][y] = B[u] +
X
I[z][k][U x + i][U y + j] × W[u][k][i][j], linear functions such as sigmoid or hyperbolic tangent as well
k=0 i=0 j=0 as rectified linear unit (ReLU) [40], which has become popular
0 ≤ z < N, 0 ≤ u < M, 0 ≤ x < F, 0 ≤ y < E, in recent years due to its simplicity and its ability to enable
E = (H − R + U )/U, F = (W − S + U )/U.
(1)
fast training. Variations of ReLU, such as leaky ReLU [41],
parametric ReLU [42], and exponential LU [43] have also been
O, I, W and B are the matrices of the ofmaps, ifmaps, filters explored for improved accuracy. Finally, a non-linearity called
and biases, respectively. U is a given stride size. Fig. 9(b) maxout, which takes the max value of two intersecting linear
shows a visualization of this computation (ignoring biases). functions, has shown to be effective in speech recognition
To align the terminology of CNNs with the generic DNN, tasks [44, 45].
• filters are composed of weights (i.e., synapses) 2) Pooling: A variety of computations that reduce the
• input and output feature maps (ifmaps, ofmaps) are dimensionality of a feature map are referred to as pooling.
composed of activations (i.e., input and output neurons) Pooling, which is applied to each channel separately, enables
8
2x2 pooling, stride 2 DNN is run only once), which is more consistent with what
9 3 5 3 Max pooling Average pooling
would likely be deployed in real-time and/or energy-constrained
applications.
10 32 2 2 32 5 18 3 LeNet [11] was one of the first CNN approaches introduced
1 3 21 9 in 1989. It was designed for the task of digit classification in
6 21 3 12
2 6 11 7 grayscale images of size 28×28. The most well known version,
LeNet-5, contains two CONV layers and two FC layers [48].
Fig. 12. Various forms of pooling (Figure adopted from Caffe Tutorial [46]). Each CONV layer uses filters of size 5×5 (1 channel per filter)
with 6 filters in the first layer and 16 filters in the second layer.
the network to be robust and invariant to small shifts and Average pooling of 2×2 is used after each convolution and a
distortions. Pooling combines, or pools, a set of values in sigmoid is used for the non-linearity. In total, LeNet requires
its receptive field into a smaller number of values. It can be 60k weights and 341k multiply-and-accumulates (MACs) per
configured based on the size of its receptive field (e.g., 2×2) image. LeNet led to CNNs’ first commercial success, as it was
and pooling operation (e.g., max or average), as shown in deployed in ATMs to recognize digits for check deposits.
Fig. 12. Typically pooling occurs on non-overlapping blocks AlexNet [3] was the first CNN to win the ImageNet Challenge
(i.e., the stride is equal to the size of the pooling). Usually a in 2012. It consists of five CONV layers and three FC layers.
stride of greater than one is used such that there is a reduction Within each CONV layer, there are 96 to 384 filters and the
in the dimension of the representation (i.e., feature map). filter size ranges from 3×3 to 11×11, with 3 to 256 channels
3) Normalization: Controlling the input distribution across each. In the first layer, the 3 channels of the filter correspond
layers can help to significantly speed up training and improve to the red, green and blue components of the input image.
accuracy. Accordingly, the distribution of the layer input A ReLU non-linearity is used in each layer. Max pooling of
activations (σ, µ) are normalized such that it has a zero mean 3×3 is applied to the outputs of layers 1, 2 and 5. To reduce
and a unit standard deviation. In batch normalization (BN), computation, a stride of 4 is used at the first layer of the
the normalized value is further scaled and shifted, as shown network. AlexNet introduced the use of LRN in layers 1 and
in Eq. (2), where the parameters (γ, β) are learned from 2 before the max pooling, though LRN is no longer popular
training [47]. is a small constant to avoid numerical problems. in later CNN models. One important factor that differentiates
Prior to this, local response normalization (LRN) [3] was AlexNet from LeNet is that the number of weights is much
used, which was inspired by lateral inhibition in neurobiology larger and the shapes vary from layer to layer. To reduce the
where excited neurons (i.e., high value activations) should amount of weights and computation in the second CONV layer,
subdue its neighbors (i.e., cause low value activations); however, the 96 output channels of the first layer are split into two groups
BN is now considered standard practice in the design of of 48 input channels for the second layer, such that the filters in
CNNs while LRN is mostly deprecated. Note that while LRN the second layer only have 48 channels. Similarly, the weights
usually is performed after the non-linear function, BN is mostly in fourth and fifth layer are also split into two groups. In total,
performed between the CONV or FC layer and the non-linear AlexNet requires 61M weights and 724M MACs to process
function. one 227×227 input image.
x−µ Overfeat [49] has a very similar architecture to AlexNet with
y=√ γ+β (2) five CONV layers and three FC layers. The main differences
σ2 +
are that the number of filters is increased for layers 3 (384
to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not
B. Popular DNN Models split into two groups, the first fully connected layer only has
Many DNN models have been developed over the past 3072 channels rather than 4096, and the input size is 231×231
two decades. Each of these models has a different ‘network rather than 227×227. As a result, the number of weights grows
architecture’ in terms of number of layers, layer types, layer to 146M and the number of MACs grows to 2.8G per image.
shapes (i.e., filter size, number of channels and filters), and Overfeat has two different models: fast (described here) and
connections between layers. Understanding these variations accurate. The accurate model used in the ImageNet Challenge
and trends is important for incorporating the right flexibility gives a 0.65% lower top-5 error rate than the fast model at the
in any efficient DNN engine. cost of 1.9× more MACs
In this section, we will give an overview of various popular VGG-16 [50] goes deeper to 16 layers consisting of 13
DNNs such as LeNet [48] as well as those that competed in CONV layers and 3 FC layers. In order to balance out the
and/or won the ImageNet Challenge [14] as shown in Fig. 7, cost of going deeper, larger filters (e.g., 5×5) are built from
most of whose models with pre-trained weights are publicly multiple smaller filters (e.g., 3×3), which have fewer weights,
available for download; the DNN models are summarized in to achieve the same receptive fields as shown in Fig. 13(a).
Table II. Two results for top-5 error results are reported. In the As a result, all CONV layers have the same filter size of 3×3.
first row, the accuracy is boosted by using multiple crops from In total, VGG-16 requires 138M weights and 15.5G MACs
the image and an ensemble of multiple trained models (i.e., to process one 224×224 input image. VGG has two different
the DNN needs to be run several times); these results were models: VGG-16 (described here) and VGG-19. VGG-19 gives
used to compete in the ImageNet Challenge. The second row a 0.1% lower top-5 error rate than VGG-16 at the cost of
reports the accuracy if only a single crop was used (i.e., the 1.27× more MACs.
9
5x5 filter Two 3x3 filters Apply sequentially Input

feature
decompose map
C=192
1x1 CONV 1x1 CONV 3x3 MAX POOL

(a) Constructing a 5×5 support from 3×3 filters. Used in VGG-16. C=64 C=64 C=192
5x5 filter Apply sequentially 1x1 CONV 3x3 CONV 5x5 CONV 1x1 CONV
5x1 filter
1x5 filter C=64 C=128 C=32 C=32
decompose
Output
feature
map
C=256
(b) Constructing a 5×5 support from 1×5 and 5×1 filter. Used in
GoogleNet/Inception v3 and v4.
Fig. 14. Inception module from GoogleNet [51] with example channel lengths.
Fig. 13. Decomposing larger filters into smaller filters. Note that each CONV layer is followed by a ReLU (not drawn).
GoogLeNet [51] goes even deeper with 22 layers. It in-

x
troduced an inception module, shown in Fig. 14, which is
composed of parallel connections, whereas previously there
1x1 CONV
was only a single serial connection. Different sized filters (i.e.,
1×1, 3×3, 5×5), along with 3×3 max-pooling, are used for
ReLU
each parallel connection and their outputs are concatenated
for the module output. Using multiple filter sizes has the x Iden%ty
3x3 CONV x
effect of processing the input at multiple scales. For improved
3x3 CONV
training speed, GoogLeNet is designed such that the weights ReLU
and the activations, which are stored for backpropagation during ReLU Iden%ty
x
training, could all fit into the GPU memory. In order to reduce 1x1 CONV
3x3 CONV
the number of weights, 1×1 filters are applied as a ‘bottleneck’ F(x)
F(x)
to reduce the number of channels for each filter [52]. The 22 +
+ H(x) = F(x) + x
layers consist of three CONV layers, followed by 9 inceptions H(x) = F(x) + x
ReLU
layers (each of which are two CONV layers deep), and one FC ReLU
layer. Since its introduction in 2014, GoogleNet (also referred

to as Inception) has multiple versions: v1 (described here), v3 7 (a) Without bottleneck (b) With bottleneck
and v4. Inception-v3 decomposes the convolutions by using
smaller 1-D filters as shown in Fig. 13(b) to reduce number Fig. 15. Shortcut module from ResNet [15]. Note that ReLU following last
CONV layer in short cut is after the addition.
of MACs and weights in order to go deeper to 42 layers.
In conjunction with batch normalization [47], v3 achieves
over 3% lower top-5 error than v1 with 2.5× increase in
is used. This is similar to the LSTM networks that are used for
computation [53]. Inception-v4 uses residual connections [54],
sequential data. ResNet also uses the ‘bottleneck’ approach of
described in the next section, for a 0.4% reduction in error.
using 1×1 filters to reduce the number of weight parameters.
ResNet [15], also known as Residual Net, uses residual
As a result, the two layers in the shortcut module are replaced
connections to go even deeper (34 layers or more). It was
by three layers (1×1, 3×3, 1×1) where the 1×1 reduces and
the first entry DNN in ImageNet Challenge that exceeded
then increases (restores) the number of weights. ResNet-50
human-level accuracy with a top-5 error rate below 5%. One
consists of one CONV layer, followed by 16 shortcut layers
of the challenges with deep networks is the vanishing gradient
(each of which are three CONV layers deep), and one FC
during training: as the error backpropagates through the network
layer; it requires 25.5M weights and 3.9G MACs per image.
the gradient shrinks, which affects the ability to update the
There are various versions of ResNet with multiple depths
weights in the earlier layers for very deep networks. Residual
(e.g., without bottleneck: 18, 34; with bottleneck: 50, 101, 152).
net introduces a ‘shortcut’ module which contains an identity
The ResNet with 152 layers was the winner of the ImageNet
connection such that the weight layers (i.e., CONV layers)
Challenge requiring 11.3G MACs and 60M weights. Compared
can be skipped as shown in Fig. 15. Rather than learning the
to ResNet-50, it reduces the top-5 error by around 1% at the
function for the weight layers F (x), the shortcut module learns
cost of 2.9× more MACs and 2.5× more weights.
the residual mapping (F (x) = H(x) − x). Initially, F (x) is
zero and the identity connection is taken; then gradually during Several trends can be observed in the popular DNNs shown
training, the actual forward connection through the weight layer in Table II. Increasing the depth of the network tends to provide
higher accuracy. Controlling for number of weights, a deeper
7 v2 is very similar to v3. network can support a wider range of non-linear functions
10
that are more discriminative and also provides more levels B. Models
of hierarchy in the learned representation [15, 50, 51, 55]. Pretrained DNN models can be downloaded from various
The number of filter shapes continues to vary across layers, websites [56–59] for the various different frameworks. It should
thus flexibility is still important. Furthermore, most of the be noted that even for the same DNN (e.g., AlexNet) the
computation has been placed on CONV layers rather than FC accuracy of these models can vary by around 1% to 2%
layers. In addition, the number of weights in the FC layers is depending on how the model was trained, and thus the results
reduced and in most recent networks (since GoogLeNet) the do not always exactly match the original publication.
CONV layers also dominate in terms of weights. Thus, the
focus of hardware implementations should be on addressing
the efficiency of the CONV layers, which in many domains C. Popular Datasets for Classification
are increasingly important. It is important to factor in the difficulty of the task when
comparing different DNN models. For instance, the task of
IV. DNN DEVELOPMENT RESOURCES classifying handwritten digits from the MNIST dataset [62]
is much simpler than classifying an object into one of 1000
One of the key factors that has enabled the rapid development
classes as is required for the ImageNet dataset [14](Fig. 16).
of DNNs is the set of development resources that have been
It is expected that the size of the DNNs (i.e., number of
made available by the research community and industry. These
weights) and the number of MACs will be larger for the more
resources are also key to the development of DNN accelerators
difficult task than the simpler task and thus require more
by providing characterizations of the workloads and facilitating
energy and have lower throughput. For instance, LeNet-5[48]
the exploration of trade-offs in model complexity and accuracy.
is designed for digit classification, while AlexNet[3], VGG-
This section will describe these resources such that those who
16[50], GoogLeNet[51], and ResNet[15] are designed for the
are interested in this field can quickly get started.
1000-class image classification.
There are many AI tasks that come with publicly available
A. Frameworks datasets in order to evaluate the accuracy of a given DNN.
For ease of DNN development and to enable sharing of Public datasets are important for comparing the accuracy of
trained networks, several deep learning frameworks have been different approaches. The simplest and most common task
developed from various sources. These open source libraries is image classification, which involves being given an entire
contain software libraries for DNNs. Caffe was made available image, and selecting 1 of N classes that the image most likely
in 2014 from UC Berkeley [46]. It supports C, C++, Python belongs to. There is no localization or detection.
and MATLAB. Tensorflow was released by Google in 2015, MNIST is a widely used dataset for digit classification
and supports C++ and python; it also supports multiple CPUs that was introduced in 1998 [62]. It consists of 28×28 pixel
and GPUs and has more flexibility than Caffe, with the grayscale images of handwritten digits. There are 10 classes
computation expressed as dataflow graphs to manage the (for 10 digits) and 60,000 training images and 10,000 test
tensors (multidimensional arrays). Another popular framework images. LeNet-5 was able to achieve an accuracy of 99.05%
is Torch, which was developed by Facebook and NYU and when MNIST was first introduced. Since then the accuracy has
supports C, C++ and Lua. There are several other frameworks increased to 99.79% using regularization of neural networks
such as Theano, MXNet, CNTK, which are described in [60]. with dropconnect [63]. Thus, MNIST is now considered a fairly
There are also higher-level libraries that can run on top of easy dataset.
the aforementioned frameworks to provide a more universal CIFAR is a dataset that consists of 32×32 pixel colored
experience and faster development. One example of such images of of various objects, which was released in 2009 [64].
libraries is Keras, which is written in Python and supports CIFAR is a subset of the 80 million Tiny Image dataset [65].
Tensorflow, CNTK and Theano. CIFAR-10 is composed of 10 mutually exclusive classes. There
The existence of such frameworks are not only a convenient are 50,000 training images (5000 per class) and 10,000 test
aid for DNN researchers and application designers, but they images (1000 per class). A two-layer convolutional deep belief
are also invaluable for engineering high performance or more network was able to achieve 64.84% accuracy on CIFAR-10
efficient DNN computation engines. In particular, because the when it was first introduced [66]. Since then the accuracy has
frameworks make heavy use of a set primitive operations, increased to 96.53% using fractional max pooling [67].
such processing of a CONV layer, they can incorporate use of ImageNet is a large scale image dataset that was first
optimized software or hardware accelerators. This acceleration introduced in 2010; the dataset stabilized in 2012 [14]. It
is transparent to the user of the framework. Thus, for example, contains images of 256×256 pixel in color with 1000 classes.
most frameworks can use Nvidia’s cuDNN library for rapid The classes are defined using the WordNet as a backbone to
execution on Nvidia GPUs. Similarly, transparent incorporation handle ambiguous word meanings and to combine together
of dedicated hardware accelerators can be achieved as was synonyms into the same object category. In otherwords, there
done with the Eyeriss chip [61]. is a hierarchy for the ImageNet categories. The 1000 classes
Finally, these frameworks are a valuable source of workloads were selected such that there is no overlap in the ImageNet
for hardware researchers. They can be used to drive experi- hierarchy. The ImageNet dataset contains many fine-grained
mental designs for different workloads, for profiling different categories including 120 different breeds of dogs. There are
workloads and for exploring hardware-software trade-offs. 1.3M training images (732 to 1300 per class), 100,000 testing
11
LeNet AlexNet Overfeat VGG GoogLeNet ResNet

Metrics
5 fast 16 v1 50
Top-5 error† n/a 16.4 14.2 7.4 6.7 5.3
Top-5 error (single crop)† n/a 19.8 17.0 8.8 10.7 7.0
Input Size 28×28 227×227 231×231 224×224 224×224 224×224
# of CONV Layers 2 5 5 13 57 53
Depth in # of CONV Layers 2 5 5 13 21 49
Filter Sizes 5 3,5,11 3,5,11 3 1,3,5,7 1,3,7
# of Channels 1, 20 3-256 3-1024 3-512 3-832 3-2048
# of Filters 20, 50 96-384 96-1024 64-512 16-384 64-2048
Stride 1 1,4 1,4 1 1,2 1,2
Weights 2.6k 2.3M 16M 14.7M 6.0M 23.5M
MACs 283k 666M 2.67G 15.3G 1.43G 3.86G
# of FC Layers 2 3 3 3 1 1
Filter Sizes 1,4 1,6 1,6,12 1,7 1 1
# of Channels 50, 500256-4096 1024-4096 512-4096 1024 2048
# of Filters 10, 500
1000-4096 1000-4096 1000-4096 1000 1000
Weights 58k 58.6M 130M 124M 1M 2M
MACs 58k 58.6M 130M 124M 1M 2M
Total Weights 60k 61M 146M 138M 7M 25.5M
Total MACs 341k 724M 2.8G 15.5G 1.43G 3.9G
Pretrained Model Website [56]‡ [57, 58] n/a [57–59] [57–59] [57–59]
TABLE II
S UMMARY OF POPULAR DNN S [3, 15, 48, 50, 51]. † ACCURACY IS MEASURED BASED ON T OP -5 ERROR ON I MAGE N ET [14]. ‡ T HIS VERSION OF L E N ET-5
HAS 431 K WEIGHTS FOR THE FILTERS AND REQUIRES 2.3M MAC S PER IMAGE , AND USES R E LU RATHER THAN SIGMOID .
MNIST ImageNet be localized and classified (out of 1000 classes). The DNN
outputs the top five categories and top five bounding box
locations. There is no penalty for identifying an object that
is in the image but not included in the ground truth. For
object detection, all objects in the image must be localized
and classified (out of 200 classes). The bounding box for all
objects in these categories must be labeled. Objects that are
Fig. 16. MNIST (10 classes, 60k training, 10k testing) [62] vs. ImageNet not labeled are penalized as are duplicated detections.
(1000 classes, 1.3M training, 100k testing)[14] dataset. Beyond ImageNet, there are also other popular image
datasets for computer vision tasks. For object detection, there
images (100 per class) and 50,000 validation images (50 per is the PASCAL VOC (2005-2012) dataset that contains 11k
class). images representing 20 classes (27k object instances, 7k of
The accuracy of the ImageNet Challenge are reported using which has detailed segmentation) [68]. For object detection,
two metrics: Top-5 and Top-1 error. Top-5 error means that if segmentation and recognition in context, there is the MS COCO
any of the top five scoring categories are the correct category, dataset with 2.5M labeled instances in 328k images (91 object
it is counted as a correct classification. The Top-1 requires categories) [69]; compared to ImageNet, COCO has fewer
that the top scoring category be correct. In 2012, the winner categories but more instances per category, which is useful for
of the ImageNet Challenge (AlexNet) was able to achieve an precise 2-D localization. COCO also has more labeled instances
accuracy of 83.6% for the top-5 (which is substantially better per image to potentially help with contextual information.
than the 73.8% which was second place that year that did not Most recently even larger scale datasets have been made
use DNNs); it achieved 61.9% on the top-1 of the validation available. For instance, Google has an Open Images dataset
set. In 2017, the highest accuracy was 97.7% for the top-5. with over 9M images [70], spanning 6000 categories. There is
In summary of the various image classification datasets, it also a YouTube dataset with 8M videos (0.5M hours of video)
is clear that MNIST is a fairly easy dataset, while ImageNet covering 4800 classes [71]. Google also released an audio
is a challenging one with a wider coverage of classes. Thus dataset comprised of 632 audio event classes and a collection
in terms of evaluating the accuracy of a given DNN, it is of 2M human-labeled 10-second sound clips [72]. These large
important to consider that dataset upon which the accuracy is datasets will be evermore important as DNNs become deeper
measured. with more weight parameters to train.
Undoubtedly, both larger datasets and datasets for new
D. Datasets for Other Tasks domains will serve as important resources for profiling and
exploring the efficiency of future DNN engines.
Since the accuracy of the state-of-the-art DNNs are perform-
ing better than human-level accuracy on image classification
tasks, the ImageNet Challenge has started to focus on more V. H ARDWARE FOR DNN P ROCESSING
difficult tasks such as single-object localization and object Due to the popularity of DNNs, many recent hardware
detection. For single-object localization, the target object must platforms have special features that target DNN processing. For
12
instance, the Intel Knights Landing CPU features special vector Temporal Architecture Spatial Architecture
instructions for deep learning; the Nvidia PASCAL GP100 (SIMD/SIMT) (Dataflow Processing)
GPU features 16-bit floating point (FP16) arithmetic support Memory Hierarchy Memory Hierarchy
to perform two FP16 operations on a single precision core for Register File
faster deep learning computation. Systems have also been built ALU ALU ALU ALU
specifically for DNN processing such as Nvidia DGX-1 and ALU ALU ALU ALU
Facebook’s Big Basin custom DNN server [73]. DNN inference ALU ALU ALU ALU ALU ALU ALU ALU
has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as ALU ALU ALU ALU
ALU ALU ALU ALU
well as FPGAs. Accordingly, it’s important to have a good
ALU ALU ALU ALU
understanding of how the processing is being performed on
these platforms, and how application-specific accelerators can ALU ALU ALU ALU
Control
be designed for DNNs for further improvement in throughput
and energy efficiency. Fig. 17. Highly-parallel compute paradigms.
The fundamental component of both the CONV and FC lay-
ers are the multiply-and-accumulate (MAC) operations, which
Filters Input fmaps Output fmaps
can be easily parallelized. In order to achieve high performance, 1
CHW
highly-parallel compute paradigms are very commonly used,
1
including both temporal and spatial architectures as shown in
Fig. 17. The temporal architectures appear mostly in CPUs
or GPUs, and employ a variety of techniques to improve M × CHW = M
parallelism such as vectors (SIMD) or parallel threads (SIMT).
Such temporal architecture use a centralized control for a large
number of ALUs. These ALUs can only fetch data from the
memory hierarchy and cannot communicate directly with each (a) Matrix Vector multiplication is used when computing a single output
other. In contrast, spatial architectures use dataflow processing, feature map from a single input feature map.
i.e., the ALUs form a processing chain so that they can pass data
Filters Input fmaps Output fmaps
from one to another directly. Sometimes each ALU can have
CHW N N
its own control logic and local memory, called a scratchpad or
register file. We refer to the ALU with its own local memory as CHW
a processing engine (PE). Spatial architectures are commonly
used for DNNs in ASIC and FPGA-based designs. In this M × = M
section, we will discuss the different design strategies for
efficient processing on these different platforms, without any
impact on accuracy (i.e., all approaches in this section produce
bit-wise identical results); specifically, (b) Matrix Multiplications is used when computing N output feature
maps from N input feature maps.
• For temporal architectures such as CPUs and GPUs, we
will discuss how computational transforms on the kernel Fig. 18. Mapping to matrix multiplication for fully connected layers
can reduce the number of multiplications to increase
throughput.
• For spatial architectures used in accelerators, we will
width is the number of input feature maps (one in Fig. 18(a)
discuss how dataflows can increase data reuse from low and N in Fig. 18(b)); finally, the height of the output feature
cost memories in the memory hierarchy to reduce energy map matrix is the number of channels in the output feature
consumption. maps (M ), and the width is the number of output feature maps
(N ), where each output feature map of the FC layer has the
dimension of 1×1×number of output channels (M ).
A. Accelerate Kernel Computation on CPU and GPU Platforms The CONV layer in a DNN can also be mapped to a matrix
CPUs and GPUs use parallelizaton techniques such as SIMD multiplication using a relaxed form of the Toeplitz matrix as
or SIMT to perform the MACs in parallel. All the ALUs share shown in Fig. 19. The downside for using matrix multiplication
the same control and memory (register file). On these platforms, for the CONV layers is that there is redundant data in the input
both the FC and CONV layers are often mapped to a matrix feature map matrix as highlighted in Fig. 19(a). This can lead
multiplication (i.e., the kernel computation). Fig. 18 shows how to either inefficiency in storage, or a complex memory access
a matrix multiplication is used for the FC layer. The height of pattern.
the filter matrix is the number of filters and the width is the There are software libraries designed for CPUs (e.g., Open-
number of weights per filter (input channels (C) × width (W ) BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN,
× height (H), since R = W and S = H in the FC layer); etc.) that optimize for matrix multiplications. The matrix
the height of the input feature maps matrix is the number of multiplication is tiled to the storage hierarchy of these platforms,
activations per input feature map (C × W × H), and the which are on the order of a few megabytes at the higher levels.
13
Filter Input Fmap Output Fmap input fmap output fmap

1 2 1 2 3 1 2 filter (weights) an output
=
Convolution:
3 4 * 4 5 6
7 8 9
3 4
R *
H
= E
activation
S W F
F F I
Toeplitz Matrix F
T
F F
T F
(w/ redundant data) T
1 2 3 4 × 1 2 4 5 = 1 2 3 4
Matrix Mult: 2 3 5 6 FFT(W) X FFT(I) = FFT(0)
4 5 7 8
5 6 8 9
(a) Mapping convolution to Toeplitz matrix Fig. 20. FFT to accelerate DNN.
Toeplitz Matrix Memory Read MAC* Memory Write

Chnl 1 Chnl 2 (w/ redundant data)
Filter 1 1 2 3 4 1 2 3 4 1 2 4 5 1 2 3 4 Chnl 1 filter weight ALU
Filter 2 1 2 3 4 1 2 3 4
× 2 3 5 6
= 1 2 3 4 Chnl 2 fmap activation
4 5 7 8
partial sum updated partial sum
5 6 8 9 Chnl 1
1 2 4 5
2 3 5 6 * multiply-and-accumulate
4 5 7 8
5 6 8 9 Chnl 2 Fig. 21. Read and write access per MAC.
(b) Extend Toeplitz matrix to multiple channels and filters
Fig. 19. Mapping to matrix multiplication for convolutional layers.

for a 3×3 filter, respectively, at the cost of reduced numeri-
cal stability, increased storage requirements, and specialized
The matrix multiplications on these platforms can be further processing depending on the size of the filter.
sped up by applying computational transforms to the data to In practice, different algorithms might be used for different
reduce the number of multiplications, while still giving the layer shapes and sizes (e.g., FFT for filters greater than 5×5,
same bit-wise result. Often this can come at a cost of increased and Winograd for filters 3×3 and below). Existing platform
number of additions and a more irregular data access pattern. libraries, such as MKL and cuDNN, dynamically chose the
appropriate algorithm for a given shape and size [77, 78].
Fast Fourier Transform (FFT) [10, 74] is a well known
approach, shown in Fig. 20 that reduces the number of
multiplications from O(No2 Nf2 ) to O(No2 log2 No ), where the B. Energy-Efficient Dataflow for Accelerators
output size is No × No and the filter size is Nf × Nf . To For DNNs, the bottleneck for processing is in the memory
perform the convolution, we take the FFT of the filter and access. Each MAC requires three memory reads (for filter
input feature map, and then perform the multiplication in weight, fmap activation, and partial sum) and one memory
the frequency domain; we then apply an inverse FFT to the write (for the updated partial sum) as shown in Fig. 21. In the
resulting product to recover the output feature map in the worst case, all of the memory accesses have to go through the
spatial domain. However, there are several drawbacks to using off-chip DRAM, which will severely impact both throughput
FFT: (1) the benefits of FFTs decrease with filter size; (2) the and energy efficiency. For example, in AlexNet, to support its
size of the FFT is dictated by the output feature map size which 724M MACs, nearly 3000M DRAM accesses will be required.
is often much larger than the filter; (3) the coefficients in the Furthermore, DRAM accesses require up to several orders of
frequency domain are complex. As a result, while FFT reduces magnitude higher energy than computation [79].
computation, it requires larger storage capacity and bandwidth. Accelerators, such as spatial architectures as shown in
Finally, a popular approach for reducing complexity is to make Fig. 17, provide an opportunity to reduce the energy cost of
the weights sparse, which will be discussed in Section VII-B2; data movement by introducing several levels of local memory
using FFTs makes it difficult for this sparsity to be exploited. hierarchy with different energy cost as shown in Fig. 22. This
Several optimizations can be performed on FFT to make it includes a large global buffer with a size of several hundred
more effective for DNNs. To reduce the number of operations, kilobytes that connects to DRAM, an inter-PE network that
the FFT of the filter can be precomputed and stored. In addition, can pass data directly between the ALUs, and a register file
the FFT of the input feature map can be computed once and (RF) within each processing element (PE) with a size of a
used to generate multiple channels in the output feature map. few kilobytes or less. The multiple levels of memory hierarchy
Finally, since an image contains only real values, its Fourier help to improve energy efficiency by providing low-cost data
Transform is symmetric and this can be exploited to reduce accesses. For example, fetching the data from the RF or
storage and computation cost. neighbor PEs is going to cost 1 or 2 orders of magnitude
Other approaches include Strassen [75] and Winograd [76], lower energy than from DRAM.
which rearrange the computation such that the number of Accelerators can be designed to support specialized process-
multiplications reduce from O(N 3 ) to O(N 2.807 ) and by 2.25× ing dataflows that leverage this memory hierarchy. The dataflow
14
Convolutional Reuse Fmap Reuse Filter Reuse

PE PE CONV layers only CONV and FC layers CONV and FC layers
Global (sliding window) (batch size > 1)
DRAM
Buffer fetch data to run Input Fmaps
PE ALU
a MAC here Filters
Input Fmap Input Fmap
Filter Filter
Normalized Energy Cost 1 1
ALU 1× (Reference)
2
0.5 – 1.0 kB RF ALU 1× 2
NoC: 200 – 1000 PEs PE ALU 2×

Activations
Reuse: Reuse: Activations Reuse: Filter weights
100 – 500 kB Buffer ALU 6× Filter weights
DRAM ALU 200×

Fig. 23. Data reuse opportunities in DNNs [80].
Fig. 22. Memory hierarchy and data movement energy [80].
Compilation Execution
decides what data gets read into which level of the memory
DNN Shape and Size Processed
hierarchy and when are they getting processed. Since there is (Program) Data
no randomness in the processing of DNNs, it is possible to
Dataflow, …
design a fixed dataflow that can adapt to the DNN shapes and (Architecture)
sizes and optimize for the best energy efficiency. The optimized Mapper DNN Accelerator
dataflow minimizes access from the more energy consuming (Compiler) (Processor)
Implementation
levels of the memory hierarchy. Large memories that can store Details
a significant amount of data consume more energy than smaller (µArch) Mapping Input
memories. For instance, DRAM can store gigabytes of data, but Data
(Binary)
consumes two orders of magnitude higher energy per access
than a small on-chip memory of a few kilobytes. Thus, every
time a piece of data is moved from an expensive level to a Fig. 24. An analogy between the operation of DNN accelerators (texts in
black) and that of general-purpose processors (texts in red). Figure adopted
lower cost level in terms of energy, we want to reuse that piece from [81].
of data as much as possible to minimize subsequent accesses
to the expensive levels. The challenge, however, is that the
storage capacity of these low cost memories are limited. Thus program into machine-readable binary codes for execution
we need to explore different dataflows that maximize reuse given the hardware architecture (e.g., x86 or ARM); in the
under these constraints. processing of DNNs, the mapper translates the DNN shape
For DNNs, we investigate dataflows that exploit three forms and size into a hardware-compatible computation mapping
of input data reuse (convolutional, feature map and filter) as for execution given the dataflow. While the compiler usually
shown in Fig. 23. For convolutional reuse, the same input optimizes for performance, the mapper optimizes for energy
feature map activations and filter weights are used within efficiency.
a given channel, just in different combinations for different The following taxonomy (Fig. 25) can be used to classify
weighted sums. For feature map reuse, multiple filters are the DNN dataflows in recent works [82–93] based on their
applied to the same feature map, so the input feature map data handling characteristics [80]:
activations are used multiple times across filters. Finally, for 1) Weight stationary (WS): The weight stationary dataflow
filter reuse, when multiple input feature maps are processed at is designed to minimize the energy consumption of reading
once (referred to as a batch), the same filter weights are used weights by maximizing the accesses of weights from the register
multiple times across input features maps. file (RF) at the PE (Fig. 25(a)). Each weight is read from
If we can harness the three types of data reuse by storing DRAM into the RF of each PE and stays stationary for further
the data in the local memory hierarchy and accessing them accesses. The processing runs as many MACs that use the
multiple times without going back to the DRAM, it can save same weight as possible while the weight is present in the RF;
a significant amount of DRAM accesses. For example, in it maximizes convolutional and filter reuse of weights. The
AlexNet, the number of DRAM reads can be reduced by up to inputs and partial sums must move through the spatial array
500× in the CONV layers. The local memory can also be used and global buffer. The input fmap activations are broadcast to
for partial sum accumulation, so they do not have to reach all PEs and then the partial sums are spatially accumulated
DRAM. In the best case, if all data reuse and accumulation across the PE array.
can be achieved by the local memory hierarchy, the 3000M One example of previous work that implement weight
DRAM accesses in AlexNet can be reduced to only 61M. stationary dataflow is nn-X, or neuFlow [85], which uses
The operation of DNN accelerators is analogous to that of eight 2-D convolution engines for processing a 10×10 filter.
general-purpose processors as illustrated in Fig. 24 [81]. In There are total 100 MAC units, i.e. PEs, per engine with each
conventional computer systems, the compiler translates the PE having a weight that stays stationary for processing. The
15
Global Buffer OSA OSB OSC

Psum Act
M M M
…
W0 W1 W2 W3 W4 W5 W6 W7 PE Parallel
Weight Output Region E E E
…
(a) Weight Stationary E E E
# Output Channels Single Multiple Multiple

Global Buffer
# Output Activations Multiple Multiple Single
Act Weight
Targeting Targeting
Notes
CONV layers FC layers
P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum
Fig. 26. Variations of output stationary [80].
(b) Output Stationary
Global Buffer
Weight
OSC are [89], [88], and [90], respectively.
Act 3) No local reuse (NLR): While small register files are
PE
Psum
efficient in terms of energy (pJ/bit), they are inefficient in terms
of area (µm2 /bit). In order to maximize the storage capacity,
(c) No Local Reuse
and minimize the off-chip memory bandwidth, no local storage
Fig. 25. Dataflows for DNNs [80]. is allocated to the PE and instead all that area is allocated
to the global buffer to increase its capacity (Fig. 25(c)). The
no local reuse dataflow differs from the previous dataflows in
input fmap activations are broadcast to all MAC units and the that nothing stays stationary inside the PE array. As a result,
partial sums are accumulated across the MAC units. In order to there will be increased traffic on the spatial array and to the
accumulate the partial sums correctly, additional delay storage global buffer for all data types. Specifically, it has to multicast
elements are required, which are counted into the required size the activations, single-cast the filter weights, and then spatially
of local storage. Other weight stationary examples are found accumulate the partial sums across the PE array.
in [82–84, 86, 87]. In an example of the no local reuse dataflow from
2) Output stationary (OS): The output stationary dataflow is UCLA [91], the filter weights and input activations are read
designed to minimize the energy consumption of reading and from the global buffer, processed by the MAC units with custom
writing the partial sums (Fig. 25(b)). It keeps the accumulation adder trees that can complete the accumulation in a single cycle,
of partial sums for the same output activation value local in the and the resulting partial sums or output activations are then put
RF. In order to keep the accumulation of partial sums stationary back to the global buffer. Another example is DianNao [92],
in the RF, one common implementation is to stream the input which also reads input activations and filter weights from
activations across the PE array and broadcast the weight to all the buffer, and processes them through the MAC units with
PEs in the array. custom adder trees. However, DianNao implements specialized
One example that implements the output stationary dataflow registers to keep the partial sums in the PE array, which helps
is ShiDianNao [89], where each PE handles the processing for to further reduce the energy consumption of accessing partial
each output activation value by fetching the corresponding input sums. Another example of no local reuse dataflow is found
activations from neighboring PEs. The PE array implements in [93].
dedicated networks to pass data horizontally and vertically. 4) Row stationary (RS): A row stationary dataflow is
Each PE also has data delay registers to keep data around for proposed in [80], which aims to maximize the reuse and
the required amount of cycles. At the system level, the global accumulation at the RF level for all types of data (weights,
buffer streams the input activations and broadcasts the weights pixels, partial sums) for the overall energy efficiency. This
into the PE array. The partial sums are accumulated inside differs from WS or OS dataflows, which optimize for only
each PE and then get streamed out back to the global buffer. weights and partial sums, respectively.
Other examples of output stationary are found in [88, 90]. The row stationary dataflow assigns the processing of a
There are multiple possible variants of output stationary as 1-D row convolution into each PE for processing as shown
shown in Fig. 26 since the output activations that get processed in Fig. 27. It keeps the row of filter weights stationary inside
at the same time can come from different dimensions. For the RF of the PE and then streams the input activations into
example, the variant OSA targets the processing of CONV the PE. The PE does the MACs for each sliding window at a
layers, and therefore focuses on the processing of output time, which uses just one memory space for the accumulation
activations from the same channel at a time in order to of partial sums. Since there are overlaps of input activations
maximize data reuse opportunities. The variant OSC targets between different sliding windows, the input activations can
the processing of FC layers, and focuses on generating output then be kept in the RF and get reused. By going through all the
activations from all different channels, since each channel only sliding windows in the row, it completes the 1-D convolution
has one output activation. The variant OSB is something in and maximize the data reuse and local accumulation of data
between OSA and OSC . Example of variants OSA , OSB , and in this row.
16
Input Fmap Input Fmap Input Fmap

Filter a b c d e Partial Sums Filter a b c d e Partial Sums Filter Partial Sums
a b c d e
a b c a b c a b c a b c a b c a b c
* =
* =
* =
Row 1 Row 1
PE
Row 1 Row 2
PE
Row 1 * Row 3
PE
* *
Reg File PE Reg File PE Reg File PE PE PE PE
c b a c b a c b a
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
e d c b a e d c b e d c
a b a
a b c PE PE PE
(a) Step 1 (b) Step 2 (c) Step 3

Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
Multiple fmaps:
Fig. 27. 1-D Convolutional reuse within PE for Row Stationary Dataflow [80]. * =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
* =
Row 1 Row 2 Row 3 Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 Fig. 29. Multiple rows of different input feature maps, filters and channels are
mapped to same PE within array for additional reuse in the Row Stationary
PE 2 PE 5 PE 8 Dataflow [80].
PE 3 PE 6 PE 9
CNN Configurations
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5 C
M
C
H
R E
1 1 1
R H E
Optimization
* =
* =
* =
…
C C
Compiler
R E
M H
N
R N E
H
Row Stationary Mapping

Hardware Resources
Fig. 28. 2-D convolutional reuse within spatial array for Row Stationary
PE PE PE
Global Buffer
Dataflow [80]. Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE PE PE
ALU ALU ALU ALU Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE PE PE
ALU ALU ALU ALU
Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
With each PE processing a 1-D convolution, multiple * ALU ALU ALU ALU Multiple filters:
Filter 1
=
Image
Fmap 1 Psum
Multiple channels:
* =
PEs can be aggregated to complete the 2-D convolution as
shown in Fig. 28. For example, to generate the first row of
output activations with a filter having three rows, three 1-D Fig. 30. Mapping optimization takes in hardware and DNNs shape constraints
to determine optimal energy dataflow [80].
convolutions are required. Therefore, we can use three PEs in
a column, each running one of the three 1-D convolutions. The
partial sums are further accumulated vertically across the three
PEs to generate the first output row. To generate the second different channels are interleaved, and run through the same PE
row of output, we use another column of PEs, where three as a 1-D convolution. The partial sums from different channels
rows of input activations are shifted down by one row, and use then naturally get accumulated inside the PE.
the same rows of filters to perform the three 1-D convolutions. The number of filters, channels, and fmaps that can be
Additional columns of PEs are added until all rows of the processed at the same time is programmable, and there exists an
output are completed (i.e., the number of PE columns equals optimal mapping for the best energy efficiency, which depends
the number of output rows). on the shape configuration of the DNN as well as the hardware
This 2-D array of PEs enables other forms of reuse to reduce resources provided, e.g., the number of PEs and the size of the
accesses to the more expensive global buffer. For example, each memory in the hierarchy. Since all of the variables are known
filter row is reused across multiple PEs horizontally. Each row before runtime, it is possible to build a compiler (i.e., mapper)
of input activations is reused across multiple PEs diagonally. to perform this optimization off-line to configure the hardware
And each row of partial sums are further accumulated across for different mappings of the RS dataflow for different DNNs
the PEs vertically. Therefore, 2-D convolutional data reuse and as shown in Fig. 30.
accumulation are maximized inside the 2-D PE array. One example that implements the row stationary dataflow
To address the high-dimensional convolution of the CONV is Eyeriss [94]. It consists of a 14×12 PE array, a 108KB
layer (i.e., multiple fmaps, filters, and channels), multiple rows global buffer, ReLU and fmap compression units as shown
can be mapped onto the same PE as shown in Fig. 29. The in Fig. 31. The chip communicates with the off-chip DRAM
2-D convolution is mapped to a set of PEs, and the additional using a 64-bit bidirectional data bus to fetch data into the
dimensions are handled by interleaving or concatenating the global buffer. The global buffer then streams the data into the
additional data. For filter reuse within the PE, different rows PE array for processing.
of fmaps are concatenated and run through the same PE In order to support the RS dataflow, two problems need to be
as a 1-D convolution. For input fmap reuse within the PE, solved in the hardware design. First, how can the fixed-size PE
different filter rows are interleaved and run through the same array accommodate different layer shapes? Second, although
PE as a 1-D convolution. Finally, to increase local partial sum the data will be passed in a very specific pattern, it still changes
accumulation within the PE, filter rows and fmap rows from with different shape configurations. How can the fixed design
17
Link Clock Core Clock Configuration Bits Accelerator needs of each dataflow under the same area constraint. For
Top-Level Control Config Scan Chain 12✕14
Filter
PE Array Processing example, since the no local reuse dataflow does not require any
Filter Element
…
Ifmap
RF in PE, it is allocated with a much larger global buffer. The
Global Ifmap …
Spad
Off-Chip RLC Buffer
MAC
Control
simulation uses the layer configurations from AlexNet with a
DRAM Decoder Psum
64 108KB
…
batch size of 16. The simulation also takes into account the
bits Ofmap
Psum
…
RLC ReLU
Enc.
fact that accessing different levels of the memory hierarchy
…
requires different energy cost.
Fig. 31. Eyeriss DNN accelerator [94]. Fig. 33 compares the chip and DRAM energy consumption
of each dataflow for the CONV layers of AlexNet with a
batch size of 16. The WS and OS dataflows have the lowest
Replication Folding energy consumption for accessing weights and partial sums,
13 27 respectively. However, the RS dataflow has the lowest total
AlexNet .. AlexNet .. energy consumption since it optimizes for the overall energy
.. ..
Layer 3-5 3 Layer 2 5 efficiency instead of only for a certain data type.
..
..
..
..
..
..
.. ..
Fig. 33(a) shows the same results with breakdown in terms of

14 14
memory hierarchy. The RS dataflow consumes the most energy
13
in the RF, since by design most of the accesses have been
3
14 moved to the lowest level of the memory hierarchy. This helps
3 13 5
12 Unused PEs 12 to achieve the lowest total energy consumption since RF has
3 13 are the lowest energy per access. The NLR dataflow has the lowest
13
3 13 Clock Gated 5
energy consumption at the DRAM level, since it has a much
Physical PE Array Physical PE Array larger global buffer and thus higher on-chip storage capacity
compared to others. However, most of the data accesses in
Fig. 32. Mapping uses replication and folding to maximized utilization of the NLR dataflow is from the global buffer, which still has a
PE array [94]. relatively large energy consumption per access compared to
accessing data from RF or inside the PE array. As a result, the
overall energy consumption of the NLR dataflow is still fairly
pass data in different patterns? high. Overall, RS dataflow uses 1.4× to 2.5× lower energy
Two mapping strategies can be used to solve the first problem than other dataflows.
as shown in Fig. 32. First, replication can be used to map shapes Fig. 34 shows the energy efficiency between different
that do not use up the entire PE array. For example, in the dataflows in the FC layers of AlexNet with a batch size of 16.
third to fifth layers of AlexNet, each 2-D convolution only uses Since there is not as much data reuse in the FC layers as in
a 13×3 PE array. This structure is then replicated four times, the CONV layers, all dataflows spend a significant amount of
and runs different channels and filters in each replication. The energy on reading weights. However, RS dataflow still has the
second strategy is called folding. For example, in the second lowest energy consumption because it optimizes for the energy
layer of AlexNet, it requires a 27×5 PE array to complete the of accessing input activations and partial sums. For the OS
2-D convolution. In order to fit it into the 14×12 physical PE dataflows, OS now consumes lower energy than OS since
C A
array, it is folded into two parts, 14×5 and 13×5, and each it is designed for the FC layers. Overall, RS still consumes
are vertically mapped into the physical PE array. Since not all 1.3× lower energy compared to other dataflows at the batch
PEs are used by the mapping, the unused PEs can be clock size of 16.
gated to save energy consumption.
A custom multicast network is used to solve the second Fig. 35 shows the RS dataflow design with energy breakdown
problem about flexible data delivery. The simplest way to pass in terms of different layers of AlexNet. In the CONV layers, the
data to multiple destinations is to broadcast the data to all PEs energy is mostly consumed by the RF, while in the FC layers,
and let each PE decide if it has to process the data or not. the energy is mostly consumed by DRAM. However, most
However, it is not very energy efficient especially when the of the energy is consumed by the CONV layers, which takes
size of PE array is large. Instead, a multicast network is used around 80% of the energy. As recent DNN models go deeper
to send data to only the places where it is needed. with more CONV layers, the ratio between number of CONV
5) Energy comparison of different dataflows: To evaluate and FC layers only gets larger. Therefore, moving forward,
and compare different dataflows, the same total hardware area significant effort should be placed on energy optimizations for
and number of PEs (256) are used in the simulation of a spatial CONV layers.
architecture for all dataflows. The local memory (register file) at Finally, up until now, we have been looking at architec-
each processing element (PE) is on the order of 0.5 – 1.0kB and tures with relatively limited storage on the order of a few
a shared memory (global buffer) is on the order of 100 – 500kB. hundred kilobytes. With much larger storage on the order of
The sizes of these memories are selected to be comparable to a few megabytes, additional dataflows can be considered. For
a typical accelerator for multimedia processing, such as video example, Fused-Layer looks at dataflow optimizations across
coding [95]. The memory sizes are further adjusted for the layers [96].
18
2.0e10 Total Energy

2
80% 20%
1.5e10 ALU
1.5 RF
Normalized RF
NoC Energy 1.0e10 NoC
Normalized 1 buffer (1 MAC = 1) buffer
Energy/MAC
DRAM 0.5e10 DRAM
0.5 ALU
0
0 L1 L2 L3 L4 L5 L6 L7 L8
WS OSA OSB OSC NLR RS CONV Layers FC Layers
DNN Dataflows
RF dominates DRAM dominates
(a) Energy breakdown across memory hierarchy
Fig. 35. Energy breakdown across layers of the AlexNet [80]. RF energy
dominates in convolutional layers. DRAM energy dominates in the fully
2 connected layer. Convolutional layer dominate energy consumption.
1.5
psums
Normalized In this section, we will discuss how moving compute and data
1 weights
Energy/MAC closer to reduce data movement (i.e., near-data processing) can
pixels
be achieved using mixed-signal circuit design and advanced
0.5
memory technologies.
0
Many of these works use analog processing which has the
WS OSA OSB OSC NLR RS drawback of increased sensitivity to circuit and device non-
DNN Dataflows idealities. Consequentially, the computation is often performed
at reduced precision, which can be accounted for during
(b) Energy breakdown across data type
the training of the DNNs using the techniques discussed in
Section VII. Another factor to take into consideration is that
Fig. 33. Comparison of energy efficiency between different dataflows in the
CONV layers of AlexNet with a batch size of 16 [3]: (a) breakdown in terms DNNs are often trained in the digital domain; thus for analog
of storage levels and ALU, (b) breakdown in terms of data types. OSA , OSB processing, there is an additional overhead cost for analog-
and OSC are three variants of the OS dataflow that are commonly seen in to-digital conversion (ADC) and digital-to-analog conversion
different implementations [80].
(DAC).
2
A. DRAM
Advanced memory technology can reduce the access energy
1.5 for high density memories such as DRAMs. For instance,
psums
Normalized embedded DRAM (eDRAM) brings high density memory on-
weights
Energy/MAC 1
pixels chip to avoid the high energy cost of switching off-chip
0.5 capacitance [97]; eDRAM is 2.85× higher density than SRAM
and 321× more energy efficient than DRAM (DDR3) [93].
0 eDRAM also offers higher bandwidth and lower latency
WS OSA OSB OSC NLR RS
compared to DRAM. In DNN processing, eDRAM can be used
DNN Dataflows
to store tens of megabytes of weights and activations on-chip
to avoid off-chip access, as demonstrated in DaDianNao [93].
Fig. 34. Comparison of energy efficiency between different dataflows in the
FC layers of AlexNet with a batch size of 16 [80].
The downside of eDRAM is that it has lower density than
off-chip DRAM and can increase the cost of the chip.
Rather than integrating DRAM into the chip itself, the
DRAM can also be stacked on top of the chip using through
VI. N EAR -DATA P ROCESSING silicon vias (TSV). This technology is often referred to as 3-D
The previous section highlighted that data movement domi- memory, and has been commercialized in the form of Hybrid
nates energy consumption. While spatial architectures distribute Memory Cube (HMC) [98] and High Bandwidth Memory
the on-chip memory such that it is closer to the computation (HBM) [99]. 3-D memory delivers an order of magnitude higher
(e.g., into the PE), there have also been efforts to bring the bandwidth and reduces access energy by up to 5× relative to
off-chip high density memory closer to the computation or to existing 2-D DRAMs, as TSV have lower capacitance than
integrate the computation into the memory itself; the latter is typical off-chip interconnects. Recent works have explored the
often referred to as processing-in-memory or logic-in-memory. use of HMC for efficient DNN processing in a variety of ways.
In embedded systems, there have also been efforts to bring the For instance, Neurocube [100] integrates SIMD processors into
computation into the sensor where the data is first collected. the logic die of the HMC to bring the memory and computation
19
voltage as the input, and the current as the output as shown in

WLDAC
code
V1 Fig. 36(b). The addition is done by summing the currents of
G1 different memristors with Kirchhoff’s current law. This is the
IBC ultimate form of a weight stationary dataflow, as the weights
I1 = V1×G1 are always held in place. The advantages of this approach
ΔV
V2 include reduced energy consumption since the computation
BL
Ideal transfer curve G2 is embedded within memory which reduces data movement,
Standard Deviation
0.06 (from Monte Carlo and increased density since memory and computation can be
ΔVBL (V)
simulations) I2 = V2×G2 densely packed with a similar density to DRAM [106].8

0.04 There are several popular candidates for non-volatile resistive
0.02 memory devices including phase change memory (PCM),
Nominal transfer curve resistive RAM (RRAM or ReRAM), conductive bridge RAM
0 I = I1 + I2
5 10 15 20 25 30 35
= V1×G1 + V2×G2 (CBRAM), and spin transfer torque magnetic RAM (STT-
WLDAC Code
MRAM) [107]. These devices have different trade-offs in terms
(a) Multiplication performed by bit-cell (b) Gi is conductance of resistive of endurance (i.e., how many times it can be written), retention
(Figure from [102]) memory (Figure from [104])
time, write current, density (i.e., cell size), variations and speed.
Fig. 36. Analog computation by (a) SRAM bit-cell and (b) non-volatile Processing with non-volatile resistive memories has several
resistive memory. drawbacks as described in [108]. First, it suffers from the
reduced precision and ADC/DAC overhead of analog process-
ing described earlier. Second, the array size is limited by the
closer together. Tetris [101] explores the use of HMC with
wires that connect the resistive devices; specifically, wire energy
the Eyeriss spatial architecture and row stationary dataflow.
dominates for large arrays (e.g., 1k×1k), and the IR drop along
It proposes allocating more area to computation than on-chip
wire can degrade the read accuracy. Third, the write energy
memory (i.e., larger PE array and smaller global buffer) in
to program the resistive devices can be costly, in some cases
order to exploit the low energy and high throughput properties
requiring multiple pulses. Finally, the resistive devices can also
of the HMC. It also adapts the dataflow to account for the
suffer from device-to-device and cycle-to-cycle variations with
HMC memory and smaller on-chip memory. Tetris achieves
non-linear conductance across the conductance range.
a 1.5× reduction in energy consumption and 4.1× increase
There have been several recent works that explore the use of
in throughput over a baseline system with conventional 2-D
memristors for DNNs. ISAAC [104] replaces the eDRAM in
DRAM.
DaDianNao with memristors. To address the limited precision
support, ISAAC computes a 16-bit dot product operation with
B. SRAM 8 memristors each storing 2-bits; a 1-bit×2-bit multiplication
Rather than bringing the memory near the compute, recent is performed at each memristor, where a 16-bit input requires
work has also investigated bringing the compute into the 16 cycles to complete. In other words, the ISAAC architecture
memory. For instance, the multiply and accumulate operation trades off area and time for increased precision. Finally, ISAAC
can be directly integrated into the bit-cells of an SRAM arranges its 25.1M memristors in a hierarchical structure to
array [102], as shown in Fig. 36(a). In this work, a 5-bit avoid issues with large arrays. PRIME [109] also replaces the
DAC is used to drive the word line (WL) to an analog voltage DRAM main memory with memristors; specifically, it uses
that represents the feature vector, while the bit-cells store the 256×256 memristor arrays that can be configured for 4-bit
binary weights ±1. The bit-cell current (IBC ) is effectively multi-level cell computation or 1-bit single level cell storage.
a product of the value of the feature vector and the value of It should be noted that results from ISAAC and PRIME are
the weight stored in the bit-cell; the currents from the bit- obtained from simulations. The task of actually fabricating
cells within a column add together to discharge the bitline large memristors arrays is still very much a research challenge;
(VBL ). This approach gives 12× energy savings compared to for instance, [110] uses a fabricated 12×12 memristor array
reading the 1-bit weights from the SRAM and performing the to demonstrate a linear classifier.
computation separately. To counter circuit non-idealities, the
DAC accounts for the non-linear bit-line discharge with respect D. Sensors
to the WL voltage, and boosting is used to combine the weak In certain applications, such as image processing, the data
classifiers that are susceptible to device variations to form a movement from the sensor itself can account for a significant
strong classifier [103]. portion of the system energy consumption. Thus there has
also been research on performing the computation as close
C. Non-volatile Resistive Memories as possible to the sensor. In particular, much of the work
focuses on moving the computation into the analog domain to
The multiply and accumulate operation can also be directly
avoid using the ADC within the sensor, which accounts for a
integrated into advanced non-volatile high density memories
significant portion of the sensor power. However, as mentioned
by using them as programmable resistive elements, commonly
referred to as memristors [105]. Specifically, a multiplication 8 The resistive devices can be inserted between the cross-point of two wires
is performed with the resistor’s conductance as the weight, the and in certain cases can avoid the need for an access transistor.
20
earlier, lower precision is required for analog computation due

to circuit non-idealities.
In [111], the matrix multiplication is integrated into the
ADC, where the most significant bits of the multiplications
are performed using switched capacitors in an 8-bit successive
approximation format. This is extended in [112] to not only
perform the multiplications, but also the accumulations in the
analog domain. In this work, it is assumed that 3-bits and
(a) Linear Quantization (b) Log Quantization
6-bits are sufficient to represent the weights and activations,
respectively. This reduces the number of ADC conversions in
the sensor by 21×. RedEye [113] takes this approach even
further by performing the entire convolution layer (including
convolution, max pooling and quantization) in the analog
domain at the sensor. It should be noted that [111] and [112]
report measured results from fabricated test chips, while results
in [113] are from simulations. (c) Non-Linear Quantization
It is also feasible to embed the computation not just before
Fig. 37. Various methods of quantization (Figures from [117, 118]).
the ADC, but into the sensor itself. For instance, in [114] an
Angle Sensitive Pixels sensor is used to compute the gradient
of the input, which along with compression, reduces the data
the number of bits. The benefits of reduced precision include
movement from the sensor by 10×. In addition, since the
reduced storage cost and/or reduced computation requirements.
first layer of the DNN often outputs a gradient-like feature
map, it maybe possible to skip the computations in the first There are several ways to map the data to quantization levels.
layer, which further reduces energy consumption as discussed The simplest method is a linear mapping with uniform distance
in [115, 116]. between each quantization level (Fig. 37(a)). Another approach
is to use a simple mapping function such as a log function
(Fig. 37(b)) where the distance between the levels varies; this
VII. C O - DESIGN OF DNN MODELS AND H ARDWARE
mapping can often be implemented with simple logic such as a
In earlier work, the DNN models were designed to maximize shift. Alternatively, a more complex mapping function can be
accuracy without much consideration of the implementation used where the quantization levels are determined or learned
complexity. However, this can lead to designs that are chal- from the data (Fig. 37(c)), e.g., using k-means clustering; for
lenging to implement and deploy. To address this, recent this approach, the mapping is usually implemented with a look
work has shown that DNN models and hardware can be co- up table.
designed to jointly maximize accuracy and throughput, while Finally, the quantization can be fixed (i.e., the same method
minimizing energy and cost, which increases the likelihood of of quantization is used for all data types and layers, filters, and
adoption. In this section, we will highlight various efforts that channels in the network); or it can be variable (i.e., different
have been made towards the co-design of DNN models and methods of quantization can be used for weights and activations,
hardware. Note that unlike Section V, the techniques discussed and different layers, filters, and channels in the network).
in this section can affect the accuracy; thus, the goal is to
Reduced precision research initially focused on reducing
not only substantially reduce energy consumption and increase
the precision of the weights rather than the activations, since
throughput, but also to minimize any degradation in accuracy.
weights directly increase the storage capacity requirement,
The co-design approaches can be loosely grouped into the
while the impact of activations on storage capacity depends on
following categories:
the network architecture and dataflow. However, more recent
• Reduce precision of operations and operands. This in-
works have also started to look at the impact of quantization
cludes going from floating point to fixed point, reducing on activations. Most reduced precision research also focuses
the bitwidth, non-linear quantization and weight sharing. on reducing the precision for inference rather than training
• Reduce number of operations and model size. This
(with some exceptions [88, 119, 120]) due to the sensitivity of
includes techniques such as compression, pruning and the gradients to quantization.
compact network architectures. The key techniques used in recent work to reduce precision
are summarized in Table III; both linear and non-linear
A. Reduce Precision quantization applied to weights and activations are explored.
Quantization involves mapping data to a smaller set of The impact on accuracy is reported relative to a baseline
quantization levels. The ultimate goal is to minimize the error precision of 32-bit floating point, which is the default precision
between the reconstructed data from the quantization levels and used on platforms such as GPUs and CPUs.
the original data. The number of quantization levels reflects the 1) Linear quantization: The first step of reducing precision
precision and ultimately the number of bits required to represent is usually to convert values and operations from floating point
the data (usually log2 of the number of levels); thus, reduced to fixed point. A 32-bit floating point number, as shown in
precision refers to reducing the number of levels, and thus Fig. 38(a), is represented by (−1)s × m × 2(e−127) , where s
21
sign exponent (8-bits) mantissa (23-bits) 2N+M-bits

Weight
sign exponent (8-bits) mantissa (23-bits) (N-bits)
32-bit float 10100101000000000101000000000100
2N-bits Quantize Output
32-bit float
-1.42122425 x 10-13 s 1= 0
1 1 0e0= 1700 1 0 0 0 0 0 0 0 0 0 1m0= 120482
000000000100 + Accumulate to N-bits (N-bits)
-1.42122425 x 10-13 s=1 e = 70 m = 20482 Activation NxN
(a) 32-bit floating point example (N-bits) multiply
sign mantissa (7-bits) sign mantissa (7-bits)

Fig. 39. Reducing the precision of multiply and accumulate (MAC).
8-bit 01100110 8-bit 0 1100110
dynamic dynamic
fixed integer fractional fixed fractional
([7-f ]-bits) (f-bits) (f-bits)
12.75 s=0 m=102 f=3 0.19921875 s=0 m=102 f=9 product; that output would need to be accumulated with 2N+M-
bit precision, where M is determined based on the largest filter
(b) 8-bit dynamic fixed point examples
size log2 (C × R × S from Fig. 9(b)), which is in the range of
Fig. 38. Various methods of number representations. 10 to 16 bits for the popular DNNs described in Section III-B.
After accumulation, the precision of the final output activation
is typically reduced to N-bits [88, 121], as shown in Fig. 39.
is the sign bit, e is the 8-bit exponent, and m is the 23-bit The reduced output precision does not have a significant impact
mantissa, and covers the range of 10−38 to 1038 . on accuracy if the distribution of the weights and activations
An N-bit fixed point number is represented by (−1)s × m × are centered near zero such that the accumulation would not
−f
2 , where s is the sign bit, m is the (N-1)-bit mantissa, and move only in one direction; this is particularly true when batch
f determines the location of the decimal point and acts as a normalization is used.
scale factor. For instance, for an 8-bit integer, when f = 0, The reduced precision is not only explored in research,
the dynamic range is -128 to 127, whereas when f = 10, the but has been used in recent commercial platforms for DNN
dynamic range is -0.125 to 0.124023438. Dynamic fixed point processing. For instance, Google’s Tensor Processing Unit
representation allows f to vary based on the desired dynamic (TPU) which was announced in May 2016, was designed for
range as shown in Fig. 38(b). This is useful for DNNs, since 8-bit integer arithmetic [123]. Similarly, Nvidia’s PASCAL
the dynamic range of the weights and activations can be quite GPU, which was announced in April 2016, also has 8-bit
different. In addition, the dynamic range can also vary across integer instructions for deep learning inference [124]. In general
layers and layer types (e.g., convolutional vs. fully connected). purpose platforms such as CPUs and GPUs, the main benefit
Using dynamic fixed point, the bitwidth can be reduced to 8 of using 8-bit computation is an increase in throughput, as
bits for the weights and 10 bits for the activations without any four 8-bit operations rather than one 32-bit operation can be
fine-tuning of the weights [121]; with fine-tuning, both weights performed for a given clock cycle.
and activations can reach 8-bits [122]. While general purpose platforms usually support 8-bit,
Using 8-bit fixed point has the following impact on energy 16-bit and/or 32-bit operations, it has been shown that the
and area [79]: minimum bit precision for DNNs can actually vary in a more
• An 8-bit fixed point add consumes 3.3× less energy fine grained manner. For instance, the weight and activation
(3.8× less area) than a 32-bit fixed point add, and 30× precision can vary between 4 and 9 bits for AlexNet across
less energy (116× less area) than a 32-bit floating point different layers without significant impact on accuracy (i.e., a
add. The energy and area of a fixed-point add scales change of less than 1%) [125, 126]. This fine-grained variation
approximately linearly with the number of bits. can be exploited for increased throughput or reduced energy
• An 8-bit fixed point multiply consumes 15.5× less energy consumption with specialized hardware. For instance, if bit-
(12.4× less area) than a 32-bit fixed point multiply, serial processing is used, where the number of clock cycles to
and 18.5× less energy (27.5× less area) than a 32-bit complete an operation is proportional to the bitwidth, adapting
floating point multiply. The energy and area of a fixed- to fine-grain variations in bit precision can result in a 2.24×
point multiply scales approximately quadratically with the speed up versus 16-bits [125]. Alternatively, a multiplier can
number of bits. be designed such that its critical path reduces based on the bit
Reducing the precision also reduces the energy and area cost precision as fewer adders are needed to resolve the product;
for storage, which is important since memory access and data this can be combined with voltage scaling for a 2.56× energy
movement dominate energy consumption as described earlier. savings versus 16-bits [126]. While these bit scaling results
The energy and area of the memory scale approximately linearly are reported relative to 16-bit, it would be interesting to see
with number of bits. It should be noted, however, that changing their impact relative to the maximum precision required across
from floating point to fixed point, without reducing bit-width, layers (i.e., 9-bits for [125, 126]).
does not reduce the energy or area cost of the memory. The precision can be reduced even more aggressively to a
For completeness, it should be noted that the precision of single bit; this area of research is often referred to as binary nets.
the internal values of a fixed-point multiply and accumulate BinaryConnect (BC) [127] introduced the concept of binary
(MAC) operation are typically higher than the weights and weights (i.e., -1 and 1), where using a binary weight reduced
activations. To guarantee no precision loss, weights and input the multiplication in the MAC to addition and subtraction
activations with N-bit fixed-point precision would require an only. This was later extended in Binarized Neural Networks
N-bit×N-bit multiplication which generates a 2N-bit output (BNN) [128] that uses binary weights and activations, which
22
reduces the MAC to an XNOR. However, BC and BNN have Weight

Weight index Weight Weight
an accuracy loss of 19% and 29.8%, respectively [129]. (log2U-bits) Decoder/ (16-bits) MAC
Memory
In order to reduce this accuracy loss, Binary Weight CRSM x Dequant Output
log2U-bits U x 16b Activation
Nets (BWN) and XNOR-Nets introduced several significant (16-bits)
Input
modifications to the DNN processing [129]. This includes Activation
multiplying the outputs with a scale factor to recover the (16-bits)
dynamic range (i.e., the weights effectively become -w and

Fig. 40. Weight sharing hardware.
w, where w is the average of the absolute values of the
weights in the filter)9 , keeping the first and last layers at 32-bit
floating point precision, and performing normalization before VGG-16 [117]. Furthermore, when weights are quantized to
convolution to reduce the dynamic range of the activations. powers of two, the multiplication can be replaced with a bit-
With these changes, BWN reduced the accuracy loss to 0.8%, shift [122, 135].10 Incremental Network Quantization (INQ)
while XNOR-Nets reduced the loss to 11%. The loss of XNOR- can be used to further reduce the loss in accuracy by dividing
Net can be further reduced by increasing the precision of the the large and small weights into different groups, and then
activations to be slightly larger than one bit. For instance, iteratively quantizing and re-training the weights [136].
Quantized Neural Networks (QNN) [119], DoReFa-Net [120], Weight Sharing forces several weights to share a single value.
and HWGQ-Net [130] allow the activations to have 2-bits, This reduces the number of unique weights in a filter or a
while the weights remain at 1-bit; in HWGQ-Net, this reduces layer. One example is to group the weights by using a hashing
the accuracy loss to 5.2%. function and use one value for each group [137]. Alternatively,
All the previously described binary nets limit the weights the weights can be grouped by the k-means algorithm [118].
to two values (-w and w); however, there may be benefits Both the shared weights and the indexes indicating which
for allowing weights to be zero (i.e., -w, 0, w). Although weight to use at each position of the filter are stored. This
this requires an additional bit per weight compared to binary leads to a two step process to fetch the weight: (1) read the
weights, the sparsity of the weights can be exploited to reduce weight index; (2) using the weight index, read the shared
computation and storage cost, which can potentially cancel weights. This approach can reduce the cost of reading and
out the cost of the additional bit. This is explored in Ternary storing the weights if the weight index (log2 of the number of
Weight Nets (TWN) [131] and then extended in Trained Ternary unique weights) is less than the bitwidth of the weight itself.
Quantization (TTQ) where a different scale is trained for each For instance, in Deep Compression [118], the number of
weight (i.e., -w1 , 0, w2 ) for an accuracy loss of 0.6% [132], unique weights per layer is reduced to 256 for convolutional
assuming 32-bit floating point for the activations. layers and 16 for fully-connected layers in AlexNet, requiring
Hardware implementations for binary/ternary nets have 8-bit and 4-bit weight indexes, respectively. Assuming there
been explored in recent publications. YodaNN [133] uses are U unique weights and the size of the filters in the layer
binary weights, while BRein [134] uses binary weights and is C×R×S×M from Fig. 9(b), there will be energy savings
activations. Binary weights are also used in the compute if reading from a CRSM × log2 U -bit memory plus a U ×16-
in SRAM work [102] described in Section VI. Finally, the bit memory (as shown in Fig. 40) cost less than reading
nominally spike-inspired TrueNorth chip can implement a from a CRSM ×16-bit memory. Note that unlike the previous
reduced precision neural network with binary activations and quantization methods, the weight sharing approach does not
ternary weights using TrueNorth’s quantized weight table [9]. reduce the precision of the MAC computation itself and only
These works tend not to support state-of-the-art DNN models reduces the weight storage requirement.
(with the exception of YodaNN).
2) Non-linear quantization: The previous works described B. Reduce Number of Operations and Model Size
involve linear quantization where the levels are uniformly In addition to reducing the size of each operation or operand
spaced out. It has been shown that the distributions of the (weight/activation), there is also a significant amount of research
weights and activations are not uniform [118, 135], and thus on methods to reduce the number of operations and model
a non-linear quantization can potentially improve accuracy. size. These techniques can be loosely classified as exploiting
Specifically, there have been two popular approaches taken activation statistics, network pruning, network architecture
in recent works: (1) log domain quantization; (2) learned design and knowledge distillation.
quantization or weight sharing. 1) Exploiting Activation Statistics: As discussed in Sec-
Log domain quantization If the quantization levels are tion III-A1, ReLU is a popular form of non-linearity used in
assigned based on a logarithmic distribution as shown in DNNs that sets all negative values to zero as shown in Fig. 41(a).
Fig 37(b), the weights and activations are more equally As a result, the output activations of the feature maps after the
distributed across the different levels and each level is used ReLU are sparse; for instance, the feature maps in AlexNet
more efficiently resulting in less quantization error. For instance, have sparsity between 19% to 63% as shown in Fig. 41(b).
using 4 bits in linear quantization results in a 27.8% loss in This sparsity gives ReLU an implementation advantage over
accuracy versus a 5% loss for log base-2 quantization for other non-linearities such as sigmoid, etc.
9 This can also be thought of as a form of weights sharing, where only two 10 Note however that multiplications do not account for a significant portion
weights are used per filter. of the total energy.
23
bitwidth Accuracy loss vs.

Reduce Precision Method
Weights Activations 32-bit float (%)
w/o fine-tuning [121] 8 10 0.4
Dynamic Fixed Point
w/ fine-tuning [122] 8 8 0.6
BinaryConnect [127] 1 32 (float) 19.2
Binary Weight Network (BWN) [129] 1* 32 (float) 0.8
Reduce Weight
Ternary Weight Networks (TWN) [131] 2* 32 (float) 3.7
Trained Ternary Quantization (TTQ) [132] 2* 32 (float) 0.6
XNOR-Net [129] 1* 1* 11
Binarized Neural Networks (BNN) [128] 1 1 29.8
Reduce Weight and Activation DoReFa-Net [120] 1* 2* 7.63
Quantized Neural Networks (QNN) [119] 1 2* 6.5
HWGQ-Net [130] 1* 2* 5.2
LogNet [135] 5 (conv), 4 (fc) 4 3.2
Incremental Network Quantization (INQ) [136] 5 32 (float) -0.2
Non-linear Quantization
8 (conv), 4 (fc) 16 0
Deep Compression [118]
4 (conv), 2 (fc) 16 2.6
TABLE III
M ETHODS TO REDUCE NUMERICAL PRECISION FOR A LEX N ET. ACCURACY MEASURED FOR T OP -5 ERROR ON I MAGE N ET. *N OT APPLIED TO FIRST AND / OR
LAST LAYERS
ReLU a cost of reduced accuracy.

9 -1 -3 9 0 0 2) Network Pruning: To make network training easier, the
1 -5 5 1 0 5 networks are usually over-parameterized. Therefore, a large
amount of the weights in a network are redundant and can
-2 6 -1 0 6 0
be removed (i.e., set to zero). This process is called network
(a) ReLU non-linearity pruning. Aggressive network pruning often requires some fine-
tuning of the weights to maintain the original accuracy. This
# of activations # of non-zero activations was first proposed in 1989 through a technique called Optimal
1
Brain Damage [140]. The idea was to compute the impact of
0.8
each weight on the training loss (discussed in Section II-C),
0.6 referred to as the weight saliency. The low-saliency weights
(Normalized) 0.4
were removed and the remaining weights were fine-tuned; this
0.2 process was repeated until the desired weight reduction and
0 accuracy were reached.
1 2 3 4 5
CONV Layer In 2015, a similar idea was applied to modern DNNs in [141].
(b) Distribution of activation after ReLU of AlexNet Rather than using the saliency as a metric, which is too difficult
to compute for the large-scaled DNNs, the pruning was simply
Fig. 41. Sparsity in activations due to ReLU. based on the magnitude of the weights. Small weights were
pruned and the model was fine-tuned to restore the accuracy.
Without fine-tuning the weights, about 50% of the weights
The sparsity can be exploited for energy and area savings
could be pruned. With fine-tuning, over 80% of the weights
using compression, particularly for off-chip DRAM access
were pruned. Overall this approach can reduce the number
which is expensive. For instance, a simple run length coding
of weights in AlexNet by 9× and the number of MACs
that involves signaling non-zero values of 16-bits and then runs
by 3×. Most of the weight reduction comes from the fully-
of zeros up to 31 can reduce the external memory bandwidth
connected layers (9.9× for fully-connected layers versus 2.7×
of the activations by 2.1× and the overall external bandwidth
11 for convolutional layers).
(including weights) by 1.5× [61]. In addition to compression,
the hardware can also be modified such that it skips reading the However, the number of weights alone is not a good metric
weights and performing the MAC for zero-valued activations for energy. For instance, in AlexNet, the number of weights
to reduce energy cost by 45% [94]. Rather than just gating the in the fully-connected layers is much larger than in the
read and MAC computation, the hardware could also skip the convolutional layers; however, the energy of the convolutional
cycle to increase the throughput by 1.37× [138]. layers is much higher than the fully-connected layers as shown
The activations can be made to be even more sparse by prun- in Fig. 35 [80]. Rather than using the number of weights
ing the low-valued activations. For instance, if all activations and MAC operations as proxies for energy, the pruning of
with small values are pruned, this can be translated into an the weights can be directly driven by energy itself [142]. An
additional 11% speed up [138] or 2× power reduction [139] energy evaluation method can be used to estimate the DNN
with little impact on accuracy. Aggressively pruning more energy that accounts for the data movement from different
activations can provide additional throughput improvement at levels of the memory hierarchy, the number of MACs, and the
data sparsity as shown in Fig. 42; this energy estimation tool
11 This simple run length compression is within 5-10% of the theoretical is available at [143]. The resulting energy values for popular
entropy limit. DNN models are shown in Fig. 43(a). Energy-aware pruning
24
# of weights
CNN Shape Configuration Hardware Energy Costs of each
(# of channels, # of filters, etc.) MAC and Memory Access
# acc. at mem. level 1 # of

Memory # acc. at mem. level 2 filters
Accesses
…
Optimization # acc. at mem. level n Edata
# of MACs # of MACs Ecomp

Calculation
CNN Weights and Input Data Energy

[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
L1 L2 L3 …
CNN Energy Consumption
Fig. 42. Energy estimation methodology from [142], which estimates the
energy based on data movement from different levels of the memory hierarchy,
(a) Compressed sparse row (CSR)
number of MACs, and data sparsity.
93%
91% ResNet-50
VGG-16
Top-5 Accuracy
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump9on
(b) Compressed sparse column (CSC)
Original DNN
Fig. 44. Sparse matrix-vector multiplications using different storage formats
(a) Energy versus accuracy trade-off of popular DNN models. (Figure from [144]).
93%
91% ResNet-50
VGG-16 multiplication, as shown in Fig. 18(a), one challenge is
Top-5 Accuracy
89% GoogLeNet
87% GoogLeNet to determine how to store the sparse weight matrix in a
85% compressed format. The compression can be applied either
83% 1.74x SqueezeNet in row or column order. A compressed sparse row (CSR)
81% AlexNet SqueezeNet format, as shown in Fig. 44(a), is often used to perform Sparse
AlexNet AlexNet SqueezeNet
79%
Matrix-Vector multiplication. However, the input vector needs
77%
5E+08 5E+09 5E+10 to be read in multiple times even though only a subset of it is
Normalized Energy Consump9on used since each row of the matrix is sparse. Alternatively,
Original DNN Magnitude-based Pruning Energy-aware Pruning
a compressed sparse column (CSC) format, as shown in
(b) Impact of energy-aware pruning. Fig. 44(b), can be used, where the output is updated several
times, and only one element of the input vector is read at
Fig. 43. Energy values estimated with methodology in [142]. a time [144]. The CSC format will provide an overall lower
memory bandwidth than CSR if the output is smaller than the
input, or in the case of DNN, if the number of filters is not
can then be used to prune weights based on energy to reduce significantly larger than the number of weights in the filter
the overall energy across all layers by 3.7× for AlexNet, which (C × R × S from Fig. 9(b)). Since this is often true, CSC can
is 1.74× more efficient than magnitude-based approaches [141] be an effective format for sparse DNN processing.
as shown in Fig. 43(b). As mentioned previously, it is well Custom hardware has been explored to efficiently support
known that AlexNet is over-parameterized. The energy-aware pruned DNN models. Many works aim to perform the process-
pruning can also be applied to GoogleNet, which is already a ing without decompressing the weights or activations. EIE [145]
small DNN model, for a 1.6× energy reduction. performs the sparse matrix-vector multiplication specifically for
Recent works have examine how to efficiently support the fully connected layers. It stores the weights in a CSC format
processing of sparse weights in hardware. One area of interest along with the start location of each column, which needs to be
is how to best store the sparse weights after pruning. Similar to stored since the compressed weights have variable length. When
compressing the sparse activations discussed in Section VII-B1, the input is not zero, the compressed weight column is read and
the sparse weights can be compressed to reduce memory access the output is updated. To handle the sparsity, additional logic
bandwidth by 20 to 30% [118]. is used to keep track of the location of the output that should
When DNN processing is performed as a matrix-vector be updated. SCNN [146] supports processing of convolutional
25
layers in a compressed format. It uses an input stationary weights [154]. It proposes a fire module that first ‘squeezes’
dataflow to deliver the compressed weights and activations to the network with 1×1 convolution filters and then expands
a multiplier array followed by a scatter network to add the it with multiple 1×1 and 3×3 convolution filters. It achieves
scattered partial sums. an overall 50× reduction in number of weights compared to
Recent works have also explored the use of structured AlexNet, while maintaining the same accuracy. It should be
pruning to avoid the need for custom hardware [147, 148]. noted, however, that reducing the number of weights does not
Rather than pruning individual weights (also referred to as fine- necessarily reduce energy; for instance, SqueezeNet consumes
grained pruning), structured pruning involves pruning groups more energy than AlexNet, as shown in Fig. 43(a).
of weights (also referred to as coarse-grained pruning). The b) After Training: Tensor decomposition can be used to
benefits of structured pruning are (1) the resulting weights can decompose filters in a trained network without impacting the
better align with the data-parallel architecture (e.g., SIMD) accuracy. It treats weights in a layer as a 4-D tensor and breaks
found in existing general purpose hardware, which results in it into a combination of smaller tensors (i.e., several layers).
more efficient processing [149]; (2) it amortizes the overhead Low-rank approximation can then be applied to further increase
cost required to signal the location of the non-zero weights the compression rate at the cost of accuracy degradation, which
across a group of weights, which improves compression and can be restored by fine-tuning the weights.
thus reduces storage cost. These groups of weights can include This approach is demonstrated using Canonical Polyadic (CP)
a pair of neighboring weights, an entire row or column of a decomposition, a high-order extension of singular value decom-
filter, an entire channel of a filter or the entire filter itself; using position that can be solved by various methods, such as a greedy
larger groups tends to result in higher loss in accuracy [150]. algorithm [155] or a non-linear least-square method [156].
3) Compact Network Architectures: The number of weights Combining CP-decomposition with low-rank approximation
and operations can also be reduced by improving the network achieves a 4.5× speed-up on CPUs [156]. However, CP-
architecture itself. The trend is to replace a large filter with a decomposition cannot be computed in a numerically stable
series of smaller filters, which have fewer weights in total; when way when the dimension of the tensor, which represents the
the filters are applied sequentially, they achieve the same overall weights, is larger than two [156]. To alleviate this problem,
effective receptive field (i.e., the region the filter uses from input Tucker decomposition is adopted instead in [157].
image to compute an output). This approach can be applied 4) Knowledge Distillation: Using a deep network or av-
during the network architecture design (before training) or by eraging the predictions of different models (i.e., ensemble)
decomposing the filters of a trained network (after training). gives a better accuracy than using a single shallower network.
The latter one avoids the hassle of training networks from However, the computational complexity is also higher. To get
scratch. However, it is less flexible than the former one. For the best of both worlds, knowledge distillation transfers the
example, existing methods can only decompose a filter in a knowledge learned by the complex model (teacher) to the
trained network into a series of filters without non-linearity simpler model (student). The student network can therefore
between them. achieve an accuracy that would be unachievable if it was
a) Before Training: In recent DNN models, filters with directly trained with the same dataset [158, 159]. For example,
a smaller width and height are used more frequently because [160] shows how using knowledge distillation can improve the
concatenating several of them can emulate a larger filter as speech recognition accuracy of a student net by 2%, which is
shown in Fig. 13. For example, one 5×5 convolution can be similar to the accuracy of a teacher net that is composed of
replaced with two 3×3 convolutions. Alternatively, one N×N an ensemble of 10 networks.
convolution can be decomposed into two 1-D convolutions, one Fig. 45 shows the simplest knowledge distillation
1×N and one N×1 convolution [53]; this basically imposes method [158]. The softmax layer is commonly used as the
a restriction that the 2-D filter must be separable, which is output layer in the image classification networks to generate
a common constraint in image processing [151]. Similarly, a the class probabilities from the class scores12 ; it squashes the
3-D convolution can be replaced by a set of 2-D convolutions class scores into values between 0 and 1 that sum up to 1.
(i.e., applied only on one of the input channels) followed by For this knowledge distillation method, soft targets (values
1×1 3-D convolutions as demonstrated in Xception [152] and between 0 and 1) such as the class scores of the teacher DNN
MobileNets [153]. The order of the 2-D convolutions and 1×1 (or an ensemble of teacher DNNs) are used instead of the
3-D convolutions can be switched. hard targets (values of either 0 or 1) such as the labels in the
1×1 convolutional layers can also be used to reduce the dataset. The objective is to minimize the squared difference
number of channels in the output feature map for a given between the soft targets and the class scores of the student DNN.
layer, which reduces the number of filter channels and thus Class scores are used as the soft targets instead of the class
computation cost for the filters in the next layer as demonstrated probabilities because small values in the class scores contain
in [15, 51, 52]; this is often referred to as a ‘bottleneck’ as important information that may be eliminated by the softmax.
discussed in Section III-B. For this purpose, the number of 1×1 Alternatively, class probabilities after the softmax layer can be
filters has to be less than the number of channels in the 1×1 used as soft targets if the softmax is configured to generate
filter. For example, 32 filters of 1×1×64 can transform an input softer class probabilities where the smaller values retain more
with 64 channels to an output of 32 channels and reduce the information [160]. Finally, the intermediate representations of
number of filter channels in the next layer to 32. SqueezeNet
uses many 1×1 filters to aggressively reduce the number of 12 Also commonly referred to as logits.
26
class robotics. For data analytics, high throughput means that more
scores probabilities data can be analyzed in a given amount of time. As the amount
softmax
Complex
softmax
Complex of visual data is growing exponentially, high-throughput big
DNN A DNN B data analytics becomes important, particularly if an action needs
(teacher) (teacher)
to be taken based on the analysis (e.g., security or terrorist
prevention; medical diagnosis).
Try to match
Low latency is necessary for real-time interactive applications.
Latency measures the time between when the pixel arrives
softmax
Simple DNN to a system and when the result is generated. Latency is
(student) measured in terms of seconds, while throughput is measured
in operations/second. Often high throughput is obtained by
batching multiple images/frames together for processing; this
Fig. 45. Knowledge distillation matches the class scores of a small DNN to
an ensemble of large DNNs. results in multiple frame latency (e.g., at 30 frames per second,
a batch of 100 frames results in a 3 second delay). This delay
is not acceptable for real-time applications, such as high-speed
navigation where it would reduce the time available for course
the teacher DNN can also be incorporated as the extra hints
correction. Thus achieving low latency and high throughput
to train the student DNN [161].
simultaneously can be a challenge.
Hardware cost is in large part dictated by the amount of
VIII. B ENCHMARKING M ETRICS FOR DNN E VALUATION on-chip storage and the number of cores. Typical embedded
AND C OMPARISON
processors have limited on-chip storage on the order of a few
As we have seen in this article, there has been a significant hundred kilobytes. Since there is a trade-off between the amount
amount of research on efficient processing of DNNs. We should of on-chip memory and the external memory bandwidth, both
consider several key metrics to compare the various strengths metrics should be reported. Similarly, there is a correlation
and weaknesses of different designs and proposed techniques. between the number of cores and the throughput. In addition,
These metrics should cover important attributes such as accu- while many cores can be built on a chip, the number of cores
racy/robustness, power/energy consumption, throughput/latency that can actually be used at a given time should be reported. It is
and cost. Reporting all these metrics is important in order often unrealistic to assume peak utilization and performance due
to provide a complete picture of the trade-offs made by a to limitations of mapping and memory bandwidth. Accordingly,
proposed design or technique. We have prepared a website to the power and throughput should be reported for running actual
collect these metrics from various publications [162]. DNNs as opposed to only reporting theoretical limits.
In terms of accuracy and robustness, it is important that the
accuracy be reported on widely-accepted datasets as discussed
in Section IV. The difficulty of the dataset and/or task should A. Metrics for DNN Models
be considered when measuring the accuracy. For instance, the
To evaluate the properties of a given DNN model, we should
MNIST dataset for digit recognition is significantly easier than
consider the following metrics:
the ImageNet dataset. As a result, a DNN that performs well
on MNIST may not necessarily perform well on ImageNet. • The accuracy of the model in terms of the top-5 error
Thus it is important that the same dataset and task is used when on datasets such as ImageNet. Also, the type of data
comparing the accuracy of different DNN models; currently augmentation used (e.g., multiple crops, ensemble models)
ImageNet is preferred since it presents a challenge for DNNs, should be reported.
as opposed to MNIST, which can also be addressed with simple • The network architecture of the model should be reported,
non-DNN techniques. To demonstrate primarily hardware including number of layers, filter sizes, number of filters
innovations, it would be desirable to report results for widely- and number of channels.
used DNN models (e.g., AlexNet, GoogLeNet) whose accuracy • The number of weights impact the storage requirement of
and robustness have been well studied and tested. the model and should be reported. If possible, the number
Energy and power are important when processing DNNs at of non-zero weights should be reported since this reflects
the edge in embedded devices with limited battery capacity the theoretical minimum storage requirements.
(e.g., smart phones, smart sensors, UAVs, and wearables), or in • The number of MACs that needs to be performed should
the cloud in data centers with stringent power ceilings due to be reported as it is somewhat indicative of the number
cooling costs, respectively. Edge processing is preferred over of operations and potential throughput of the given DNN.
the cloud for certain applications due to latency, privacy or If possible, the number of non-zero MACs should also
communication bandwidth limitations. When evaluating the be reported since this reflects the theoretical minimum
power and energy consumption, it is important to account compute requirements.
for all aspects of the system including the chip and external Table IV shows how these metrics are reported for various
memory accesses. well known DNNs. The accuracy is reported for the case where
High throughput is necessary to deliver real-time perfor- only a single crop for a single model is used for classification,
mance for interactive applications such as navigation and such that the number of weights and MACs in the table are
27
AlexNet GoogLeNet v1
Metrics
dense sparse dense sparse reported in terms of the core area in squared millimeters
Top-5 error 19.6 20.4 11.7 12.7 per multiplier along with process technology.
Number of CONV Layers 5 5 57 57 In terms of cost, different platforms will have different
Depth in implementation-specific metrics. For instance, for an FPGA,
5 5 21 21
(Number of CONV Layers)
Filter Sizes 3,5,11 1,3,5,7 the specific device should be reported, along with the utilization
Number of Channels 3-256 3-832 of resources such as DSP, BRAM, LUT and FF; performance
Number of Filters 96-384 16-384 density such as GOPs/slice can also be reported.
Stride 1,4 1,2
NZ Weights 2.3M 351k 6.0M 1.5M
Each processor should report various specifications for each
NZ MACs 395M 56.4M 806M 220M metric as shown in Table V, using the Eyeriss chip as an
FC Layers 3 3 1 1 example. It is important that all metrics and specifications are
Filter Sizes 1,6 1 accounted for in order fairly evaluate all the design trade-offs.
Number of Channels 256-4096 1024
Number of Filters 1000-4096 1000
For instance, without the accuracy given for a specific dataset
NZ Weights 58.6M 5.4M 1M 870k and task, one could run a simple DNN and easily claim low
NZ MACs 14.5M 1.9M 635k 663k power, high throughput, and low cost – however, the processor
Total NZ Weights 61M 5.7M 7M 2.4M might not be usable for a meaningful task; alternatively, without
Total NZ MACs 410M 58.3M 806M 221M
TABLE IV reporting the off-chip bandwidth, one could build a processor
M ETRICS FOR P OPULAR DNN M ODELS . S PARSITY IS ACCOUNT FOR BY with only multipliers and easily claim low cost, high throughput,
REPORTING NON - ZERO (NZ) WEIGHTS AND MAC S . high accuracy, and low chip power – however, when evaluating
system power, the off-chip memory access would be substantial.
Finally, the test setup should also be reported, including whether
the results are measured or obtained from simulation14 and
consistent.13 Note that accounting for the number of non-zero
how many images were tested.
(NZ) operations significantly reduces the number of MACs
In summary, the evaluation process for whether a DNN
and weights. Since the number of NZ MACs depends on the
system is a viable solution for a given application might go as
input data, we propose using the publicly available 50,000
follows: (1) the accuracy determines if it can perform the given
validation images from ImageNet for the computation. Finally,
task; (2) the latency and throughput determine if it can run fast
there are various methods to reduce the weights in a DNN
enough and in real-time; (3) the energy and power consumption
(e.g., network pruning in Section VII-B2). Table IV shows
will primarily dictate the form factor of the device where the
another example of these DNN model metrics, by comparing
processing can operate; (4) the cost, which is primarily dictated
sparse DNNs pruned using [142] to dense DNNs.
by the chip area, determines how much one would pay for this
solution.
B. Metrics for DNN Hardware
To measure the efficiency of the DNN hardware, we should IX. S UMMARY
consider the following additional metrics: The use of deep neural networks (DNNs) has seen explosive
• The power and energy consumption of the design should growth in the past few years. They are currently widely used
be reported for various DNN models; the DNN model for many artificial intelligence (AI) applications including
specifications should be provided including which layers computer vision, speech recognition and robotics and are often
and bit precision are supported by the hardware during delivering better than human accuracy. However, while DNNs
measurement. In addition, the amount of off-chip accesses can deliver this outstanding accuracy, it comes at the cost
(e.g., DRAM accesses) should be included since it of high computational complexity. Consequently, techniques
accounts for a significant portion of the system power; it that enable efficient processing of deep neural network to
can be reported in terms of the total amount of data that improve energy-efficiency and throughput without sacrificing
is read and written off-chip per inference. accuracy with cost-effective hardware are critical to expanding
• The latency and throughput should be reported in terms the deployment of DNNs in both existing and new domains.
of the batch size and the actual run time for various Creating a system for efficient DNN processing should
DNN models, which accounts for mapping and memory begin with understanding the current and future applications
bandwidth effects. This provides a more useful and and the specific computations required both now and the
informative metric than peak throughput. potential evolution of those computations. This article surveys a
• The cost of the chip depends on the area efficiency, which number of the current applications, focusing on computer vision
accounts for the size and type of memory (e.g., registers applications, the associated algorithms, and the data being used
or SRAM) and the amount of control logic. It should be to drive the algorithms. These applications, algorithms and
input data are experiencing rapid change. So extrapolating
13 Data augmentation is often used to increase accuracy. This includes using
these trends to determine the degree of flexibility desired to
multiple crops of an image to account for misalignment; in addition, an handle next generation computations, becomes an important
ensemble of multiple models can be used where each model has different
weights due to different training settings, such as using different initializations ingredient of any design project.
or datasets, or even different network architectures. If multiple crops and
models are used, then the number of MACs and weights required would 14 If obtained from simulation, it should be clarified whether it is from
increase. synthesis or post place-and-route and what library corner was used.
28
Metrics Specifications Eyeriss

Process Technology 65nm LP TSMC (1.0V)
Total Core Area (mm2 ) 12.25
Total On-chip Memory (kB) 192
Number of Multipliers 168
Cost
Core area
0.073
per Multiplier (mm2 )
On-chip Memory
1.14
per Multiplier (kB)
Measured or Simulated Measured
Test Setup
If Simulated, Syn or PnR n/a
DNN Model AlexNet VGG-16
Top-5 error on ImageNet 19.8 8.8
Dense/Sparse Dense Dense
Accuracy
Supported Layers All CONV layers All CONV layers
Bits per Weight 16 16
Bits per Input Activation 16 16
Batch Size 4 3
Latency and Throughput
Run Time (msec) 115.3 4309.4
Power (mW) 278 236
Power and Energy Off-chip Accesses
3.85 107.03
per Image Inference (MBytes)
Test Setup Number of Images Tested 100 100
TABLE V
E XAMPLE B ENCHMARK M ETRICS FOR E YERISS [94].
During the design-space exploration process, it is critical to article both reviews a variety of these techniques and discusses
understand and balance the important system metrics. For DNN the frameworks that are available for describing, running and
computation these include the accuracy, energy, throughput training networks.
and hardware cost. Evaluating these metrics is, of course, Finally, DNNs afford the opportunity to use mixed-signal
key, so this article surveys the important components of circuit design and advanced technologies to improve efficiency.
a DNN workload. In specific, a DNN workload has two These include using memristors for analog computation and 3-D
major components. First, the workload is the form of each stacked memory. Advanced technologies can also can facilitate
DNN network including the ‘shape’ of each layer and the moving computation closer to the source by embedding compu-
interconnections between layers. These can vary both within tation near or within the sensor and the memories. Of course, all
and between applications. Second, the workload consists of of these techniques should also be considered in combination,
the specific the data input to the DNN. This data will vary while being careful to understand their interactions and looking
with the input set used for training or the data input during for opportunities for joint hardware/algorithm co-optimization.
operation for inference. In conclusion, although much work has been done, deep
This article also surveys a number of avenues that prior neural networks remain an important area of research with
work have taken to optimize DNN processing. Since data many promising applications and opportunities for innovation
movement dominates energy consumption, a primary focus at various levels of hardware design.
of some recent research has been to reduce data movement
while maintaining accuracy, throughput and cost. This means ACKNOWLEDGMENTS
selecting architectures with favorable memory hierarchies like Funding provided by DARPA YFA, MIT CICS, and gifts
a spatial array, and developing dataflows that increase data from Nvidia and Intel. The authors thank the anonymous
reuse at the low-cost levels of the memory hierarchy. We reviewers as well as James Noraky, Mehul Tikekar and
have included a taxonomy of dataflows and an analysis of Zhengdong Zhang for providing valuable feedback on this
their characteristics. Other work is presented that aims to save paper.
space and energy by changing the representation of data values
in the DNN. Still other work saves energy and sometimes R EFERENCES
increases throughput by exploiting the sparsity of weights [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
and/or activations. vol. 521, no. 7553, pp. 436–444, May 2015.
The DNN domain also affords an excellent opportunity [2] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer,
G. Zweig, X. He, J. Williams et al., “Recent advances in deep
for joint hardware/software co-design. For example, various learning for speech research at Microsoft,” in ICASSP, 2013.
efforts have noted that efficiency can be improved by increasing [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
sparsity (increasing the number of zero values) or optimizing Classification with Deep Convolutional Neural Networks,” in
the representation of data by reducing the precision of values NIPS, 2012.
or using more complex mappings of the stored value to the [4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
Learning affordance for direct perception in autonomous
actual value used for computation. However, to avoid losing driving,” in ICCV, 2015.
accuracy it is often useful to modify the network or fine-tune the [5] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
network’s weights to accommodate these changes. Thus, this Blau, and S. Thrun, “Dermatologist-level classification of skin
29
cancer with deep neural networks,” Nature, vol. 542, no. 7639, [25] J. Zhou and O. G. Troyanskaya, “Predicting effects of noncod-
pp. 115–118, 2017. ing variants with deep learning-based sequence model,” Nature
[6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, methods, vol. 12, no. 10, pp. 931–934, 2015.
G. van den Driessche, J. Schrittwieser, I. Antonoglou, [26] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, “Predicting the sequence specificities of dna-and rna-binding
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, proteins by deep learning,” Nature biotechnology, vol. 33, no. 8,
K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the pp. 831–838, 2015.
game of Go with deep neural networks and tree search,” Nature, [27] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolu-
vol. 529, no. 7587, pp. 484–489, Jan. 2016. tional neural network architectures for predicting dna–protein
[7] F.-F. Li, A. Karpathy, and J. Johnson, “Stanford CS class binding,” Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016.
CS231n: Convolutional Neural Networks for Visual Recogni- [28] M. Jermyn, J. Desroches, J. Mercier, M.-A. Tremblay, K. St-
tion,” http://cs231n.stanford.edu/. Arnaud, M.-C. Guiot, K. Petrecca, and F. Leblond, “Neural net-
[8] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, works improve brain cancer detection with raman spectroscopy
J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, in the presence of operating room light artifacts,” Journal of
Y. Nakamura et al., “A million spiking-neuron integrated circuit Biomedical Optics, vol. 21, no. 9, pp. 094 002–094 002, 2016.
with a scalable communication network and interface,” Science, [29] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck,
vol. 345, no. 6197, pp. 668–673, 2014. “Deep learning for identifying metastatic breast cancer,” arXiv
[9] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, preprint arXiv:1606.05718, 2016.
R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, [30] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
T. Melano, D. R. Barch et al., “Convolutional networks for forcement learning: A survey,” Journal of artificial intelligence
fast, energy-efficient neuromorphic computing,” Proceedings research, vol. 4, pp. 237–285, 1996.
of the National Academy of Sciences, 2016. [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
[10] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of D. Wierstra, and M. Riedmiller, “Playing Atari with Deep
convolutional networks through FFTs,” in ICLR, 2014. Reinforcement Learning,” in NIPS Deep Learning Workshop,
[11] Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, 2013.
I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard, [32] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end
“Handwritten digit recognition: applications of neural network training of deep visuomotor policies,” Journal of Machine
chips and automatic learning,” IEEE Commun. Mag., vol. 27, Learning Research, vol. 17, no. 39, pp. 1–40, 2016.
no. 11, pp. 41–46, Nov 1989. [33] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena,
[12] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in “From Perception to Decision: A Data-driven Approach to End-
1960 IRE WESCON Convention Record, 1960. to-end Motion Planning for Autonomous Ground Robots,” in
[13] B. Widrow, “Thinking about thinking: the discovery of the ICRA, 2017.
LMS algorithm,” IEEE Signal Process. Mag., 2005. [34] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik,
[14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, “Cognitive mapping and planning for visual navigation,” in
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, CVPR, 2017.
and L. Fei-Fei, “ImageNet Large Scale Visual Recognition [35] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
Challenge,” International Journal of Computer Vision (IJCV), control policies for autonomous aerial vehicles with mpc-guided
vol. 115, no. 3, pp. 211–252, 2015. policy search,” in ICRA, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning [36] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
for Image Recognition,” in CVPR, 2016. agent, reinforcement learning for autonomous driving,” in NIPS
[16] “Complete Visual Networking Index (VNI) Forecast,” Cisco, Workshop on Learning, Inference and Control of Multi-Agent
June 2016. Systems, 2016.
[17] J. Woodhouse, “Big, big, big data: higher and higher resolution [37] N. Hemsoth, “The Next Wave of Deep Learning Applications,”
video surveillance,” technology.ihs.com, January 2016. Next Platform, September 2016.
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich [38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Feature Hierarchies for Accurate Object Detection and Semantic Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Segmentation,” in CVPR, 2014. [39] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab-
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional hadran, “Deep convolutional neural networks for LVCSR,” in
Networks for Semantic Segmentation,” in CVPR, 2015. ICASSP, 2013.
[20] K. Simonyan and A. Zisserman, “Two-stream convolutional [40] V. Nair and G. E. Hinton, “Rectified Linear Units Improve
networks for action recognition in videos,” in NIPS, 2014. Restricted Boltzmann Machines,” in ICML, 2010.
[21] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [41] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlin-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep earities improve neural network acoustic models,” in ICML,
neural networks for acoustic modeling in speech recognition: 2013.
The shared views of four research groups,” IEEE Signal Process. [42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into
Mag., vol. 29, no. 6, pp. 82–97, 2012. rectifiers: Surpassing human-level performance on imagenet
[22] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, classification,” in ICCV, 2015.
and P. Kuksa, “Natural language processing (almost) from [43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and
scratch,” Journal of Machine Learning Research, vol. 12, no. Accurate Deep Network Learning by Exponential Linear Units
Aug, pp. 2493–2537, 2011. (ELUs),” ICLR, 2016.
[23] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [44] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and deep neural network acoustic models using generalized maxout
K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” networks,” in ICASSP, 2014.
CoRR abs/1609.03499, 2016. [45] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, , C. Laurent,
[24] H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, Y. Bengio, and A. Courville, “Towards End-to-End Speech
D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, Recognition with Deep Convolutional Neural Networks,” in
T. R. Hughes et al., “The human splicing code reveals new Interspeech, 2016.
insights into the genetic determinants of disease,” Science, vol. [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
347, no. 6218, p. 1254806, 2015. shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
30
architecture for fast feature embedding,” in ACM International [75] J. Cong and B. Xiao, “Minimizing computation in convolutional
Conference on Multimedia, 2014. neural networks,” in ICANN, 2014.
[47] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating [76] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
deep network training by reducing internal covariate shift,” in networks,” in CVPR, 2016.
ICML, 2015. [77] “Intel Math Kernel Library,” https://software.intel.com/en-us/
[48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- mkl.
based learning applied to document recognition,” Proc. IEEE, [78] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
vol. 86, no. 11, pp. 2278–2324, Nov 1998. B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives
[49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and for Deep Learning,” arXiv preprint arXiv:1410.0759, 2014.
Y. LeCun, “OverFeat: Integrated Recognition, Localization and [79] M. Horowitz, “Computing’s energy problem (and what we can
Detection using Convolutional Networks,” in ICLR, 2014. do about it),” in ISSCC, 2014.
[50] K. Simonyan and A. Zisserman, “Very Deep Convolutional [80] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Archi-
Networks for Large-Scale Image Recognition,” in ICLR, 2015. tecture for Energy-Efficient Dataflow for Convolutional Neural
[51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Networks,” in ISCA, 2016.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper [81] ——, “Using Dataflow to Optimize Energy Efficiency of Deep
With Convolutions,” in CVPR, 2015. Neural Network Accelerators,” IEEE Micro’s Top Picks from the
[52] M. Lin, Q. Chen, and S. Yan, “Network in Network,” in ICLR, Computer Architecture Conferences, vol. 37, no. 3, May-June
2014. 2017.
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, [82] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur-
“Rethinking the inception architecture for computer vision,” in danovic, E. Cosatto, and H. P. Graf, “A Massively Parallel
CVPR, 2016. Coprocessor for Convolutional Neural Networks,” in ASAP,
[54] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception- 2009.
v4, Inception-ResNet and the Impact of Residual Connections [83] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an
on Learning,” in AAAI, 2017. embedded biologically-inspired machine vision processor,” in
[55] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, FPT, 2010.
R. Caruana, A. Mohamed, M. Philipose, and M. Richardson, [84] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi,
“Do Deep Convolutional Nets Really Need to be Deep and “A Dynamically Configurable Coprocessor for Convolutional
Convolutional?” ICLR, 2017. Neural Networks,” in ISCA, 2010.
[56] “Caffe LeNet MNIST,” http://caffe.berkeleyvision.org/gathered/ [85] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
examples/mnist.html. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,”
[57] “Caffe Model Zoo,” http://caffe.berkeleyvision.org/model zoo. in CVPR Workshop, 2014.
html. [86] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “A
[58] “Matconvnet Pretrained Models,” http://www.vlfeat.org/ 1.93TOPS/W scalable deep learning/inference processor with
matconvnet/pretrained/. tetra-parallel MIMD architecture for big-data applications,” in
[59] “TensorFlow-Slim image classification library,” https://github. ISSCC, 2015.
com/tensorflow/models/tree/master/slim. [87] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[60] “Deep Learning Frameworks,” https://developer.nvidia.com/ L. Benini, “Origami: A Convolutional Network Accelerator,”
deep-learning-frameworks. in GLVLSI, 2015.
[61] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An [88] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,
Energy-Efficient Reconfigurable Accelerator for Deep Convolu- “Deep Learning with Limited Numerical Precision,” in ICML,
tional Neural Networks,” IEEE J. Solid-State Circuits, vol. 51, 2015.
no. 1, 2017. [89] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
[62] C. J. B. Yann LeCun, Corinna Cortes, “THE MNIST X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting
DATABASE of handwritten digits,” http://yann.lecun.com/exdb/ Vision Processing Closer to the Sensor,” in ISCA, 2015.
mnist/. [90] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal,
[63] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Memory-centric accelerator design for Convolutional Neural
“Regularization of neural networks using dropconnect,” in ICML, Networks,” in ICCD, 2013.
2013. [91] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
[64] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” mizing FPGA-based Accelerator Design for Deep Convolutional
https://www.cs.toronto.edu/∼kriz/cifar.html. Neural Networks,” in FPGA, 2015.
[65] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny [92] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
images: A large data set for nonparametric object and scene O. Temam, “DianNao: A Small-footprint High-throughput
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, Accelerator for Ubiquitous Machine-learning,” in ASPLOS,
no. 11, pp. 1958–1970, 2008. 2014.
[66] A. Krizhevsky and G. Hinton, “Convolutional deep belief [93] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li,
networks on cifar-10,” Unpublished manuscript, vol. 40, 2010. T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A
[67] B. Graham, “Fractional max-pooling,” arXiv preprint Machine-Learning Supercomputer,” in MICRO, 2014.
arXiv:1412.6071, 2014. [94] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An
[68] “Pascal VOC data sets,” http://host.robots.ox.ac.uk/pascal/ Energy-Efficient Reconfigurable Accelerator for Deep Convo-
VOC/. lutional Neural Networks,” in ISSCC, 2016.
[69] “Microsoft Common Objects in Context (COCO) dataset,” http: [95] V. Sze, M. Budagavi, and G. J. Sullivan, “High Efficiency Video
//mscoco.org/. Coding (HEVC): Algorithms and Architectures,” in Integrated
[70] “Google Open Images,” https://github.com/openimages/dataset. Circuit and Systems. Springer, 2014, pp. 1–375.
[71] “YouTube-8M,” https://research.google.com/youtube8m/. [96] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer
[72] “AudioSet,” https://research.google.com/audioset/index.html. CNN accelerators,” in MICRO, 2016.
[73] S. Condon, “Facebook unveils Big Basin, new server geared [97] D. Keitel-Schulz and N. Wehn, “Embedded DRAM develop-
for deep learning,” ZDNet, March 2017. ment: Technology, physical design, and application issues,”
[74] C. Dubout and F. Fleuret, “Exact acceleration of linear object IEEE Des. Test. Comput., vol. 18, no. 3, pp. 7–15, 2001.
detectors,” in ECCV, 2012. [98] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM
31
architecture increases density and performance,” in Symp. on and modularized RTL compilation of Convolutional Neural
VLSI, 2012. Networks onto FPGA,” in FPL, 2016.
[99] J. Standard, “High bandwidth memory (HBM) DRAM,” [122] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented
JESD235, 2013. Approximation of Convolutional Neural Networks,” in ICLR,
[100] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad- 2016.
hyay, “Neurocube: A programmable digital neuromorphic [123] S. Higginbotham, “Google Takes Unconventional Route with
architecture with high-density 3D memory,” in ISCA, 2016. Homegrown Machine Learning Chips,” Next Platform, May
[101] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, 2016.
“TETRIS: Scalable and Efficient Neural Network Acceleration [124] T. P. Morgan, “Nvidia Pushes Deep Learning Inference With
with 3D Memory,” in ASPLOS, 2017. New Pascal GPUs,” Next Platform, September 2016.
[102] J. Zhang, Z. Wang, and N. Verma, “A machine-learning [125] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
classifier implemented in a standard 6T SRAM array,” in Symp. A. Moshovos, “Stripes: Bit-serial deep neural network comput-
on VLSI, 2016. ing,” in MICRO, 2016.
[103] Z. Wang, R. Schapire, and N. Verma, “Error-adaptive classifier [126] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-
boosting (EACB): Exploiting data-driven training for highly scalable processor for real-time large-scale ConvNets,” in Symp.
fault-tolerant hardware,” in ICASSP, 2014. on VLSI, 2016.
[104] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, [127] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:
J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: Training deep neural networks with binary weights during
A Convolutional Neural Network Accelerator with In-Situ propagations,” in NIPS, 2015.
Analog Arithmetic in Crossbars,” in ISCA, 2016. [128] M. Courbariaux and Y. Bengio, “Binarynet: Training deep
[105] L. Chua, “Memristor-the missing circuit element,” IEEE Trans. neural networks with weights and activations constrained to+
Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971. 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
[106] L. Wilson, “International technology roadmap for semiconduc- [129] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-
tors (ITRS),” Semiconductor Industry Association, 2013. Net: ImageNet Classification Using Binary Convolutional
[107] Lu, Darsen, “Tutorial on Emerging Memory Devices,” 2016. Neural Networks,” in ECCV, 2016.
[108] S. B. Eryilmaz, S. Joshi, E. Neftci, W. Wan, G. Cauwenberghs, [130] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with
and H.-S. P. Wong, “Neuromorphic architectures with electronic low precision by half-wave gaussian quantization,” in CVPR,
synapses,” in ISQED, 2016. 2017.
[109] P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, [131] F. Li and B. Liu, “Ternary weight networks,” in NIPS Workshop
Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory on Efficient Methods for Deep Neural Networks, 2016.
Architecture for Neural Network Computation in ReRAM-based [132] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary
Main Memory,” in ISCA, 2016. Quantization,” ICLR, 2017.
[110] M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. [133] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An
Likharev, and D. B. Strukov, “Training and operation of Ultra-Low Power Convolutional Neural Network Accelerator
an integrated neuromorphic network based on metal-oxide Based on Binary Weights,” in ISVLSI, 2016.
memristors,” Nature, vol. 521, no. 7550, pp. 61–64, 2015. [134] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato,
[111] J. Zhang, Z. Wang, and N. Verma, “A matrix-multiplying ADC H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and
implementing a machine-learning classifier directly with data M. Kuroda, T.and Motomura, “BRein Memory: A 13-Layer
conversion,” in ISSCC, 2015. 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable
[112] E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched- In-Memory Deep Neural Network Accelerator in 65nm CMOS,”
capacitor matrix multiplier with co-designed local memory in in Symp. on VLSI, 2017.
40nm,” in ISSCC, 2016. [135] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional
[113] R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong, Neural Networks using Logarithmic Data Representation,”
“RedEye: analog ConvNet image sensor architecture for contin- arXiv preprint arXiv:1603.01025, 2016.
uous mobile vision,” in ISCA, 2016. [136] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental
[114] A. Wang, S. Sivaramakrishnan, and A. Molnar, “A 180nm Network Quantization: Towards Lossless CNNs with Low-
CMOS image sensor with on-chip optoelectronic image com- precision Weights,” in ICLR, 2017.
pression,” in CICC, 2012. [137] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
[115] H. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrish- “Compressing Neural Networks with the Hashing Trick,” in
nan, A. Veeraraghavan, and A. Molnar, “ASP Vision: Optically ICML, 2015.
Computing the First Layer of Convolutional Neural Networks [138] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,
using Angle Sensitive Pixels,” in CVPR, 2016. and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep
[116] A. Suleiman and V. Sze, “Energy-efficient HOG-based object neural network computing,” in ISCA, 2016.
detection at 1080HD 60 fps with multi-scale support,” in SiPS, [139] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K.
2014. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks,
[117] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Minerva: Enabling low-power, highly-accurate deep neural
“Lognet: Energy-Efficient Neural Networks Using Logrithmic network accelerators,” in ISCA, 2016.
Computations,” in ICASSP, 2017. [140] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain
[118] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Damage,” in NIPS, 1990.
Compressing Deep Neural Networks with Pruning, Trained [141] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights
Quantization and Huffman Coding,” in ICLR, 2016. and connections for efficient neural networks,” in NIPS, 2015.
[119] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Ben- [142] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efficient
gio, “Quantized neural networks: Training neural networks Convolutional Neural Networks using Energy-Aware Pruning,”
with low precision weights and activations,” arXiv preprint in CVPR, 2017.
arXiv:1609.07061, 2016. [143] “DNN Energy Estimation,” http://eyeriss.mit.edu/energy.html.
[120] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa- [144] R. Dorrance, F. Ren, and D. Marković, “A scalable sparse
Net: Training low bitwidth convolutional neural networks with matrix-vector multiplication kernel for energy-efficient sparse-
low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016. blas on FPGAs,” in ISFPGA, 2014.
[121] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable [145] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
32
and W. J. Dally, “EIE: efficient inference engine on compressed

deep neural network,” in ISCA, 2016.
[146] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn:
An accelerator for compressed-sparse convolutional neural
networks,” in ISCA, 2017.
[147] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
structured sparsity in deep neural networks,” in NIPS, 2016.
[148] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of
deep convolutional neural networks,” ACM Journal of Emerging
Technologies in Computing Systems, vol. 13, no. 3, p. 32, 2017.
[149] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and
S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying
hardware parallelism,” in ISCA, 2017.
[150] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally,
“Exploring the regularity of sparse structure in convolutional
neural networks,” in CVPR Workshop on Tensor Methods In
Computer Vision, 2017.
[151] J. S. Lim, “Two-dimensional signal and image processing,”
Englewood Cliffs, NJ, Prentice Hall, 1990, 710 p., 1990.
[152] F. Chollet, “Xception: Deep Learning With Depthwise Separa-
ble Convolutions,” CVPR, 2017.
[153] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
convolutional neural networks for mobile vision applications,”
arXiv preprint arXiv:1704.04861, 2017.
[154] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy
with 50x fewer parameters and <1MB model size,” ICLR,
2017.
[155] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting Linear Structure Within Convolutional Networks
for Efficient Evaluation,” in NIPS, 2014.
[156] V. Lebedev, Y. Ganin, M. Rakhuba1, I. Oseledets, and V. Lem-
pitsky, “Speeding-Up Convolutional Neural Networks Using
Fine-tuned CP-Decomposition,” ICLR, 2015.
[157] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin,
“Compression of Deep Convolutional Neural Networks for Fast
and Low Power Mobile Applications,” in ICLR, 2016.
[158] C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model
Compression,” in SIGKDD, 2006.
[159] L. Ba and R. Caurana, “Do Deep Nets Really Need to be
Deep?” NIPS, 2014.
[160] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge
in a Neural Network,” in NIPS Deep Learning Workshop, 2014.
[161] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Y. Bengio, “Fitnets: Hints for Thin Deep Nets,” ICLR, 2015.
[162] “Benchmarking DNN Processors,” http://eyeriss.mit.edu/
benchmarking.html.

Deep NN - Theory, Tutorial and Survey

Uploaded by

Copyright:

Available Formats

Deep NN - Theory, Tutorial and Survey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep NN - Theory, Tutorial and Survey

Uploaded by

Copyright:

Available Formats

1

Efficient Processing of Deep Neural Networks:

computer vision, speech recognition, and robotics. While DNNs experts.

Fig. 2. Connections to a neuron in the brain. xi , wi , f (·), and b are the

to be 1014 to 1015 synapses in the average human brain.

Fig. 4. An example of backpropagation through a neural network.

inference can sometimes be useful for performing training. DNN Timeline

• Speech and Language DNNs have significantly improved

Feed Forward Recurrent

Shape Parameter Description

Input fmaps Sigmoid Hyperbolic Tangent

Low-Level Mid-Level High-Level

5x5 filter Two 3x3 filters Apply sequentially Input

1x1 CONV 1x1 CONV 3x3 MAX POOL

GoogLeNet [51] goes even deeper with 22 layers. It in-

layer. Since its introduction in 2014, GoogleNet (also referred

LeNet AlexNet Overfeat VGG GoogLeNet ResNet

Filter Input Fmap Output Fmap input fmap output fmap

Toeplitz Matrix Memory Read MAC* Memory Write

Fig. 19. Mapping to matrix multiplication for convolutional layers.

Convolutional Reuse Fmap Reuse Filter Reuse

NoC: 200 – 1000 PEs PE ALU 2×

DRAM ALU 200×

Global Buffer OSA OSB OSC

# Output Channels Single Multiple Multiple

Input Fmap Input Fmap Input Fmap

(a) Step 1 (b) Step 2 (c) Step 3

Row Stationary Mapping

Fig. 33(a) shows the same results with breakdown in terms of

2.0e10 Total Energy

voltage as the input, and the current as the output as shown in

simulations) I2 = V2×G2 densely packed with a similar density to DRAM [106].8

earlier, lower precision is required for analog computation due

sign exponent (8-bits) mantissa (23-bits) 2N+M-bits

sign mantissa (7-bits) sign mantissa (7-bits)

reduces the MAC to an XNOR. However, BC and BNN have Weight

dynamic range (i.e., the weights effectively become -w and

bitwidth Accuracy loss vs.

ReLU a cost of reduced accuracy.

# acc. at mem. level 1 # of

# of MACs # of MACs Ecomp

CNN Weights and Input Data Energy

Metrics Specifications Eyeriss

and W. J. Dally, “EIE: efficient inference engine on compressed

You might also like