Recurrent Neural Networks: An Embedded Computing Perspective

Accepted for publication in IEEE open Access
Digital Object Identifier 10.1109/ACCESS.2020.2982416
Recurrent Neural Networks: An

Embedded Computing Perspective
NESMA M. REZK1 , MADHURA PURNAPRAJNA2 ,TOMAS NORDSTRÖM3 , AND ZAIN
UL-ABDIN1
1
School of information technology, Halmstad University, Sweden (e-mail: nesma.rezk,zain-ul-abdin@hh.se)
arXiv:1908.07062v3 [cs.NE] 19 Mar 2020
2
Amrita Vishwa Vidyapeetham, Bangalore, India (e-mail: p_madhura@blr.amrita.edu)
3
Umeå University (e-mail:tomas.nordstrom@umu.se)
Corresponding author: Nesma M. Rezk (e-mail: nesma.rezk@hh.se).
This research is performed in the NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Indo-Swedish project, funded by VINNOVA Strategic Innovation grant and the Department of Science and Technology
(INT/SWD/VINN/p-10/2015), Government of India.
ABSTRACT Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for
applications with time-series and sequential data. Recently, there has been a strong interest in executing
RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational
capability and a large memory space. In this paper, we review existing implementations of RNN models on
embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems.
We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing
their realization. Then, we explain the components of RNN models from an implementation perspective.
We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we
compare the defined objectives with the implementations and highlight some open research questions and
aspects currently not addressed for embedded RNNs.
Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead
is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more
promising optimizations that could be applied in future research. Additionally, this article observes that high
performance has been targeted by many implementations, while flexibility has, as yet, been attempted less
often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a
better manner.
INDEX TERMS Compression, Flexibility, Efficiency, Embedded computing, Long Short Term Memory
(LSTM), Quantization, Recurrent Neural Networks (RNNs)
I. INTRODUCTION devices imposes some optimizations to RNN applications.

Recurrent Neural Networks (RNNs) are a class of Neural Embedded platforms are time-constrained systems that suffer
Networks (NNs) dealing with applications that have se- from limited memory and power resources. To run RNN
quential data inputs or outputs. RNNs capture the temporal applications efficiently on embedded platforms, RNN appli-
relationship between input/output sequences by introducing cations need to overcome these restrictions.
feedback to FeedForward (FF) neural networks. Thus, many
applications with sequential data such as speech recognition A. SCOPE OF THE ARTICLE
[1], language translation [2], and human activity recognition In this article, we study RNN models and specifically focus
[3] can benefit from RNNs. on RNN optimizations and implementations on embedded
In contrast to cloud computing, edge computing can guar- platforms. The article compares recent implementations of
antee better response time and enhance security for the RNN models on embedded systems found in the literature.
running application. Augmenting edge devices with RNNs For a research paper to be included in the comparison, it
grant them the intelligence to process and respond to sequen- should satisfy the following conditions:
tial problems. Realization on embedded platforms in edge • It discusses the implementation of an RNN model or the
VOLUME , 2020 1
Rezk et al.: Recurrent Neural Networks: An Embedded Computing Perspective
FIGURE 1: Structure of the survey article. RNN models should run on an embedded platform in an edge device. Section
II discusses the objectives of such an implementation and the challenges facing it. Section III describes the RNN models in
detail. There follows a discussion of how algorithmic optimizations (Section IV-A) may be applied to RNN models and how
platform-specific optimizations (Section IV-B) are applied to embedded platforms. The resulting implementations are discussed
in Section V and compared to the objectives in Section VI.
recurrent layer of an RNN model. • A study of the optimizations applied to RNNs to execute
• The target platform is an embedded platform such as them on embedded platforms.
FPGA, ASIC, etc. • An application-independent comparison of recent im-
plementations of RNNs on embedded platforms.
To provide a complete study, the survey also addresses the
• Identification of possible opportunities for future re-
methods used for optimizing the RNN models and realizing
search.
them on embedded systems.
This survey distinguishes itself from related works because
no existing article includes the components of the RNN C. SURVEY STRUCTURE
models with optimizations and implementations in a single This survey article is organized as shown in Figure 1. Sec-
analysis, as may be seen in Table 1. The other surveys focus tion II defines the objectives of realizing RNN models on
on one or two aspects compared to those covered in this embedded platforms and the challenges faced in achieving
article. Some articles study RNNs from an algorithmic point them. We then define a general model for RNN applications
of view [4], [5]. While another group of survey articles looks and discuss different variations for the recurrent layers in
at the hardware implementations. For example, one survey RNN models in Section III. However, it is difficult to run
on neural networks efficient processing [6] studied CNNs, RNN models in their original form efficiently on embedded
CNN optimizations, and CNN implementations, while an- platforms. Therefore, researchers have applied optimizations
other CNN survey [7] studied CNN mappings on FPGAs. to both the RNN model and the target platform. The op-
Some articles were specialized in algorithmic optimizations timizations applied to the RNN model are called algorith-
such as compression [8]. All algorithmic optimizations for mic optimizations and are discussed in Section IV-A; the
both CNNs and RNNs were surveyed in one article that optimizations applied to the hardware platform are called
also discussed their implementations [9]. However, the main platform-specific optimizations and are discussed in Sec-
scope of the article was optimizations, and so RNN mod- tion IV-B. Then, in Section V, we present an analysis of
els and their components were not studied. Furthermore, the hardware implementations of RNNs suggested in the
the RNN implementations included were limited to speech literature. The implementations are compared against the
recognition applications on the TIDIGITs dataset. applied optimizations and their achieved performance. In
Section VI, we compare the implementations analyzed in
B. CONTRIBUTIONS Section V with the objectives defined in Section II to define
This survey article provides the following: the gap between them and propose research opportunities
to fill this gap. Finally, in Section VII, we summarize our
• A detailed comparison of RNN models’ components
survey.
from a computer architecture perspective that addresses
computational and memory requirements.
2 VOLUME , 2020
TABLE 1: Comparison with related survey articles.

Article [4] & [5] [8] [9] [6] [7] This article
Analysis of RNN models components 3 7 7 7 7 3
Analysis of CNN models components 7 7 7 3 7 7
Algorithmic Optimizations 7 3 3 3 3 3
Platform-specific optimizations 7 7 7 3 3 3
RNN Implementations 7 7 Speech recognition only 7 7 3
CNN Implementations 7 7 3 3 FPGAs only 7
II. OBJECTIVES AND CHALLENGES implementation should meet embedded platforms’ en-
Implementation efficiency is the primary objective in im- ergy constraints. To compare the energy consumption
plementing RNN applications on embedded systems. Im- of different implementations, we use the number of
plementation efficiency requires the implementation to have operations per second per watt as a measure of energy
high throughput, low energy consumption, and to meet real- efficiency.
time requirements. A secondary objective for the implemen- • Real-time requirements In real-time implementations,
tation would be flexibility. Flexibility requires the implemen- a response cannot be delayed beyond a predefined dead-
tation to support variations in the RNN model, to allow for line, and energy consumption cannot exceed a prede-
online training, and to meet different applications require- fined limit. The deadline is defined by the application
ments. In meeting these objectives, there exist challenges in and is affected by the frequency of sensor inputs and
mapping these applications onto embedded systems, such as the system response time. Normally, the RNN execution
the large number of computations to be performed within the should meet the predefined deadline.
limited available memory. These objectives and challenges
are discussed in detail below. 2) Flexibility
The flexibility of the solution in this context is the abil-
A. OBJECTIVES OF REALIZING RNNS ON EMBEDDED ity of the solution to run different models under different
PLATFORMS constraints without being restricted to one model or one
To realize RNN models on embedded platforms, we define configuration. For an implementation to be flexible, we define
some objectives that will influence the solution. These ob- the following requirements that should be satisfied:
jectives are divided into implementation efficiency objectives • Supporting variations in RNN layer The recurrent
and flexibility objectives. layers of RNN models can vary in the type of the layer
(different types of the recurrent layer are discussed in
1) Implementation Efficiency Section III-B), the number of hidden cells, and the
Since we target embedded platforms, we consider the online number of recurrent layers.
execution of the application. To satisfy the implementation • Supporting other NN layers RNN models have other
efficiency objective, the implementation should have a high types of NN layers as well. A solution that supports
throughput, low energy consumption, and meet the real- more NN layers is considered a complete solution for
time requirements of the application. The real-time require- RNN models, and not just a flexible solution. Convolu-
ments of the application pose additional demands for the tion layers, fully connected layers, and pooling layers
throughput, energy consumption and the accuracy of the might be required in an RNN model.
implementation. Accuracy indicates how correct the model • Supporting algorithmic optimization variations Dif-
is in performing recognition, classification, translation, etc. ferent algorithmic optimizations are applied to RNN
• High throughput Throughput is a measure of per- models to implement them efficiently on embedded sys-
formance. It measures the number of processed in- tems (Section IV). Supporting at least one algorithmic
put/output samples per second. Application-level inputs optimization for the hardware solution is in many cases
and outputs are diverse. For image processing applica- mandatory for a feasible execution of RNN models on
tions, the input can be frames and the throughput can an embedded system. Combinations of optimizations
be the number of consumed frames per second, which will lead to higher efficiency and flexibility as this gives
may also depend on the frame size. For speech/text the algorithmic designer more choices while optimizing
applications, it can be the number of predicted words the model for embedded execution.
per second. Thus for different sizes and types of input • Online training Training is a process that sets param-
and outputs, throughput can have different units and the eter values within the neural network. In embedded
throughput value may be interpreted in various ways. platforms, training is performed offline, and only infer-
To compare different applications, we use the number ence is run on the platform at run-time. For real-life
of operations per second as a measure of throughput. problems, it is often not enough to run only inference
• Low energy consumption For an implementation to on the embedded platforms – some level of training
be considered efficient, the energy consumption of the is required at run-time as well. Online training allows
VOLUME , 2020 3
the neural network to adapt to new data that was not of memory accesses affects the throughput and energy con-
encountered within the training data, and to adapt to sumption of the implementation [15].
changes in the environment. For example, online train-
ing is required for object recognition in autonomous cars 3) Accuracy challenges
to achieve lifelong learning, by continuously receiving To overcome the previous two issues (computation and mem-
new training data from fleets of robots and updating ory challenges), optimizations can be applied to RNN models
the model parameters [10]. Another example is in au- as discussed in Section IV. These optimizations may affect
tomated visual monitoring systems that continuously accuracy. The acceptable decrease in accuracy varies with
receive new labeled data [11]. the application domain. For instance, in aircraft anomaly
• Meeting the requirements of different application detection, the accepted range of data fluctuation is only 5%
domains One aspect of flexibility is to support the re- [16].
quirements of different application domains. This makes
the implementation attractive because the solution can III. RECURRENT NEURAL NETWORKS
support a wider range of applications. However, differ- The intelligence of humans, as well as most animals, de-
ent application domains can have different performance pends on having a memory of the past. This can be short-
criteria. Some application domains, such as autonomous term, as when combining sounds to make words, and long-
vehicles [12], might require very high throughput with term, for example where the word “she” can refer back to
moderate power consumption, while others, such as “Anne” mentioned hundreds of words earlier. This is exactly
mobile applications [13], [14], require extremely low what RNN provides in neural networks. It adds feedback
power consumption but have less stringent constraints that enables using the outputs of previous time step while
on throughput. processing the current time-step input. It aims to add memory
cells that function similarly to human long-term and short-
B. CHALLENGES IN MAPPING RNNS ON EMBEDDED term memories.
PLATFORMS RNNs add recurrent layers to the NN (Neural Network)
We shall now take a look at the challenges faced by hardware model. Figure 2 presents a generic model for RNNs that
solutions to meet the objectives discussed above. consists of three sets of layers (input, recurrent, and output).
Input layers take the sensor output and convert it into a vector
1) Computation challenges
that conveys the features of the input. These are followed
by the recurrent layers, which provide feedback. In most
The main computation bottleneck in RNNs is the matrix to
recent recurrent layer models, memory cells exist as well.
vector multiplications. The LSTM layer (Explained in detail
Subsequently, the model completes similarly to most NN
in Section III-B) has four computation blocks, each of which
models with Fully Connected (FC) layers and an output
has one matrix to vector multiplication. For example, if the
layer that can be a softmax layer. FC layers and the output
size of the vector is 1280 and the size of the matrices is 1280
layer are grouped into the set of output layers in Figure 2.
× 1024, each matrix to vector multiplication requires 1280 ×
In this section, we discuss the input layers, different types
1024 MAC (Multiply And Accumulate) operations. The total
of recurrent layer, output layers, RNN modes of operation,
number of MAC operations in the LSTM would be 4×1280×
deep RNN, and RNN applications and their corresponding
1024 = 5.24 Mega MAC, which is approximately equivalent
datasets.
to 10.5 MOP. The high number of computations negatively
affects both the throughput of the implementation and energy
consumption.
One other problem in RNNs is the recurrent structure of
the RNN. In RNNs, the output is fed back as an input in
such a way that each time-step computation needs to wait
for the previous time-step computation to complete. This
temporal dependency makes it difficult to parallelize the
implementation over time-steps.
FIGURE 2: Generic model of RNNs with diverse recurrent
2) Memory challenges layers.
The memory required for the matrix to vector multiplica-
tions can be very large. The size and the access time of
these matrices become a memory bottleneck. The previous A. INPUT LAYERS (FEATURES EXTRACTOR) AND
example of the LSTM layer, requires four matrices, each of CORRESPONDING APPLICATIONS AND DATASETS
size 1280 × 1024. Consider 32-bit floating-point operations: Input layers are needed by many implementations to pre-
the size of the required memory for the weights would be pare the sensor output for processing (these may also called
32 × 4 × 1280 × 1024 = 21M B. Also, the high number feature extraction layers). Often, the raw sensor data, e.g.,
4 VOLUME , 2020
the audio samples or video frames, are in a form that is Image/video applications cover any application that takes
unsuitable for direct processing in the recurrent layer. Also, images as input, for example, image captioning, activity
the RNN performance (in learning rate and accuracy) can be recognition, and video description.
significantly improved if suitable features are extracted in the
input layer. 3) Text inputs
As sensor types (and numbers) change with the applica- When the input is in the form of text, we often want to
tion, RNN models show a large variation with application represent words as vectors, and word embedding is one
types as well. Thus it is important to study which applications common way to do this [23]. The word embedding layer
an RNN model is used for and their corresponding datasets. extracts the features of each word in relation to the rest of the
Datasets are used by researchers to demonstrate success vocabulary. The output of the word embedding is a vector.
in applying their methods and the modifications to them. For two words with similar contexts, the distance between
Datasets differ in the size of the data samples, the values of their two vectors is short, while it is large for two words that
data samples, and the total size of the dataset. The success have different contexts.
of NN models is measured by accuracy. Accuracy indicates Following word embedding in an input layer, deeper text
how correct the model is when carrying out recognition, analysis or natural language processing is performed in the
classification, translation, etc. recurrent layers.
In this section, we discuss examples from three application Applications:
domains where input layer pre-processing is used: audio, • Text generation
video, and text. In Table 2, we summarize these applica- RNN models can be used for language-related applica-
tion domains and their corresponding datasets. For different tions such as text generation. RNN models can predict
datasets, different metrics are used to assess accuracy. the next words in a phrase, using the previous words as
inputs.
1) Audio inputs • Sentiment analysis
Audio feature extractors translate sound signals into feature Sentiment analysis is the task of understanding the
vectors. In speech processing, we often want to extract a underlying opinion expressed by words [24], [25]. Since
frequency content from the audio signal (in a similar way the input words comprise a sequence, RNN methods are
to the human ear) [17]. There are many ways to do this, for well-suited to performing sentiment analysis.
example, by using short-time Fourier transform (STFT), mel
frequency cepstral coefficients (MFCC) and linear predictive B. RECURRENT LAYERS
coding (LPC) coefficients [18]. In this section, we cover the various types of recurrent layers.
Applications: Speech recognition For each layer, we discuss the structure of the layer and
Speech recognition applications receive audio as input, the gate equations. The most popular recurrent layer is the
understand it, and translate it into words. Speech recognition Long Short Term Memory (LSTM) [39]. Changes have been
can be used for phonetic recognition, voice search, conversa- proposed to the LSTM to enhance algorithmic efficiency or
tional speech recognition, and speech-to-text processing [19]. improve computational complexity. Enhancing algorithmic
efficiency means improving the accuracy achieved by the
2) Video inputs RNN model, which includes LSTM with peepholes and
When the input is a video signal, that is, a sequence of images ConvLSTM, as discussed in Sections III-B2 and III-B3.
or frames, it is natural to use a convolutional neural network Improving computational complexity means reducing the
(CNN) as an input layer. CNN layers then extract image number of computations and the amount of memory re-
features from each video frame and feed the resulting feature quired by an LSTM to run efficiently on a hardware plat-
vector to the recurrent layer. This use of a CNN as an input form. Techniques include LSTM with projection, GRU, and
layer before a recurrent layer has been employed for many QRNN/SRU, which are discussed in Sections III-B4, III-B5,
applications with video inputs, such as activity recognition, and III-B6, respectively. These changes can be applied to
image description [3], [20], or video description [21]. the gate equations, interconnections, or even the number of
The use of CNN as an input layer can also be found for gates. Finally, we compare all the different layers against
audio signals [22]. In this case, a short segment of audio the number of operations and the number of parameters in
samples is transformed into a frequency domain vector using, Table 3.
for example, STFT or MFCC. By combining a number of
these segments into a spectrogram, we can show information 1) LSTM
about the source’s frequency and amplitude against time. First, we explain the LSTM (Long Short Term Memory)
This visual representation is then fed into a CNN as an image. layer. Looking at LSTM as a black box, the input to LSTM is
The CNN then extracts speech or audio features suitable for a vector combination of the input vector xt and the previous
the recurrent layer. time-step output vector ht−1 , where the output vector at time
Applications: Image/Video applications t is denoted as ht . Looking at the structure of an LSTM,
VOLUME , 2020 5
TABLE 2: RNN input layer types and their corresponding application domains and datasets.
Input type Applications Dataset Accuracy measure metric
TIDIGITS [26]
AN4 [27]
Word Error Rate (WER) (Lower is better) &
Audio input Speech recognition TIMIT [28]
Phone Error Rate (PER) (Lower is better)
Wall Street Journal(WSJ) [29]
LibriSpeech ASR corpus [30]
COCO [31] BLEU (Higher is better)
Video input Image/video applications Moving MNIST [32] Cross entropy loss (Lower is better)
comma.ai driving dataset [33] RMS prediction error (Lower is better)
Penn Treebank (PTB) [34] Perplexity per word (PPW)
wikitext [35] (Lower is better)
Text generation
Text input Text8 [36] & Bilingual Evaluation Understudy (BLEU)
WMT’14 [37] (Higher is better)
Sentiment analysis IMDB [38] Testing accuracy (Higher is better)
it has a memory cell state Ct and three gates. These gates the addition of the previous state vector Ct−1 element-
control what is to be forgotten and what is to be updated wise multiplied with the forget gate output vector ft
by the memory state (forget and input gates). They also and the new state candidate vector C et element-wise
control the part of the memory state that will be used as multiplied with the input gate output vector it as
an output (output gate). Our description of the LSTM unit
Ct = ft Ct−1 + it C
et , (4)
is based on its relationship with hardware implementations.
Thus, in Figure 3a, we show the LSTM as four blocks instead where is used to denote the element-wise multiplica-
of three gates because LSTM is composed of four similar tion.
computation blocks. • Output gate The role of the output gate is to compute
The computation block is the matrix to vector multiplica- the LSTM output. First, the output gate vector ot is
tion of the combination of xt and ht−1 with one of the weight computed as
matrices {Wf , Wi , Wc , Wo }. This is considered the domi-
ot = σ(Wo [ht−1 , xt ] + bo ), (5)
nant computational task in LSTMs. Each block is composed
of a matrix to vector multiplication followed by the addition where xt is the input vector, ht−1 is the hidden state
of a bias vector {bf , bi , bc , bo }, and then the application of output vector, Wo is the weight matrix, bo is the bias
a nonlinear function. Each block might have element-wise vector, and σ is the sigmoid function. Then, the hidden
multiplication operations as well. The nonlinear functions state output ht is computed by applying the element-
used in the LSTM are tanh and sigmoid functions. The four wise multiplication of the output gate vector ot (that
computation blocks are as follow: holds the decision of which part of the state is the
• Forget gate The role of the forget gate is to decide output) to the tanh of the state vector Ct as
which information should be forgotten. The forget gate ht = ot tanh(Ct ). (6)
output ft is calculated as
The number of computations and parameters for LSTM are
ft = σ(Wf [ht−1 , xt ] + bf ), (1) shown in Table 3. Matrix to vector multiplications dominate
the number of computations and parameters. For each matrix
where xt is the input vector, ht−1 is the hidden state
to vector multiplication, the input vector xt of size m and
output vector, Wf is the weight matrix, bf is the bias
the hidden state output vector ht−1 of size n are multiplied
vector, and σ is the sigmoid function.
with weight matrices of size (m + n) × n. That requires
• Input gate The role of the input gate is to decide which
n(m + n) MAC operations, which is equivalent to nm + n2
information is to be renewed. The input gate output it is
multiplications and nm + n2 additions. The number of pa-
computed similarly to the forget gate output as
rameters in the weight matrices is nm + n2 as well. Since
it = σ(Wi [ht−1 , xt ] + bi ), (2) this computation is repeated four times within the LSTM
computation, these numbers are multiplied by four in the total
using the weight matrix Wi and the bias vector bi . number of operations and parameters for an LSTM. For the
• State computation The role of this computation is to models in the studied papers, n is larger than m. Thus, n has
compute the new memory state Ct of the LSTM cell. a dominating effect on the computational complexity of the
First, it computes the possible values for the new state LSTM.
C
et = tanh(WC [ht−1 , xt ] + bC ), (3)
2) LSTM with peepholes
where xt is the input vector, ht−1 is the hidden state Peephole connections were added to LSTMs to make them
output vector, Wc is the weight matrix, and bc is the bias able to count and measure the time between events [40]. As
vector. Then, the new state vector, Ct is calculated by seen in Figure 3b, the output from the state computation is
6 VOLUME , 2020
(a) Long Short Term Memory (LSTM). (b) LSTM with peepholes.
(c) LSTM with projection layer. (d) Gated Recurrent Unit (GRU).
(e) Quasi-RNN (QRNN). (f) Simple Recurrent Unit (SRU).
FIGURE 3: Different variations of an RNN layer.
VOLUME , 2020 7
used as input for the three gates. The LSTM gate equations 4) LSTM with projection layer
are changed to: The LSTM is changed by adding one extra step after the
last gate [43]. This step is called a projection layer. The
ft = σ(Wf [ht−1 , xt , Ct−1 ] + bf ), (7) output of the projection layer is the output of the LSTM and
the feedback input to the LSTM in the next time-step, as
it = σ(Wi [ht−1 , xt , Ct−1 ] + bi ), (8) shown in Figure 3c. Simply, a projection layer is like an FC
layer. The purpose of this layer is to allow an increase in the
and
number of hidden cells while controlling the total number of
ot = σ(Wo [ht−1 , xt , Ct ] + bo ). (9) parameters. This is performed by using a projection layer that
has a number of units p less than the number of hidden cells.
where xt is the input vector, ht−1 is the hidden state output
The dominating factor in the number of computations and
vector, Ct−1 is the state vector at time t − 1, Wf , Wi , Wo are
the number of weights will be 4pn instead of 4n2 , where n is
the weight matrices, and bf , bi and bo are the bias vectors.
the number of hidden cells and p is the size of the projection
The number of operations and computations for an LSTM layer. Since p < n, n can increase with a smaller effect on
with peepholes are shown in Table 3. There exist two rows the size of the model and the number of computations.
for an LSTM with peepholes. The first one considers the In Table 3, we show the number of operations and pa-
multiplication with the cell state in the three gates as a ma- rameters required for an LSTM with a projection layer. In
trix to vector multiplication. The number of multiplications, the original paper proposing the projection layer, the authors
additions, and weights increases by 3 × n2 . However, the considered the output layer of the RNN as a part of the
weight matrices multiplied with the cell state can be diagonal LSTM [43]. The output layer was an FC layer that changes
matrices [41]. Thus, the matrix to vector multiplication can the size of the output vector to o, where o is the output size.
be considered as element-wise vector multiplication, which Thus, there is an extra po term in the number of multiplica-
has become widely used for LSTM with peepholes. In this tions, additions, and weights. We put the extra terms between
case, the number of multiplications, additions, and weights curly brackets to show that they are optional terms. The
will increase by 3n only. projection layer can be applied to an LSTM with peepholes
as well. In Table 3, we show the number of operations and
3) ConvLSTM parameters for an LSTM with peepholes and a projection
ConvLSTM is an LSTM with all matrix to vector multipli- layer.
cations replaced with 2D convolutions [42]. The idea is that
if the input to the LSTM is data that holds spatial relations 5) GRU
such as visual frames, it is better to apply 2D convolutions The Gated Recurrent Unit (GRU) was proposed in 2014 [44].
than matrix to vector multiplications. Convolution is capable The main purpose was to make the recurrent layer able
of extracting spatial information from the data. The vectors to capture the dependencies at different time scales in an
xt , ht , and Ct are replaced with 3-D tensors. One can adaptive manner [45]. However, the fact that GRU has
think of each element in the LSTM vectors as a 2D frame only two gates (three computational blocks) instead of three
in the ConvLSTM vectors. Convolution weights need less (four computational blocks) as with the LSTM makes it
memory than to vector matrices weights. However, using more computationally efficient and more promising for high-
them involves more computation. performance hardware implementations. The three computa-
The number of operations and parameters required for a tional blocks are as follows:
convLSTM are shown in Table 3. The calculated numbers are • Reset gate The reset gate is used to decide whether to
for a convLSTM without peepholes. If peepholes are added, use the previously computed output or treat the input
the number of multiplications, additions, and weights will as the first symbol in a sequence. The reset gate output
increase by 3n. Since the main change from an LSTM is vector rt is computed as
the replacement of the matrix to vector multiplications with
convolutions, the change in the number of operations and rt = σ(Wr [ht−1 , xt ]), (10)
parameters would be via the nm + n2 factor that appears
in multiplications, additions, and the number of weight equa-
output vector, Wr is the weight matrix, and σ is the
tions. The number of multiplications and additions (MACs)
sigmoid function.
in convolutions of input vector xt and hidden state output
• Update gate The update gate decides how much of the
vector ht−1 is rcnmki 2 + rcn2 × ks 2 , where r is the number
output is updated. The output of the update gate zt is
of rows and c is the number of columns in the frames, n is the
computed as the reset gate output rt using the weight
number of frames in input xt , m is the number of frames in
matrix Wz as
output ht (or the number of hidden cells), ki is the size of the
filter used with xt , and ks is the size of the filter used with zt = σ(Wz [ht−1 , xt ]). (11)
ht−1 . The number of weights is the size of the filters used for
convolutions.
8 VOLUME , 2020
VOLUME , 2020

TABLE 3: Comparing LSTM and its variations.
RNN layer Number of Operations Number of Parameters
Multiplications Additions Nonlinear Weights Biases
4n2 + 4nm + 3n 4n2 + 4nm + 5n 5n 4n2 + 4nm 4n
LSTM
= LST Mmul = LST Madd = LST Mnonlinear = LST Mweights = LST Mbiases
7n2 + 4nm + 3n 7n2 + 4nm + 5n 5n 7n2 + 4nm 4n
LSTM + peepholes
= LST Mmul + 3n2 = LST Madd + 3n2 = LST Mnonlinear = LST Mweights + 3n2 = LST Mbiases
LSTM + peepholes (di- 4n2 + 4nm + 6n 4n2 + 4nm + 8n 5n 4n2 + 4nm + 3n 4n
agonalized) = LST Mmul + 3n = LST Madd + 3n = LST Mnonlinear = LST Mweights + 3n = LST Mbiases
4np + 4nm + 3n + np + {po} 4np + 4nm + 5n + np + 5n 4np + 4nm + np + {po} 4n
LSTM + projection
{po}
= LST M P rojmul = LST M P rojadd = LST Mnonlinear = LST M P rojweights = LST Mbiases
LSTM + peepholes (di- 4np + 4nm + 6n + np + {po} 4np + 4nm + 8n + np + 5n 4np + 4nm + 3n + np + 4n
agonalized) + projection {po} {po}
= LST M P rojmul + 3n = LST M P rojadd + 3n = LST Mnonlinear = LST M P rojweights + = LST Mbiases
3n
ConvLSTM 4rcnmki 2 +4rcn2 ks 2 + 3n 4rcnmki 2 +4rcn2 ks 2 + 5n 4nmki 2 + 4n2 ks 2 4n
5n
3n2 + 3nm + 3n 3n2 + 3nm + 2n 3n 3n2 + 3nm -
GRU = 0.75LST Mmul = 0.75LST Madd = = 0.75LST Mweights -
0.6LST Mnonlinear
QRNN 3knm + 3n 3knm + 2n 3n 3knm -
SRU 3nm + 6n 3nm + 8n 2n 3nm + 2n 2n
In the table we use the following symbols: m is the size of input vector xt , n is the number of hidden cells in ht , p is the size of the projection layer, o is the size of the output
layer, r is the number of rows in a frame, c is the number of columns in a frame, ki is size of the 2D filter applied to xt , ks is the size of the 2D filter applied to ht−1 , and k is
the size of 1D convolution filter. The term {po} is an optional term as discussed in Section III-B4.
9
• Output computation The role of this block is to com- to vector multiplications, not convolutions. The two gates
pute the hidden state vector ht . First, it computes the (forget and update gates) are computed using the equations
ht
possible values for the hidden state vector e
ft = σ(Wf xt + vf ct−1 + bf ) (18)
ht = tanh(W [rt
e ht−1 , xt ]), (12) and
rt = σ(Wr xt + vr ct−1 + br ) (19)
output vector, and W is the weight matrix. Then, the respectively. In both gate calculations, Ct−1 is used but only
hidden state vector ht is computed from the old output for element-wise multiplications. The parameter vectors vf
ht−1 and the new possible output e
ht as and vr are learned with weight matrices and biases during
training.
ht = (1 − zt ) ht−1 + zt ht .
e (13) The third computational block is the state computation Ct
As with LSTM, we visualize a GRU in Figure 3d as three Ct = ft Ct−1 + (1 − ft ) (W xt ), (20)
blocks, not two gates, as it has three blocks of matrix to vector where Ct−1 is the old state vector and xt is the input
multiplications. In Table 3, we show the number of oper- vector. The computation is controlled by the forget gate
ations and parameters required for a GRU. The number of output vector ft that decides what is to be forgotten and what
operations and parameters is approximately 0.75 the number is to be treated as new.
of operations and parameters in the LSTM. Finally, the SRU output ht is computed from the new
state Ct and the input vector xt checked by the update gate
6) QRNN and SRU (which decides the parts of the output that are taken from the
The purpose of Quasi-RNN (QRNN) [46] and Simple Recur- new state and the parts that are taken from input) using the
rent Unit (SRU) [47] is to make the recurrent unit friendlier equation
for computation and parallelization. The bottleneck in an ht = rt Ct + (1 − rt ) xt . (21)
LSTM/GRU is the matrix to vector multiplications. It is
difficult to parallelize this part because it depends on the Figure 3f visualizes the SRU. The output computation is
previous time-step output ht−1 and previous time-step state performed in the same block with the update gate. It is worth
Ct−1 . In QRNN/SRU, ht−1 and Ct−1 are removed from all observing that in neither QRNN nor SRU, ht−1 are used in
matrix to vector multiplications and appear only in element- the equations – only the old state Ct−1 is used. The number
wise operations. QRNN has two gates and a memory state. of operations and parameters for an SRU is shown in Table 3.
It has three heavy computational blocks. In these blocks, In Table 3, we compare the LSTM and all of its variations
only the input vector xt is used as input. It replaces the against the memory requirements for the weights and the
matrix to vector multiplications with 1D convolutions with number of computations per single time-step. This compar-
inputs along the time-step dimension. For instance, if the ison helps to understand the required hardware platform for
filter dimension is two, convolution is applied on xt and xt−1 . each of them. To make it easier for the reader to understand
The three computation blocks compute the forget gate vector the difference between the LSTM and the other variants, we
ft , candidate for new state vector C et , and the output gate show the equations for operations and parameters in terms of
vector ot as LSTM operations and parameters if they are comparable.
C. OUTPUT LAYERS
ft = σ(Wf ∗ xt ), (14)
The output layers in the RNN model are the FC layers and
et = tanh(Wc ∗ xt ),
C (15) the output function.
and 1) FC (Fully Connected) Layers

ot = σ(Wo ∗ xt ), (16) The RNN model might have one or more FC layers after
the recurrent layers. Non-linear functions may be applied
where Wf and Wc , Wo are the convolution filter banks and between the FC layers as well. This is called fully con-
“∗” is to denote the convolution operation. nected because each neuron in the input is connected to
The state vector Ct is computed as each neuron of the output. Computationally, this is done by
Ct = ft Ct−1 + (1 − ft ) C
et (17) matrix to vector multiplication using a weight matrix of size
Inputsize × outputsize , where Inputsize is the size of the
and the hidden state vector ht is computed using equation 6. input vector and Outputsize is the size of the output vector.
Figure 3e is used to visualize the QRNN layer. The number One purpose of the FC layer in RNN models can be to change
of operations and parameters required for a QRNN is shown the dimension of the hidden state output vector ht to the
in Table 3, where k is the size of the convolution filter. dimension of the RNN model output to prepare it for the
The SRU has two gates and a memory state as well. output function. In this case, the FC layer might be replaced
The heavy computational blocks (three blocks) are matrix by adding a projection layer in the recurrent layer.
10 VOLUME , 2020
2) Output function analysis [49] are two examples. In activity recognition,

The output function is the final step in the neural networks in- the model takes a sequence of images as input and
ference. It generates the output of the neural network model. determines the activity taking place in the images. In
This output can be a prediction, classification, recognition, sentiment analysis, the model takes a sequence of words
and so on. For example, in a text prediction problem, the (sentence) as input and generates a single emotion at the
softmax function is used as an output function. The output end. In this case, the temporal sequence sequence is only
is a vector of probabilities that sum to one. Each probability in the input.
corresponds to one word. The word with the highest proba- • Many to many A many to many model has a sequence
bility becomes the prediction of the neural network [48]. in the input and a sequence in the output, as shown in
Figure 4c. Language translation [2] and video descrip-
D. PROCESSING OF DATA IN RNN MODELS tion [3] are two examples. In language translation, the
There are many ways that processing of data may vary in model has a sequence of words (sentence) as an input
RNN models. The first is to vary the way time steps are and a sequence of words (sentence) as an output. In
treated. This is influenced by the nature of the application, video description applications, the model has a sequence
which may have inputs with temporal relations, outputs with of image frames as input and a sequence of words
temporal relations, or both. The second form of variation (sentence) as output.
is related to bidirectional RNNs. We discuss below how • One to one There is no RNN model with one to one
a bidirectional RNN can process inputs both forwards and unrolling. One to one simply means that there is no
backwards in time. We also discuss what is meant by a deep temporal relation contained in the inputs or the outputs
RNN model. (a feedforward neural network).
1) RNN unfolding variations through time-steps 2) Bidirectional RNN

RNN unfolding/unrolling is performed to reveal the rep- In Bidirectional RNN, input can be fed into the recurrent
etition in the recurrent layer and to show the number of layer from two directions: past to future and future to past.
time steps required to complete a task. Unfolding the RNN That requires a duplication of the recurrent layer, so that two
illustrates the different types of RNN models one can meet. recurrent layers work simultaneously, each processing input
• One to many A one to many model generates a se- in a different temporal direction. This can help the network to
quence of outputs for a single input, as shown in Fig- better understand context by obtaining data from the past and
ure 4a. Image captioning is one example [3]. The model the future at the same time. This concept can be applied to
takes one image as input and generates a sentence as an different variations of recurrent layers such as BiLSTM [50]
output. The words of the sentence compose a sequence and BiGRU [51].
of temporally related data. In this case, the temporal
sequence is only in the output. E. DEEP RECURRENT NEURAL NETWORKS (DRNN)
Making a neural network a deep neural network is achieved
by adding non-linear layers between the input layer and
the output layer [52]. This is straightforward in feedforward
NNs. However, in RNNs, there are different approaches that
can be adopted. Similarly to feedforward NNs, there can be
a stack of recurrent layers (stacked RNN) [41] as shown in
(a) One to Many RNN. (b) Many to One RNN. Figure 5, where we have a stack of two recurrent layers.
The output of the first layer is considered as the input for
the second layer. Alternately, the extra non-linear layers
can be within the recurrent layer computations [53]. Extra
non-linear layers can be embedded within the hidden layer
vector ht calculation, where the xt and ht−1 vectors used to
calculate ht , pass through additional non-linear layers. This
(c) Many to Many RNN. model is called the deep transition RNN model. The extra
non-linear layers can also be added in computing the output
from the hidden state vector; this model is called the deep
FIGURE 4: Unfolding RNN models through multiple time output RNN model. It is possible to have an RNN model that
steps. is both a deep transition and a deep output RNN model [54].
One other way to have extra non-linear functions within the
• Many to one A many to one model combines a se- recurrent layer is to have them within the gate calculations –
quence of inputs to generate a single output, as shown a method called H-LSTM (Hidden LSTM).
in Figure 4b. Activity recognition [3] and sentiment
VOLUME , 2020 11
the optimization method as

Va − Vb
a∆ = (−1)α × 100, (22)
Vb
where a∆ is the effect of the optimization method on ac-
curacy as a percentage of the original accuracy value, Vb is
the value of accuracy before optimization, Va is the value of
accuracy after optimization, and α is an indicator that has a
value of 0 if higher accuracy values are better and 1 if lower
accuracy values are better. Thus, if the baseline accuracy
achieved by the original model without optimizations is 96%
and the accuracy after optimization is 94%, the effect of
FIGURE 5: Stacked RNN. The first layer output is h1t and the optimization on accuracy is −2.1%. If the accuracy after
second layer output is h2t . optimization is 98%, the effect of optimization on accuracy
is +2.1%. If the optimization has no effect on accuracy, then
the effect on accuracy is 0%.
As shown in Figure 6, the algorithmic optimizations are
IV. OPTIMIZATIONS FOR RNNS
quantization, compression, deltaRNN, and nonlinear. The
As with all neural network applications, RNN applications first three optimizations are applied to the matrix to vector
are based upon intensive operations performed on high preci- multiplications operations and the last is applied to compu-
sion values. They therefore require high computation power, tation of non-linear functions. The table in Figure 6 com-
large memory bandwidth, and high energy consumption. pares quantization, compression, and deltaRNN with their
Because of the resource constraints of embedded platforms, effect on memory requirements, number of memory accesses,
there is a need to decrease the computation and memory number of computations, and MAC operation cost. MAC
requirements of RNN applications. In this section, we present operation cost can be decreased by decreasing the precision
optimizations that have been applied to RNNs to realize of operands.
them on embedded systems. In Section V which follows,
we discuss hardware implementations of RNNs on embedded 1) Quantization
platforms and how they relate to the optimizations presented
Quantization is a reduction in the precision of the operands.
here. Researchers have been working on two types of opti-
Quantization can be applied to the network parameters only,
mizations. The first type is related to the RNN algorithms
or to the activations and inputs as well. While discussing
themselves, where RNN algorithms are modified to decrease
quantization, there are three important factors to consider.
computation and memory requirements. The modification
First, the number of bits used for weights, biases, acti-
should have no effect or only a limited effect on accuracy.
vations, and inputs. Second, the quantization method. The
The second type of optimization is related to the embedded
quantization method defines how to store the full precision
platform, where hardware improvements are applied to in-
values in a lower number of bits. Third, discussing whether
crease the parallelization of the application and decrease the
quantization was applied with training from the outset or the
overhead of memory accesses. Figure 6 illustrates these two
model was re-trained after applying quantization. These three
types of optimizations.
factors all affect accuracy. However, they are not the only
factors affecting accuracy, which may also be affected by
A. ALGORITHMIC OPTIMIZATIONS model architecture, dataset, and other factors. Yet, these three
In this section, we discuss the different algorithmic optimiza- factors have more relevance when applying quantization to
tions that may be performed on the recurrent layer of an the RNN model.
RNN application to decrease its computation and memory In discussing quantization methods, we cover fixed-point
requirements. We discuss how these optimizations are carried quantization, multiple binary codes quantizations, and ex-
out, and how they affect accuracy. Applying optimizations di- ponential quantization. We also study whether the selec-
rectly to inference can have unacceptable effects on accuracy. tion of the quantized value is deterministic or stochastic.
Thus, training the network would be required to enhance In deterministic methods, the selection is based on static
the accuracy. optimizations may be applied during the model thresholds. In contrast, selection in stochastic methods relies
main training or after the model is trained and then the model on probabilities and random numbers. Relying on random
is retrained for some epochs (training cycles). numbers is more difficult for hardware.
Different datasets measure accuracy using different units.
For some units higher values are better, while for others a: Quantized values representation
lower values are better. To provide a unified measure of the There are different methods for representing quantized val-
change in accuracy, we calculate the percentage change in ues. In the following, we explain three commonly used
accuracy from the original value to the value after applying methods.
12 VOLUME , 2020
FIGURE 6: Optimizations applied to RNN applications with section numbers indicated, comparing the effect of different
algorithmic optimizations on memory and computation requirements.
1) Fixed-point quantization In this quantization method, probabilities to compute the quantized value
the 32-bit floating-point values are quantized into a
+1 with probability p = σh (x),
fixed-point representation notated as Qm,f , where m xb = (23)
−1 with probability 1 − p,
is the number of integer bits, and f is the number
of fractional bits. The total number of bits required where σh is the “hard sigmoid” function defined as
is k. The sign bit may be included in the number of x+1 x+1
σh (x) = clip( , 0, 1) = max(0, min(1, )).
integer bits [55] or added as an extra bit added to 2 2
(24)
m and f [56]. For example, in the first case [55],
Binarization has great value for hardware computation
Q1.1 is used to represent 2 bits fixed-point that has
as it turns multiplication into addition and subtraction.
three values {−0.5,0,0.5}. This quantization method is
The greatest value comes with full binarization, where
also called Pow2-ternarization [57]. Usually, fixed-point
both the weights and the activations have binary preci-
quantization is deterministic, in that for each floating-
sion. In this case, it is possible to concatenate weights
point value, there is one quantized fixed-point value
and activations into 32-bit operands and do multiple
defined by an equation (i.e. it is rule-based). Fixed-point
MAC operations using XNOR and bit-count operations.
quantization is performed by clipping the floating-point
Full binarization can reduce memory requirements by
value between minimum and the maximum boundaries,
a factor of 32 and decrease computation time consider-
and then rounding it.
ably [60].
2) Exponential quantization Exponential quantization
Adding one more value to binary precision is called
quantizes a value into an integer power of two. Expo-
ternarization. Weights in ternarized NN are restricted
nential quantization is very beneficial for the hardware
to three values. These three values can be {−1, 0,
as multiplying with exponentially quantized value is
1} [61]. Power two ternarization is discussed above as
equivalent to shift operations if the second operand is a
a form of fixed-point quantization, and is an example
fixed-point value, and addition to exponent if the second
of ternarization with three different values {−0.5, 0,
operand is a floating-point value [55], [58]. Exponential
0.5}. Both deterministic and stochastic ternarization
quantization can be both deterministic and stochastic.
have been applied to RNNs [55].
3) Binary and multi-bit codes quantization The lowest
Having four possible quantization values is called Qua-
precision in RNNs is binary precision [59]. Each full
ternarization. In quaternarization, the possible values
precision value is quantized into one of two values.
can be {−1, −0.5, +0.5, +1} [62]. In order to benefit
The most common two values are {−1, +1}, but it
from the high computational benefit of having binary
can also be {0, +1}, {−0.5, 0}, {−0.5, +0.5}, or any
weights and activations while using a higher number
combination of two values [55]. Binarization can be
of bits, multiple binary codes {−1,+1} has been used
deterministic or stochastic. For deterministic binariza-
for quantization [63]. For example, two bit quantization
tion, a sign function can be used for binarization. For
has four possible values {{−1,−1}, {−1,1}, {1,−1},
stochastic binarization, selection thresholds depend on
{1,1}}.
The most common method for deterministic quantiza-
tion is uniform quantization. Uniform quantization may
VOLUME , 2020 13
not be the best quantization method as it can change For training, if quantization was applied with training from
the distribution of the original data, especially for non- the beginning, we write “With training”. If quantization was
uniform data, which can affect accuracy. One solution applied after training and the model was later retrained, we
is balanced quantization [64]. In balanced quantization, write “Retraining”. Positive values for accuracy means that
data is divided into groups of the same amount of data quantization enhanced the accuracy and negative values for
before quantization to ensure a balanced distribution of accuracy means that quantization caused the model to be less
data following quantization. Other suggested solutions accurate.
treat quantization as an optimization problem, and in- Each experiment in Table 4 is applied to a different model,
clude greedy quantization, refined greedy quantization, different dataset, and may also have used different training
and alternating multi-bit quantization [63], [65]. methods. Thus, conclusions about accuracy from Table 4
cannot be generalized. Still, we can make some observations:
b: Training/Retraining
• Fixed point quantization, exponential quantization and
As mentioned earlier, there are three options to minimize mixed quantization have no negative effect on accuracy.
accuracy loss due to quantization. The first is to apply Accuracy increased after applying these quantization
quantization with training [66], where quantized weights are methods. Quantized models can surpass baseline mod-
used during the forward and backward propagation only. Full els in accuracy as weight quantization has a regulariza-
precision weights are used for the parameters update step tion effect that overcomes over-fitting [56].
in the (Stochastic Gradient Descent) SGD. Copies for both • Regarding binary quantization, the negative effect on
quantized and full precision weights are kept to decide at accuracy varied within small ranges in some experi-
inference time which one to use [55]. In the second approach, ments [56], [62]. Experiments showed that using more
quantization is applied to pretrained parameters and the RNN bits for activations may enhance the accuracy [56].
model is retrained to decrease the accuracy loss. Also, bina- Using binary weights with convLSTM is not solely re-
rization of LSTM gate outputs during training have been ap- sponsible for the poor accuracy obtained, as Ternary and
plied by using the GumbelSoftmax estimator [67]. Authors in Quaternary quantization resulted in poor accuracy with
one RNN implementation [56] adopted a mix of training and convLSTM as well [62]. However, these quantization
retraining approaches, where only the activations were not methods were successful when applied on LSTM and
quantized from the beginning. Activations were quantized GRU in the same work [62].
after training and then the model was retrained for 40 epochs.
The third approach is to use quantized parameters without
2) Compression
training/retraining. This is very commonly used with 16-bit
fixed-point quantization. Usually, training happens at training Compression decreases the model size by decreasing the
servers and quantization is applied at the inference platform number of parameters or connections. As the number of
without having the opportunity to retrain the model. It is parameters is reduced, memory requirements and the num-
very common as well to use 16-bit fixed-point quantization ber of computations decrease. Table 5 compares different
with other optimization techniques such as circulant matrices compression methods. The compression ratio shows the ratio
compression [68], pruning [69], and deltaRNN (discussed between the number of parameters of models before and
later in Section IV-A3) [70]. after applying compression methods. Accuracy degradation
is computed using Eq. 22.
c: Effect on accuracy (i) Pruning Pruning is the process of eliminating redun-
In Table 4, we list studies that included experiments on the dancy. Computations in RNNs are mainly dense matrix
quantization of RNN models. Not all of the studies have operations. To improve computation time, dense matri-
a hardware implementation, as the purpose is to show that ces are transformed into sparse matrices, which affects
quantization can be performed while keeping accuracy high. accuracy. However, careful choice of the method used to
In the table, we put the three factors affecting the accuracy transform a dense matrix to a sparse matrix may result
discussed earlier (number of bits, quantization method, and in only a limited impact on accuracy while providing
training) with an addition of the type of recurrent layer significant gains in computation time. Reduction in
(LSTM, GRU...) and the dataset. Then, we show the effect memory footprint along with computation optimization
of quantization on accuracy computed with respect to the is essential for making RNNs viable. However, pruning
accuracy achieved by full precision parameters and activa- results in two undesirable effects. The first is a loss in the
tion using Eq. 22. For the number of bits, we use W/A regularity of memory organization due to sparsification
where W is the number of bits used for weights and A of the dense matrix, and the second is a loss of accuracy
is the number of bits used for activations. For the RNN on account of the removal of weights and nodes from the
type, we put the recurrent layers used in the experiments. model under consideration. The transformation from a
All recurrent layers are explained in Section III. We use regular matrix computation to an irregular application
x*y*z, where x is the number of layers, y is the type of the often results in the use of additional hardware and
layers, and z is the number of hidden cells in each layer. computation time to manage data. To compensate for the
14 VOLUME , 2020
TABLE 4: Effect of quantization methods on accuracy.

Method W/A RNN type Dataset Training Accuracy Paper
2/2 1*BiLSTM*128 OCR dataset With training +0.7% [56]
Fixed Point
P2T/real 4*BiLSTM*250 WSJ With training +6% [55]
Exponential EQ/real 1*GRU*200 TIDIGITS With training +1% [55]
Mixed EQ+ fixed6/8 3*BiLSTM*512 AN4 Retraining +10.7%1 [58]
B/real 1*GRU*128 IMDB With training −5.3% [62]
Binary B/real ConvLSTM Moving MNIST With training −100%2 [62]
B/1 1*BiLSTM*128 OCR dataset With training −3.7% [56]
B/4 1*BiLSTM*128 OCR dataset With training +1% [56]
B/real 1*GRU*200/400 TDIGITS With training −80.9% [55]
T/real 1*GRU*128 IMDB With training −4% [62]
T/real ConvLSTM Moving MNIST With training −50%2 [62]
Ternary T/real 1*GRU*200 TDIGITS With training −1.6% [55]
Q/real 1*GRU*128 IMDB With training −1.7% [62]
Quaternary
Q/real ConvLSTM Moving MNIST With training −75%2 [62]
3/3 1*LSTM*512 WikiText2 Retraining +1.4% [63]
Multi-Binary 2/2 1*LSTM*512 WikiText2 Retraining −6% [63]
1/4 2*LSTM*256 PTB With training −7.8% [71]
1 Accuracy is also affected by the compression scheme and nonlinear functions approximation used in this work.
2 We calculate the error at the tenth frame (third predicted frame).
In the table we have used the symbols: W/A for number of bits for weights/number of bits for activations, P2T for power
two ternarization, EQ for exponential quantization, B for binary quantization, T for ternary quantization, and Q for quaternary
quantization.
loss of accuracy caused by pruning, various methods, weights contribute to prediction accuracy.
including retraining, have been applied. The following • Gradual thresholding This method [75] uses a set of
sections describe methods of pruning and compensation weight masks and a monotonically increasing thresh-
techniques found in the literature. Table 5 summarizes old. Each weight is multiplied with its corresponding
the methods of pruning and its impact on sparsity and mask. This process is iterative, and the masks are
accuracy. Sparsity in this context refers to the number updated by setting all parameters that are lower than
of empty entries in the matrices. In Table 5, sparsity the threshold to zero. As a result, this technique grad-
indicates the impact on the number of entries elimi- ually prunes weights introduced within the training
nated because of the method of pruning used. Within process, in contrast to hard thresholding.
RNNs, pruning can be classified as either magnitude • Block Pruning In block pruning [76], magnitude
pruning for weight matrix sparsification, or structure- thresholding is applied to blocks of a matrix instead of
based pruning. individual weights during training. The weight with
Magnitude pruning Magnitude pruning relies on elim- the maximum magnitude is used as a representative
inating all weight values below a certain threshold. for the entire block. If the representative weight is
In this method, the choice of threshold is crucial to below the current threshold, all the elements in the
minimize the negative impact on accuracy. Magnitude blocks are set to zero. As a result, block sparsification
pruning is primarily based on identifying the correct mitigates the indexing overhead, irregular memory
threshold for pruning weights. accesses, and incompatibility with array-data-paths
• Weight Sub-groups For weight matrix sparsification, that characterises unstructured random pruning.
the RNN model is trained to eliminate redundant • Grow and prune Grow and prune [54] combines
weights and only retain weights that are necessary. gradient-based growth [77] and magnitude-based
There are three categories to create weight subgroups pruning [74] of connections. The training starts with
to select the pruning threshold [72]. These three a randomly initialized seed architecture. Next, in the
categories are class-blind, class-uniform, and class- growth phase, new connections, neurons, and feature
distribution. In class-blind, x% of weights with the maps are added based on the average gradient over
lowest magnitude are pruned, regardless (blind) of the entire training set. Once the required accuracy has
the class. In class-uniform, lower pruning x% of been reached, redundant connections and neurons are
weights is uniformly performed in all classes. In eliminated based on magnitude pruning.
class-distribution, weights within the standard devi- Structure pruning Modifying the structure of the net-
ation of that class are pruned. work by eliminating nodes or connections is termed
• Hard thresholding [73], [74] identifies the correct structure pruning. Connections that may be impor-
threshold value that preserves accuracy. ESE [74] tant are learned in the training phase or pruned using
uses hard thresholding during training to learn which probability-based techniques.
VOLUME , 2020 15
• Network sparsification Pruning through network of the weights matrices. For large matrices, circulant
sparsification [78] introduces sparsity for the connec- matrices can use nearly 4× less memory space. The
tions at every neuron output, such that each output back-propagation algorithm is modified to allow train-
has the same number of inputs. Furthermore, an op- ing of the weights in the form of circulant matrices.
timization strategy is formulated that replaces non- Block-circulant matrices Instead of transforming the
zero elements in each row with the highest absolute weight matrix into a circulant matrix, it is transformed
value. This step avoids any retraining, which may into a set of circulant sub-matrices [68], [80]. Figure 7
be compute-intensive and difficult in privacy critical shows a weight matrix that has 32 parameters. The
applications. However, the impact of this method of block size of the circular sub-matrices is 4. The weight
pruning on accuracy has not been directly measured. matrix has transformed into two circulant sub-matrices
Design space exploration over different levels of with 8 parameters (4 parameters each). The compression
sparsity measures the quality of output and gives an ratio is 4×, where 4 is the block size. Thus, having
indication of the relationship between the level of larger block sizes will result in a higher reduction in
approximation and the application-level accuracy. model size. However, a high compression ratio may
• Drop-out DeepIoT [79] compresses neural network degrade the prediction accuracy. In addition, the Fast
structures into smaller dense matrices by finding the Fourier Transform (FFT) algorithm can be used to speed
minimum number of non-redundant hidden elements up the computations. Consequently, the computational
without affecting the performance of the network. For complexity decreases by a factor of O( logk k ).
LSTM networks, Bernoulli random probabilities are
used for dropping out hidden dimensions used within
the LSTM blocks.
Retaining accuracy levels Pruning alongside training
and retraining has been employed to retain the accuracy
levels of the pruned models. Retraining works on the
pruned weights and/or pruned model until convergence
to a specified level of accuracy is achieved. Pruning
has shown a regularization effect on the retraining
phase [72]. The regularization effect might be the reason
for outperforming baseline model accuracy. Another
benefit for pruning which might be the reason for out-
performing the baseline accuracy is that pruning allows FIGURE 7: Regular weight matrix transformed into block-
the finding of a better local minimum. Pruning increases circulant sub-matrices of block size 4 [68].
the loss function immediately, which results in further
gradient descent. (iii) Tensor decomposition Tensors are multidimensional
Handling irregularity in pruned matrices Pruning to arrays. A vector is a tensor of rank one, and a 2-D
maximize sparsity results in a loss in regularity (or struc- matrix is a tensor of rank two and so on. Tensors can
ture) of memory organization due to sparsification of the be decomposed into lower ranks tensors, and tensor
original dense matrix. Pruning techniques that are ar- operations can be approximated using these decompo-
chitecture agnostic, mainly result in unstructured irreg- sitions in order to decrease the number of parameters in
ular sparse matrices. Methods such as load balancing- the NN model. Canonical polyadic (CP) decomposition,
aware pruning [74] and block pruning (explained ear- Tucker decomposition, and tensor train decomposition
lier within magnitude pruning) [76] have been applied are some of the techniques used to apply tensor decom-
to minimize these effects. Load balancing-aware prun- position [81]. Tensor decomposition techniques can be
ing [74] works towards ensuring the same sparsity ratio applied to the FC layers [82], convolution layers [83],
among all the pruned sub-matrices, thereby achieving an and recurrent layers [81]. In Table 5, we show an exam-
even distribution of non-zero weights. These techniques ple of applying tensor decomposition on a GRU layer
introduce regularity in the sparse matrix to improve using the CP technique. In another example, Adam’s
performance and avoid index tracking. algorithm has been used as an optimizer for the train-
(ii) Structured matrices ing process [84]. Tensor decomposition techniques can
Circulant matrices A circulant matrix is a matrix in achieve a high compression ratio compared to other
which each column (row) is a cyclic shift of the pre- compression methods.
ceeding column (row) [58]. It is considered as a special
case of Toeplitz-like matrices. The weight matrices are
reorganized into circular matrices. The redundancy of
values in the matrices reduces the space complexity
16 VOLUME , 2020
VOLUME , 2020

TABLE 5: Effect of compression methods on accuracy.
Method Technique RNN Type Dataset Compression ratio Training Accuracy Paper
(Sparsity for pruning )
Weight subgroups 4*LSTM*1024 + WMT’14 5× (80%)-10×(90%) Retraining +2.1%-−1.7% [72]
4*LSTM*1024
Magnitude
Hard thresholding 2*LSTM*512 TIMIT 1.1× (10%) -1.3×(24%) None 0% [73]
pruning
Gradual pruning 2*LSTM*1500 PTB 20× ( 90%) With training −2.3% [85]
Block pruning 7*BiLSTM*2560 Speech Data2 12.5× (92%) With training −12% [75]
Grow&Prune 1*H-LSTM*512 1 COCO 8× (87.5%) -19× (95%) With training 0%-−2.2% [54]
Network sparsification 2*LSTM*512 COCO 2× (50%) None 0% [78]
Structured
Drop-out 5*BiLSTM*512 LibriSpeech 10× (90%) None 0% [79]
pruning
ASR corpus
Circulant 3*BiLSTM*512 AN4 nearly 4× With training +10.7%3 [58]
Structured ma-
trices Block-circulant 2*LSTM*1024 TIMIT 15.9× With training -5.5% [68]
Tensor CP 1*GRU*512 Nottingham 101× - 481× With training −1% - −5% [81]
decomp.
Knowledge Plain 4*LSTM*1000 WMT’14 3× With training −1% [86]
distillation +Pruning 4*LSTM*1000 WMT’14 26× With training + −5.1%
Retraining
1 H-LSTM is hidden LSTM. Non-linear layers are added in gate computations (Explained in Section III).
2 Dataset name is not mentioned in the paper.
3 Accuracy is also affected by quantization (Table 4) and nonlinear functions approximation used in this work.
TABLE 6: Effect of DeltaRNN method on accuracy

RNN model Dataset Training Accuracy Speedup paper
1*GRU*512 TIDIGITs With training −1.6% 5.7× [70]
CNN+ 1*GRU*512 Open-driving With training 0% 100× [87]
17
(iv) Knowledge distillation Knowledge distillation is a constant output values. However, to achieve high accuracy,
method that replaces a large model with a smaller model large LUTs are required and that consumes a large area of
that should behave like a large model. Starting from silicon, which is not practical. Several methods have been
a large model (teacher) with trained parameters and a proposed to decrease the LUTs size while preserving high
dataset, the small model (student) is trained to behave accuracy.
like the large model [86]. In addition to knowledge Piecewise linear approximation: This approximation
distillation, pruning can be applied to the resulted model method is done by dividing the non-linear function curve
to increase the compression ratio, as shown in Table 5. into a number of line segments. Any line segment can be
represented by only two values: the slope and the bias. Thus,
3) DeltaRNN for each segment, only two values are stored in the LUTs.
Delta Recurrent Neural Networks (DeltaRNN) [87] makes The choice of the number of segments affects both accuracy
use of the temporal relation between input sequences. For and the size of LUTs. Thus, the choice of the number of
two consecutive input vectors xt and xt−1 , the difference segments must be made wisely to keep the accuracy high
between corresponding values in the two vectors may be zero while keeping the LUTs as small as possible. The compu-
or close to zero. The same holds for the hidden state output tational complexity of the non-linear function changes to
vector. The idea is to skip computations for input/hidden state be a single comparison, multiplication and addition, which
values that when compared to input/hidden state values of the may be implemented using shifts and additions. Compared to
previous time step, have a difference that is less than a pre- using look-up tables, piecewise linear approximation requires
defined threshold called delta (Θ). Improvement comes from fewer LUTs and more computations.
decreasing the number of computations and the number of Hard tanh / Hard sigmoid: Hard tanh and hard sigmoid
memory accesses required by the recurrent unit. However, are two examples of piecewise linear approximation with
memory requirements will not decrease because we still three segments. The first segment is saturation to zero or −1
need to store all the weights as we cannot predict which (zero in case of sigmoid and −1 in case of tanh), the last
computations will be skipped. segment is saturation to one, and the middle segment is a line
The value of delta threshold affects both accuracy and segment that joins the two horizontal lines.
speedup. In Table 6, we summarize the effect of DeltaRNN There is a variant of piecewise linear approximation called
on accuracy for two different datasets. In some occasions, it piecewise non-linear approximation. The line segments are
was required to train the RNN using a delta algorithm before replaced by non-linear segments and the use of multipliers
inference to obtain better accuracy at inference time. Fur- cannot be avoided as they can in the linear version. This made
thermore, the speedup gained by the delta algorithm at one the linear approximation preferable in hardware design.
delta value is not static. It depends on the relation between RALUT One other method to reduce the size of the LUTs
the input sequences. The highest speedup could be reached is to use RALUT (Range Addressable Look Up Tables) [90].
using video frames (open driving dataset) as input data, as In RALUTs, each group of inputs is mapped into a single
seen in Table 6. However, the time-consuming CNN before output.
the recurrent layer negated the speedup gained by deltaRNN.
Thus, the 100x speedup in GRU execution dropped down to B. PLATFORM SPECIFIC OPTIMIZATIONS
a non-significant speedup for the model as a whole. On the In this section, we discuss the optimizations performed on
other hand, CNN-Delta [88] applied a similar delta algorithm the hardware level to run an RNN model efficiently. These
on CNNs. Applying delta algorithms to both recurrent layers optimizations may be related to computation or memory. For
and CNN layers might prove beneficial. computation-related optimizations, techniques are applied
to speedup the computations and obtain higher throughput.
4) Non-linear function approximation For memory-related optimizations, techniques are applied to
Non-linear functions are the second most used operations in carry out memory usage and accesses with reduced memory
the RNN after matrix to vector multiplications, as may be overhead.
seen in Table 3. The non-linear functions used in the recurrent
layers are tanh and sigmoid, respectively. Both functions 1) Compute-specific
require floating-point division and exponential operations, The bottleneck in RNN computations is the matrix to vector
which are expensive in terms of hardware resources. In multiplications. It is difficult to fully parallelize matrix to
order to have an efficient implementation for an RNN, non- vector multiplications over time-steps as the RNN model
linear function approximations are implemented in hardware. includes a feedback part. Each time-step computation waits
This approximation should satisfy a balance between high for the preceding time-step computations to complete so it
accuracy and low hardware cost. In what follows, we present can use the hidden state output as an input for the new time
the approximations used in the implementations under study. step computation.
Look-up tables (LUTs): Replacement of non-linear • Loop unrolling Loop unrolling is a parallelization tech-
function computation with look-up tables is the fastest nique that creates multiple instances of the looped op-
method [89]. The input range is divided into segments with erations to gain speedup at the expense of resources.
18 VOLUME , 2020
There are two kinds of loop unrolling used in RNN im- having a similar color. The output vector is built from
plementations. The first is inner loop unrolling, where three vectors, where each of the three output vectors
the inner loop of the matrix to vector multiplication are accumulated together to form one vector in the
is unrolled [21], [91]. The second kind is unrolling output. This computation requires nine cycles to be
over time-steps. RNN needs to run for multiple time- completed, assuming that new weights can be loaded
steps for each task to be completed. The computation of into the hardware multiplication unit within the cycle
the recurrent unit can be unrolled over time-steps [92]. time.
However, this cannot be fully parallelized, as discussed
earlier. Only computations that rely on inputs can be
parallelized, while computations relying on hidden state
outputs are performed in sequence. One solution can be
to use QRNN or SRU, as discussed in Section III-B. In
QRNN and SRU, the matrix to vector multiplications
do not operate on the hidden state output and thus can
be fully parallelized over unrolled time steps [93].
• Systolic arrays 2D Systolic arrays are a good candidate FIGURE 8: Tiling: converting one matrix to vector multipli-
for matrix to vector multiplication [94], [95] and con- cation into nine matrix to vector multiplications.
volution units [15]. Systolic arrays are efficient as mul-
tiplications operands move locally between neighbor • Hardware sharing In the GRU recurrent layer, the
PEs (processing elements) [96]. Thus, systolic arrays execution of rt and e ht has to be in sequence as e ht
require less area, less energy, and less control logic. Well computation depends on rt as shown in Eq. 12. Thus,
designed systolic arrays can guarantee that PEs remain the computation of rt and e ht is the critical path in the
busy to maximize throughput. GRU computation. While zt can be computed in parallel
• Pipelining Pipelining is an implementation technique as it is independent of e ht and rt . The same hardware
that can increase throughput. Pipelining has been used in can be shared for computing rt and zt to save hardware
RNN implementations in various ways. Coarse-grained resources [97].
pipelining (CGPipe) is used to tailor the LSTM/variants • Load balancing In the case of sparse weight matri-
data dependency [68], [80]. LSTM computation is per- ces (resulting from pruning), load balancing techniques
formed in three stages, with double buffers in between. might be needed during parallelization of the matrix
The first stage is for weight matrices multiplications to vector multiplication over processing elements [73],
with inputs and hidden cells vectors, the second stage [74].
is for non matrix to vector operations, and the third • Analog computing Analog computing is a good can-
stage is for projection layer computations. Fine-Grained didate for neural network accelerators [98]. Analog
Pipelining (FGPipe) can be used to schedule the op- neural networks [99] and analog CNNs [100] have been
erations within the CGPipe stages. The design of the studied recently. Interestingly, RNN implementations
pipelining scheduler is a critical task due to the data de- using analog computing have started to attract attention
pendency in LSTM/variants [74]. Some operations need from researchers [98], [101]. Analog computing brings
to be performed sequentially, while some operations can significant benefits, especially for the critical matrix-
be done concurrently. Having sparse weight matrices vector computation, by making it both faster and more
(due to applying pruning) increases the complexity of energy-efficient. This is true for the non-linear functions
the scheduler design. that normally are calculated between the NN layers as
• Tiling Tiling consists of dividing one matrix to vector well. Analog computing also allows for more efficient
multiplication into multiple matrix to vector multiplica- communication as a wire can represent many values
tions. Usually, tiling is used when a hardware solution instead of only a binary value. The performance of an
has built-in support for matrix to vector multiplication of analog computer will however, critically depend on the
a specific size in one clock cycle. When the input vector digital to analog and analog to digital converters, for
or the weight matrix size is larger than the size of the both speed and energy consumption.
vector or the matrix supported by the hardware, tiling
is used to divide the matrix to vector multiplication to 2) Memory considerations
be performed on the hardware in multiple cycles [58], For the processing of an RNN algorithm, memory is needed
[91]. Thus, tiling can be combined with Inner-loop to store weight matrices, biases, inputs, and activations,
unrolling or systolic arrays. Figure 8 shows a vector where the weight matrices have the highest memory require-
that is broken into three vectors and a matrix that is ment. The first decision related to memory is the location of
broken into nine matrices. Thus, one matrix to vector weights storage. If all the weights are stored in the off-chip
multiplication is broken into nine matrix to vector mul- memory, accessing the weights comprises the largest cost
tiplications. Each vector is multiplied with the matrices with respect to both latency and energy [91], [102].
VOLUME , 2020 19
On-chip memory After applying the algorithmic opti- representation of up to 64 bits. DWM density is hoped to
mizations introduced in Section IV-A, the memory require- improve SRAM by 30x and DRAM by 10x [110]. Using
ments of the RNN layer are reduced, which increases the DWM in RNN accelerator can achieve better performance
possibility of storing the weights in the on-chip memory. and lower energy consumption [106].
However, this results in a restriction on the largest model size Processing In Memory (PIM) PIM gets rid of the data
that can run on the embedded platform. On-chip memory has fetching problem by allowing computation to take place
been used for storing the weights by many implementations in memory, eliminating memory access overhead. In such
[56], [58], [68], [70], [103]. an architecture, a memory bank is divided into three sub-
Hybrid memory Storing all the weights in the on-chip array segments: memory sub-arrays, buffer sub-arrays, and
memory restricts the size of the model executed on the processing sub-arrays, which are used as conventional mem-
embedded solution. Storing some of the weights in on-chip ory, data buffer, and processing sub-arrays respectively.
memory with the remainder in off-chip memory might pro- ReRAM-based PIM arrays is one approach used to ac-
vide a solution [69]. celerate CNNs [111]–[113] and RNNs [114]. ReRAM that
In addition to maximizing the use of on-chip memory, supports XNOR and bit counting operations will only be
some researchers use techniques to reduce the number and sufficient for RNN implementation if binary or multi-bit
the cost of memory accesses. code (Section IV-A1) quantization has been applied [71].
Memristors crossbar arrays have successfully been used as
• Multi time-step parallelization an analog dot product engine to accelerate both CNNs [115]
The fact that QRNN and SRU remove the hidden state and RNNs [101].
output from the matrix to vector multiplications can be
leveraged to allow multi time-step parallelization [93]. V. RNN IMPLEMENTATIONS ON HARDWARE
Multi time-step parallelization is performed by convert- In the previous section, we discussed the optimizations ap-
ing multiple matrix to vector multiplication into a fewer plied to decrease the computation and memory requirements
matrix to matrix multiplications. This method decreases of RNN models. In this section, we study recent implemen-
the number of memory accesses by reusing the weights tations of RNN applications on embedded platforms. The
for computations involving multiple time-steps. implementations are divided into FPGA, ASIC, and other im-
• Reordering weights Reordering weights so they oc- plementations. We analyze these implementations and study
cupy memory in the same order as computation helps the effects of the applied optimizations. However, the effect
decrease the memory access time [91]. Reordering the of each optimization is not shown separately. Instead, the
parameters in memory is carried out in a way that outcomes of applying the mix of optimizations are discussed
ensures memory accesses will be sequential. with respect to the objectives presented in Section II. First,
• Compute/load overlap In order to compute matrix to with regard to efficiency, the implementations are compared
vector multiplications, weights need to be accessed and in terms of throughput, energy consumption, and meeting
loaded from memory and then used for computations. real-time requirements. Then, for flexibility, we discuss im-
The total time is the sum of the access time and com- plementations that support variations in the models, online
putation time. To decrease this time, memory access training, or different application domains.
and computations can be overlapped. This overlap can Table 7 shows the details of the implementations studied
be achieved by fetching the weights for the next time- here. Authors names are shown, along with the name of
step while performing the computation of the current the architecture; if named; the affiliation, and the year of
time-step. The overlap would require the existence of publication. Table 8 and Table 9 present the implementations
extra buffers for storing the weights of the next time-step under study. Table 8 shows implementations performed on
while using the weights of the current time-step [74]. FPGAs, while Table 9 shows implementations performed
• Doubling memory fetching In this method, twice on other platforms. Each implementation has an index. The
the number of required weights for computation are index starts with “F” for FPGA implementations, “A” for
fetched [104]. Half of the weights are consumed in ASIC implementations, and “C” for other implementations.
the current time step t computations and the rest are For each implementation, the tables show the platform, the
buffered for the following time step t + 1. Doubling RNN model, the applied optimizations, and the runtime
memory fetching can reduce memory bandwidth by performance.
half. In most cases, only the recurrent layers of the RNN model
Domain-wall memory (DWM) DWM is a new technol- are shown, as most of the papers provided the implementation
ogy for non-volatile memories proposed by Parkin et al. for these layers only. The recurrent layers are written in the
from IBM in 2008 [105]. DWM technology is based on a format x*y*z, where x is the number of recurrent layers, y
magnetic spin [106]–[109]. Information is stored by setting is the type of recurrent layers (e.g. LSTM, GRU, ..), and z
the spin orientation of magnetic domains in a nanoscopic is the number of hidden cells in each layer. If the model has
permalloy wire. Multiple magnetic domains can occupy one different modules (e.g. two different LSTM models or LSTM
wire which is called race-tracking. Race-tracking allows the + CNN), we give the number of executed time-steps of the
20 VOLUME , 2020
TABLE 7: Detailed information about papers under study

Index Authors Name Affiliation Year
F2 [80] Li et al. E-RNN Syracuse University, Northeastern University, 2019
Florida International University,
Mellon University,
Carnegie University of Southern California,
SUNY University
F4 [68] Wang et al. C-LSTM Peking University, Syracuse University, 2018
City University of New York
F1 [56] Rybalkin et al. FINN-L University of Kaiserslautern, 2018
Xilinix Research Lab
F6 [74] Han et al. ESE Stanford University, DeePhi Tech, 2017
Tsinghua University, NVIDIA
F3 [70] Gao et al. DeltaRNN University of Zurich & ETH Zurich 2018
F5 [92] Rybalkin et al. - University of Kaiserslautern, 2017
German Research Center for Artificial Intelligence
F7 [21] Zhang et al. University of Illinois, Inspirit IoT Inc, 2017
Tsinghua University, Beihang University
F8 [103] Lee et al. - Seoul National University 2016
F9 [16] Sun et al. - Shanghai Jiao Tong University, 2018
Chinese Academy of Sciences,
University of Cambridge, Imperial College
F10 [91] Guan et al. - Peking University, University of California 2017
PKU/UCLA Joint Research Institute in Science and Engi-
neering
F11 [78] Rizakis et al. - Imperial College London 2018
F12 [69] Chang et al. DeepRnn Purdue University 2017
A1 [101] Ankit et al. PUMA Purdue University, Hewlett Packard Enterprise, 2019
University of Illinois at Urbana-Champaign
A2 [58] Wang et al. - Nanjing University 2017
A3 [98] Zhao et al. - Louisiana State University 2019
A8 [114] Long et al. - Georgia Institute of Technology, Atlanta 2018
A4 [97] Chen et al. Ocean Fudan University, Zhejiang University, 2017
University of Washington
A5 [73] Park et al. - Pohang University of Science and Technology 2018
A6 [104] Giraldo et al. Laika KU Leuven 2018
A7 [71] Yin et al. - Arizona State University 2018
A9 [95] Kwon et al. MAERI Goergia Institute of Technology 2018
A10 [94] Sharma et al. Bit Fusion Goergia Institute of Technology, Arm Inc. 2018
University of California (San Diego)
C1 [93] Sung et al. - Seoul National University 2018
C2 [93]
C3 [116] Cao et al. MobiRNN Stony Brook University 2017
RNN model. Both algorithmic and platform optimizations consider is that compression optimization results in decreas-
are shown in the tables. All the optimizations found in the ing the number of operations in the model before running it.
tables are explained above in Section IV using the same Consequently, the number of operations per second is not a
keywords as in the tables. For quantized models, “Quanti- fair indicator of the implementation efficiency. In this case,
zation X” is written in the optimizations column, where X is the throughput is calculated using the number of operations
the number of bits used to store the weights. The effective in the dense RNN model, not the compressed model. We
throughput and the energy efficiency given in the tables are call this an Effective Throughput. Below, we list the methods
discussed in detail in the sub-section below. used to deduce the throughput values for the different papers.
• Case q1: Effective throughput is given in the paper.
A. IMPLEMENTATION EFFICIENCY • Case q2: Number of operations in the dense model and
To study the efficiency of the implementations understudy, computation time are given. By dividing number of
we focus on three aspects: throughput, energy consumption, operations nop by time, we get the effective throughput
and meeting real-time requirements. Qef f as shown in Eq. 25. In some papers, the number
of operations and the computation time timecomp were
1) Effective Throughput given for multiple time steps (multiple inputs), which
To compare the throughput of different implementations, would require running the LSTM nsteps times.
we use the number of operations per second (OP/s) as a nop × nsteps
measure. Some of the papers surveyed did not directly state Qef f = (25)
tcomp
the throughput. For these papers, we have tried to deduce the
throughput from other information given. One other aspect to • Case q3: The implemented RNN model information is
VOLUME , 2020 21
provided in the paper. Thus, we calculate the number of for computation are fetched before the computation starts.
operations from the model information and then divide Thus, they managed to eliminate the off-chip memory access
it by computation time to get the throughput as in case overhead by having an efficient compute/load overlap.
q2. To compute the number of operations, the number Both F2 and F4 applied block-circulant matrices optimiza-
of operations in the matrix to vector multiplications is tion. In addition, A2 applied circulant matrices optimization.
counted as they have the dominant effect on the perfor- This indicates that restructuring weight matrices into circu-
mance. If the paper does not give enough information to lant matrices and sub-matrices is one of the most fruitful
calculate the number of operations, the number of oper- optimizations. The reason could be that circulant matrices
ations can be approximately calculated by multiplying optimization does not cause the irregularity of weight ma-
the number of parameters by two. trices seen in pruning [68]. Additionally, circulant/block-
• Case q4: The energy efficiency is given in terms of circulant matrices can be accompanied by low precision
OP/s/watt and the power consumption is given in watt. quantization without a harsh effect on accuracy as in A2 (6-
By multiplying the two values, throughput is calculated. bit) and F2 (12-bit). It is observed in Table 8 that F2 and
• Case q5: Effective throughput could not be computed. F4 optimizations are almost identical, but their performance
For a fair comparison between the ASIC implementations, is different. F2 and F4 have differences in the hardware
we have applied scaling to 65 nm technology at 1.1 V using architecture and F2 applied lower precision than F4, but the
the general scaling equations in Rabaey [117] and scaling most important reason is that F2 used a better approach in
estimate equations for 45 nm and smaller technologies [118]. training the compressed RNN model. F2 was able to reach
If the voltage value is not mentioned in the paper, we assume the same accuracy level reached by F4 with block size 8
the standard voltage for the implementation technology. For while using block size 16. Thus, the RNN model size in F2 is
example, since A7 was implemented on 65 nm, we assume approximately 2x less than F4.
the voltage value to be 1.1 V. Nevertheless, it is noticed that the only compute-
To analyze Table 8 and Table 9 and understand the effect optimization in F2 and F4 is pipelining. In these two imple-
of different optimizations on throughput, the tables entries mentations, pipelining served in two roles. The first is coarse-
are ordered in descending order, starting with the highest grained pipelining between LSTM stages, and the second,
throughput implementation. There exist two optimization fine-grained pipelining within each stage. It worth knowing
groups that appear more frequently among the high through- that F1 is based on the same architecture as F5. F1 achieved
put implementations. The first group is related to decreasing higher throughput than F6 by applying higher frequency and
memory access time. Memory access time is decreased either using lower precision. Assuming linear frequency scaling,
by using on-chip memory for all weights or overlapping the the ratio between the two implementations’ throughput is
computation time and the weights loading time. close to the ratio between the precisions used for storing the
The second group is related to algorithmic optimizations. weights by the two implementations.
Algorithmic optimizations present in all high throughput The lack of algorithmic optimizations in A1 [101] was
implementations are compression (pruning, block-circulant compensated by the use of analog crossbar-based matrix to
matrices, etc.), deltaRNN, and low precision quantization. vector multiplication units. Analog crossbar units allowed
Non-linear function approximations and 16-bit quantization low latency matrix to vector multiplications. Implementa-
are not within the groups of high effect optimizations. Quan- tions that used analog computing are marked with an “A”
tization with 16-bit is present in many implementations that sign in Figures 9 and 10. Comparing A1 to A3, both were
do not aim for lower precision, and it does not have a great using analog crossbars. A1 surpassed A3 by applying PIM
effect on computation cost. Thus, it is not a differentiating (Processing In Memory), which removes memory access
factor. Non-linear function approximations do not contribute overhead. Therefore, in Figures 9 and 10, we consider PIM
to the most used operations (matrix to vector multiplications). as a memory access optimization.
Finally, the throughput values are plotted against the im- One implementation that stands out is A6 [104], which
plementations in Figure 9. The scaled effective throughput has a very low throughput for an ASIC implementation while
values for the ASIC implementations are used. Implemen- applying on-chip and algorithmic optimizations. This partic-
tations that have memory access optimizations and/or algo- ular implementation was meant to meet a latency deadline of
rithmic optimizations are highlighted by putting them inside 16ms while consuming low power – at the micro-watt level.
a square and/or circle. It can be observed from Figure 9 Thus, high throughput was not the objective from the be-
that all of the implementations with high throughput have ginning. Implementations that defined real-time requirements
some algorithmic optimization and applied memory access are marked by an “RT” sign in Figures 9 and 10. Another
optimization. For example, F1 [56] applied low precision implementation that rewards close inspection is F8. Despite
quantization and placed all the weights on the on-chip mem- applying the two mentioned optimizations, it could not reach
ory. F2 [80], F3 [70], F4 [68], and A2 [58], all applied both as high performance as expected. The conclusion here is that
on-chip memory optimization and algorithmic optimizations. applying memory access optimization and algorithmic opti-
In F6 [74], the architecture had a scheduler that overlapped mization is necessary but not sufficient for high performance.
computation with memory accesses. All the weights required
22 VOLUME , 2020
VOLUME , 2020

TABLE 8: Comparison of RNNs implementations on FPGAs.
Index Platform Model Algorithmic Platform Qef f 1 Eef f 2
Optimizations Optimizations GOP/s GOP/s/watt
F1 [56] Zync XCZU7EV @266 MHz 1*BiLSTM*128 Quantization 1 On-chip,Pipelining < q1 > 3435 < e4 > -
Inner-loop-unrolling
Unrolling-timesteps, Tiling
F2 [80] Alpha Data ADM-7V3 2*LSTM*1024 Block-circulant 16 On-chip < q3 > 2485 < e1 > 99.4
@200MHz Piecewise approx. Pipelining
Quantization 12
F3 [70] Zync 7100 @125 MHz 1*GRU*256 DeltaRNN, RALUT On-chip < q1 > 1198.3 < e2 >164
Quantization 16 Pipelining
F4 [68] Alpha Data ADM-7V3 2*LSTM*1024 Block-circulant 8 On-chip < q3 > 1167.3 < e1 > 53
@200MHz Piecewise approx. Pipelining
Quantization 16
F5 [92] Zynq- 7000 XC7Z045 @142 1*BiLSTM*100 Quantization 5 On-chip, Pipelining < q1 > 308 < e3 > 44
MHz Inner-loop-unrolling
F6 [74] XCKU060 @200 MHz 2*LSTM*1024 Pruning, Quantization 12 Compute/load overlap < q3 > 78.6 3 < e2 > 1.9
Pipelining, Load balancing
Inner-loop-unrolling
F7 [21] Virtex-7 VC709 @100 MHz AlexNet + Quantization 16 Inner-loop-unrolling < q2 > 36.254 < e4 > -
15steps:1*LSTM*256 Reordering weights
F8 XC7Z045 @ 100 MHz 100steps:3*LSTM*256 Quantization 6 On-chip < q3 > 305 < e2 > 5.4
[103] 3840steps:2*LSTM*256 Inner-loop-unrolling
F9 [16] VC707 @150 MHz 1*LSTM*10 + FC Hard Sigmoid Tiling,Inner-loop-unrolling < q1 > 13.5 < e4 > -
Pipelining
F10 VC707 @150 MHz 3*LSTM*250 Piecewise approx. Tiling, Inner-loop-unrolling < q1 > 7.3 < e4 > -
[91] Pipelining, Reordering weights
Compute/load overlap
F11 Zynq ZC706 @100 MHz 2 * LSTM*512 Pruning Tiling, Inner-loop-unrolling < q3 > 1.55 < e4 > -
[78]
F12 Zynq-7000 XC7Z045 2*LSTM*128 Quantization 16 Hybrid memory < q4 > 0.2 < e1 > 0.11
[69] @142MHz Piecewise approx. Compute/load overlap
1 The cases q1-q4 are explained in Section V-A1.
2 The cases e1-e4 are explained in Section V-A2.
3 The effective throughput in the paper was computed on a bit basis. For a fair comparison, we recalculated the number of operations on an operand basis.
4 The throughput is for running CNN and LSTM combined together.
5 The number of time steps the model should run per second to reach real-time behavior is given. We computed the number of operations in the model and multiplied by the number
of time steps in one second, then multiplied by the speedup gained over the real-time threshold to get the implementation throughput.
23
24
TABLE 9: Comparison of RNNs implementations on ASIC and other platforms.

Category Index Platform Model Algorithmic Platform Qef f 1GOP/s Eef f 2 GOP/s/watt
Optimizations Optimizations (original/scaled)3 (original/scaled)3
A1 [101] CMOS 32nm @1GHz LSTM 8 Quantization 16 Memristor PIM < q1 >52300/16000 < e1 >837/250
Analog computing, Pipelining
A2 [58] TSMC 90nm 1*LSTM*512 Quantization 6 On-chip < q1 > 2460/3406 < e2 > 2436/2787
@600MHz &1v Circulant matrices Tiling
ASIC Piecewise approx. Inner-loop-unrolling
A3 [98] CMOS 180nm 1*LSTM*16 Quantization 4 Analog computing < q4 >473.3/1211 < e1 >950/7044
On-chip
A4 [97] CMOS 65nm GRU7 Quantization 16 On-chip < q1 > 311.6 < e1 > 2000/2380
@400 MHz Piecewise approx. Hardware sharing, Pipelining
A5 [73] CMOS 65nm 2*LSTM*512 Pruning Load balancing < q3 > 295 < e3 > 122.9
@200MHz Inner-loop-unrolling
A6 CMOS 65nm 2*LSTM*32 Quantization 4 On-chip < q2 > 0.002 4 < e2 > 469.3/128
[104] @239 KHz Piecewise approx. Doubling memory fetching
A7 [71] CMOS 65nm 1*LSTM*256 Quantization 16 ReRAM PIM < q5 > - < e1 >27000

Analog computing
C1 [93] ARMv8 1*SRU*1024 SRU Unrolling-timesteps < q3 > 22.3 < e4 > -
Others @ 2GHz
C2 [93] Intel Core i7 1*SRU*1024 SRU Unrolling-timesteps < q3 >19.2 < e4 > -
@ 3.2GHz
C3 Adreno 330 GPU 2*LSTM*32 - RenderScript5 < q3 > 0.0011 < e4 > -
[116] @ 450 MHz
1 The cases q1-q4 are explained in Section V-A1.
2 The cases e1-e4 are explained in Section V-A2.
3 Scaled to 65 nm at 1.1 volt using general scaling [117] and scaling estimates for 45 nm and smaller technologies [118].
4 The throughput is not high as the purpose was to reach very low power consumption while performing inference within 16ms.
5 RenderScript is a mobile-specific parallelization framework [119].
6 Quantization used 1 bit for weights and 4 bits for activations.
7 A4 proposed a GRU core without providing specific model details.
8 A1 did not specify which model achieved the provided throughput and energy efficiency.
VOLUME , 2020
In addition, Figure 9 shows that most of the ASIC im- effective throughput while ASIC implementations are more
plementations were not exceeding FPGA implementations in energy efficient. The credit should go to ASIC technology.
terms of throughput. We think the reason is that the ASIC It can be observed that the highest energy efficiency was
implementations under study did not use the latest ASIC achieved by A7 [120] and A3 [98]. Both implementations
technologies, as shown in Table 9. used analog crossbar based matrix to vector multiplications.
For the low throughput implementations, Figure 9 shows A7 managed to save memory access time by computing in
that some implementations did not apply either of the two memory. The quantization method used was multi-bit code
optimizations (memory access and algorithmic), such as quantization (1-bit for weights and 4-bit for activations).
F9 [16] that had a strict accuracy constraint bounding the use Multi-bit code quantization enables replacing the MAC oper-
of algorithmic optimizations and C3 [116]. In addition, some ation with XNOR and bit-counting operations, as discussed
implementations applied only one of the two optimizations, in Section IV-A1. It was sufficient to use an XNOR-RRAM
including F11 [78] and F12 [69]. based architecture to implement the RNN.
Both A1 (applying PIM and analog computing) and A6
2) Energy efficiency (Applying memory and algorithmic optimizations) were less
To compare the implementations from the energy consump- energy efficient than expected. They were both less energy
tion perspective, we use the number of operations per second efficient than A4 (Applying only memory optimization). A1
per watt as a measure. The last columns in Table 8 and had a quite high clock frequency of 1 GHz. This high fre-
Table 9 show the energy efficiency. Energy efficiency is quency helped the implementation to achieve high through-
calculated based on the dense model, not the sparse model, put. However, we suspect that this high frequency is the main
as for effective throughput. However, it was not possible to reason for the energy efficiency degradation compared to the
obtain values for energy efficiency for all implementations. other implementations. A6 had the least power consumption
In some cases, the power consumption was not mentioned among all implementations (≤ 5 µW ). The low throughput
in the paper, while in others, the consumed power was not of A6 affected the overall energy efficiency value.
provided in a precise manner. For instance, the power of the
whole FPGA board may be provided, which does not indicate 3) Meeting real-time requirements
how much power is used by the implementation with respect In some of the implementations, real-time requirements for
to the peripherals [21], [91]. throughput and power have been determined. For example,
Here, we list the cases used for computing the energy in F8 [103], the speech recognition system had two RNN
efficiency in Table 8 and Table 9. The case number appears models. One model for acoustic modeling and the other for
in triangular brackets, <>, before the numeric value character-level language modeling. The real-time require-
ment was to run the first model 100 times per second and the
• Case e1: The Eef f energy efficiency is given in the
second model 3,840 times per second. While in A6 [104],
paper.
an LSTM accelerator for an always-on Keyword Spotting
• Case e2: The power consumption is given in the paper.
System (KWS), the real-time response demanded that a new
To compute the energy efficiency Eef f , the effective
input vector should be consumed every 16 ms and the power
throughput Qef f (OP/s) is divided by the power P
consumption should not exceed 10 µW .
(watt) as
Qef f B. FLEXIBILITY
Eef f = . Flexibility, as defined in Section II is the ability of the
P
solution to support different models and configurations. The
• Case e3: Energy and computation time are provided. flexibility of the solution can be met by supporting variations
First, we divide energy by time to get power. Then, we in the model. Models can vary in the number of layers, the
divide effective throughput Qef f by the power to get number of hidden units per layer, optimizations applied on
energy efficiency, as in case e2. the model, and more. Flexibility can be met by supporting
• Case e4: energy efficiency could not be computed. online training or meeting different application domain re-
Figure 10 is a plot of the energy efficiency found or quirements.
deduced for the implementations under study against the Flexibility is not quantitative, like throughput. Thus, we
implementation index. Implementations are sorted in the plot use a subjective measure for flexibility to reach a flexibility
according to energy efficiency and the scaled values for the score for each implementation. Table 10 shows the flexibility
ASIC implementations are used. Again, to show the effect of aspects supported by each implementation, as discussed in
optimizations, we chose the two most effective optimizations the papers and the flexibility score for each implementation.
from Table 8 and Table 9 to include in the figure. They are Papers that do not discuss any flexibility aspects are omitted
the same as in Figure 9: memory access optimization and al- from Table 10. In A4 [97], the architecture should be able to
gorithmic optimizations. Comparing the effective throughput support various models, but the number of cells and layers
and energy efficiency of FPGA and ASIC implementations, the architecture can support are not mentioned in the paper.
it is observed that FPGA and ASIC have close values for Hence, we cannot deduce how the implementation could
VOLUME , 2020 25
FIGURE 9: Effective throughput of different implementations and the key optimizations affecting them.
FIGURE 10: Energy efficiency of different implementations and the key optimizations used.
support variations in the RNN model. Also, the variations store all the weights in on-chip memory is very beneficial.
should be supported on the hardware platform and not only It leads to better performance and less energy consumption
by the method before fabrication. In A2 [58], the design by decreasing the cost of memory accesses. However, this
method can support two different RNN layers. However, the solution may be unfeasible for larger problems. For example,
fabricated chip supports only one of them. Thus, we do not in F8 [103], the number of weights in the model and their
consider A2 [58] to meet the flexibility objective. precision are restricted by the on-chip memory size. It is not
To understand how far flexibility is met by the implemen- possible to run a model with an increased number of hidden
tations, Figure 11 shows the percentage of implementations cells or increased precision. A possible solution is to use
supporting each flexibility aspect. Flexibility is visualized an adaptable approach, where the location chosen to store
as levels. Level 0 is used to indicate no flexibility. Level 0 the weights is dependent on the model size and thus can a
requires the implementation to support only one recurrent wide range of models can be supported. Another solution was
layer configuration. All papers meet level 0 requirement but adopted in F12 [69], where some of the weights are stored in
thereafter they vary in meeting other flexibility aspects. The internal memory, and the rest are stored in off-chip memory
flexibility aspects and how they can be met are discussed (Hybrid memory).
below. Supporting other NN layers (level 2) Supporting other
Supporting variations in RNN layers (level 1) Recurrent NN layers allows the solution to run a broader range of NN
layers can vary in the type of layers, the number of cells applications. Also, other NN layers may exist in the RNN
in each layer, and the number of layers (the depth of the model, such as convolutions used as a feature extractor. Sup-
RNN model). One optimization that might have a side effect porting such a convolution in the implementation increases
on the flexibility of the solution is the choice of using on- the flexibility of the solution, as it can run RNN models with
chip or off-chip memory to store the weights. Being able to visual inputs and run CNN independent applications.
26 VOLUME , 2020
FIGURE 11: Percentage of implementations meeting flexibility aspects for different flexibility levels and the definition of
flexibility levels.
TABLE 10: Flexibility score of implementations under study. Supporting algorithmic optimization variations (Level
3) Variations in the optimizations applied are also considered
Index Flexibility aspects in papers Score
F2 [80] Varying layer (LSTM/GRU) XXX
as variations in the model. For example, variation due to
Varying number of cells applying or not applying pruning is related to the presence
Varying block size (block circulant matrices) of sparse or dense matrices in the matrix to vector multi-
F4 [68] Varying layer (LSTM/BiLSTM) XXX plications computations. The design in A9 [95] employed
Varying number of layers
Varying number of cells a configurable interconnection network topology to increase
F1 [56] Varying layer (LSTM/BiLSTM) XXX the flexibility of the accelerator. The accelerator in A9 [95]
Varying precision supported both LSTM and CNN layers. The accelerators sup-
FC supported
F5 [92] Varying layer (LSTM/BiLSTM) XX ported both sparse and dense matrices. One other variation
FC supported in the precision of the weights and activations. The design
F7 [21] Convolution supported XX in A10 [94] supported varying precision models by allowing
FC supported
F8 [103] Varying number of layers XXX dynamic precision per layer for both CNN and RNN models.
Varying number of cells Similarly, the Microsoft NPU brainwave architecture [121]
Input layer supported varying precision using a narrow precision block
F10 Varying number of layers XX
[91] Varying number of cells
floating-point format [122]. To maximize the benefit of
A4 [97] Online training X varying precision, F1 [56] applied a parameterizable paral-
A5 [73] Varying number of cells XX lelization scheme. When lower precision is required, LSTM
FC supported
units are duplicated to exploit the unused resources to gain
A6 [104] Varying number of layers XXXX
Varying number of cells speedup. And, when higher precision is used SIMD folding
Linear/nonlinear quantization is applied to save resources for the needed high precision.
FC supported
A8 [114] Varying type of layer(LSTM/GRU) XXX
Convolution supported
Online training (Level 4) Incremented online training
FC supported was included in A4 [97] to support retraining pre-trained
A9 [95] Varying number of cells XXXX networks to enhance accuracy. Changes in hardware design
Varying number of layers were applied to support both training and inference without
Dense/Sparse
Convolution supported affecting the quality of inference. For example, three modes
A10 [94] Varying number of cells XXXX of data transfer were applied. The first to load new weights;
Varying number of layers the second to load input sequences; and the third to update
Varying precision certain weights. Extra precision was only used for training.
A1 [101] Varying number of cells XXXXX
Varying number of layers Meeting different applications domains constraints
Varying type of layers (Level 5) None of the implementations target variations
FC supported in the application domain constraints. NetAdapt is a good
C2 [93] Varying layer (LSTM/SRU/QRNN) XX example of an implementation that can adapt to different
Varying number of cells metric budgets [123]. However, it only targets CNNs.
C3 Varying number of layers XX
[116] Varying number of cells
VOLUME , 2020 27
VI. DISCUSSIONS AND OPPORTUNITIES solution in [123]. Following a similar method in RNNs, and
In the previous section, we studied the implementations of in addition, also supporting model variations, could lead to
RNN on embedded platforms. In Section II, we defined interesting solutions.
objectives for realizing RNN models on embedded platforms. Opportunities for future research
In this section, we investigate how these objectives are being Based on the survey reported on in this article, we list some
met by the implementations. opportunities for future research.
Throughput It is clear that throughput was the main ob- QRNN and SRU: QRNN and SRU (Section III-B6) are
jective for most of the implementations. As seen in Figure 9, two alternatives to LSTM where the matrix to vector compu-
high throughput was achieved by many of them. Algorithmic tations for the current time-step are independent of previous
and memory optimizations are present in most of these high time-step computations. Thus, using them in RNN models
throughput implementations. The algorithmic optimizations can make the parallelization more efficient and consequently
applied were effective because they decrease both the com- lead to better performance.
putation and the memory requirements of the RNN models. DeltaRNN [87] and DeltaCNN [88]: We believe that
For example, if 4-bit precision is used instead of 32-bit applying the delta algorithm to both recurrent and convolu-
for weights storage, the memory requirement is decreased tion layers is a logical step because of the temporal relation
to 1/8. Multiple 4-bit weights can be concatenated during between the input sequences. Adding a delta step to other
weights fetching. Thus, the number of memory accesses can algorithmic optimizations such as pruning and quantization
decrease as well. Furthermore, the hardware required for 4- would decrease memory access and computation require-
bit operations is simpler than the hardware required for 32-bit ments.
floating-point operations. Block-circulant matrices Using block-circulant matrices
Memory-specific optimizations are effective because they as an algorithmic optimization decreases the RNN size while
decrease or hide the overhead of accessing the large number avoiding irregularity of computation as introduced by prun-
of weights used in RNN computations. Memory access time ing [68]. Applying circulant matrices can be accompanied by
can be decreased by storing all weights in on-chip memory. low precision parameters and activations, with a small effect
However, this can bound the validity of the solution for on accuracy [58]. With the addition of the delta algorithm, as
larger models as on-chip memory may not be sufficient to mentioned earlier, RNN inference can achieve a promising
store the weights. Overlapping the memory access time with throughput and energy efficiency.
computation and computation in memory are also considered Hybrid optimizations: It has been shown that a mix of
to be memory optimizations. algorithmic optimizations can be applied to an RNN model
Energy efficiency Applying algorithmic and memory ac- with a loss in accuracy that is acceptable [58]. Applying a
cess optimizations has a positive effect on energy efficiency mix of optimizations would enable the implementations to
as well. Algorithmic optimizations lead to a decrease in the benefit from each optimization. For an RNN implementation,
number of computations, the complexity of computations, three classes of optimizations can be mixed with tuning.
and the number of memory accesses, and so decrease the The first optimization is the delta algorithm, and the cor-
energy consumed by the implementation. Also, minimizing responding parameter is delta. The second is quantization
off-chip memory use by storing weights on on-chip memory and the corresponding parameters are the number of bits and
is an effective way to enhance energy efficiency. Analog com- the quantization method. The third optimization is compres-
puting and processing in memory implementations showed sion. If the applied compression technique is block-circulant
superior energy efficiency in ASIC implementations. matrices, the parameter is the block size. Tuning the three
Meeting real-time requirements was not an objective for parameters delta, number of bits, quantization method, and
many of the implementations. In a few of them, real-time block size, the designer can achieve the highest performance
deadlines were mentioned and followed in the design of the while keeping the accuracy within an acceptable range (the
solution. range is dependent on the application).
Flexibilty In Section II-A, flexibility is defined as a sec- Analog computing and processing in memory: Analog
ondary objective. Thus, we do not expect flexibility to be computing [98] and processing in memory [71], [101] have
fully met by the implementations. Variations in the RNN shown promising performance, especially in energy effi-
model was partially fulfilled by many implementations. How- ciency. Analog crossbar based matrix to vector multiplication
ever, the number of variations covered by each implementa- units can provide low latency and computing in memory
tion is quite low. Few implementations included other NN overcomes the memory access problems.
layers and variations in algorithmic optimizations. Online- Flexible neural networks and domain-specific archi-
training was targeted by only one implementation. Embed- tectures Domain-specific architectures (DSAs) have been
ded implementations do not usually support online-training. highlighted as a future opportunity in the field of computer
However, on the algorithmic side, researchers are carrying architecture [124]. DSAs (also called accelerators or custom
out interesting work based on online or continuous train- hardware) for neural networks applications can reach high
ing [10], [11]. None of the RNN implementations support performance and energy efficiency. Designing an architec-
different applications, but this has been done by the CNN ture for neural networks applications as a specific domain
28 VOLUME , 2020
with known memory access patterns enhances parallelism efficient recurrent layers and algorithmic optimizations that
and the use of the memory hierarchy. It is possible to use can enhance implementations’ efficiency. Additionally, we
lower precision and benefit from domain-specific languages describe how custom embedded hardware implementations
(DSLs). Google Edge TPU is an example of a DSA for neural can support flexible RNNs solutions.
networks inference using 8-bit precision [125]. Based on the
study in this article, we add that the neural networks DSA REFERENCES
needs to support flexibility. For the flexibility aspects defined [1] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with
earlier in Section II to be fulfilled, there are some features deep recurrent neural networks,” CoRR, vol. abs/1303.5778, 2013.
[2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
need to be supported in the underlying hardware. with neural networks,” CoRR, vol. abs/1409.3215, 2014.
• Variable bit-width operations as in A10 [94] to support [3] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
different quantization schemes. gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolu-
tional networks for visual recognition and description,” CoRR, vol.
• Some optimizations require pre/post-processing on in- abs/1411.4389, 2014.
put vectors and weights. Support for weights reordering, [4] H. Salehinejad, J. Baarbe, S. Sankar, J. Barfett, E. Colak, and
delta vectors computation, retaining circulant matrices S. Valaee, “Recent advances in recurrent neural networks,” CoRR, vol.
abs/1801.01078, 2018.
from equivalent vectors, and other operations required [5] Z. C. Lipton, “A critical review of recurrent neural networks for sequence
by miscellaneous optimizations would be useful. learning,” CoRR, vol. abs/1506.00019, 2015.
• Support for training that would imply the support of [6] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep
neural networks: a tutorial and survey,” Proceedings of the IEEE, vol.
back-propagation and the allowance of weights modi- 105, no. 12, pp. 2295–2329, Dec 2017.
fication. [7] S. I. Venieris, A. Kouris, and C. Bouganis, “toolflows for mapping
convolutional neural networks on FPGAs: a survey and future directions,”
CoRR, vol. abs/1803.05900, 2018.
VII. SUMMARY [8] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model
Today we see a trend towards more intelligent mobile devices compression and acceleration for deep neural networks,” CoRR, vol.
that are processing applications with streamed data in the abs/1710.09282, 2017.
[9] E. Wang, J. J. Davis, R. Zhao, H. Ng, X. Niu, W. Luk, P. Y. K. Cheung,
form of text, voice, and video. To process these applications, and G. A. Constantinides, “Deep neural network approximation for
RNNs are important because of their efficiency in processing custom hardware: Where we’ve been, where we’re going,” CoRR, vol.
sequential data. In this article, we have studied the state-of- abs/1901.06955, 2019. [Online]. Available: http://arxiv.org/abs/1901.
06955
the-art in RNN implementations from the embedded systems [10] A. Teichman and S. Thrun, “Practical object recognition in autonomous
perspective. The article includes all the aspects required for driving and beyond,” in Advanced Robotics and its Social Impacts, Oct
the efficient implementation of an RNN model on embed- 2011, pp. 35–38.
[11] C. Käding, E. Rodner, A. Freytag, and J. Denzler, “Fine-tuning deep
ded platforms. We study the different components of RNN neural networks in continuous learning scenarios,” in Asian Conference
models, with an emphasis on implementation rather than on on Computer Vision. Springer, 2016, pp. 588–605.
algorithms. Also, we define the objectives that are required [12] Y. Wang, S. Liang, S. Yao, Y. Shan, S. Han, J. Peng, and H. Luo,
“Reconfigurable processor for deep learning in autonomous vehicles,”
to be met by the hardware solutions for RNN applications, 2017.
and the challenges that make them difficult to implement. For [13] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro,
an RNN model to run efficiently on an embedded platform, and F. Kawsar, “DeepX: a software accelerator for low-power deep learn-
ing inference on mobile devices,” in 2016 15th ACM/IEEE International
some optimizations need to be applied. Thus, we study both Conference on Information Processing in Sensor Networks (IPSN), April
algorithmic and platform-specific optimizations. Then, we 2016, pp. 1–12.
analyze existing implementations of RNN models on embed- [14] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
wireless networking: a survey,” CoRR, vol. abs/1803.04311, 2018.
ded systems. Finally, we discuss how the objectives defined [15] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: a spatial architecture for
earlier in the article have been met and highlight possible energy-efficient dataflow for convolutional neural networks,” in Proceed-
directions for research in this field in the future. ings of the 43rd International Symposium on Computer Architecture, ser.
We conclude from the analysis of the implementations ISCA ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 367–379.
[16] Z. Sun, Y. Zhu, Y. Zheng, H. Wu, Z. Cao, P. Xiong, J. Hou, T. Huang,
that there are two optimizations that are used for most of and Z. Que, “FPGA acceleration of LSTM based on data for test flight,”
the efficient implementations. The first is algorithmic opti- in 2018 IEEE International Conference on Smart Cloud (SmartCloud),
mizations. The second is to decrease the memory access time Sept 2018, pp. 1–6.
[17] E. Sejdić, I. Djurović, and J. Jiang, “Time–frequency feature representa-
for weights retrieval, which can be implemented by relying tion using energy concentration: An overview of recent advances,” Digital
on on-chip memory for storing the weights, applying an signal processing, vol. 19, no. 1, pp. 153–183, 2009.
efficient overlap between weights loading and computations, [18] K. Gupta and D. Gupta, “An analysis on LPC, RASTA and MFCC
techniques in automatic speech recognition system,” in 2016 6th Inter-
or by computing in memory. However, using analog cross- national Conference-Cloud System and Big Data Engineering (Conflu-
bar based multipliers can achieve high performance without ence), Noida, India. IEEE, 2016, pp. 493–497.
relying too much on algorithmic optimizations. A study of [19] L. Deng, “A tutorial survey of architectures, algorithms, and applications
for deep learning,” APSIPA Transactions on Signal and Information
the implementations in the literature shows performance high Processing, vol. vol.3, 2014.
enough for many streaming applications while also showing [20] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for gener-
a lack of flexibility. Finally, we deduce some opportunities ating image descriptions,” in 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2015, pp. 3128–3137.
for research to fill the gap between the defined objectives and [21] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang,
the research work under study. We highlight some hardware Z. Cheng, K. Rupnow, and D. Chen, “High-performance video content
VOLUME , 2020 29
recognition with long-term recurrent convolutional network for FPGA,” [45] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation
in 2017 27th International Conference on Field Programmable Logic and of gated recurrent neural networks on sequence modeling,” CoRR, vol.
Applications (FPL), Sept 2017, pp. 1–4. abs/1412.3555, 2014.
[22] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Convo- [46] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural
lutional gated recurrent neural network incorporating spatial features networks,” CoRR, vol. abs/1611.01576, 2016.
for audio tagging,” in 2017 International Joint Conference on Neural [47] T. Lei, Y. Zhang, and Y. Artzi, “Training RNNs as fast as CNNs,” CoRR,
Networks (IJCNN). IEEE, 2017, pp. 3461–3466. vol. abs/1709.02755, 2017.
[23] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep [48] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilis-
learning based natural language processing,” CoRR, vol. abs/1708.02709, tic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, Mar.
2017. 2003.
[24] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Founda- [49] A. S. Timmaraju, “Sentiment analysis on movie reviews using recursive
tions and Trends in Information Retrieval, vol. 2, no. 1-2, pp. 1–135, Jan. and recurrent neural network architectures,” Semantic scholar, 2015.
2008. [50] J. Li and Y. Shen, “Image describing based on bidirectional LSTM and
[25] Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, and E. Cambria, “Learning word improved sequence sampling,” in 2017 IEEE 2nd International Confer-
representations for sentiment analysis,” Cognitive Computation, vol. 9, ence on Big Data Analysis (ICBDA), March 2017, pp. 735–739.
no. 6, pp. 843–851, Dec 2017. [51] V. Vukotic, C. Raymond, and G. Gravier, “A step beyond local observa-
tions with a dialog aware bidirectional GRU network for spoken language
[26] R. Leonard, “A database for speaker-independent digit recognition,” in
understanding,” in Interspeech, San Francisco, United States, Sep. 2016.
ICASSP ’84. IEEE International Conference on Acoustics, Speech, and
Signal Processing, vol. 9, March 1984, pp. 328–331. [52] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach.
Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009.
[27] “AN4 dataset,” http://www.speech.cs.cmu.edu/databases/an4/, accessed
[53] R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio, “How to construct deep
on: Oct. 2018.
recurrent neural networks,” CoRR, vol. abs/1312.6026, 2013.
[28] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett, “DARPA [54] X. Dai, H. Yin, and N. K. Jha, “Grow and prune compact, fast, and
TIMIT acoustic-phonetic continous speech corpus cd-rom. nist speech accurate lstms,” CoRR, vol. abs/1805.11797, 2018.
disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, p. 27403, 01
[55] J. Ott, Z. Lin, Y. Zhang, S. Liu, and Y. Bengio, “Recurrent neural
1993.
networks with limited numerical precision,” CoRR, vol. abs/1608.06902,
[29] J. Garofalo, D. Graff, D. Paul, and D. Pallet, “WSJ dataset,” https:// 2016.
catalog.ldc.upenn.edu/LDC93S6A, linguistic Data Consortium, Philadel- [56] V. Rybalkin, A. Pappalardo, M. M. Ghaffar, G. Gambardella, N. Wehn,
phia. Accessed on: Oct. 2019. and M. Blott, “FINN-L: library extensions and design trade-off anal-
[30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: ysis for variable precision LSTM networks on FPGAs,” CoRR, vol.
an ASR corpus based on public domain audio books,” in 2015 IEEE abs/1807.04093, 2018.
International Conference on Acoustics, Speech and Signal Processing [57] E. Stromatias, D. Neil, M. Pfeiffer, F. Galluppi, S. B. Furber, and
(ICASSP), April 2015, pp. 5206–5210. S.-C. Liu, “Robustness of spiking deep belief networks to noise and
[31] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, reduced bit precision of neuro-inspired hardware platforms,” Frontiers
P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: in Neuroscience, vol. 9, p. 222, 2015.
common objects in context,” CoRR, vol. abs/1405.0312, 2014. [58] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks:
[32] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learn- a memory-efficient approach,” IEEE Transactions on Very Large Scale
ing of video representations using lstms,” CoRR, vol. abs/1502.04681, Integration (VLSI) Systems, vol. 25, no. 10, pp. 2763–2775, Oct 2017.
2015. [59] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
[33] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol. “Binarized neural networks,” in Proceedings of the 30th International
abs/1608.01230, 2016. Conference on Neural Information Processing Systems, ser. NIPS’16.
[34] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large USA: Curran Associates Inc., 2016, pp. 4114–4122.
annotated corpus of English: the Penn treebank,” Comput. Linguist., [60] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
vol. 19, no. 2, pp. 313–330, Jun. 1993. imageNet classification using binary convolutional neural networks,”
[35] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture CoRR, vol. abs/1603.05279, 2016.
models,” CoRR, vol. abs/1609.07843, 2016. [61] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711,
[36] M. Mahoney, “About the test data,” http://mattmahoney.net/dc/textdata, 2016.
accessed on: Oct. 2018. [62] M. Z. Alom, A. T. Moody, N. Maruyama, B. C. V. Essen, and T. M.
[37] “Wmt’14 dataset,” https://nlp.stanford.edu/projects/nmt/, accessed on: Taha, “Effective quantization approaches for recurrent neural networks,”
Jan. 2019. CoRR, vol. abs/1802.02615, 2018.
[63] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alter-
[38] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
nating multi-bit quantization for recurrent neural networks,” CoRR, vol.
“Learning word vectors for sentiment analysis,” in Proceedings of the
abs/1802.00150, 2018.
49th Annual Meeting of the Association for Computational Linguistics:
[64] S. Zhou, Y. Wang, H. Wen, Q. He, and Y. Zou, “Balanced quantization:
Human Language Technologies. Portland, Oregon, USA: Association
an effective and efficient approach to quantized neural networks,” CoRR,
for Computational Linguistics, June 2011, pp. 142–150.
vol. abs/1706.07145, 2017.
[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[65] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network sketching: exploiting
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
binary structure in deep CNNs,” CoRR, vol. abs/1706.02021, 2017.
[40] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in [66] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: training deep
Proceedings of the IEEE-INNS-ENNS International Joint Conference on neural networks with binary weights during propagations,” CoRR, vol.
Neural Networks. IJCNN 2000. Neural Computing: New Challenges and abs/1511.00363, 2015.
Perspectives for the New Millennium, July 2000, pp. 189–194 vol.3. [67] Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu,
[41] A. Graves, “Generating sequences with recurrent neural networks,” “Towards binary-valued gates for robust LSTM training,” CoRR, vol.
CoRR, vol. abs/1308.0850, 2013. abs/1806.02988, 2018. [Online]. Available: http://arxiv.org/abs/1806.
[42] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convo- 02988
lutional LSTM network: a machine learning approach for precipitation [68] S. Wang, Z. Li, C. Ding, B. Yuan, Y. Wang, Q. Qiu, and Y. Liang,
nowcasting,” CoRR, vol. abs/1506.04214, 2015. “C-LSTM: enabling efficient LSTM using structured compression tech-
[43] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recur- niques on FPGAs,” CoRR, vol. abs/1803.06305, 2018.
rent neural network architectures for large scale acoustic modeling,” in [69] A. X. M. Chang and E. Culurciello, “Hardware accelerators for recurrent
Fifteenth annual conference of the international speech communication neural networks on FPGA,” in 2017 IEEE International Symposium on
association, 2014. Circuits and Systems (ISCAS), May 2017, pp. 1–4.
[44] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the [70] C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A
properties of neural machine translation: encoder-decoder approaches,” power-efficient recurrent neural network accelerator,” in Proceedings of
CoRR, vol. abs/1409.1259, 2014. the 2018 ACM/SIGDA International Symposium on Field-Programmable
30 VOLUME , 2020
Gate Arrays, ser. FPGA âĂŹ18. New York, NY, USA: Association sign, Automation & Test in Europe. European Design and Automation
for Computing Machinery, 2018, pp. 21–30. [Online]. Available: Association, 2017, pp. 1394–1399.
https://doi.org/10.1145/3174243.3174261 [93] W. Sung and J. Park, “Single stream parallelization of recurrent neural
[71] S. Yin, X. Sun, S. Yu, J. Seo, and C. Chakrabarti, “A parallel RRAM networks for low power and fast inference,” CoRR, vol. abs/1803.11389,
synaptic array architecture for energy-efficient recurrent neural net- 2018.
works,” in 2018 IEEE International Workshop on Signal Processing [94] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra,
Systems (SiPS), Oct 2018, pp. 13–18. and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable
[72] A. See, M. Luong, and C. D. Manning, “Compression of neural machine architecture for accelerating deep neural networks,” CoRR, vol.
translation models via pruning,” CoRR, vol. abs/1606.09274, 2016. abs/1712.01507, 2017. [Online]. Available: http://arxiv.org/abs/1712.
[73] J. Park, J. Kung, W. Yi, and J. J. Kim, “Maximizing system performance 01507
by balancing computation loads in LSTM accelerators,” in 2018 Design, [95] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling
Automation Test in Europe Conference Exhibition (DATE), March 2018, flexible dataflow mapping over DNN accelerators via reconfigurable
pp. 7–12. interconnects,” SIGPLAN Not., vol. 53, no. 2, pp. 461–475, Mar. 2018.
[74] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, [Online]. Available: http://doi.acm.org/10.1145/3296957.3173176
Y. Wang, H. Yang, and W. J. Dally, “ESE: efficient speech recognition [96] A. Samajdar, Y. Zhu, P. N. Whatmough, M. Mattina, and T. Krishna,
engine with compressed LSTM on FPGA,” CoRR, vol. abs/1612.00694, “Scale-sim: Systolic CNN accelerator,” CoRR, vol. abs/1811.02883,
2016. 2018. [Online]. Available: http://arxiv.org/abs/1811.02883
[75] S. Narang, G. F. Diamos, S. Sengupta, and E. Elsen, “Exploring sparsity [97] C. Chen, H. Ding, H. Peng, H. Zhu, R. Ma, P. Zhang, X. Yan, Y. Wang,
in recurrent neural networks,” CoRR, vol. abs/1704.05119, 2017. M. Wang, H. Min, and R. C. . Shi, “OCEAN: an on-chip incremental-
[76] S. Narang, E. Undersander, and G. F. Diamos, “Block-sparse recurrent learning enhanced processor with gated recurrent neural network accel-
neural networks,” CoRR, vol. abs/1711.02782, 2017. [Online]. Available: erators,” in ESSCIRC 2017 - 43rd IEEE European Solid State Circuits
http://arxiv.org/abs/1711.02782 Conference, Sept 2017, pp. 259–262.
[77] X. Dai, H. Yin, and N. K. Jha, “NeST: a neural network synthesis [98] Z. Zhao, A. Srivastava, L. Peng, and Q. Chen, “Long short-term memory
tool based on a grow-and-prune paradigm,” CoRR, vol. abs/1711.02017, network design for analog computing,” J. Emerg. Technol. Comput.
2017. Syst., vol. 15, no. 1, pp. 13:1–13:27, Jan. 2019. [Online]. Available:
[78] M. Rizakis, S. I. Venieris, A. Kouris, and C. Bouganis, “Approximate http://doi.acm.org/10.1145/3289393
FPGA-based LSTMs under computation time constraints,” CoRR, vol. [99] D. Maliuk and Y. Makris, “An experimentation platform for on-chip
abs/1801.02190, 2018. integration of analog neural networks: A pathway to trusted and robust
[79] S. Yao, Y. Zhao, A. Zhang, L. Su, and T. Abdelzaher, “DeepIoT: analog/RF ICs,” IEEE Transactions on Neural Networks and Learning
compressing deep neural network structures for sensing systems with a Systems, vol. 26, no. 8, pp. 1721–1734, Aug 2015.
compressor-critic framework,” in Proceedings of the 15th ACM Confer- [100] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H. Yoo, “14.6 A
ence on Embedded Network Sensor Systems, ser. SenSys ’17. New 0.62mw ultra-low-power convolutional-neural-network face-recognition
York, NY, USA: ACM, 2017, pp. 4:1–4:14. processor and a CIS integrated with always-on haar-like face detector,” in
[80] Z. Li, C. Ding, S. Wang, W. Wen, Y. Zhuo, C. Liu, Q. Qiu, W. Xu, 2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb
X. Lin, X. Qian, and Y. Wang, “E-RNN: design optimization for efficient 2017, pp. 248–249.
recurrent neural networks in FPGAs,” CoRR, vol. abs/1812.07106, 2018. [101] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S.
[Online]. Available: http://arxiv.org/abs/1812.07106 Williams, P. Faraboschi, W. Hwu, J. P. Strachan, K. Roy, and D. S.
[81] A. Tjandra, S. Sakti, and S. Nakamura, “Tensor decomposition for com- Milojicic, “PUMA: A programmable ultra-efficient memristor-based
pressing recurrent neural network,” 2018 International Joint Conference accelerator for machine learning inference,” CoRR, vol. abs/1901.10351,
on Neural Networks (IJCNN), pp. 1–8, July 2018. 2019. [Online]. Available: http://arxiv.org/abs/1901.10351
[82] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing [102] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
neural networks,” CoRR, vol. abs/1509.06569, 2015. [Online]. Available: W. J. Dally, “EIE: efficient inference engine on compressed deep neural
http://arxiv.org/abs/1509.06569 network,” CoRR, vol. abs/1602.01528, 2016.
[83] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- [103] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, “FPGA-
sky, “Speeding-up convolutional neural networks using fine-tuned cp- based low-power speech recognition with recurrent neural networks,” in
decomposition,” CoRR, vol. abs/1412.6553, 2014. Signal Processing Systems (SiPS), 2016 IEEE International Workshop
[84] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” on. IEEE, 2016, pp. 230–235.
CoRR, vol. abs/1412.6980, 2014. [104] J. S. P. Giraldo and M. Verhelst, “Laika: a 5uW programmable LSTM ac-
[85] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy celerator for always-on keyword spotting in 65nm CMOS,” in ESSCIRC
of pruning for model compression,” in 6th International Conference on 2018 - IEEE 44th European Solid State Circuits Conference, Sept 2018,
Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 pp. 166–169.
- May 3, 2018, Workshop Track Proceedings, 2018. [Online]. Available: [105] S. S. Parkin, M. Hayashi, and L. Thomas, “Magnetic domain-wall race-
https://openreview.net/forum?id=Sy1iIDkPM track memory,” Science, vol. 320, no. 5873, pp. 190–194, 2008.
[86] Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” [106] M. H. Samavatian, A. Bacha, L. Zhou, and R. Teodorescu, “RNNFast: An
CoRR, vol. abs/1606.07947, 2016. [Online]. Available: http://arxiv.org/ accelerator for recurrent neural networks using domain wall memory,”
abs/1606.07947 CoRR, vol. abs/1812.07609, 2018.
[87] D. Neil, J. Lee, T. Delbrück, and S. Liu, “Delta networks for optimized [107] Y. Wang, H. Yu, L. Ni, G. Huang, M. Yan, C. Weng, W. Yang, and J. Zhao,
recurrent network computation,” CoRR, vol. abs/1612.05571, 2016. “An energy-efficient nonvolatile in-memory computing architecture for
[88] L. Cavigelli, P. Degen, and L. Benini, “CBinfer: change-based infer- extreme learning machine by domain-wall nanowire devices,” IEEE
ence for convolutional neural networks on video data,” CoRR, vol. Transactions on Nanotechnology, vol. 14, no. 6, pp. 998–1012, Nov 2015.
abs/1704.04313, 2017. [108] H. Yu, Y. Wang, S. Chen, W. Fei, C. Weng, J. Zhao, and Z. Wei,
[89] F. Piazza, A. Uncini, and M. Zenobi, “Neural networks with digital LUT “Energy efficient in-memory machine learning for data intensive image-
activation functions,” pp. 1401–1404 vol.2, Oct 1993. processing by non-volatile domain-wall memory,” in 2014 19th Asia and
[90] R. Muscedere, V. Dimitrov, G. A. Jullien, and W. C. Miller, “Efficient South Pacific Design Automation Conference (ASP-DAC), Jan 2014, pp.
techniques for binary-to-multidigit multidimensional logarithmic number 191–196.
system conversion using range-addressable look-up tables,” IEEE Trans- [109] J. Chung, J. Park, and S. Ghosh, “Domain wall memory based
actions on Computers, vol. 54, no. 3, pp. 257–271, March 2005. convolutional neural networks for bit-width extendability and energy-
[91] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “FPGA-based accelerator for efficiency,” in Proceedings of the 2016 International Symposium on Low
long short-term memory recurrent neural networks,” in 2017 22nd Asia Power Electronics and Design, ser. ISLPED ’16. New York, NY, USA:
and South Pacific Design Automation Conference (ASP-DAC), Jan 2017, ACM, 2016. [Online]. Available: http://doi.acm.org/10.1145/2934583.
pp. 629–634. 2934602
[92] V. Rybalkin, N. Wehn, M. R. Yousefi, and D. Stricker, “Hardware [110] A. J. Annunziata, M. C. Gaidis, L. Thomas, C. W. Chien, C. C. Hung,
architecture of bidirectional long short-term memory neural network for P. Chevalier, E. J. O’Sullivan, J. P. Hummel, E. A. Joseph, Y. Zhu,
optical character recognition,” in Proceedings of the Conference on De- T. Topuria, E. Delenia, P. M. Rice, S. S. P. Parkin, and W. J. Gallagher,
VOLUME , 2020 31
“Racetrack memory cell array with integrated magnetic tunnel junction MADHURA PURNAPRAJNA received her
readout,” in 2011 International Electron Devices Meeting, Dec 2011, pp. BachelorŠs degree in Electronics and Communi-
24.3.1–24.3.4. cation Engineering from P.E.S Institute of Tech-
[111] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, nology, Bengaluru in August 1998; her Master’s
“PRIME: A novel processing-in-memory architecture for neural network in Electrical and Computer Engineering from Uni-
computation in ReRAM-Based main memory,” in 2016 ACM/IEEE 43rd versity of Alberta, Canada in January 2005; and
Annual International Symposium on Computer Architecture (ISCA), her PhD in Electrical Engineering from the Heinz
June 2016, pp. 27–39.
Nixdorf Institute, University of Paderborn, Ger-
[112] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-
many in December 2009.
based accelerator for deep learning,” in 2017 IEEE International Sympo-
sium on High Performance Computer Architecture (HPCA), Feb 2017, Madhura Purnaprajna was a post-doctoral fel-
pp. 541–552. low with an International Research Fellowship from the German Re-
[113] S. Yu, Z. Li, P. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian, search Foundation (Deutsche Forschungsgemenischaft) and MHV Fellow-
“Binary neural network with 16 Mb RRAM macro chip for classifica- ship (SNSF), at the Processor Architecture Lab, EPFL, Switzerland and the
tion and online training,” in 2016 IEEE International Electron Devices High-performance Computing Lab, IISc., Bangalore. Currently, she serves
Meeting (IEDM), Dec 2016, pp. 16.2.1–16.2.4. as Associate Professor at the Department of Computer Science, School of
[114] Y. Long, T. Na, and S. Mukhopadhyay, “ReRAM-Based processing-in- Engineering, Bengaluru, since February 2015. Her research interests are in
memory architecture for recurrent neural network acceleration,” IEEE Reconfigurable Computing and Processor Architectures.
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26,
no. 12, pp. 2781–2794, Dec 2018.
[115] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional
neural network accelerator with In-Situ analog arithmetic in crossbars,”
in 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), June 2016, pp. 14–26.
[116] Q. Cao, N. Balasubramanian, and A. Balasubramanian, “Mobirnn: effi-
cient recurrent neural network execution on mobile GPU,” CoRR, vol.
abs/1706.00878, 2017.
[117] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Upper
Saddle River, NJ, USA: Prentice-Hall, Inc., 1996.
[118] A. Stillmaker and B. Baas, “Scaling equations for the accurate
prediction of CMOS device performance from 180nm to 7nm,”
Integration, vol. 58, pp. 74 – 81, 2017. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S0167926017300755
[119] “Android renderscript kernel description,” https://developer.android.com/
guide/topics/renderscript/compute.html., accessed on: Jan. 2019.
[120] Y. Long, T. Na, and S. Mukhopadhyay, “ReRAM-based processing-in-
memory architecture for recurrent neural network acceleration,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, pp. 1–14,
2018.
[121] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, TOMAS NORDSTRÖM received his M.S.E.E.
S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, degree in 1988, his licentiate degree in 1991,
A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. and his Ph.D. in 1995, all from Luleå Univer-
Caulfield, E. S. Chung, and D. Burger, “A configurable cloud-scale
sity of Technology, Sweden. His PhD Thesis
DNN processor for real-time AI,” in 2018 ACM/IEEE 45th Annual
"Highly Parallel Computers for Artificial Neural
International Symposium on Computer Architecture (ISCA), June 2018,
pp. 1–14. Networks" bridged the two fields of computer
[122] J. H. Wilkinson, Rounding Errors in Algebraic Processes. New York, engineering and signal processing, between which
NY, USA: Dover Publications, Inc., 1994. he has been moving ever since.
[123] T. Yang, A. G. Howard, B. Chen, X. Zhang, A. Go, V. Sze, and H. Adam, Between 1996 and 1999, Tomas Nordström was
“Netadapt: platform-aware neural network adaptation for mobile applica- with Telia Research (the research branch of the
tions,” CoRR, vol. abs/1804.03230, 2018. major Swedish telephone operator) where he developed broadband Internet
[124] J. L. Hennessy and D. A. Patterson, “A new golden age for computer communication over twisted copper pair. He also became Telia’s leading
architecture,” Commun. ACM, vol. 62, no. 2, pp. 48–60, Jan. 2019. expert in speaker verification during these years. In December 1999, he
[Online]. Available: http://doi.acm.org/10.1145/3282307 joined the FTW Telecommunications Research Center Vienna, Austria,
[125] “Google edge TPU,” https://cloud.google.com/edge-tpu/, accessed on: where he has been working as a Key Researcher in the field of Şbroadband
Nov. 2019. wireline accessŤ. During his years at FTW, he worked on various aspects
of wireline communications such as the simulation of xDSL systems, cable
characterization, RFI suppression, exploiting the common-mode signal in
xDSL, and more recently, dynamic spectrum management. In 2009 was
appointed Associate Professor in computer systems engineering at Halmstad
NESMA M. REZK received her bachelor and University (HH), Sweden. At HH he has returned to the area of computer
masters degrees in computer and systems en- architecture and his current research interests include all aspects of energy-
gineering from the faculty of engineering, Ain efficient embedded computers. In 2012 he became full Professor in Com-
Shams University, Egypt in 2010 and 2015, re- puter Engineering at HH and has built up a research group focusing on
spectively. heterogeneous many-core architectures. Additionally, he has been work-
From 2011 to 2016 she was a teaching and ing in the field of dependable wireless communication studying optimal
research assistant in the faculty of engineering, usage of communication resources, dynamic spectrum management, and
Ain Shams University. Since 2016, she has been a IoT reference architectures. In 2019 he became full Professor in Embedded
PhD student in the School of Information Technol- and Intelligent Systems at Umeå University, Sweden, where his research
ogy at Halmstad University, Sweden. Her research focuses on edge computing, intelligent IoT systems, and high-performance
interests include embedded systems, deep learning applications, and design embedded computing architectures and platforms.
of domain-specific architectures.
32 VOLUME , 2020
ZAIN-UL-ABDIN completed his PhD degree in

Computer Science from Örebro University in 2011
and until recently held an appointment as Asso-
ciate Professor at Halmstad University.
He has played a key role in a number of re-
search projects such as Smart Multicore Embed-
ded Systems (SMECY), High-Performance Em-
bedded Computing (HiPEC), and Towards Next
Generation Embedded Systems (NGES). In these
projects he has worked in close collaboration with
the Swedish Defence Research Agency (FOI) and industrial partners such as
ST- Microelectronics, Ericsson, Adapteva, and Saab. He has authored more
than 42 journal and conference articles and has been awarded the best paper
prize in the PhD forum of 19th Field-programmable Logic and Applications
conference (FPLŠ09). He has sat on the technical program committee of
several leading conferences and has served as an external reviewer for
journals (IJRC, IEEE-TSP, IEEE-Micro, JSP). His main research interests
are high-performance embedded computing and the parallel programming
models.
VOLUME , 2020 33

Recurrent Neural Networks: An Embedded Computing Perspective

Uploaded by

Copyright:

Available Formats

Recurrent Neural Networks: An Embedded Computing Perspective

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recurrent Neural Networks: An Embedded Computing Perspective

Uploaded by

Copyright:

Available Formats

Accepted for publication in IEEE open Access

Digital Object Identifier 10.1109/ACCESS.2020.2982416

Recurrent Neural Networks: An

I. INTRODUCTION devices imposes some optimizations to RNN applications.

TABLE 1: Comparison with related survey articles.

(e) Quasi-RNN (QRNN). (f) Simple Recurrent Unit (SRU).

FIGURE 3: Different variations of an RNN layer.

Rezk et al.: Recurrent Neural Networks: An Embedded Computing Perspective

and 1) FC (Fully Connected) Layers

2) Output function analysis [49] are two examples. In activity recognition,

1) RNN unfolding variations through time-steps 2) Bidirectional RNN

the optimization method as

TABLE 4: Effect of quantization methods on accuracy.

Rezk et al.: Recurrent Neural Networks: An Embedded Computing Perspective

TABLE 6: Effect of DeltaRNN method on accuracy

TABLE 7: Detailed information about papers under study

Rezk et al.: Recurrent Neural Networks: An Embedded Computing Perspective

TABLE 9: Comparison of RNNs implementations on ASIC and other platforms.

Rezk et al.: Recurrent Neural Networks: An Embedded Computing Perspective

ZAIN-UL-ABDIN completed his PhD degree in

You might also like