Isvlsi2019 SS
Isvlsi2019 SS
Isvlsi2019 SS
Abstract—In the Machine Learning era, Deep Neural Networks (DNNs) Input Pre-Trained Deep
have taken the spotlight, due to their unmatchable performance in Neural Network
several applications, such as image processing, computer vision, and
natural language processing. However, as DNNs grow in their complexity,
Input dimension
Design-Time Optimizations
their associated energy consumption becomes a challenging problem.
of the DNN
Such challenge heightens for edge computing, where the computing Network Network
devices are resource-constrained while operating on limited energy Pruning Quantization
budget. Therefore, specialized optimizations for deep learning have to
be performed at both software and hardware levels. In this paper, we Dataflow Selection for Efficient
comprehensively survey the current trends of such optimizations and Run-Time Optimization Mapping on the DNN Accelerator
discuss key open research mid-term and long-term challenges. Division of the image into tiles
Index Terms—pre-processing, pruning, quantization, DNN, accelerator, Attention and activity-based tile selection Hardware Optimized DNN and
hardware, software, performance, energy efficiency, low power, deep Characteristics Mapping Strategy
learning, neural networks, edge computing, IoT. DNN Hardware Accelerator
I. I NTRODUCTION Stop Sign
Output
Only Selected
Deep Neural Networks (DNNs) have become popular due to Tiles
the (1) availability of large datasets, (2) accessibility of hardware
resources for compute-intensive workloads (like GPGPUs) and (3)
open-source Deep Learning libraries. Nowadays, they are widely used Fig. 1: Our Flow for Cross-Layer Optimizations for Deep Learning.
in several applications like image classification [59], detection [58]
and segmentation [66]. Usually few years are required between the learning hardware practitioners about the security (Section V) and
invention of a novel DNN algorithm and its successful hardware other open research challenges (Section VI). An overview of our
deployment, as for the case of [34] and [67] implemented in [28]. work is depicted in Figure 1.
To achieve high efficiency products, optimizations at different levels A. Key Scientific Questions and Associated Challenges
of abstractions are required.
Low Power & Memory Budget: Performing DNN inference on
DNNs undoubtedly perform better the larger and deeper they
edge devices, which are typically resource- and power-constrained,
are, but this effect demands a continuously increased complexity
is a challenging task. For example, the ResNet-50 [23] requires more
on the hardware perspective to design specialized accelerators for
that 95MB of memory to store the weights and more than 3.8 billion
Deep Learning. Currently, there are several use-case scenarios where
multiplications to process a single image. Such amount of processing
hardware acceleration is beneficial for DNNs: (1) offline DNN
is infeasible to be deployed in edge devices to return real-time results.
training in data centers, (2) inference in data centers, (3) online
Latency: While mobile voice recognition applications like Ap-
learning on mobile devices and (4) inference on mobile devices.
ple Siri, Amazon Alexa and Google Assistant have the processing
While the discussed techniques are beneficial for multiple scenarios,
based on the cloud, for other critical applications (e.g., autonomous
in this paper, we focus mostly on the 4th scenario. Moreover, besides
vehicles, drones, and wearable healthcare devices) the near-sensor
the hardware acceleration, it is extremely important to start from an
processing is necessary to get a fast response from the DNN, as
algorithm which is highly-optimized at the software level, e.g., by
well as due to privacy and security reasons. Moreover, not only
reducing the number of DNN inferences through pre-processing, and
latency, but also security and privacy issues motivate near-sensor
reducing the computations for each inference through pruning and
processing. Therefore, specialized hardware accelerators are required
quantization. Although these kind of optimizations are transparent
to efficiently perform the DNN inference at the edge to meet the
while considering inference in edge devices, they are beneficial
latency, security, and privacy requirements.
to achieve several order of magnitudes of energy improvements.
Accuracy vs. Speed and Efficiency: High accuracy DNNs are
If matched with specialized hardware accelerators and optimized
extremely computational and memory intensive. Even though some
dataflows, these improvements will grow further.
of the recent trends hint towards designing DNNs with small memory
After discussing the main scientific questions and challenges
footprint [26] [25], the most promising approach is to compress the
(Section I-A), we survey the current trends of deep learning for
DNN by parameter pruning, sharing and quantization. Several dense
edge computing (Section II). We then present our methodology
DNN accelerators have been proposed, but to facilitate compression
(Section III) of different cross-layer optimizations, supported by case
optimizations like pruning, sparse DNN accelerators can achieve
study analyses (Section IV), before raising questions for future deep
better results in terms of efficiency.
*Alberto Marchisio and Muhammad Abdullah Hanif have equal contribu- Redundant Operations: DNNs usually contain several redundant
tions. operations, like multiplications with zero, and correlated inputs in
2 [71], binarization [54] and approximate computing [4] [18] have to
5 CLASS-DISTRIBUTION 5
1 6 6 leverage the trade-off between accuracy and efficiency.
T=1.5
7 5 7 5 Hardware Accelerators: The optimizations at the software level
3 2
σ1T = 3.52
2 7
σ2T = 2.81
7 should be supported by specialized hardware accelerators in a co-
8 8
4 4 design fashion [47] [19]. Recent advances in the datacenter comput-
ing deep learning [27] have inspired accelerators for edge devices.
5
2
CLASS-UNIFORM 5
Specialized accelerators like [5] [28] exploit the concurrency and the
1 6
50%
6 parallelism available in the processing of the DNNs, especially for
7 5 7
3 2 convolutional leyers, while [20] takes care also of the fully-connected
2 7 7 layers. These architectures, however, accelerate dense DNNs, and
8 8
4 4
cannot exploit the sparsity introduced by pruning. Therefore, spe-
cialized accelerators for sparse DNNs are required [13] [52]. Chal-
2
5 CLASS-BLIND 5 lenging aspects of these accelerators are flexibility, reconfigurability
1 6 6
7
50%
7
and data reuse [35] [39] [65]. Moreover, particular types of DNNs,
5 5
3 2 like CapsuleNets [60] and GANs [11] present several differences in
2 7 7
8 8 the computation patterns, as compared to traditional DNNs. These
4
challenges are addressed by their specialized accelerators. For exam-
Fig. 2: Different magnitude-based pruning schemes. ple, CapsAcc [46] adopts a data reuse policy to efficiently process the
routing-by-agreement algorithm on a systolyc array-based accelerator
streaming applications. Therefore, we do not necessarily need to for CapsuleNets, and GANAX [76] propose a unified MIMD-SIMD
process the complete set of inputs at every stage. A challenging task design for concurrent execution of GANs.
is finding these redundancies and eliminating them efficiently. Optimizations for Object Detection: DNNs have been used
Memory: Significant cycles and energy may be required for successfully in a variety of tasks such as classification and detec-
memory data transfer to/from the computational array, which necessi- tion. Numerous detectors have been proposed by the deep learning
tates efficient memory architectures and data organization strategies community, including Faster R-CNN [58], R-FCN [7], YOLO [56]
for DNNs hardware. Moreover, following the in-memory comput- and SSD [40]. These object detectors are separated into two main
ing trends, memristor devices allow to use resistive memories for categories: 1) Region-based detectors, a two-stage approach, with
analog computation, with additional cost of ADC/DAC overhead. a region proposal stage followed by a classifier, and 2) Single-
Is a CMOS-based design with traditional memory hierarchy the shot detectors, consist of a single Convolutional Neural Network
optimal solution for DNN processing, or should we adopt in-memory (CNN) trained to perform object detection. Region-based detectors
computing for deep learning at the edge? use a region proposal method, such as Selective Search algorithm, to
Security: Due to outsourcing of training and data dependencies, produce regions-of-interest (RoIs) for object detection. These RoIs
DNNs possess several security vulnerabilities that can be exploited are then warped into fixed size images and feed into a CNN network
to perform security attacks, e.g., adversarial examples, backdoors and one-by-one. This process is time consuming due to the large number
data poisoning, for confidence reduction (ambiguity in classification), of RoIs that can be extracted ( 2000) and processed by a single
random or targeted misclassification and model stealing. These secu- CNN. Considering a typical 1000 × 600 image, there will be roughly
rity vulnerabilities raise fundamental challenges like model privacy 20000 potential RoIs per image, where different methods, such as
and secure execution of DNNs, regarding ensuring the robustness non-maximum suppression (NMS), is applied on the proposed RoIs
of DNN-based systems. Traditionally, the pre-processing, data en- to reduce their count to 2000.
cryption and watermarking are used, however, all these defenses can Single-shot detectors such as YOLO [56], have shown significant
be neutralized by sophisticated model stealing or black-box attacks. potential, especially for resource-constrained applications, compared
Therefore, there is a dire need to develop more sophisticated and to region proposal approaches by trading accuracy with real-time
efficient defenses to ensure model privacy and secure execution of processing speed. To this end, single-shot detectors avoid the multi-
the DNNs. stage process by processing the whole image at once. The detector
receives an input image, resizes the image based on the CNN input
II. C URRENT T RENDS size and then splits the input image into a grid, where for each grid it
DNN compression is an attractive solution to reduce the complexity generates bounding boxes and class probabilities based on the number
of a given network. The work of [14] proposed a 3-step method (prun- of objects. Thus, the whole image is processed only once, which
ing, quantization and encoding) to significantly reduce the memory makes this approach faster than the region proposal based approach.
footprint of a given DNN. Network pruning was first used in [10] to
III. C ROSS -L AYER O PTIMIZATIONS FOR D EEP L EARNING
reduce the number of connections. Several different pruning method-
S YSTEMS
ologies have been explored in the literature Different magnitude-
based pruning methods are shown in Figure 2. Structured pruning Combining the current trends for targeting the above-discussed sci-
[75] employs constraints on some DNN parameters (e.g., kernel, filter, entific challenges, we propose a methodology to apply optimizations
channel) to maintain a certain structure. Another approach is to prune across different software and hardware layers. The flow of our cross-
the redundant and least significant weights, regardless of the structure layer methodology (shown in Figure 3), can be summarized in the
of the DNN itself [15] [45], and share the weights to reduce the following key steps:
dimensionality [14]. Other compression methods, based on variational • Software-Level Optimizations: The software-level optimizations
dropout [44], knowledge transfer [24] and low-rank approximations mainly include network pruning (Step-1 in Fig. 3) and quantization
[70] are promising as well. On the other hand, techniques which (Step-2 in Fig. 3) of the parameters. Network pruning is usually
are focusing on reducing the precision, like quantization [79] performed iteratively, where, in each iteration, a small number of
Software-level Optimizations Hardware-level
Pre-Trained Optimizations
1
Neural Network Pruning
Network Hardware Accelerator
Retraining
Design 4
Design-Time
Training & 2
Quantization Hardware
Validation Approximations 5
sets Retraining
Stop Sign
Run-Time
96
91.9
22.47
8
8
17.29
5.5
6
Accuracy [%]
Optimal Pruning: Class_Blind Optimal Quantization Point 74.4 3
MED
74 2
73.6
73.2 1
72.8 0
Accurate Approximate Approximate Approximate
Multiplier Multiplier 1 Multiplier 2 Multiplier 3
Type of Multiplier used in DNN Inference Hardware
Fig. 9: Effects of approximations in the multipliers of a DNN inference
Marchisio et al. [47] hardware on the classification accuracy of the LeNeT network when used
Pre-trained Compressed for the CiFAR-10 dataset [16]. MED represents Mean Error Distance of a
DNN Pruning Quantization DNN multiplier computed using uniform input distribution.
Fig. 8: Combining Pruning and Quantization as software-level optimizations. celerator weight memories. Most of the works towards hardware-
multiplications, which require multiply-and-accumulate (MAC) units. level approximation in DNN-based systems have been carried out
A MAC multiplies the weight and the activation, and updates the in approximating computational modules of the DNN accelerators,
partial sum. Depending upon the architecture of the accelerator like adders and multipliers. Few of the prominent works include [73]
and the dataflow mapping strategy (e.g., Weight Stationary, Output [78] [49] [50]. To highlight what impact of approximations in the
Stationary, Row Stationary 1 [5]), different data reuse scenarios can multipliers used for DNN inference, Fig. 9 shows how the accuracy
be exploited, like weight, input activation, and output activation reuse of the LeNeT network for the CiFAR-10 dataset decreases when
[20], for convolutional and fully-connected layers. the approximation level of the multipliers is increased. To cater the
A DNN hardware accelerator can further benefit from the network accuracy loss due to approximations, the work in [73] proposed to
sparsity, introduced by pruning. For this reason, specialized designs incorporate approximations in the forward pass of the training process
for handling relative indexing and skipping multiplications in which to tune the network for the introduced approximations.
one of the terms is equal to zero. The load imbalance problem All the above-mentioned approaches result in some accuracy loss
can be mitigated by the utilization of queues [13]. The dataflow and, therefore, can hardly be used in any safety-critical application
(e.g., PlanarTiled-InputStationary-CartesianProduct-sparse [52]) has because of their stringent accuracy constraints. To address this, we
to manage the coordinates of all the nonzero weights, input and proposed CANN [16], an approach where curable approximations are
output activations. Moreover, the support of different bit-widths can applied in the system such that approximation errors introduced by
be handled by having flexible-size processing elements [65]. one module are completely cured by the subsequent module/s while
DNN Inference at the Edge: Accurate or Approximate? DNNs ensuring efficiency gains of approximate computing.
are considered to be inherently error-resilient [38] and, therefore, can In-Memory Computing: The main operations in state-of-the-art
leverage approximate computing for achieving significant efficiency DNNs are vector-matrix and matrix-matrix multiplications, which
gains at the cost of minor accuracy loss, that may be compensated are highly data intensive. Therefore, the memory access latency and
through re-training. The efficiency gain-per-unit accuracy drop de- access energy can potentially become the critical bottlenecks. In-
pends on the error-resilience of the DNNs, which also depends on memory computing and near-memory computing have emerged as
the type of application and other characteristics of the DNNs [38]. promising paradigms for addressing such bottlenecks. Several archi-
Applications like image classification, which generate only one output tectures have been proposed which make use of ReRAM crossbars
(i.e., class of the image) per input sample are considered to be more for realizing in-memory computing, i.e., performing computations
error-resilient as compared to the applications like object detection, where the data is stored. PRIME [6] reported around 895x efficiency
which produce more sophisticated output. Various techniques based gains in the overall energy consumption of the accelerator compared
on fault/noise injection have been proposed to evaluate the error- to the then state-of-the-art. PIPELAYER [68] proposed a hardware
resilience of the DNNs [55] [18]. These techniques help in quantify- architecture for improving the overall throughput of ReRAM crossbar
ing the amount of approximation that can be applied in a DNN. based accelerators. However, there are some practical issues associ-
Approximations can be employed both at the hardware and the ated with ReRAM crossbars when used for computations which limit
software level. However, here we mainly talk about hardware-level the offline trained networks to perform as expected on such these
approximations, because the pruning and the quantization techniques, accelerators. To address these issues, recently, a device variability-
perfect examples of software-level optimizations/approximations, aware training methodology has also been proposed in [42], which
have already been discussed in Section IV-B. Hardware-level approx- trains a network while adding stochastic noise in the parameters of the
imations include architecture- and circuit-level simplifications. These network. The noise is modeled based on the variation-characteristics
types can further be classified into data, and functional approxima- of the hardware and, therefore, helps in maintaining high accuracy
tions [63], where data approximations refer to approximations in data even when there are significant variations in the network parameters
storage [61] (i.e., memories) and functional approximations refer to because of the device variations.
approximations in the functionality of the processing units [17] [62]
[57]. V. M ACHINE L EARNING S ECURITY
In memories, aggressive voltage scaling is one of the most
prominent approaches which can lead to significant efficiency gains. Recently, security for machine learning, especially in DNNs, has
Towards this, Kim et al. [33] proposed MATIC, a memory-adaptive become one of the prime challenges to ensure robustness of DNN-
training approach that enables aggressive voltage scaling of ac- based systems. This is because these systems are highly vulnerable to
1 Weight Stationary: maximize convolutional reuse and filter reuse, Output data poisoning, model stealing and adversarial example [64] [21]. In
Stationary: maximize partial sum accumulation and input feature map reuse. this section, we present a brief overview of the recent advancements
Row Stationary: maximize all these parameters. in security attacks and corresponding defenses for DNNs (Fig. 10).
(a) Training Dataset Validation Dataset White-box Attacks (b) Target Image TriSec Attack [29]
DNN Model Structure e.g., FGSM, JSMA, TrISec
Attack Image
[29], etc. DNN Structure Perceptibility
Data Poisoning Perceptibility (CR & SSI) vs and parameters Analysis
Validation Noise Analysis
Backdoor Training DNN Adversarial
Real-time Dataset
Parameters Examples Back-propagation algorithm to
Ranges for perceptibility
Neural Trojans (CR & SSI) generate attack noise
Hardware
Trained DNN Model DNN Inferencing
Implementation
QuSecNets [30] FAdeML [31]
Trained Model of
Inference Adversarial
Defended DNN
Noise Filtering Output Examples Noise vs Quantization Noise filtering
Validation
DNN Defenses
DNN
accuracy parameters accuracy analysis
Quantization (e.g., QuSecNets [30], Black-box Attacks analysis selection
FAdeML [31], Defended DNN
Score-based, Gradient- Integrate noise
Sobel Filtering Model Integrate quantization
based, Decision-based [32] filters with DNN
Fig. 10: (a) An overview of the different security attacks and corresponding defense strategies for machine learning, especially DNNs. (b) An example of our
training dataset-unaware imperceptible attack (e.g., TrISec [29]) and pre-processing based defense strategies (e.g., QuSecNets [30], FAdeML [31]).
A. Security Attacks: dataset. These defense strategies can be countered by using pruning or
Security vulnerabilities in DNNs can be exploited at different weight sensitivity analysis before adding the neural Trojans or before
development phases of DNN-based systems, i.e., training, hardware training it for backdoors [12]. Several defense strategies have been
implementation and inferencing. During the DNN training, attacker proposed to counter the adversarial attacks [77], e.g., DNN masking,
either can exploit the training dataset by introducing small patterned gradient masking, adversarial learning, generative adversarial network
noise, adding specially crafted backdoors [12], or modifying the DNN based defense, data augmentation, and pre-process the input data
structure and parameters (i.e., neural Trojans [41]), to train the DNNs (e.g., noise filtering [31], quantization [30]) to detect remove or make
for particular noise patterns or backdoor triggers. In these kind of the adversarial noise perceptible. For example, recently, it has been
attacks, the attacker either requires complete access to the training studied that even low pass noise filtering at the input of the DNN
dataset or DNN structure. In most of the cases, the specially crafted can neutralize the adversarial examples if the attacker is unaware
attack noise is perceptible and may or may not have the hidden of it [31]. Note, all of these defense strategies can be countered by
backdoor triggers. Therefore, these attacks can be detected during decision-based attacks which can estimate the DNN behavior even in
the DNN inference using subjective and objective analysis [29]. masked DNNs under the black-box settings [1]–[3], [32].
On the other hand, attacks during the DNN inferencing do not
depend on the whole training datasets but may require some samples VI. O PEN R ESEARCH C HALLENGES
from training dataset. In these attack, attacker can exploit both black-
In the following, we list some key research challenges which can
box2 and white-box3 settings to generate adversarial examples for
potentially have a huge impact for improving the efficiency of deep
misclassification or confidence reduction [51]. The attack noise gen-
learning for edge computing.
erated from such attacks may or may not be imperceptible depending
upon their attack algorithm. Based on the attack algorithm and • Hardware Software Co-Design: A common trend is to optimize
optimization to ensure imperceptibility, several adversarial attacks the DNN for achieving high accuracy, without caring much about
have been proposed [77], e.g., the limited-memory Broyden-Fletcher- the underlying hardware complexity and energy consumption of a
Goldfarb-Shanno (L-BFGS), the Fast Gradient Sign Method (FGSM), computing device. On the other hand, hardware designers have to
and the Jacobian-based Saliency Map Attack (JSMA), etc. However, implement a-posteriori architectures to exploit the software-level
these attacks do not consider the subjective and objective analysis for optimizations. However, hardware-aware software-level optimiza-
ensuring the imperceptibility. tions, e.g., for DNN architecture exploration [69] or compression
To address these limitation, recently, training data-unaware imper- [43] are promising and need further efforts to succeed.
ceptible, TrISec, attack was proposed [29]. This attack leverages the • In-Memory Computing: It seems to be a promising paradigm
back-propagation algorithm to perform the targeted misclassifcation for developing accelerators that can offer orders of magnitude of
and incorporates the subjective and objective analysis, i.e., correlation energy-efficiency gains compared to the conventional CPU and
coefficient and structural similarity, into its attack algorithm to GPU based systems. However, the high variation characteristics
ensure the imperceptibility under the white-box settings. Note, all associated with ReRAM and other non-volatile memories limit the
the attacks during the inferencing under the white-box setting can accelerators which are based on them to offer precise functionality.
be used in black-box setting by combining them with model stealing Towards this, the multi-level cell (MLC) ReRAM technology has
attacks [30]. Similarly, several attacks under black-box setting have to be mature enough to offer reasonable precision while offering
been proposed, e.g., decision-based [32] and score-based attacks, high data density. Also, a significant amount of work is required
that either use the search algorithm or perform statistical analysis to develop methods which can be used to train networks such that
to estimate the behavior of the DNNs. they can offer high accuracy even when operated on NVM-based
in-memory computing devices.
B. Defenses: • Hardware-Aware Hyperparameter Tuning and DNN Architec-
Several security defense strategies have been proposed based on the tural Exploration: Several software-level optimization techniques
different threat models and the development phases of DNN-based have been proposed which highlight that sparse DNNs, i.e., having
systems. Backdoor and training data poisoning attacks can be neutral- lesser number of parameters, can also offer nearly the same level
ized by using pruning [12] or retraining the DNN on a subset of the of output accuracy as dense DNNs. Systematic methodologies are
2 Black-Box Attack: the attacker does not has any access to the trained DNN required which, while being aware of the underlying hardware
structure or the output probabilities architecture and the system, can tune the network such that it offers
3 White-Box Attack: the attacker has access to the DNN structure which an near-optimal energy and performance efficiency while maintaining
attacker can exploit but cannot modify. the baseline accuracy.
• Event-based Spiking Neural Networks: They have the potential [37] C. H. Lampert et al. Efficient subwindow search: A branch and bound framework
to be much more energy-efficient, as compared to digital-based for object localization. In TPAMI, 2009.
[38] G. Li et al. Understanding error propagation in deep learning neural network
DNNs, because the power is only consumed when a spike is firing. (DNN) accelerators and applications. In HPCA, 2017.
Such event-driven processing are promising. Therefore, companies [39] J. Li et al. SmartShuttle: Optimizing off-chip memory accesses for deep learning
accelerators. In DATE, 2018.
like IBM and Intel are investing into their respective neuromorphic [40] W. Liu et al. SSD: Single Shot MultiBox Detector. In ECCV, 2016.
architecture chips and its accelerators [48] [9]. [41] Y. Liu et al. Neural trojans. In IEEE ICCD, 2017.
[42] Y. Long et al. Design of Reliable DNN Accelerator with Un-reliable ReRAM. In
R EFERENCES DATE, 2019.
[43] Q. Lou et al. AutoQB: AutoML for Network Quantization and Binarization on
[1] A. Athalye et al. Obfuscated gradients give a false sense of security: Circumventing Mobile Devices. arXiv preprint arXiv:1902.05690v1, 2019.
defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018. [44] C. Louizos et al. Bayesian compression for deep learning. In NIPS, 2017.
[2] A. Athalye et al. On the robustness of the CVPR 2018 white-box adversarial [45] A. Marchisio et al. PruNet: Class-Blind Pruning Method For Deep Neural Net-
example defenses. arXiv preprint arXiv:1804.03286, 2018. works. In IJCNN, 2018.
[3] W. Brendel et al. Decision-based adversarial attacks: Reliable attacks against black- [46] A. Marchisio et al. CapsAcc: An Efficient Hardware Accelerator for CapsuleNets
box machine learning models. arXiv preprint arXiv:1712.04248, 2017. with Data Reuse. In DATE, 2019.
[4] C. Chen et al. Exploiting approximate computing for deep learning acceleration. [47] A. Marchisio et al. HW/SW co-design and co-optimizations for deep learning. In
In DATE, 2018. INTESA@ESWEEK, 2018.
[5] Y. H. Chen et al. Eyeriss: A spatial architecture for energy efficient dataflow for [48] P. A. Merolla et al. A million spikingneuron integrated circuit with a scalable
convolutional neural networks. In ISCA, 2016. communication network and interface. in Science, 2014.
[6] P. Chi et al. Prime: A novel processing-in-memory architecture for neural network [49] V. Mrazek et al. Design of power-efficient approximate multipliers for approximate
computation in reram-based main memory. ACM SIGARCH Computer Architecture artificial neural networks. In ICCAD, 2016.
News, 2016. [50] V. Mrazek et al. AutoAx: An Automatic Design Space Exploration and Circuit
[7] J. Dai et al. R-FCN: Object Detection via Region-based Fully Convolutional Building Methodology utilizing Libraries of Approximate Components. In DAC,
Networks. In NIPS, 2016. 2019.
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In [51] N. Papernot et al. Practical black-box attacks against machine learning. In ACM
CVPR, 2005. CCS, 2017.
[9] M. Davies et al. Loihi: A neuromorphic manycore processor with on-chip learning. [52] A. Parashar et al. SCNN: An accelerator for compressed-sparse convolutional
In IEEE Micro, 2018. neural networks. In ISCA, 2017.
[10] X. Dong et al. Learning to prune deep neural networks via layer-wise optimal [53] G. Plastiras et al. Efficient ConvNet-based object detection for unmanned aerial
brain surgeon. arXiv preprint arXiv:1705.07565, 2017. vehicles by selective tile processing. In ICDSC, 2018.
[11] I. J. Goodfellow et al. Generative adversarial nets. In NIPS, 2014. [54] M. Rastegari et al. Xnor-net: Imagenet classification using binary convolutional
[12] T. Gu et al. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. neural networks. In ECCV, 2016.
In IEEE Access, 2019. [55] B. Reagen et al. Ares: A framework for quantifying the resilience of deep neural
[13] S. Han et al. EIE: Efcient Inference Engine on Compressed Deep Neural Network. networks. In DAC, 2018.
In ISCA, 2016. [56] Y. Redmon et al. You Only LookOnce: Unified, Real-Time Object Detection. In
[14] S. Han et al. Deep compression: Compressing deep neural networks with pruning, CVPR, 2016.
trained quantization and huffman coding. In ICLR, 2016. [57] S. Rehman et al. Architectural-space exploration of approximate multipliers. In
ICCAD, 2016.
[15] S. Han et al. Learning both weights and connections for efficient neural network.
[58] S. Ren et al. Faster R-CNN: towards real-time object detection with region proposal
In NIPS, 2015.
networks. In CoRR, vol. abs/1506.01497, 2015.
[16] M. A. Hanif et al. CANN: Curable Approximations for High-Performance Deep
[59] O. Russakovsky et al. Imagenet large scale visual recognition challenge. In
Neural Network Accelerators. In DAC, 2019.
International Journal of Computer Vision, 2015.
[17] M. A. Hanif et al. QuAd: Design and analysis of quality-area optimal low-latency
[60] S. Sabour et al. Dynamic routing between capsules. In NIPS, 2017.
approximate adders. In DAC, 2017.
[61] F. Sampaio et al. Approximation-aware multi-level cells STT-RAM cache archi-
[18] M. A. Hanif et al. Error resilience analysis for systematically employing approxi-
tecture. In CASES, 2015.
mate computing in convolutional neural networks. In DATE, 2018.
[62] M. Shafique et al. A low latency generic accuracy configurable adder. In DAC,
[19] M. A. Hanif et al. X-DNNs: Systematic Cross-Layer Approximations for Energy-
2015.
Efficient Deep Neural Networks. In Journal of Low Power Electronics, 2018.
[63] M. Shafique et al. Cross-layer approximate computing: From logic to architectures.
[20] M. A. Hanif et al. MPNA: A Massively-Parallel Neural Array Accelerator
In DAC, 2016.
with Dataflow Optimization for Convolutional Neural Networks. arXiv preprint
[64] M. Shafique et al. An overview of next-generation architectures for machine
arXiv:1810.12910, 2018.
learning: Roadmap, opportunities and challenges in the IoT era. In DATE, 2018.
[21] M. A. Hanif et al. Robust machine learning systems: Reliability and security for [65] H. Sharma et al. Bit Fusion: Bit-level Dynamically Composable Architecture for
deep neural networks. In IOLTS 2019. Accelerating Deep Neural Network. In ISCA, 2018
[22] H. Harzallah et al. Combining efficient object localization and image classification. [66] E. Shelhamer et al. Fully convolutional networks for semantic segmentation. In
In ICCV, 2009. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[23] K. He et al. Deep residual learning for image recognition. In CoRR, vol. [67] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
abs/1512.03385, 2015. image recognition. In CoRR, vol. abs/1409.1556, 2014.
[24] G. E. Hinton et al. Distilling the knowledge in a neural network. In NIPS, 2015. [68] L. Song et al. Pipelayer: A pipelined reram-based accelerator for deep learning.
[25] A. G. Howard et al. Mobilenets: Efficient convolutional neural networks for mobile In HPCA, 2017.
vision applications. arXiv preprint arXiv:1704.04861, 2017. [69] D. Stamoulis et al. HyperPower: Power-and memory-constrained hyper-parameter
[26] F. N. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters optimization for neural networks. In DATE, 2018.
and <0.5MB model size. arXiv preprint arXiv:1602.07360, 2016. [70] C. Tai et al. Convolutional neural networks with low-rank regularization. vol.
[27] Z. Jia et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmark- abs/1511.06067, 2015.
ing. arXiv preprint arXiv:1804.06826, 2018. [71] F. Tung and G. Mori. Clip-q: Deep network compression learning by inparallel
[28] N. P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. pruning-quantization. In CVPR, 2018.
In ISCA, 2017. [72] J. R. Uijlings et al. Selective Search for Object Recognition. In IJCV, 2014.
[29] F. Khalid et al. TrISec: training data-unaware imperceptible security attacks on [73] S. Venkataramani et al. AxNN: energy-efficient neuromorphic systems using
deep neural networks. In IOLTS, 2019. approximate computing. In ISLPED, 2014.
[30] F. Khalid et al. QuSecNets: Quantization-based Defense Mechanism for Securing [74] P. Viola and M. J. Jones. Robust real-time face detection. In IJCV, 2004.
Deep Neural Network against Adversarial Attacks. In IOLTS, 2019. [75] W. Wen et al. Learning structured sparsity in deep neural networks. In NIPS, 2016.
[31] F. Khalid et al. FAdeML: understanding the impact of pre-processing noise filtering [76] A. Yazdanbakhsh et al. GANAX: A Unified SIMD-MIMD Acceleration for
on adversarial machine learning. In DATE 2019. Generative Adversarial Network. In ISCA, 2018.
[32] F. Khalid et al. RED-Attack: Resource efficient decision based attack for machine [77] X. Yuan et al. Adversarial examples: Attacks and defenses for deep learning. In
learning. arXiv preprint arXiv:1901.10258, 2019. IEEE Transactions on neural networks and learning systems, 2019.
[33] S. Kim et al. MATIC: Learning around errors for efficient low-voltage neural [78] Q. Zhang et al. ApproxANN: An approximate computing framework for artificial
network accelerators. In DATE, 2018. neural network. In DATE, 2015.
[34] A. Krizhevsky et al. Imagenet classification with deep convolutional neural [79] A. Zhou et al. Incremental Network Quantization: Towards Lossless CNNs with
networks. In NIPS, 2012. Low-precision Weights. In ICLR, 2017.
[35] H. Kwon et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accel- [80] C. Zhu et al. Trained ternary quantization. In ICLR, 2017.
erators via Reconfigurable Interconnects. In ASPLOS, 2018.
[36] C. Kyrkou et al. Dronet: Efficient convolutional neural network detector for real-
time UAV applications. In DATE, 2018.