1 Introduction
Deep neural networks (
DNNs) have greatly promoted the development of technologies such as image and video comprehension, speech recognition, and natural language processing, being deployed in safety-critical and security-sensitive domains such as finance, healthcare, and autonomous driving. However, attackers can deceive DNN models to make false predictions by carefully manipulating minor perturbations to construct adversary examples [
2,
57,
72]. Such attacks may lead to disastrous consequences in security-sensitive applications. Moreover, physical adversarial attacks have been demonstrated that they are more destructive and make this vulnerability a serious threat in real-world scenarios [
23].
To defend these adversarial attacks, the current practice is to use dense/sparse DNN models or traditional machine learning denoising methods, or a combination of both in Table
1. For example, MagNet [
34] detects adversarial examples by comparing the manifold measurements difference between clean examples. It consists of one or more detectors, which usually contain a reformer and two DNN models. SafeNet [
32] adopts the combination of an additional DNN model and SVM algorithm to detect adversarial examples. Robust-ADMM [
66] builds a framework to explore robust and sparse DNNs by combining adversarial training and pruning technique.
Currently, DNN models are usually deployed on GPUs or domain specific DNN accelerators. GPU-based solutions provide better performance and versatility, but it is
difficult to achieve cost and execution efficiency. More importantly, deploying detection network naïvely will introduce
security issues. Deploying the two networks in loosely coupled two GPUs or accelerators can lead to potential information leakage [
20]. Specifically, the data transmission between two accelerators or PCIe traffic is performed in plain text, which result in the data leakage of the DNN models. This allows an attacker to carry out model-extraction attacks.
Deploying target networks with detection algorithms poses key problems to the current DNN accelerator architecture. First, pioneering works defend against adversarial attacks within DNN accelerators [
9,
10,
45,
62], dubbed defensive DNN accelerators, which are still in its infancy. These pioneering adversarial defense accelerators often rely on a fixed or specific type of adversarial example defense methods such as NASGuard [
62] and 2-in-1 accelerator [
9], thus inevitably compromising their acceleration efficiency for sparse DNN defense methods or others adversarial defense algorithms. Second, although existing DNN accelerators (e.g., multi-tenant accelerators) can support multi-model parallel execution and achieve better energy efficiency, they cannot support various adversarial example defense methods to meet computational diversity requirements. For example, they cannot effectively support the sparse or special computing defense methods such as machine learning denoising.
To tackle above challenges, we proposes
sDNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the
simultaneous execution of original (target) DNN networks and the defense algorithm or network that detects adversarial example attacks. The design of the sDNNGuard architecture is based on thorough analysis of major adversarial example defense approaches and deep understanding of the fundamental requirements for compute resources, on-chip data sharing, and task synchronization. To satisfy these requirements, sDNNGuard adopts a tightly-coupled heterogeneous architecture with a CPU core and an elastic DNN accelerator to support
diverse adversarial example defense methods. The elasticity enables target and detection network to run simultaneously in one accelerator; and the CPU core efficiently performs non-DNN computing, the special layer of the neural network, and conversion of sparse storage format for weights and activation values. To the best of our knowledge, no previous works create a cost-effective DNN accelerator architecture for adaptive dense and sparse DNN adversarial example defense methods. In summary, this article makes the following key contributions.
—
A hybrid sparse-dense defensive DNN accelerator architecture is first proposed that is capable of accelerating not only dense DNN detect algorithms, but also sparse DNN defense methods and other mixed dense-sparse (e.g., dense-dense and sparse-dense) workloads to fully exploit the benefits of sparsity.
—
A new hardware abstraction is introduced to achieve dynamical resource scheduling mechanism. Partitioned hardware resources serve as new hardware abstraction. It allows different DNN models to be mapped to multiple logic DNN accelerators in the most cost-effective way. Thus, the compiler can use new hardware abstraction for coarse-grained resource mapping, and the scheduler also uses them for flexible resource allocation at run time.
—
An elastic on-chip buffer management is proposed to fully exploit the data locality—the detection network can efficiently access intermediate data of the target network without off-chip accesses—and guarantee the access order required by the detection mechanism.
—
An elastic PE computing resource management is proposed to ensure that detection network executes faster than target network to avoid false prediction by identifying possible attacks timely. Moreover, it also maximizes the utilization of computing resources.
—
An extended AI instruction set is proposed to support: (1) the synchronization of two networks and the efficient data interaction; (2) the inter-core communication mechanisms that facilitate efficient communication and task scheduling; and (3) the resource allocation of two networks and mixed dense/sparse mode select.
3 sDNNGuard Overview
3.1 Tightly Coupled Heterogeneous Architecture
To address the challenges discussed above, we propose
sDNNGuard, a heterogeneous architecture consisting of an elastic DNN accelerator and a CPU core, which can effectively support various adversarial defense methods. With the tightly coupled design of the accelerator architecture and CPU core, sDNNGuard is both versatile and adaptable. The overall architecture of sDNNGuard is shown in Figure
2. Specifically, elastic on-chip buffer and elastic PE array are used to allocate different hardware resources for target network and detection network. In addition, dual MAC switches with selector are used to select the activation values or weights of a DNN input and support for sparse configuration. The reconfigurable interconnect fabric enables flexible and dynamic connections between different buffers and PE array of two networks. The CPU core is mainly used to execute the special computing units and convert sparse storage formats for weights and activation values.
The global buffer is established because it is the key to provide efficient data communication. Since the CPU and the DNN accelerator are integrated in one chip, the direct communication method between the two is to access the shared off-chip DRAM accessed via a bus. But it will lead to significant energy consumption and performance overhead. The read/write performance of global on-chip buffer is over 80× higher than off-chip DRAM accesses. Thus, in sDNNGuard, a global on-chip buffer with the Address Translation Unit (ATU) tightly couples the CPU and the accelerator, which enables fast communication between them and within the accelerator.
3.2 Hardware Abstraction of Logical DNN Accelerators
To meet the needs of the hardware requirements of two networks, our article proposes a new hardware abstraction of logical DNN accelerators that can be constructed dynamically to match the different dimensions of layers. The new hardware abstraction serves as a middle layer that decouples the physical resources with fixed mapping. In other words, on-chip buffer and PE array are decoupled, and would be partitioned into multiple homogeneous on-chip buffer (GLB) slices and PE array slices.
The key components to implementing the new hardware abstraction are reconfigurable interconnect fabric, elastic on-chip buffer, and elastic PE array. Reconfigurable interconnect fabric adopts similar technique as GANPU [
24] and Maeri [
28] used. A set of banks within GLB are reconfigured as private memories for some PEs in Figure
4. And the compiler is responsible for remapping the address of each bank. Thus, the reconfigured memory structure not only minimizes stalls due to computation, but also maximizes buffer-level parallelism. Figure
6 reveals the reconfigurability of PE array to handle the different computational requirements under different workloads. When a batch of PEs on sDNNGuard is idle, we can limit the bandwidth of each PE and save power by disabling them.
The logical DNN accelerator often includes one or more groups of GLB slices and PE array slices. And logical DNN accelerators are constructed dynamically through allocating elastic hardware resources during runtime. Multiple DNN logical accelerators can process multiple layers at same time, with each logical DNN accelerator is assigned to process different layer. The reconfigurability of sDNNGuard can meet various computational requirements for sparse or dense adversarial example defense methods.
3.3 Task Data Communication and Sharing
Existing DNN accelerators typically use a static and one-way on-chip buffer design. For example, synaptic weights buffers (SB) and input neurons buffers (NBin) can only be used as an input cache for PEs. The output neurons buffers (NBout) are only used as an output cache for PEs. Unfortunately, the static one-way buffer is inefficient for supporting data communication between the target network and detection mechanism, because the DNN accelerator must put OFmaps into off-chip DRAM, which consumes more off-chip memory bandwidth.
To address this problem, an elastic on-chip buffer management mechanism is proposed to enable the detection mechanism to reuse shared OFmaps. It also effectively reduces data transmission and improves data communication efficiency between the target network and detection mechanism. Therefore, the three characteristics can be concluded for the elastic buffer management mechanism.
Improved local data reuse. Compared with the static one-way buffer structure, the elastic buffer management mechanism allows output buffers to be converted to input buffers through buffer switch, thus avoiding data movement or copying. It enables detection mechanism to access shared OFmaps in a timely manner with better communication efficiency, thereby reducing the computation time or latency of DRAM access.
Support for concurrent tasks. We design a new on-chip buffer access mode for the elastic on-chip buffer management mechanism, which can effectively support two new scenarios. The detection mechanism and target network can access the shared OFmaps simultaneously, when the on-chip buffer stores the entire OFmaps of target network. If not, the detection mechanism takes the priority—target network is suspended until detection mechanism have finished the reading operations of its intermediate data.
Enforcing Read-After-Write dependence between tasks. As a shared buffer between the target network and detection mechanism, we should make sure that OFmaps are read by the detection mechanism before they are updated by the target network. The empty/full buffer status registers are added to enforce the RAW dependencies. And, these status registers can be monitored by the hardware scheduler at runtime, which constantly switches buffer reads and writes to schedule task execution, so as to ensure the correct data access sequence.
3.4 Hybrid Sparse-Dense Computation of Target Network and Detection Network
Existing sparse DNN accelerators [
13,
40] aim at squeezing out zero values during DNN inference, without considering their adversarial robustness. In addition, exploiting the sparsity of weights and activation values needs a more complex hardware design lead to significant area and energy overhead. S2TA [
31] quantitatively shows that the additional buffer required to utilize both sparsity results in a 71% increase in energy of per MAC compared with the baseline (systolic array) dense accelerator. To design an accelerator that supports both sparse and dense workloads, the most straightforward way is to combine the carefully designed dense processing elements and sparse data streams in previous works. But this zero-skipping logic precludes many opportunities for energy-efficient data reuse due to its irregular data access patterns. This makes them difficult to implement with minimal overhead on dense accelerator architectures. In addition, the choice of compression format greatly affects the size of storage and memory access, and also influences the performance of the PE array. Thanks to tightly coupled heterogeneous architecture, CPU cores can easily replace these complex hardware designs on top of dense accelerator architectures, and implement sparse control logic and support for conversion of sparse storage format for weights and activation values.
A sparse storage format is required for sparse DNN computations, which facilitates efficient parallel data access and load balancing. Figure
3 shows the differences in the overhead of various compression formats with different levels of sparsity. Since each non-zero element needs indices of log2(dimension) bits, the data transfer overhead for coordinates (COO),
compressed sparse columns (
CSC) and
compressed sparse rows (
CSR) varies greatly at different levels of sparsity. In addition, these sparse storage formats and their variants can result in low memory bandwidth utilization, this is because they allocate the data required by different PEs at far away locations in memory. While Bitmap has the lowest memory footprint at various sparsity levels. So sDNNGuard uses a Bitmap format, where each bit represents whether its corresponding element is zero or non-zero. To minimize the hardware overhead of sparse control, a natural choice is that adopts CPU core to operate on the bitmap metadata and calculate how many MACs are needed. Sparse buffers can also be abstracted and customized through an elastic on-chip management mechanism.
3.5 Task-Level Synchronization and Scheduling
Data coordination and communication are needed between target and detection network during execution. The two networks compete for on-chip buffers and the computing resources of the DNN accelerator. Therefore, it is necessary to ensure that sufficient resources can be obtained by detection network to process the data of the target network in time. In fact, different target networks have different PE utilization ratios, and even different layers in the same network have different PE utilization ratios. This provides an opportunity to dynamically allocate their hardware resources within different layers. When PEs is overused by one layer of the target network, it requires additional scheduling and synchronization mechanisms to ensure that the detection network has sufficient resources to process the data from target network. Otherwise, the data will be lost and the detection of an adversarial example cannot be completed.
Due to the sequential and deterministic nature of the processing flow of the detection and target networks, we design a new scheduler within the DNN accelerator and introduce an extended AI instruction set to dynamically allocate PE and on-chip buffer resources instead of a complex handshaking mechanism to synchronize and schedule tasks. The instruction set dynamically schedules global on-chip buffer and PE computing resources. It also works with event queue to enable efficient task scheduling and communication between the CPU core and the DNN accelerator.
4 Implementation
4.1 Scheduler of sDNNGuard
The scheduler is responsible for receiving and parsing the extended AI instruction set, processing the monitoring status registers and buffer status registers. It consists of instruction fetch and decoder unit, register file (storing instructions temporarily), two issue units, event queue communication control logic, and synchronization control logic and so on. At run time, the scheduler leverages the hardware abstraction to decide which hardware resources to allocate, and when to execute. Furthermore, it handles task queue management and event queue communication mechanism between the elastic CPU core and the elastic DNN accelerator (ATU, receiving, decoding and sending, etc.).
Instruction fetch and decode unit. Instruction fetch unit includes components such as an instruction cache and a program counter to keep track of the address of the next instruction to be fetched. To improve instruction fetch performance, we also employ prefetching technique where multiple instructions are fetched ahead of time. Fetched instructions are typically stored in an instruction queue or buffer before being decoded. This allows for decoupling the fetch and decode stages and helps in maintaining a steady flow of instructions. In instruction decoding stage, fetched instructions are decoded into their corresponding control signals for execution units and on-chip buffer control logic such as elastic PE array and memory accesses. The scheduler also coordinates the flow of instructions through the pipeline stages, ensuring proper synchronization and handling of hazards such as data dependencies. Overall, the fetch and decode modules work together to fetch instructions from host CPU, decode them into instruction control signals, and prepare them for two issue units to execute in pipeline stages.
Two issue units. We also equip with two issue units to increase instruction dispatch and improve pipeline utilization, furthering reducing instruction conflicts and races. With two issue units, the scheduler can better utilize its execution units, filling potential idle cycles that might occur if only one instruction is issued at a time. The dual issue units also can be designed to handle different types of instructions (e.g., one for computing operations and another for memory operations), which helps in balancing the workload across the accelerator’s resources.
Task queue management. Double FIFO modules are used for receiving task queue information. As the expression format of task is 8 bit width, the capacity and depth of FIFO is very small. Implementing this double FIFO modules is lightweight and its area overhead is negligible. And the scheduler in our proposed architecture employs FCFS (First Come First Service) strategy for incoming tasks. The dual FIFO design coupled with FCFS ensures a balanced load distribution and minimizes task waiting times, thereby optimizing overall scheduling efficiency.
Monitoring status registers. To effectively implement monitoring of status registers, we adopt two mechanism: polling mechanism and interrupt-driven mechanism. First, implementing a polling mechanism can frequently checks the status registers at defined intervals. Second, for critical changing registers (e.g., the status register of the on-chip buffer/bank shared by the two networks), we use an interrupt-driven mechanism to handle the change immediately. Specifically, the hardware triggers an interrupt when a register’s status changes, informing the scheduler to response it quickly. Meanwhile, the scheduler maintains a table that keeps track of all status registers and their current states. This table is updated either through polling or interrupts. By using such an approach to monitoring and handling status register changes, the scheduler can maintain system stability, optimize resource utilization, and ensure quick response to dynamic conditions in sDNNGuard.
4.2 Elastic On-Chip Buffer
4.2.1 Microarchitecture Implementation.
As shown in Figure
4, the elastic on-chip buffer is mainly used to cache the weights and feature maps of the target and detection network, essentially a large global buffer (GLB) consisting of multiple SRAM banks and their corresponding routing networks. We describe each component in detail as follows.
Multi-bank SRAM routing. Each physical bank is uniquely identified by an index (the high address of an SRAM bank). The buffer consists of an index array that holds the indices of the physical banks. The index array structure is maintained within the control unit to store the indices of all SRAM banks. This array allows for quick lookup and reference during memory allocation and access operations. Multiple SRAM banks are dynamically grouped together as a set to form a buffer or GLB slice (e.g., SB_target, NBin_Target, and NBout_target of target network etc.). Besides, the granularity of static one-way buffer allocation is at the set level. We use bank-level elastic buffer allocation granularity in the DNN compiler because the best bank utilization of AlexNet at set-level in DNN accelerator is only 51%, which greatly improves the on-chip buffer utilization.
These instructions from the host CPU can be received by scheduler, which uses them to form the routing table for index array switching and the configuration table for on-chip buffer allocation. According to routing table and the configuration table, the status of these bank switches can be dynamically set, without physically moving/copying data. This also allows for rapid reconfiguration in response to changing computational demands. Moreover, it supports efficient data reuse between the two networks, as well as new hardware abstraction through dynamic exchange.
During execution, the buffer size is dynamically allocated by the target and detection network according to the following policies: (1) The capacity of the target network to store the OFmaps should be guaranteed first. (2) The allocation is performed according to the size of OFmaps of target and detection network. For example, input buffer (NBin_detect) of detection network can be small or even omitted. (3) The spatial structure and usage mode of the buffer for target and detection network should remain unchanged to allow the efficient management of the elastic on-chip buffer. We omit weight buffers (SB_target and SB_detect) in Figure
4 since their work mode is similar.
Multi-bank SRAM management. Each physical bank contains four status registers (read completion, write completion, full, and empty) to display its usage status. The status of the input and output buffer of the target and detection network includes NBinReadComplete, BankWriteDoing, BankWriteFinish, NBinUsing and NBinUnusing and so on. The scheduler regularly polls these status registers to check the state of each bank. This information is used to resolve data dependency, furthering control the correct execution of the two networks. The scheduler determines whether the target network has completed the computation of the current layer and writes all OFmaps to NBout_target by reading the status register, and if it has, it will informs both networks to read the shared feature map simultaneously as the input data of the next layer. Specifically, After writing OFmaps to NBout_target, the scheduler checks BankWriteFinish to confirm completion and then signals both the target and detection networks to start reading these maps for the next layer’s computation.
4.2.2 Typical Scenarios.
As shown in Figure
5, layer i-1 of target network produces four OFmaps which become the four IFmaps of layer i ❶. Then, layer i generates eight OFmaps, assuming that there are two banks and four banks for NBin_target and NBout_target, respectively. If each bank stores a feature map, it means that in one iteration, the accelerator can read two IFmaps and calculate the convolution results of the four parts 0~3. In addition, two iterations are required, and all four IFmaps are needed to load for the accelerator to calculate the four final OFmaps of layer i. Thus, two additional iterations are required to generate OFmaps 4~7 ❷. The detection network works in a similar way.
On-chip buffer stores all OFmaps of the target network ❶. The shared feature maps can be read by the target network and the detection network simultaneously without guaranteeing the order between them, which does not make an impact on the operation of the target network since the NBout_target buffer can store all IFmaps of the target network. However, it must be ensured that the detection network must process these feature maps in a timely manner.
On-chip buffer only stores partial OFmaps of the target network ❷. Assume OFmaps 0~3 of layer i of the target network has been completed and written to the output buffer NBout_target. In addition, the target network requires to wait for the detection network to finish reading these feature maps.The scheduler needs to determine whether the detection network OFmaps 0~3 has been read by the polling status register. Then, the target network will be informed to move the data to the off-chip DRAM. In addition, the read path of IFmaps is from the DRAM to the input buffer NBin_target, and the rest OFmaps 4~7 would be handled in a similar way.
4.3 Elastic PE Resource Management
To allow for arbitrary reconfiguration, Planaria [
11] and SARA [
47] adopt an overkill and costlier architecture design than necessary. Planaria’s compilation time for dynamic architecture fission is very long and the reconfiguration overhead is expensive. It takes hours or even a day to compile ResNet101 on a two-node cluster, which needs to run 853k simulations. And SARA requires an additional neural network accelerator to recommend how to configure the systolic array at runtime. To reduce above heavy hardware overhead and support the elastic allocation of computing resources, we add three modules to the existing PE structure in Figure
6 to support elastic allocation of computing resources. They not only amortizes the cost of reconfiguration, but also effectively alleviates load imbalance problem. Each module is described as follows:
—
Dual MAC Switch (MS) unit: Add a group of MS units on the input port of PE to select the activation values or weights of a network input. Each selector is associated with a MAC unit and consists of a control logic that receives sparsity information and uses it to decide whether the MAC should perform a multiplication for a particular input and weight pair. These selectors are specifically designed to support for N:M fine-grained sparsity, which can selectively activate and compute only a subset of elements in a structured manner. This is controlled through a configuration register.
—
Adder Switch (AS) unit: We add an AS unit to the output port of PE, which selects partial sums or one computational output of a certain network. By dynamically selecting which partial sums to pass through, the AS unit optimizes data flow within the accelerator, reducing unnecessary data movement and enhancing overall execution speed.
—
Routing Logic unit: It is responsible for controlling the input path selection for the partial sum. We add a routing logic unit to guarantee the consistency of data input paths and output routes belonging to the same network, so that the target network and detection network could be operated as two threads. Moreover, it also assigns the source and destination pairs for each MAC according to the distribution routing bits, which are generated by CPU core based on the source-destination table entries. The unit uses a table-driven approach to ensure data path consistency and preventing mismatches and potential data corruption. This table can be dynamically updated based on specific layer requirements, obtaining optimal data flow and resource allocation.
Construct Logical DNN Accelerators. Based on elastic on-chip buffer and PE resource management mechanism, logical DNN accelerators can be constructed dynamically in Figure
4, and each one is allocated to handle a layer or network. Therefore, multiple logical accelerators could handle multiple networks or layers in parallel. The logic accelerator is composed of one or more groups of hardware resources, each group of hardware resources includes a subset of physical MAC (PE array slice) and physical memory library (GLB slice).
The logical accelerator owns a sub-accelerator configuration table in routing logic unit, which defines the configuration of hardware resources (the quantity of MACs and banks), and involves the indices of MACs and memory banks. Those indices determine the physical connection of the MAC to the memory bank through reconfigurable interconnect fabric. The hardware abstraction of forming logical DNN accelerators permit us in accordance with the shapes of the layers and multi-branch of the network to match hardware architecture.
Support for the N:M fine-grained sparsity. The N:M fine-grained sparsity has been used in several typical DNNs with different N:M sparsity levels, including image classification, detection, adversarial defense, and machine translation [
68,
69]. Selectors are essential components to support the N:M fine-grained sparsity, and determine which inputs and weights are used in computations based on sparsity patterns. They use a decoding mechanism to interpret these patterns and configure the MAC units accordingly. This involves reading the sparsity pattern from a dedicated memory area and setting control signals for each MAC unit. sDNNGuard employs a computational paradigm that allows for selective activation and computation of a subset of elements within a structured data representation. This selective approach reduces the computation complexity and storage overhead, and also offers more efficient computation and storage utilization.
Support for the sparsity of weights and activation values. The sparse development needs to be light-weight to fully conducive to the dense DNN accelerator. To that end, we design dual MAC switches to exploit sparsity, which achieves a more streamlined hardware design. They can quickly reconfigure the data path during operation to adjust the hardware according to different computing modes, and support dense and sparse DNN workloads to effectively accelerate sparse problems at a lower cost.
The compressed weight value and activation value made a pair by matching indexes, and are allocated to a MAC array, then the partial sums (Psums) are accumulated to the corresponding output buffer. During sparse computation, CPU core severs as a sparsity controller. For each matrix, CPU core determines the sparse matrix decoding method and how to map it to sDNNGuard. And, it can extract a sufficient number of weight–activation pairs to maintain a high MAC array utilization. The CPU core performs index matching and encodes the addresses of valid weight-activation pairs, and decodes the addresses and dispatches the pairs for parallel computation. The counters and tables are also implemented to determine the indices where dense computations are needed. Compared with dedicated sparse control logic, it has higher flexibility and adaptability.
4.4 Compiler Design
As shown in Figure
7, the software stack of sDNNGuard automatically transforms the target and detection DNN models into binary executable file to execute on the accelerator. It consists of four software components; (1) the converter, (2) the NVDLA compiler with an analysis model, (3) the simulator, and (4) runtime environment. The Converter transforms the DNN’s specification to dataflow graphs of the DNN model. The NVDLA compiler with an analysis model takes the dataflow graph and data dependency as input, and outputs a binary executable file. Runtime creates a map to put these instructions to the control signals in the accelerator and establishes the execution schedule.
Analytical model: To maximize the performance of the target and detection networks and minimize off-chip accesses, the compiler jointly optimizes the accelerator architecture and related scheduling to determine the hardware parameters of the two networks. We adopt a heuristic resource optimization search method to schedule computing tasks and perform the resource allocation. The main optimization function of the analytical model is shown in Figure
8, and manages the optimization loop, iterating over a range of PEs and buffer sizes for both the target and detection networks. The
recordConfiguration function is used to store or output the best configurations found during the optimization. The algorithm’s inputs include: data dependency of the two DNN models, the dataflow graph, sDNNGuand hardware parameters (on-chip cache size, PE number and DRAM bandwidth, etc.). It outputs the on-chip buffer size required by each layer of the target network and detection network and number of PEs by taking the flowing steps:
—
Initialization: initialize the estimated operation cycles of the target and detection network, the quantity of PEs and the on-chip buffer size for each layer of two models, and so on;
—
Increment: The greedy algorithm is adopted to iteratively change from an appropriate value to the maximum number of PEs on the sDNNGuard to search for an appropriate on-chip buffer size;
—
Calculation and estimation: evaluate the implementation cycle according to the allocated resources in a simulator;
—
Reiterate/termination: If the estimated operation period is less than the initial value or operation period, record the selection; If the number of PEs reaches the maximum, terminate; otherwise, repeat from the step Increment.
Furthermore, to narrow the search space of the number of PEs and the size of on-chip buffer, we set the empirical value by scale based on the size of the target/detection network.
4.5 Orchestrating Communication with Extended AI Instruction Set
4.5.1 Communication Between Target and Detection Network.
During execution, the two networks need to schedule and co-allocate computing resources and on-chip buffer resources, and they need to be controlled synchronously. In addition, the detection network also needs to use a large amount of intermediate data generated by the target network. As shown in Table
2, in order to process different data and meet communication requirements, the extended AI instruction set consists of three types of instructions. We use the
\(Control\_Sync\) and the
\(Control\_Event\_Dispatch\) instruction to coordinate the communication between the elastic DNN accelerator and the CPU and between the two networks. And, the scheduler operates the computation flow by monitoring the status register during data communication. An additional
\(Control\_Polling\_Check\) instruction is added to inquire the status register on the DNN accelerator to get target and detection network processing status (including ChannelProcessStart, ChannelProcessDoing, ChannelProcessDone, etc.), the event processing status (including TaskStart, TaskDoing, TaskSuspension, TaskDone), and the read and write status of on-chip buffer. Besides, a separate thread is required to query the event queue to handle special calculations on the CPU core, parse the command data packets accordingly, and send the final processing results and the current status to the scheduler. To support for dense or sparse target networks and detection networks, we also set
\(Cfg\_SparseDense\_Mode\) instruction to achieve the computation mode of hybrid sparse-dense DNN workloads.
The classical format of the extended AI instruction set consists of operation code, target register, direct value and source register. It is a 64 bit instruction. The AI instruction set extension is similar to the neural network accelerators [
50], which is composed of 16 64 bit general registers. NVDLA and Bitfusion [
51] are 32-bit instructions, which are subsets of extended AI instructions.
4.5.2 Communication Within Detection Network.
To meet the requirements of effective communication between the elastic DNN accelerator and CPU core, we use the ATU to support the CPU to read and write the elastic DNN accelerator’s on-chip buffer based on the global chip buffer and the event queue communication mechanism to complete effective data interaction and task scheduling.
Event queue communication mechanism. In order to quickly respond to accelerator demands, we design an event queue communication mechanism in Figure
9. The scheduler obtains and packages related instructions (containing task start and event ID) for event communication, and sends the command package to the event queue. The event queue and completion queue are implemented as a
First-In-First-Out (
FIFO) data structure. This ensures that events can be processed in the order. As shown in Figure
9, the structure of command packet consists of seven parts, which contains essential data like event IDs and any specific parameters required for task execution. The specific details of the latter three of them have been described in the figure. The CPU core continuously monitors the event queue. Upon detecting a new event, it dequeues the event and decodes the packet to understand the required task. Based on the event type and contained instructions, the CPU core executes the necessary tasks. After completing special calculations, CPU sends the final results of processing and the current status to the scheduler. In addition, it also support for sparse format event that informs CPU core to perform the operations of sparse computations.
5 Evaluation
5.1 Methodology
Accelerator Implementation. RISC-V and NVDLA are adopted to implement the accelerator of sDNNGuard. The logic fabric and scheduler are designed using the Verilog language, and the related hardware is synthesized using Synopsis Design Compiler with SMIC 65nm standard-cell library. The flexible NVDLA is simulated and verified using Synopsys Verilog Compile Simulator (VCS), and the power consumption is estimated using Synopsys Prime-Time PX based on the simulated Value Change Dump (VCD) file.
Simulator. We evaluated the accelerator architecture by a custom cycle-accurate timing simulator integrated with CACTI [
36] to simulate DRAM access latency. It’s input file contains the micro-architectural parameters like the maximum memory bandwidth, on-chip buffer size, the quantity of PE and the description of each layer of DNN networks. The outputs include the size of OFmap/IFmap, DRAM bandwidth requirement, DRAM accesses, SRAM accesses and weights of target network and adversarial networks. In addition, a compiler is designed to generated extended AI instructions for target network and detection network according to these parameters. The utilization of average NVDLA’s PE for a target network is 84.6%, and the detection network model is typically small in size (3~10 layers). Compared with NVDLA, the performance parameters of PEs and on-chip buffer resources based on the simulator are increased by 12.5% and 50%, respectively.
Workloads. As summarized in Table
4, we evaluate sDNNGuard using several classical target DNNs as benchmarks. The selected DNNs have been adopted for different applications, including image comprehension, object recognition and natural language processing, and they are popular medium to large scale with dense DNN workloads with several hundreds of Mbytes memory footprints. Moreover, three classical detection networks are selected, and they contain different data flows (sequential and parallel) and computing platforms (Elastic DNN accelerator and CPU). Two sparse DNN detection networks are selected to evaluate the hybrid sparse-dense DNN accelerator architecture, which contain weight sparsity and activation sparsity. Moreover, two types of adversarial detection mechanism (Adversarial example reject [
3] or purify [
19]) are select to validate the effectiveness of sDNNGuard architecture.
Baselines. To make a fair comparison, we re-implement NVDLA called Source-NVDLA (SNVDLA), which consists of a large NVDLA and a small NVDLA to execute the target network and the detection network, respectively. The Elastic NVDLA (ENVDLA) has the same number of PEs and on-chip buffer size as the SNVDLA. The evaluation platform configuration is presented in Table
3.
Implement sparse control logic and convert sparse storage format on CPU. The CPU manages the sparsity by determining which data elements (weights or activation values) are non-zero, which needs to be processed. We use comprehensive SPARSKIT package [
46] to handle a number of operations on sparse matrices, particularly conversion between various sparse formats. For bitmap sparse format, a bitmap (binary array) is used to indicate the presence (1) or absence (0) of elements. Each bit corresponds to a potential value in the data structure (e.g., matrix), where 1 indicates a non-zero value and 0 represents a zero value. The CPU iterates through the bitmap array bit by bit. When a bit is set to 1, the corresponding index in the data structure has a non-zero value. This approach reduces the overhead of storing indices for non-zero values and allows quick checks for element existence.
5.2 Results on sDNNGuard Architecture
Performance. We compare elastic NVDLA against the SNVDLA on typical detection network (FD [
42], SATR [
3], and DISCO [
19]) and six target networks are listed in Table
4. The performance improvement of the elastic NVDLA is shown in Figure
10 (a) and Figure
11. In order to obtain better performance, elastic NVDLA dynamically adjusts resource allocation through synchronization and configuration instructions, rather than allocating resources as fixed as the target network and detection network. The average acceleration of elastic NVDLA is about 1.42× compared with SNVDLA. Specifically, the performance improvement of dynamic resource management and scheduling mechanism is about 26.3% higher than that of fixed resource allocation. Meanwhile, the performance improvement of the elastic NVDLA is improved by 15.7% through data communication. These gains come primarily from the fact that the elastic NVDLA is optimized for DNN techniques through the integration of dedicated functional units and elastic resources management. The extension AI instruction and new hardware abstraction also improve the utilization of storage and computing resources, which is the principal cause for the elastic NVDLA performance improvement.
It can be seen from Figure
10 (b) that the maximum feature map of VGG16 is 6.27 MB. Because its feature map is too large, the intermediate data generated by target network need to be stored in the off chip DRAM. Elastic NVDLA combines control instructions and data communication, and implements elastic on-chip buffer management mechanism, reducing DRAM traffic by 36.2%, which increases the processing performance of VGG16 by 1.61×. Furthermore, to guarantee that the detection network could absorb middle data in a short time, the scheduler need schedule additional resources to alleviate the storage and computation pressure.
As shown in Figure
13 (a), we also represent how to use the extended AI instruction set. Although synchronous and communication instructions are not used frequently, their effect on the cooperative operation on the target and detection network cannot be denied. The
\(Control\_Polling\_Check\) and
\(Cfg\_Clear\) instructions account for a high proportion because the scheduler needs on-chip buffer status registers and control process status registers. Because there are no special computing operations on the detection network,
\(Control\_Event\_Dispatch\) instruction is not required. The extended AI instructions account for about 2.35% of the total instructions. For fair comparison, the operation period of both the extended AI instructions and NVDLA instructions are set to 18 cycles.
Area and power. Table
5 lists the area and power parameters. The total area of the elastic NVDLA is 5.1
\(mm^2\) , which is about 4.7% larger than the SNVDLA. The combinational logic (complex interconnect and control unit, mainly scheduler and routing logic, etc) occupies 1.02% area of elastic NVDLA. The nonlinear activation unit and pooling unit occupy about 3.7% of the area. Compared with the SNVDLA, the ENVDLA consumes 17% more power. In other words, the scheduler, elastic on-chip buffer management mechanism and elastic PE resource management mechanism consume about 9.9% of the power. The non-linear activation unit and the pool unit also occupy about 7% of the power.
5.3 Detection Mechanism on sDNNGuard
SNVDLA and RISC-V are connected through the system bus. The performance improvement of the tightly coupled DNN accelerator architecture is showed in Figure
12 (a). As can be seen from the figure, the on-chip buffer connection has higher communication efficiency, and its average performance is about 2.04× higher than the data transmission of traditional bus connection. Figure
12 (b) presents an average 67% reduction in off chip data traffic. We can see that smaller target network models can have performance loss when running detection network NICs (including SVM algorithms) on elastic DNN accelerator and CPU. More exactly, the operation time of the SVM algorithm on RISC-V is about 3.8 ms. By comparison, the running time with the detection network of Alexnet and GoogLeNet taking no account of data dependence of SVM run by CPU are 0.97 ms and 1.17 ms, respectively. Thus the DNN accelerator must execute
\(Control\_Sync\) instructions to wait for the CPU to complete its calculations. However, to reduce the execution time of the detection network, the scheduler must allocate more on-chip buffers and PEs to the detection network, but this would increase the execution time of the target network. Owing to the reduction of target network execution time, when coexisting with AlexNet and GoogLeNet, NIC execution time can be reduced by 6.3% and 8.5%, respectively. Again, the maximum execution time of ResNet50 is about 9.05 ms. At present, the NIC runtime (1.12 ms+3.8 ms) could satisfy the latency demand of the target network. Moreover, if sufficient computing power is used to accelerate SVM, the latency requirement of the target network can also be kept.
In addition, in Figure
13 (b), we show the percentage breakdown of extended AI instruction types in the six benchmark tests. Among them, configuration instruction, data instruction and control instruction accounted for 33.8%, 23.7% and 42.5%, respectively. This shows that the control instructions account for a high proportion in the communication between the elastic accelerator and the CPU, and it also shows that the effective implementation of these instructions is crucial to increase the throughput of sDNNGuard. Moreover, the computational complexity of the detection algorithm present a high degree of correlation to the allocation and management of on-chip buffers and PEs. The detection method (NIC) (Figure
13 (b)) has higher computational complexity than the detection network (AN) (Figure
13 (a)), so executing the network in Figure
13 (b) requires more instructions.
5.4 Non-DNN Defence Methods on CPU
Some classical non-DNN adversarial example defense methods have been evaluated on RISC-V. Classical non-DNN adversarial example defense methods: Feature Squeezing [
64], Random Forest [
7], Decision Tree [
21], Logistic Regression [
15], SVM [
32], and PCA [
45]. Figure
14 shows that these defense methods can meet the performance requirements of larger target network models like ResNet50 and VGG16, not for smaller target network models like AlexNet.
As sDNNGuard uses all storage and computing resources to accelerate target networks, this make some defense methods unable to satisfy the maximum performance demands of small target networks like AlexNet and TextCNN. The single frame executing time of RseNet50 is 8.73 ms, satisfying the demand of adversarial example detection. Execution time of Random Forest on CPU is too long to satisfy the demand of all target networks.
5.5 Comparison with DeepFense
DeepFense [
45] is the first end-to-end online malicious input detection framework, which can verify whether the input examples are abnormal in parallel with the victim DNN model. In addition, it can achieve instant performance in resource constrained scenarios by using hardware/software/algorithm co design and customization acceleration. Our work provides a hardware acceleration framework to detect adversarial example online, and it also has adaptive defense evolution capability for unknown adversarial examples.
To be fair, we implemented the DeepFense of eight defenders in parallel on the sDNNGuard running at 150 MHz. The experimental results show that the DeepFense of sDNNGuard to MNIST and SVHN is increased by 1.36× and 1.21× respectively. DeepFense includes multiple DNN detectors and a non DNN model, which is a general deployment scenario of sDNNGuard. Our architecture could reduce the movement of data for DeepFense by making the most of the intermediate data of the target network. Although the overall performance of sDNNGuard is slightly better than DeepFense in terms of online defense against adversarial examples, its dedicated hardware architecture performs better than sDNNGuard for multi model defenders with more than 10 models. This is mainly because sDNNGuard is unable to implement parallel dedicated architecture and lacks sufficient non DNN computing capacity.
5.6 Comparison with Other Sparse Accelerators
To reveal the superiority of our work, we compare the performance improvement of SCNN [
40], SparTen [
13], and our sDNNGuard on top of six target networks and two sparse detection networks in Figure
15. SCNN and SparTen are recent
state-of-the-art (
SOTA) designs that exploit weight sparsity and activation sparsity. Their performance results are normalized to that of SCNN. We can see that sDNNGuard outperforms the baselines across all the DNN networks with average 2.51× and 1.73× higher throughput over SCNN and SparTen, respectively. Such improvements stems from (1) the hardware abstraction of logical DNN accelerators, and (2) the reconfigurable memory hierarchy of sDNNGuard in reducing off-chip memory accesses. Both fully utilize the capability of fused DNN accelerator architecture. Compared with dense workloads, sparse workloads suffer from more data movement instructions per flop, as well as irregular or inconsecutive memory access patterns. To achieving higher performance on sDNNGuard, it is best to exploit properties of both the sparse weights/activation values and the underlying hardware architecture. SCNN and SparTen utilize complex hardware to exploit the sparsity of weights and activation values, while sDNNGuard relies on CPU cores to achieve the same function with minimal overhead. In addition, decoupled with the increased flexibility across PEs, enables sDNNGuard to obtain high throughput and speedups over SCNN and SparTen. Under same sparsity level, HyDRA and BiP follow the same performance improvement trends in Figure
15. This also shows that these improvements closely track the per-benchmark sparsity.
5.7 Ablation Studies and Sensitivity Analysis
Performance impacts of elastic on-chip buffer and PE resource management. By selectively enabling and disabling the elastic resource management feature, we run target networks and detection networks in Table
4 to observe its impact on the accelerator’s overall performance. Experimental results show that elastic resource management can achieve 1.26× performance improvement over static resource allocation. Dynamic resource management enables the accelerator to adapt its computational strategy on-the-fly, potentially improving throughput and reducing latency by optimizing resource use. For hardware architecture, this underscores the importance of flexible, adaptable designs that can respond to varying computational demands without sacrificing efficiency.
Performance impacts of extended AI instruction set. The experiment involves running identical workloads on sDNNGuard with the extended AI instruction feature toggled on and off. As these instructions are tailored to implement computing operations of detection networks directly within the hardware, allowing for rapid and energy-efficient processing of intermediate results. We adopt ordinary strategies to achieve these operations such as replacing synchronization instructions with waits. We observe the experimental results and find that sDNNGuard with extended AI instruction set can obtain 1.13× performance improvement. By incorporating specialized instructions, the hardware is better equipped to perform complex operations more efficiently, An extended AI instruction set directly addresses the need for specialized computation in the face of sophisticated detection networks, potentially reducing the computational overhead and latency.
Performance impacts of the number of PE. We use the ratio of
frame per second (
FPS) to show the sensitivity. Although the number of PEs increased from 64 to 2048, the computing performance did not increase linearly, and the PE utilization decreased with the increase of the number of PEs. The increase in area, power consumption and cost is not proportional to the increase in performance, so the elastic DNN accelerator does not need a large number of PEs. For example, compared with 64 PEs, the performance of ResNet50 can be improved by up to 6.33×, while the computing resources are increased by 32×. According to the experimental results, when the number of PEs is within 64~256, the utilization rate of PE is close to 100%. If the number of PE continues to increase, the utilization rate will gradually decrease. Furthermore, we tested SafeNet [
32], which is the largest detection network, mainly implemented with VGG16. We improved the number of PEs of DNN accelerator by 43%, and achieved the same function as the target network.
Performance impacts of buffer capacity. Especially for small computing intensive networks, increasing the on-chip buffer size does not significantly improve the overall performance of detection network, this is because they are mainly limited by the DRAM bandwidth. Compared to 64KB on-chip buffers, GoogLeNet and AlexNet achieved a maximum performance improvement of 14.3%. However, VGG16 and ResNet50 achieved a maximum performance improvement of 2.36×, this is mainly owing to the increase of buffers on the chip, which reduces the data traffic between the DNN accelerator and the DRAM outside the chip.
Performance impacts of LLC size of CPU. By changing the scale of the LLC, we measured the performance of typical adversarial example defense methods (such as the Decision Tree [
21], Logistic Regression [
15], SVM [
32], Random Forest [
7]) on RISC-V, and calculated the mean value of their performance improvement. The size of LLC is set to 0 as the baseline for comparison. Within a certain range, more cache blocks and larger LLC sizes could lead to significant performance improvements. For the 128 byte cache block size, increasing the size of the LLC can improve the performance by 3.34x. However, when the LLC size is larger than a certain value (such as 256KB), the defense performance will not be improved accordingly. As shown in the figure, the overall performance is most sensitive when the LLC value is 64 KB. This is mainly because the L1 cache size is set to that value, which enables the processor to employ these data to enhance the model performance.
6 Related Work
Adversarial attacks and defenses. Adversarial example attacks and defenses are still continuously competitive and evolving. From the initial digital-domain attacks [
14,
57] to the later physical-domain attacks [
23,
27], adversarial attacks [
2,
72] have been seriously threaten the safety of human life and property. Various adversarial defense methods [
58,
65,
71,
73] have been proposed to defend against various adversarial attacks. Early defenses used machine learning to denoise adversarial examples [
30,
64], also adopted image transformation to detection adversarial examples [
58]. Feature Squeezing [
64] detects adversarial examples by median smoothing filtering and bit width compression, [
30] denoises adversarial perturbation through a high-level representation guided denoiser. As adversarial example attacks become stronger, stronger DNN-based defense methods also emerge. PixelDefend [
55] utilized PixelCNN [
59] to build its adversarial example detector. Researchers also adopt the hybrid approach of DNN and machine learning for higher adversarial examples. NIC [
33] distinguishes adversarial and natural samples by training a set of models and a SVM-RBF algorithm to extract DNN invariants. With deeper research, many methods utilizing sparsity for efficient input analysis and detection have been proposed [
6,
38,
41,
67]. In addition to the above defense methods, some special defense methods such as using information theory [
71], distillation [
39], semantic contradiction [
65]. In our work, through innovative hardware architecture, sDNNGuard provides an efficient and flexible solution for these methods to detect or defend against adversarial examples, offering stronger security for the application of deep learning models in sensitive domains.
Defensive accelerators against adversarial attacks. To defend against adversarial attacks, pioneering defensive accelerators [
9,
10,
16,
45,
48,
60,
61,
62] have been proposed, but they are still in their infancy and evolving. From the perspective of adversarial example defense algorithms, existing defensive DNN accelerators can be grouped into the two categories.
❶ Accelerator with additional adversarial example detector. [
10,
45,
60] run both the victim DNN model and an additional detector simultaneously to defend against adversarial attacks. DeepFense [
45] is the first end-to-end automated framework, which can detect malicious inputs online and verify their legitimacy. DNNGuard [
60] is the first defensive DNN accelerator architecture, which tightly couples with elastic DNN accelerator and CPU core to effectively perform the target DNN network and the detection algorithm or DNN network for detecting adversarial attacks. Ptolemy [
10] propose an algorithm-architecture co-designed framework, which detects adversarial attacks at runtime based on canary paths. DNNShield [
48] is a hardware-accelerated defense relies on dynamic and random sparsification of the DNN model to achieve inference approximation efficiently, and shows the SOTA detection rate.
❷ Accelerator with robust DNN network. These works [
9,
62] exploit the adversarial robustness within one accelerator without adding extra modules. 2-in-1 accelerator [
9] proposes a precision-scalable accelerator for Random Precision Switch algorithm to win both adversarial robustness and computing efficiency. NASGuard [
62] focuses on enabling efficient inference of robust NAS networks. The above works have made pioneering efforts in defensive DNN accelerators, but since they all focus on the acceleration of dense networks, they are not efficient enough when the networks are sparse. Our work supports acceleration of mixed sparse-dense DNN workloads.
Sparse DNN accelerators. Previous works [
1] exploit sparsity in compressed activation values to reduce latency and improve throughput. However, they only work with one sparsity, weights or activation values, but not both. EIE [
17] exploits both weights and activation values sparsity but only for the
fully connected (
FC) layer. Eyeriss-v2 [
5] cluster the storage and computing resources via a flexible hierarchical mesh on-chip network (HM-NoC), and adds additional sparsity logic to reduce energy consumption and improve performance. SCNN [
40] attempts to fully exploit weights and activation values sparsity for convolution layers. SparTen [
13] adopts greedy balancing to address load imbalance which is not solved in SCNN and provides hardware support for the sparse execution of weight and activation value. Griffin [
52] develops a hybrid architecture that can take full use of dual sparsity and reduce the hardware overheads. These sparse DNN accelerators adds the complex control logic and increases PE design complexity, this significantly augments the overall chip area. While sDNNGuard adopts CPU core to replace large buffers, and sparse control logic.