1 Introduction
In recent years,
Deep Neural Networks (DNNs) have made a huge leap in various complex tasks such as image recognition and natural language processing. However, these networks come at the cost of high computational and storage requirements. These requirements limit the usage of DNNs in various applications which suffer from restricted available computational resources and memory access. At the same time, the hardware resources needed by state-of-the-art DNNs keep growing to achieve better accuracy. This growth represents a problem that is observed in CMOS ASIC design. There, the on-chip memory is the biggest bottleneck for efficient energy computation and it can not even expand more to fulfill the requirements for such growing networks [
22]. Additionally, in such systems, the usage of off-chip memory, results in both substantial energy and latency [
47]. These challenging requirements lead to several research efforts focusing on solving the memory bottleneck and investigations for other possible computational paradigms such as near memory computing and in-memory computing.
Over the past years, several emerging non-volatile devices were presented as a solution to fulfill the aforementioned DNN requirements. These devices can perform logic and arithmetic operations while being the main storage cells of the in-memory computational systems. Several non-volatile technologies have been presented and utilized within these in-memory architectures, e.g.,
resistive random-access memories (
RRAM) [
46],
phase change memories (
PCM) [
2],
magnetic tunnel junctions (
MTJs) [
42], and
ferroelectric field-effect transistor (FeFET) [
32]. In these architectures, crossbars of
Non-Volatile-Memory (NVM) units are built to perform
matrix-vector multiplication (
MVM) where DNN weights are stored within the memory cells. This topology substantially reduces the memory requirements related to data movement and the parallel nature of crossbars further accelerates DNN computations.
Despite the aforementioned advantages, the crossbar-based in-memory accelerators suffer from several drawbacks. First, NVM usually can not adopt high-precision computing due to the limited number of states that are available in an NVM cell [
34,
43]. Second, as these accelerators perform operations in the analog domain, they are complemented by interfaces including
Analog to Digital Converters (ADCs) and
Digital to Analog Converters (DACs) to perform conversions from and into the digital domain. As the higher precision, these blocks become the more power consumption, latency, and area they need which compromises the whole system performance and makes it less efficient. For example, DACs/ADCs can constitute over 85% of the area and power consumption in crossbar-based systems [
28].
Moreover, reducing the computations precision drops the accuracy of the inferred networks and limits the usage of these systems to small networks. Also, as the crossbar is restricted in size, mapping different weights of the networks’ layers becomes challenging too. This can affect the overall system utilization since the mapping should maximize the data reuse as well as maintain high operational parallelization.
In our architecture, FeFETs are employed as the main memory cell since they offer several advantages over other embedded resistive memory concepts such as STT-MRAM, PCRAM, or OxRAM. FeFET as a storage transistor concept can act as current-source-like elements, making this implementation much less prone to various issues, e.g., IR drop or noise, which can be observed on memristor based concepts. FeFETs with a memory window up to 2V exhibit a very high on/off current ratio exceeding 10
\(^5\) and the lowest programming power of about
\(\lt 0.1\) fJ in any embedded memory concept [
5,
38]. The cells have a long-term retention of 10 years. Furthermore, long retention multi-bit performance has been demonstrated in FeFET [
25]. Due to the nature of high-k
metal-oxide-semiconductor field-effect transistor (
MOSFET), they also offer the smallest read-out latency of about 1 ns. FeFETs have so far been demonstrated in 28 nm bulk CMOS and 22 nm
fully depleted silicon on insulator (
FDSOI) technology but they offer a promising scaling roadmap towards smaller technology nodes, which distinguishes it from Flash-based concepts [
19].
In this work, we tackle the drawbacks mentioned earlier by further optimizing the different system blocks as well as exploiting opportunities FeFET memory elements can offer in the digital and analog domains. We optimize the unit size in the crossbar array, requiring only 1 FeFET for each AND operation while insuring accessibility for both the inference and programming modes. Furthermore, we explore the bit-decomposition algorithm and map its operations to our different architecture blocks to accelerate the computation of DNNs and allow for various levels of parallelism.
In summary, our new contributions are the following:
–
Novel bit-decomposition based mixed-signal architecture for the in-memory computation of Convolutional Neural Networks (CNNs) using FeFET as the memory cell which allows for small power consumption.
–
Complete elimination of DACs while using lower-precision ADCs through efficient parallel bit decomposed MAC operations. Also, building current-based optimized ADC for our crossbar operations.
–
Adopting completely flexible-precision which allows for a wide range of DNN and to trade accuracy for throughput and vice versa.
–
Preserving system immunity against cell variability with accuracy loss of less than 2% at a highly optimized quantization and at high system utilization.
–
Physical implementation in GLOBALFOUNDRIES technology and silicon verification of the core elements of the mixed-signal blocks by on-wafer characterization including corner states.
–
We provide an in-depth system-level evaluation of FELIX architecture across various benchmarks. We combine the measurements from the taped-out system part as well as the simulation results to accurately benchmark the system performance.
Compared to our prior work [
38,
39], we further optimized the crossbar structure and interface between the crossbar array and the mixed-signal blocks. In addition, we designed and implemented our own optimized current-based ADC to achieve better performance. Also, we performed a reorganization of the architecture components to further minimize the data transfer which includes showing the architecture pipeline and memory mapping. Finally, we extensively evaluate our architecture based on physical implementation measurements as well as simulations taking into consideration variability and system scaling at different target network precision.
Based on the performed evaluations, our system outperforms the state-of-the-art in-memory architectures using different memory technologies including FeFET while showing a high system utilization across various networks with different workloads.
This article is structured as follows: First, in Section
2, we go through related work and the current state-of-the-art. In Section
3, we provide a background on the basics of the FeFET crossbar design and the mode of operation as well as the convolutional neural networks bit-decomposition concept. In Section
4, we introduce our architecture where we breakdown the system into different components and elaborate the structure of each component individually as well as their mode of operation. In Section
5, we show the system pipeline and the mapping of different kernels to the memory locations. Finally, we perform an extensive evaluation of our architecture using various benchmarks and at different modes of operations as well as a comparison to state-of-the-art systems.
2 Related Work
Several recent in-memory architectures were presented to leverage the advances in NVM technologies. These architectures fall into one of three main categories.
In-memory analog-based architectures, several multi-precision analog-based crossbar structures were presented which maximize the usage of the analog properties the crossbar structure is offering. The analog crossbar is complemented with DACs to convert the digital input values to analog signals for the crossbar. Also, ADCs convert the final output to the digital form. This method is followed in several architectures such as ISAAC [
37], iCELIA [
10,
30,
44,
51].
Although these architectures allow for high-precision networks to be trained and inferred, the high cost of this approach appears in the power consumption and area. For example, the work presented at ISAAC [
37] shows that the 8-bit ADC accounted for 58% of total power and 31% of the area. Also, it can be observed in iCELIA [
44] that the cost of ADCs and DACs accounts for almost 90% of its power consumption. To solve that, the work presented in [
7] lowers the bit precision of the architecture which corresponds to minimizing the cost of these blocks. However, it limits the usage of such architecture to networks that can tolerate such low bit precision. For example, the presented results showed limited accuracy loss in LeNet-5 but in cases of larger networks such as ResNet-18, it reaches 8% accuracy loss.
In-memory binary architectures, with the aforementioned problems in analog-based architectures and due to the limited number of states NVM can store; several studies regarding the usage of crossbar in-memory architecture for binarized DNN inference have been presented in [
3,
8,
40,
48]. Though, these architectures show a high area and power efficiency with very high device reliability, their usage is still limited to a certain set of networks that can adopt such binary data representation.
In-memory digital-based architectures In [
20,
29,
50] full precision full digital inference architectures were proposed. However, they lost one of the main advantages the in-memory architectures are offering by ignoring the advantages of the analog domain-based operations by activating only a single memory cell at a time. Also, in [
29], this system showed very poor utilization in shallow layers. For example, if the system contains a large number of processing crossbars that can serve and benefit very deep layers, it performs poorly in the shallow layers compared to its peak performance. Also, the proposed operating frequency suggested by the work (2 GHz) leads to a high power consumption.
4 FeFET Crossbar Accelerator
In this section, we present our FeFET-based in-memory accelerator which is dedicated to DNNs. We start with a detailed discussion of each block across the various levels within the accelerator followed by an overview of our architecture. Throughout this section, we selected the different system parameters based on the design space exploration we performed. We start by defining the unrolling factors such as kernels unrolling factor, Kernel size, input size, and so on, at each layer within various networks. Second, we define for each system component parameter the area, power and throughput relation whether within the architecture itself in terms of impact on different subsequent blocks and busses or externally in terms of data movement and loads. From these information, we build a complete cost function that tries to optimize the system parameters to match the highest level of utilization across various layers within various networks while preserving certain conditions of total available FeFET capacity (8 MB). The final cost function has been built to adapt designer preference to optimize certain aspects over others. However, our chosen parameters do not emphasize a certain aspect over another and regard power, area and throughput equally.
4.1 FeFET Processing Element Design
The main building block of our architecture is the
Processing Element (PE) where the analog operations are performed as well as the digital conversions. As shown in Figure
5, a PE consists of the analog FeFET crossbar, mixed-signal blocks (
ADC), and digital blocks (adders and subtractors). We explore each block structure and operation as well as their role in the architecture pipeline.
4.1.1 Crossbar Organization and Operation.
As already explained in Section (
2), a convolution operation can be decomposed into a set of AND operations through the input feature map and the weights. Similarly, each cell in the crossbar can perform a one-bit AND operation as explained. The crossbar is composed such that each unit cell stores a single weight bit. The gate, drain, and source of the transistors are connected to a WL, DL, and
source line (
SL), as in Figure
3. The input feature map is fed into the WL and acts as row activation. The
sourceline (SL) represents a control signal that specifies activated columns for each clock cycle. Finally, the DL yields the final result of the operation upon accumulation along various memory cells. The crossbar is divided into segments of 16 × 16 FeFETs, meaning 16 WLs ×16 DLs per segment, which improves the hardware utilization of the periphery circuits, such as the ADC. The ADC is connected to each column of 16 segments. The segments within the column share their DLs and since there are 16 segment columns per crossbar, we implemented 16 ADCs.
Every FeFET within such a segment can be programmed individually as schematically drawn in Figure
3. In order to program the FeFET in one of the two binary states, high
\(V_t\) (HVT) or low
\(V_t\) (LVT), a short pulse
\(V_{PROG}=0\,V\) is applied and all the inhibit switches are set to 1 in order to discharge all the SLs from each the segment and the DLs from the crossbar to the ground. Afterward a WL voltage of
\(V_{prog,WL}={3}\) V is applied while floating the SL/DL. To prevent the other FeFETs within the same row (same WL) from being programmed, the rest of the inhibit switches are set to 0 and an inhibit voltage
\(V_{INHIBIT,SL/DL}\) via the
\(V_{PROG}\) wire is applied at the SLs on the segment level and the DLs on the crossbar level (Figure
3). An additional inhibit voltage is applied to all WLs to prevent
\(V_t\) state disturb of the FeFETs within the same SL and DL, which should not be programmed. This voltage is set to be 1 V. The preferred program pulse width here is expected to be 100
\(\mu\)s. For erase, we assume a block erase with
\(V_{erase,WL}={-2}\) V. Upon inference, the DL_SEL switch of the selected FeFET activates its SL within the segment and the
DL within the crossbar and connects the FeFET to the ADC.
Besides the FeFETs, we use a resistive element—forming a 1FeFET1R bit-cell—to limit the current of the FeFET and reduce the input current variation for the ADC. Therefore, we use a current generator to supply a reference current of about 100 nA, which is transferred into the FeFET column per segment via a current mirror with a similar latency as the ADC. The reference current for the ADC has been designed as 100 nA too. The strongly reduced current variation for the output “1” for bit-wise multiplication operation, makes it possible to clearly distinguish the accumulated number of activated FeFETs in the LVT state upon row activation. During the inference process, only one FeFET per segment is activated and it can contribute to the total current. Therefore, the SL of the activated FeFET is connected with the resistive element/current mirror (Figure
3) and its DL is connected with the ADC. The total current of the 8 segments is connected to the ADC which senses the result of the MAC operation.
4.1.2 Current-based ADC.
Unlike common ADCs, our ADC operates in high side current mode with a thermometer coding as in Figure
3. It consists of a group of stacked current mirrors as the PMOS transistors and the NMOS which are responsible for forwarding the measurement current to the next stage of the ADC. Each PMOS in the current mirrors is responsible to create a reference current of
\(100 nA\).
As the input current is smaller than the reference current, the first current mirror maintains a high voltage potential at its output Out1. The Out1 is also the input of the ADC and it is connected with the segment column. If the input current rises beyond the reference current Iref, the voltage at Out1 drops sharply. If the output potential at Out1 drops, the source-gate voltage of N6 increases. This turns on N6, which then provides a bypass for any extra current drawn by the input. If the input current keeps rising and exceeds the set current of the P1/P3 mirror, the voltage at Out2 will drop too. The process repeats for all subsequent stages of the ADC. Finally, the ADC is followed by an encoder responsible for buffering the ADC outputs and yielding the final binary result which is used afterwards.
The presented current-based ADC shows very high efficiency in terms of power and latency and it has only a small footprint and resolution. This fact does not limit the usage of this ADC in our architecture as we target low-precision ADCs. In Figure
4(a), the sequential activation of the ADC outputs is presented. For simulation purposes, the time step per activation is set to 10 ns. The simulation and measurement results of the ADC outputs with respect of the input current is presented in Figure
4(b). The simulation results are plotted with the line and the measurement results with the dashed line.
4.1.3 FeFET Crossbar Section.
Each crossbar is split into sections. In each section, groups of FeFET segments are formed where each group is connected to a specific ADC. In this configuration as previously highlighted, only one FeFET is activated per segment at any time. Such organizations limit the number of ADCs within the crossbar to the columns of the FeFET segments instead of the columns of FeFETs. Also, it limits the resolution of the ADCs as only one FeFET is activated per connected segment to the ADC. Such an approach drastically reduces the area and power of such a crossbar. Furthermore, it allows for using the crossbar as a memory for various layers instead of using the entire crossbar for a single layer. However, such a limited number of possible activations does not affect the throughput compared to existing architectures as it allows for high frequency. For example, in ISAAC [
37] though they activate the entire crossbar
\(128 \times 128\) they can run it only every 100 ns. This is equivalent to 163.84 Giga Activation per second. Also, activating the entire crossbar simultaneously drastically reduces utilization at networks not matching crossbar size. On the other hand, our architecture can activate the cell every 1 ns yielding up to 256 Giga activations per second while maintaining achievable parallelization levels as shown later which translates into a high level of utilization at different workloads.
In Figure
5, an example of the crossbar section is illustrated. Here, the convolution operation is unrolled by the following factors. The
\(\sum _{i=0}^{k-1}\sum _{j=0}^{k-1}\sum _{k=1}^{C_{in}}\) is parallelized by the factor of the simultaneously activated FeFETs per group. Also,
\(\sum _{n=0}^{W_p-1}\) is parallelized by the factor of groups per crossbar section. For example, in case of 4-bit weight quantization, our architecture will consider only four groups per section.
The number of activated FeFETs per group also defines the required ADC precision. In our architecture, we activate 8 FeFETs at a time each from a different segment which corresponds to an 8 states ADC. However, approximating the output value to a 3-bit value (7 states) does not affect the network’s accuracy due to the bit-level sparsity. As illustrated, the 3-bit results out of the ADCs are followed by adders to form a partial result for the MAC operation. Each ADC result is shifted by a value of
n mentioned in (
2) which corresponds to the weight bit significance computed by the ADC. It can be observed, that one of the ADC results is subtracted to maintain the correctness of signed operations as this ADC computes the operations of the weights sign bit. The adder yields a partial sum that is routed to cluster adders (explained in the next section) for further processing.
4.1.4 Overall Processing Element Crossbar.
Within each crossbar, we form four sections that operate in parallel. From a hardware perspective, the four sections do not have any physical separation. It is an algorithmic point of view where each section represents a different kernel. However, the inputs applied to the crossbar rows are applied to all of them. Since all four sections have the same input feature map, they can parallelize the computation of different kernels. Each section has its own set of ADCs and adders that should yield a partial sum for a different kernel output feature. In our architecture, we use crossbars of size 256 × 256 FeFETs which consists of 16 × 16 FeFET segments. Each 3-bit ADC is connected to 16 FeFET segments. For the presented system configuration, only 8 segments are activated.
This crossbar organization balances the arithmetic operations between the different blocks while lowering the data traffic needed. At any point in time, it only needs 8 bits of input (in case of 8 activations) and only four result words each of seven bits needs to be routed to the next block for further processing.
In addition to the crossbar sections, a row/input decoder is used to activate the current computed weight row in case of a value of 1 is applied as input. Also, a column selection module is used to select the correct column.
4.2 PE Cluster
As each PE computes a single input feature map bit significance at a time, the PE cluster collects the results from the enclosed PEs and further accumulates the values over time until all bits of the input feature map have been computed. As illustrated in Figure
5, each cluster consists of eight PEs that compute a single-bit operation for each of the 64 input feature maps across four different kernels. The results of different PEs are then added together. To do that, each cluster has four adders (one for each kernel) with eight inputs each. The results of these adders are then shifted by
m bits to the left which represents the significance of the currently computed input feature map bit. The shifting value signal is generated by the central control unit.
Finally, these values are accumulated through four different accumulators to form the final result. In the case of the most significant bit of the input feature map, it is subtracted instead of accumulated to yield a correct signed MAC operation result as this bit represents the input sign-bit. The iterations needed for accumulation depend solely on the input feature map precision. Through this pipeline, the PE cluster allows for further reducing the data traffic by only transmitting four words, 18-bit each in case of computing 64 x 8-bit activation/4-bit weight MAC operations.
4.3 System Tile
Our system consists of tiles of the presented clusters in addition to a special tile module. As DNNs are known for the high number of MAC operations reaching thousands of operations for each output feature, the presented cluster can perform only 64 MAC operations which can increase the latency per single output feature.
Therefore, to minimize the single output feature latency, we form tiles of clusters that can further parallelize MAC operations. The results of the different clusters within the tile are received by the tile module as shown in Figure
6. Depending on the number of accumulations per output feature at the computed layer, either the results of two or four clusters are added or even accumulating the results in case of MAC operation larger than the four clusters capacity. Hence, in the case of only 64 accumulations needed per output feature, the tile module yields 16 final output features (Four features for each kernel). Also, if the accumulations needed were more than 256, then the accumulations continue until the final value is ready. In the case of pooling following the computed layer, the yielded features are stored and compared or averaged with the next output features and only the pooling result is stored. The tile module also has an activation unit for applying any activation function to the output feature (ReLu, tanh, sigmoid, etc.) and performing the final re-quantization to 8-bit.
The multiplexer responsible for selecting the output feature, the accumulator, activation unit and the pooling unit receive control signals from the central control unit.
4.4 Top Level Architecture
The final system is composed of several tiles. Though each PE already performs the computations of four kernels, this parallelization has to be further increased as convolution layers usually consist of a large number of kernels. We further parallelize computed kernels across the different tiles. If r is the number of system tiles, then the overall system parallelizes the kernels that can be computed by the factor 4r. If the output feature map depth is less than or equal to 4r, then the full output feature map depth will be computed simultaneously.
Such structuring allows for maximizing the data reuse. As different tiles are computing different kernels, they can receive the same input feature map. This high input data reuse allows for reducing the data to be loaded from buffers as well as the complexity of the loading mechanism and finally the overall data traffic across the whole system. In Figure
5, we illustrate the top-level design of the system including a central control unit and buffers.
4.4.1 Buffers.
The tiles grid is complemented with a buffer unit that is split into two main partitions; an input feature map buffer and an output feature map buffer. The input buffer is further split into partitions such that each partition is storing the input feature bits targeting a certain cluster in each tile. This partitioning and the later explained mapping are done to preserve a one-to-one relation between the buffers partition and the row of tiles which simplifies the reading process from the buffers. The mapping of the partition content is defined prior to inference through a framework that we built. During buffering of the input feature map, words that are built from bits from different features are stored. This eases the loading process from the buffers to the different PEs. The second partition of the buffers is the output feature map partition. As it is preceded by the pooling unit within the cluster column, only the result of this unit in the case of pooling layers is stored. This partition is split into two different partitions to implement a double buffering scheme. The first one receives and stores the currently computed features while the other partition stores the features from the previous iteration in the external memory. Each cycle the partitions swap their roles. The size of the partitions is determined based on the performance analysis to make sure that at any point the system will not be limited by input/output feature memory load. Such a partitioning scheme requires an external memory burst rate (memory bandwidth) of 5.3 GB/second at 100% utilization of the system.
4.4.2 Control Unit.
The central control unit is responsible for initiating the pipeline and generating the different control signals mentioned for different system blocks such as shifting and resetting signals. At the start of each layer, the control unit receives different layer parameters that include the input feature map precision. This precision defines the number of iterations the crossbar needs to complete the operation. Also, it receives the input feature map depth and kernel size; which define the required accumulations and the results of the PE clusters that need to be added. Also, the control unit receives the kernels’ mappings in the crossbars to activate the corresponding columns and map inputs to the correct rows. Such mappings are done statically and the weights are stored in the crossbar sub-arrays according to the mapping. Finally, it receives the number of kernels which defines if the input needs to be used more than once on the grid to compute different kernels if the number of kernels was more than the kernel parallelization factor.
6 Evaluation
In this section, we first show the experimental setup and the different benchmarks used including their accuracies. Then, we evaluate our architecture and explore the experimental results from several aspects such as accuracy, utilization, power, area, and speed.
6.1 Experimental Setup
We used Cadence Virtuoso to evaluate the mixed-signal blocks where the transient behavior of the ADCs, the associated column, and the current limiter were investigated. Process variation of the circuit elements was investigated by Monte-Carlo simulation at room temperature. The MAC accuracy of the output at the thermometer-code ADC is analyzed and compared to the expected results. For suitably small
\(\sigma _{V_t}\) of the FeFET and the chosen 22FDX technology, we obtain fault-free MAC operation computation. Also, we use Synopsys Design Compiler tools to evaluate the different digital blocks. Finally, we estimate the full system area and power consumption by combining all the circuit simulations and measurements and determine the overall overhead and latency. We built our simulation framework using ProxSim [
11] which is based on Tensorflow. We used it to evaluate the CNN accuracies considering our performed bit-decomposition as well as the variability introduced by the FeFET cells and the ADCs.
In Figure
9, the taped-out wafer of the column segments with the ADC is presented. The total area of the crossbar column is
\(51.476 \times 5.651\) \(\mu\)m
\(^2\) which makes it hard to be distinguished on the pads-scaled wafer. The measurement results with respect to the simulations are given in Figure
4(b). As can be seen, the measurement results match the simulations except for the first output (input) of the ADC where there is a small leakage which results in a lower voltage level.
In Table
2, we summarize the different parameters of our architecture as well as the operating frequency. As we only consider 1,024 PEs within our design, the maximum size of the network the architecture can support is 64 Mb. Though the architecture can adapt to different weight precisions, we map and compute different convolutional models at the accuracy of 8-bit/4-bit (input/weight) as it presents a good balance between the benchmarks’ accuracy and system performance. We use a batch size of 1 for the selected benchmarks to represent worst case of parallelization from an input perspective as our architecture targets low power inference applications. We benchmark our system with Squeeznet [
16] for Imagenet [
12], Resnet18 [
15] for Imagenet, Resnet20 [
15] for CIFAR10 [
21], MobilenetV2 [
35] for CIFAR10, Dialated model [
45] for Cityscapes [
9], and LeNet-5 [
23] for MNIST [
13]. These benchmarks represent different computational and memory loads as shown in Table
4.
Finally, we built a mapping framework based on Section
5 which is responsible for defining the storage unit for weights of different benchmarks. Also, we built a simulation framework based on the different system estimations to measure system utilization and performance across different benchmarks.
6.2 Benchmarks Accuracy
Since our architecture performance depends on the precision of the inferred networks, we tested the applicability of the usage of 8-bit/4-bit quantization (input/weight) based on the optimizations presented in [
17]. We measured the classification accuracy over the test dataset in the case of (CIFAR10 and MNIST), and over the validation dataset when using ImageNet. For the dilated model, the Mean Intersection Over Union is reported as the CNN accuracy over the validation dataset.
As shown in Table
3, compared to the floating-point (Top-1) accuracy the 8-bit/8-bit maintains the accuracy with a maximum 0.6% accuracy loss. In the case of 8-bit/4-bit, we were still able to maintain accuracy with a maximum loss of 2%. Despite using highly optimized quantization (8-bit/4-bit), the system FeFET variations, and the ADC leakage, our architecture maintains networks’ accuracy with almost no accuracy loss (<0.2%). In the following section, it can be observed how such quantization optimization can boost the system performance and use the advantage of flexible precision the system is offering.
6.3 Performance Analysis
In this section, we benchmark our architecture across the different presented networks. We focus also on how our system can achieve high average utilization across different computational loads to maximize the performance in comparison to the peak performance.
For the computational performance, our architecture can reach a peak performance of 2.05 tera operations per second (TOPS) where each operation is a full MAC operation for 8-bit input and 4-bit weight.
As shown in Figure
10, our system shows a direct relation between the system hardware utilization and the achieved performance compared to the peak performance without any effect related to the model size. This means that our architecture is achieving a complete independence from the external memory latency. Such efficiency was achieved through the locality of weights throughout the inference process. Also, the effective double buffering for inputs and outputs which masks the memory-related transfers. Additionally, our system utilization depends on the computational load of network layers as shown in Table
4. It can be observed that as the computational load per output feature increase the corresponding utilization as well. This yields a high overall network utilization as the layers with high computational loads represent the major bottleneck. However, the utilization can be affected by the dimensions of the layer compared to the parallelization factor. Also, as explained previously we integrate the pooling layers with their previous layer so that any extra overhead related to their tasks can be avoided. Overall, the system shows high flexibility which still allows it to reach a high utilization across different loads reaching 93% as shown in Figure
10. Also, as shown in Table
5, our architecture maintains a high throughput and energy performance across various benchmarks. The system also shows a good extendability to larger dimensions of PEs and crossbar sizes while maintaining a high computation efficiency. However, in order to keep high utilization in such extended dimensions, replications of the weights need to be stored.
6.4 Energy Efficiency
With energy efficiency ranging from 1,169 TOPS/W (1-bit/1-bit) to 18.27 TOPS/W (8-bit/8-bit), our architecture shows a huge edge in terms of energy efficiency as shown in Figure
12. Such performance can be accounted for the system’s very low operating voltage and current. Also, as shown in area efficiency, the innovative
ADCs presented that are customized for our in-memory architecture and elimination of
DACs account for a large reduction in power consumption as shown in Figure
11. We also drastically reduced the number of simultaneously operating cells within the crossbar which effectively makes the major power consumption comes from the digital blocks (adders, shifters, etc.). This allows for further power reduction as these blocks offer various effective approximation opportunities that can still preserve network accuracies while increasing energy efficiency.
Compared to the state-of-the-art architectures, our design shows a reduction of average power consumption down to 56.1 mW as shown in Table
2. This corresponds to energy efficiency of 36.5 TOPS/W (8-bit/4-bit). With such performance, our system outperforms the top-performing state-of-the-art in-memory architectures by a factor of 1.63X even on the macro to full system comparison.
6.5 Area Efficiency
We also analyzed the area efficiency of our architecture. In this area investigation, we included the crossbar and the
ADC size. In Figure
14, the layout and the area of the crossbar with the ADC is illustrated. As can be seen, the total area of the crossbar with the ADC is around
\(85 \times 50\) \(\mu\)m
\(^2\) which makes it a very competitive design in terms of area efficiency. Also, we included the adders, accumulators across the PEs and the overall system, the needed input/output feature map buffers, and finally the control unit logic needed for the system operation. A detailed breakdown of the system is shown in Table
2. The design occupies an area of 4.9 mm
\(^2\).
As our system can allow for completely flexible input and weight precision with few changes in
ADC output connections. In Figure
13, we show the system performance at different precisions. Ranging from 261.1 TOPS/W/mm
\(^2\) for binary operations to 4.08 TOPS/W/mm
\(^2\) for 8-bit/8-bit (inputs/weights), our system shows a very high area efficiency where the architecture can accommodate a network with a size up to 64 Mb.
As shown in Figure
11, such performance can be accounted to the huge reduction in the area needed by the small 3-bit
ADCs and the usage of small FeFET cells. Furthermore, the optimized architecture data accumulation reduces the data to be transported which reflects on the buses area.
6.6 Comparison to DNN Accelerators
In Table
6, we compare our architecture and another FeFET-based architecture. It can be observed that we maintain a more feasible external memory bandwidth and frequency without compromising our architecture performance as shown in Table
7. Though we occupy a larger PE area, our PE design drastically reduces the number of consecutive adders, accumulators, and network on chip requirements which reduces the final system area. Also, Our PE includes the size efficient ADCs which allows for simultaneous activation of FeFETs while the compared architecture can only activate a single FeFET per crossbar column and consequently needs 256 counters and adders for each PE. Finally, we require much lower memory bandwidth due to very high data reuse and lower operational frequency compared to the other FeFET architecture.
Furthermore, in Table
7, we compared our architecture to further advanced accelerators including the pure CMOS-based UNPU and several in-memory architectures that vary in their computational paradigm. We scale the different architectures process into our process for a fair comparison. Despite our system adaptation of more flexible and balanced mixed-signal processing as shown in the previous section, it outperforms still the presented architectures in energy efficiency. However, Pipelayer and ISAAC show higher area efficiency due to their ability to store more than a single bit in their used ReRAM cell. In case of Lattice, area performance is not completely comparable since they only consider the area of the PE macro and none of the other system components such as buffers, control unit, and so on. Also, in our comparison, we scale the entire ReRAM area to our process even-though the ReRAM cell does not necessarily scale as well. Also, our architecture is lower in terms of peak performance as we keep the number of PEs to a level that can guarantee a high level of utilization at all points of time which represents a major drawback for several presented works.
Also, as previously highlighted, the system supports various networks with different complexities at low quantization and high accuracy compared to the state-of-the-art. For example, in Pipelayer, the network accuracy sharply deteriorates at lower resolutions and needs to have a high quantization precision. Also, our architecture shows a better ability to integrate more approximations, whether through flexible quantization or at the various system blocks that can tolerate approximation without affecting the network accuracy. We could not include a full comparison on a specific network in terms of performance and accuracy as many of the state-of-the-art architectures refrain from presenting their performance on specific networks for various understandable reasons.