8.1 Experimental Setup
We develop C++-based cycle-accurate simulator which emulates the functionalities of COSMO. The simulator uses the performance and energy characteristics of the hardware are obtained from circuit level simulations for a 45 nm CMOS process technology using Cadence Virtuoso. We use VTEAM memristor model [
49] for our memory design simulation with
\(R_{ON}\) and
\(R_{OFF}\) of
\(10\, k\Omega\) and
\(10\, M\Omega\) respectively and a switching delay of 1.1 ns. This forms the COSMO’s design cycle. All COSMO operations have been designed to meet the memory cycle constraint of 1.1 ns. We implement logic operations using digital PIM operations [
28] discussed in brief in Section
2.2 and shown in Figure
1. In doing so, some operations, like NOR, take one cycle while others take multiple cycles. For example, implementing XOR operation in COSMO takes three cycles. For other operations, like COSMO-addition, we take the simulator estimate of sense-amplifier read time as 1.1 ns even though it is much faster than that. We simulate a memory block under various conditions for different operations in Virtuoso. We record the latency and energy consumption at different settings and use them in our simulator. Our simulator maps the application over many memory blocks and utilizes the results obtained from memory block simulation to calculate the latency and the average power consumption of our design.
We compare the efficiency of the proposed COSMO with state-of-the-art processor NVIDIA GPU GTX 1080 Ti. While reporting the execution time for GPU, we preload the data onto GPU and report only the GPU execution time. We utilize nvprof with Nvidia Visual Profiler (NVVP) to get the execution time for individual kernels and consider only the kernels corresponding to application computations while comparing GPU performance with COSMO. However, GPU has certain GPU-side overheads like context switching, hardware scheduling, and so on, which may increase its execution time.
We consider COSMO efficiency on several image processing and learning applications. For image processing, we used four general applications, including:
Sobel, Robert, Prewitt, and
BoxSharp. We use random images from
Caltech 101 [
57] library. For learning application, evaluate COSMO efficiency on DNN and HD computing applications. As Table
2 shows, we test COSMO efficiency and accuracy on four popular networks running on large-scaled ImageNet dataset [
47]. The GPU evaluations for DNNs were done using their PyTorch [
1] implementations. We used Brevitas library from Xilinx to obtain their integer models [
66]. For HD computing, we evaluated COSMO accuracy and efficiency on four practical applications including speech recognition (ISOLET), face detection (FACE), activity recognition (UCIHAR), and security detection (SECURITY). To compare COSMO with GPUs, we developed a GPU-optimized version of the CPU implementation presented in [
35]. HD similarity check-in GPU implements dot product between two 10,000 dimension vectors. This operation is the same for all datasets because the encoded input hypervector has dimensionality of 10,000 for each dataset. For an inference input, we instantiate
k such operations, where
k is the number of classes. Table
2 compares the baseline accuracy (32-bit integer values) and the quality loss of the applications running on COSMO using 32-bit SM-SC [
91] encoding. Our evaluation shows that COSMO can result only about 1.5% and 1% quality loss on DNN and HD computing.
8.2 COSMO Tradeoffs
COSMO, and SC in general, depends on the length of bit-stream. The greater the length, higher is the accuracy. However, this increase in accuracy comes at the cost of increased area and energy consumption. As the length is increased, more area (both memory and CMOS) is required to store and process the data, requiring more energy. It may also result in higher latency for some operations like MUX-based additions, for which the latency increases linearly with bit-stream length. To evaluate the effect of bit-stream length at operation level, we generate random inputs for each operation and take the average of 100 such samples. Each accumulation operation inputs 100 stochastic numbers. Moreover, the results here correspond to unipolar encoding. However, all other encodings have similar behavior with slight change in accuracy. An increase in bit-stream length has a direct impact on the accuracy, area, and energy at operation level. While the latency of the design remains same for all operations except MUX-based addition, Bernstein polynomial, and FSM-based operations. It happens because these operations process each bit sequentially. While COSMO supports MUX-based addition, it uses the proposed COSMO-addition (Section 5.4) by default, which does not scale linearly with latency. When implemented with a bit-stream length of 256, all operations have on an average 4\(\times\) improvement in area and energy consumption as compared to the corresponding implementation with a bit-stream length of 1,024 and incur 3.6% quality loss. For the same change in bit-stream length, the latency of MUX-based addition, Bernstein polynomial, and FSM-based operations differ on an average by 3.95\(\times\).
To evaluate the effect at application level, we implement the general applications listed above using COSMO with an input dataset of size 1 kB. The results shown here use unipolar encoding with AND-based multiplication, COSMO addition, and Maclaurin series-based other arithmetic functions. Since all these operations are scalable with the bit-stream length, the latency of the operations does not change. The minor increase in the latency at application level with the length is due to the time taken by stochastic-to-binary conversion circuits. However, this change is negligible. Figure
11 shows the impact of bit-stream length on different applications. On an average, both the area and energy consumption of the applications increase by 8
\(\times\), when the bit-stream length increases from 512 to 4,096, with an average 6.1 dB PSNR gain. As shown in Figure
12, with a PSNR of 29 dB, the output of Sobel filter with bit-stream length of 4,096 is visibly similar to that of the exact computation.
8.3 Learning on COSMO and GPU
COSMO Configurations: We compare COSMO with GPU for the DNNs and HD computing workloads detailed in Table
2. We use SM-SC encoding with a bit-stream length of 32 [
91] to represent the inputs and weights in DNNs and value of each dimension in HD computing on COSMO. Also, evaluation is performed while keeping the COSMO area and technology node the same as GPU. We analyze COSMO in five different configurations to evaluate the impact of the various techniques proposed in this work at application level. Of these configurations, COSMO-ALL is the best configuration and applies all the stochastic PIM techniques proposed in this work. As compared to COSMO-ALL, COSMO-PC, and COSMO-MUX do not implement the new addition technique proposed in Section
5.4 but use the conventional PC and MUX-based addition/accumulation, respectively. COSMO-NP implements all the techniques except the memory bitline segmentation, which eliminates block partitioning. Finally, COSMO-FX replaces the XNOR operation in COSMO-ALL with the XNOR implementation of [
28].
Comparison of different COSMO Configurations: Comparing different COSMO configurations in Figure
13, we observe that COSMO-PC addition is on an average 240
\(\times\) and 647
\(\times\) slower than COSMO-ALL for DNNs and HD computing. This happens since COSMO-PC reads each and every data sequentially for accumulation as compared to COSMO-ALL which performs a highly parallel single cycle accumulation. This effect is seen very clearly in case of DNNs, where COSMO-ALL is 810
\(\times\) and 140
\(\times\) faster than COSMO-PC for AlexNet and VGG-16, both of which have large FC layers. On the other hand, COSMO-ALL is just 3.8
\(\times\) and 7.7
\(\times\) better than COSMO-PC for ResNet-18 and GoogleNet which have one fairly small FC layer each accumulating {
\(512\times 1,000\)} and {
\(1,024\times 1,000\)} data points. The latency of COSMO-MUX scales linearly with the bit-stream length. For our 32-bit DNN implemenatation, COSMO-ALL is 5.1
\(\times\) faster than COSMO-MUX. COSMO-ALL really shines over COSMO-MUX in the case of HD computing and is 188
\(\times\) faster. COSMO-MUX becomes a bottleneck in similarity check phase when the products for all dimensions need to be accumulated. COSMO-ALL provides the maximum theoretical speedup of 32
\(\times\) over COSMO-NP. In practice, COSMO-ALL is on an average 11.9
\(\times\) faster than COSMO-NP for DNNs. Further, COSMO-ALL is 20% faster and 30% more energy efficient than COSMO-FX for DNNs. This shows the benefits of COSMO over previous digital PIM operations.
Comparison with GPU for DNNs: COSMO benefits from three factors: simpler computations due to SC, high density storage and processing architecture, and less data movement between processor and memory due to PIM. From Figure
13, we observe that COSMO-ALL is on an average
\(141\!\times\) faster than GPU for DNNs. COSMO latency majorly depends upon the convolution operations in a network. As discussed before, while COSMO parallelizes computations over input channels and weights depth in a convolution layer, the convolution of a weight window over an individual input channel still serializes the sliding of windows through the input. This means that the latency of a convolution layer in COSMO is directly proportional to its output size. It is reflected in the results where COSMO achieves higher acceleration in case of AlexNet (362
\(\times\)) and ResNet-18 (130
\(\times\)) as compared to VGG-16 (
\(29\!\times\)) and GoogleNet (
\(41\!\times\)). Here, even though ResNet-18 is deeper than VGG-16, its execution is faster because it reduces the size of the output of its convolution significantly faster than VGG-16. Also, COSMO-ALL is
\(80\times\) more energy efficient than GPU. This is due to the low-cost SC operations and the reduced data movement in COSMO, where the DNN models are pre-stored in memory.
To justify our experimental results, we also present a qualitative comparison between GPU and COSMO. Here, we consider the ideal performance of 11.34 TOPS/s for NVIDIA GTX 1080 Ti [
2]. We account only for the core TDP of 250W of the GPU and not the entire system power. This translates to an ideal computational (power) efficiency of 10 GOPS/s/mm
\(^2\) (46 GOPS/s/W). In contrast, COSMO-ALL (in 45 nm process node) has computational (power) efficiency of 525 GOPS/s/mm
\(^2\) (908 GOPS/s/W). For a fair comparison, we normalized COSMO-ALL to 16 nm process node, same as GTX 1080 Ti. While scaling down, we consider area and power changes [
84]. We do not reduce COSMO’s latency since it depends on the switching behavior of ReRAM which may not scale sufficiently. We observe that the COSMO’s computational and power efficiency increase to 4151 GOPS/s/mm
\(^2\) and 3681 GOPS/s/W, respectively.
However, COSMO has on average 1.2% accuracy loss as compared to GPUs running in 32-bit integer representation, as shown in Table
2. Moreover, COSMO is not meant for DNN training, owing to its stochastic nature. Also, COSMO cannot perform online model re-training as GPU. Instead, we train DNN models on GPUs and load the trained and quantized model to COSMO. However, the data loading happens just once and is amortized over several test inputs. The time for loading inference model is common to both GPU and COSMO, and is excluded from our performance estimates.
Comparison with GPU for HD Computing: COSMO-ALL is on an average 156\(\times\) faster than GPU for HD classification tasks. The computation in a HD classification task is directly proportional to the number of output classes. However, computation for different classes are independent from each other. The high parallelism (due to the dense architecture and configurable partitioning structure) provided by COSMO makes the execution time of different applications less dependent on the number of classes. However, in the case of GPU, the restricted parallelism (4,000 cores in GPU vs 10,000 dimensions in HD) makes the latency directly dependent on the number of classes. The energy consumption of COSMO-ALL scales linearly with classes while being on an average 2,090\(\times\) more energy efficient than GPU. This is majorly due to the reduced data movement in COSMO, where the huge class models are pre-stored in the memory. Moreover, HD computing consists of simple bitwise and addition operations. Unlike COSMO, GPU is not able to fully exploit the simplicity of operations that HD provides. GPU performs hundreds of thousands of MAC operations to implement HD dot product, whereas COSMO just uses simple XNORs and highly efficient accumulation.
8.4 COSMO vs Previous Accelerators
Stochastic Accelerators: We first compare COSMO with four state-of-the-art SC accelerators [
43,
75,
81,
91]. To demonstrate the benefits of COSMO, we first implement the designs proposed by them on COSMO hardware. Figure
14(a) shows the relative performance per area of COSMO as compared to them. We present the results for two configurations. In the first, we use the same bitstream length and logic for multiplication, addition, and other functions that the corresponding accelerators use. In the second, we replace the additions and accumulations in all the designs with COSMO addition. Irrespective of the configuration, COSMO consumes 7.9
\(\times\), 1,134
\(\times\), 474
\(\times\), 2,999
\(\times\) less area as compared to [
43,
75,
81,
91], respectively. While comparing with previous designs in their original configuration, we observe that COSMO does not perform better than three of the designs [
43,
81,
91]. The high area benefits provided by COSMO are overshadowed by the high latency addition used in these designs. It requires popcounting each data point either exactly or approximately, both of which require reading out data. Unlike previous accelerators, COSMO uses memory block as processing elements. Multiple data read-outs from a memory block need to be done sequentially, resulting in high execution times, with COSMO being on an average 6.3
\(\times\) and maximum
\(7.9\times\) less efficient. Moreover, the baseline performance figures for these accelerators used to compare COSMO are optimized for small workloads which do not scale with the complexity and size of operations (a 200-input neuron for [
43] and a
\(8\times 8\times 8\) MAC unit for [
81,
91], while ignoring the overhead of SNGs). However, when COSMO addition is used for these accelerators, COSMO is on an average
\(11.5\times\) and maximum
\(20.1\times\) more efficient than these designs.
On the other hand, when the workload size and complexity is increased, as in case of SC-DNN [
75] which implements LeNet-5 [
52] neural network for MNIST dataset [
51], COSMO is better than SC-DNN even in their original configuration, being 3.7
\(\times\) more efficient for the most accurate SC-DNN design. Further, when COSMO addition and accumulation is used on the same design, COSMO becomes 10.4
\(\times\) more efficient.
DNN Accelerators: We also compare the computational (
\(GOPS/s/mm^2\)) and power efficiency (
\(GOPS/s/W\)) of COSMO with state-of-the-art DNN accelerators [
20,
77,
83]. Here, DaDianNao [
20] is a CMOS-based ASIC design, while ISAAC [
77] and PipeLayer [
83] are ReRAM based PIM designs. Unlike these designs, which have a fixed processing element (PE) size, the high flexibility of COSMO allows it to change the size of its PE according to the workload and operation to be performed. For example, a 3
\(\times\)3 convolution (2000
\(\times\)100 FC layer) is spread over 9 (2000) logical partitions, each of which may further be split into multiple physical partitions as discussed in Section
7.1. As a result, COSMO does not have theoretical figures for computational and power efficiency. However, to compare COSMO with these accelerators, we run the four neural networks shown in Table
2 on COSMO and report their average efficiency in Figure
14(b). We observe that COSMO is more power efficient than all DNN accelerators, being 3.2
\(\times\), 2.4
\(\times\), and 6.3
\(\times\) better than DaDianNao, ISAAC, and PipeLayer, respectively. This is due to three main reasons, reducing the complexity of each operation, reducing the number of intermediate reads and writes to the memory, and eliminating the use of power hungry conversions between analog and digital domains.
We also observe that COSMO is computationally more efficient than DaDianNao and ISAAC, being 8.3\(\times\) and 1.1\(\times\) better respectively. This is due to the high parallelism that COSMO provides, processing different input and outputs channels in parallel. COSMO is still 2.8\(\times\) computationally efficient as compared to PipeLayer. It happens because even though COSMO parallelizes computation within a convolution window, it serializes sliding of a window over the convolution operation. On the other hand, PipeLayer makes a large number of copies of weights to parallelize computation within the entire convolution operation. However, computational efficiency is inversely effected by the size of accelerator, which makes the comparatively old technology node of COSMO an invisible overhead in computational efficiency. To give an intuition for the benefits which COSMO can provide, we scale all the accelerators to the same technology, i.e., 28 nm. DaDianNao and PipeLayer are already reported at 28 nm node. On scaling ISAAC and COSMO to 28nm, their computational efficiency increase to 625 \(GOPS/s/mm^2\) and 1,355 \(GOPS/s/mm^2\) respectively. This shows that COSMO can be as computational efficient as the best DNN accelerator while providing significantly better power efficiency.
Embedded Devices: We compare the power efficiency of COSMO with the state-of-the-art implementations of DNN inference on NVIDIA Tegra Jetson X1 GPU [
65], FPGA [
78], and Edge-TPU [
3]. For the inference task on AlexNet, Tegra X1 (FPGA) achieves a power efficiency of 45 images/s/W (16 images/s/W), while COSMO achieves 506 images/s/W. For VGG-16, COSMO achieves power efficiency of 1,112 images/s/W as opposed to 66 images/s/W for Edge-TPU.
8.5 COSMO and Memory Non-Idealities
Bit-Flips: SC is inherently immune to singular bit-flips in data. COSMO, being based on it, enjoys the same immunity. Here, we evaluate the quality loss in COSMO with increase in the number of bit-flips. We evaluate the general applications with the same configuration as in Section
8.2 with a bit-stream length of 1024. The quality loss is measured as the difference between accuracy with and without bit-flips. Figure
15(a) shows that with 10% bit-flips, the average quality loss is meagre 0.27%. When the bit-flips increase to 25%, applications lose only 0.66% in accuracy.
Memory Lifetime: COSMO uses the switching of ReRAM cells, which are known to have low endurance. Higher switching per cell may result in reduced memory lifetime and increased unreliability. Previous work [
29,
37,
50] uses an iterative process to implement multiplication and other complex operations. The more the iterations, higher is the number of operations and so is the per cell switching count. COSMO reduces this complex iterative process to just one logic gate, in case of multiplication, while it breaks down other complex operations into a series of simple operations. Hence, achieving less switching count per cell. Figure
15(b) shows that for multiplication, COSMO increases the lifetime of memory by 5.9
\(\times\) and 6.6
\(\times\) on an average as compared to APIM [
37] and Imaging [
29], respectively.