Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
Yangyang He1*
1
School of Computer Science, Beijing University of Posts and Telecommunications,
BEIJING, 100876, China
*
Corresponding author’s e-mail: hyy@bupt.edu.cn
Abstract. Internet of Things devices are faced with ever-increasing amounts of data, and they
can no longer only do data collection as they did in the past, and hand over computing tasks to
servers on the cloud. The growing computing requirements in the field of Internet of Things
and the diversity of its scenarios put forward configurable and flexible customization
requirements for customized processors. This paper conducts in-depth analysis and research on
configurable customized processors, and analyzes related work in the field of RISC-V chips
and deep convolutional neural networks in the field of device-side optimization at home and
abroad.Based on this, this paper proposes an instruction and hardware design of a customizable
deep convolutional neural network accelerator attached to Rocket-Chip Generator based on the
RISC-V modular instruction set.
1. Introduction
As electronic components are gradually approaching their physical limits, the development speed of
processors has slowed in recent years. The stagnation of processor development and the limitation of
network bandwidth promote the transition from cloud computing to edge computing. The excess
performance of IoT devices is used for front-end data processing and analysis, and then transmitted
back to the cloud for processing to relieve the pressure on the cloud. Microprocessors based on RISC-
V can be constructed relatively quickly and give full play to the characteristics of RISC's simplified
instruction set. It can be customized to meet the diverse needs of IoT devices.
At present, the research of neural network accelerators is mainly divided into three directions: 1.
For the CPU/GPU platform, due to its strong computing power and sufficient memory bandwidth, it is
mainly to optimize the software of the neural network code itself and use compression , Pruning,
quantization and other methods; 2. ASIC: specifically design specific hardware circuits for specific
tasks, and implement software algorithms through hardware to accelerate neural networks; 3. FPGA:
use FPGA programmability to design hardware structures To adapt to the algorithm, FPGA has a
higher energy efficiency ratio than CPU/GPU; although the energy efficiency ratio is not as good as
ASIC, FPGA provides a smaller cost and more flexible design.
Many companies have open sourced neural network end-to-side computing frameworks developed
for mobile and embedded devices, including Tencent’s NCNN, Alibaba’s MNN, Xiaomi’s MACE and
OPEN AI LAB’s Tengine.[1,2,3,4]. These frameworks mainly use Winograd transformation to
optimize convolution operations, and use the ARM-based NEON instruction set for parallel operations
to optimize data structure, memory management, and scheduling among multiple cores. Research on
acceleration of convolutional neural networks is mainly focused on two aspects: improving parallelism;
optimizing memory utilization.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ISAEECE 2021 IOP Publishing
Journal of Physics: Conference Series 1871 (2021) 012073 doi:10.1088/1742-6596/1871/1/012073
2
ISAEECE 2021 IOP Publishing
Journal of Physics: Conference Series 1871 (2021) 012073 doi:10.1088/1742-6596/1871/1/012073
3
ISAEECE 2021 IOP Publishing
Journal of Physics: Conference Series 1871 (2021) 012073 doi:10.1088/1742-6596/1871/1/012073
3. Accelerator design
The convolutional neural network accelerator designed in this paper uses GEMM optimized weight
fixed systolic array method, which combines GEMM optimization with weight fixed systolic array.
The overall architecture of the deep neural network accelerator is shown in Figure 4. The purple part
in the figure is the accelerator, which is connected to RocketCore through ROCC. The accelerator
accesses L2Cache and Dram through the DMA module to read and write back feature map data. The
accelerator is composed of three parts, the control module, the calculation module and the cache
module.
ex_controller
store_controller Activatation
Pooling
L2 Cache TEMP
DRAM
PE ... PE
4. Conclusion
First, synthesize the Rocket-Chip convolutional neural network accelerator code, check the power
consumption and hardware resource consumption as shown in Figure 6.
4
ISAEECE 2021 IOP Publishing
Journal of Physics: Conference Series 1871 (2021) 012073 doi:10.1088/1742-6596/1871/1/012073
Acknowledgments
First of all, I would like to thank my teacher. Without his guidance and help, there would be no article.
Then I would like to thank my classmates for their help during the formation of this article. Finally, I
would like to thank my relatives for their encouragement and support during this period.
References
[1] Oaid. (2020) Tengine. https://github.com/OAID/Tengine.
[2] Tencent. (2020) NCNN. https://github.com/Tencent/ncnn.
[3] Xiaomi. (2020) MACE. https://github.com/XiaoMi/mace.
[4] Alibaba. (2020) MNN. https://github.com/alibaba/MNN.
[5] Chen Y H, Krishna T, Emer J S, et al. (2017) Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits,
52(1): 127-138.
[6] Prof. Robert van de Geijn. how-to-optimize-gemm. https://github.com/flame/how-to-optimize-
gemm/wiki
[7] Wang Zifeng. (2020) Development of Artificial Intelligence Chip Software Stack and Algorithm
Research. Hangzhou Dianzi University.