CN107153873A

CN107153873A - A kind of two-value convolutional neural networks processor and its application method

Info

Publication number: CN107153873A
Application number: CN201710316252.9A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-05-08
Filing date: 2017-05-08
Publication date: 2017-09-12
Anticipated expiration: 2037-05-08
Also published as: CN107153873B

Abstract

The present invention provides a binary convolutional neural network processor, comprising: a storage device for data to be calculated, used to store elements of data to be convoluted in binary form and convolution kernel elements in binary form; a binary convolution device , for performing a binary convolution operation on the convolution kernel element in the binary form and corresponding elements in the data to be convoluted in the binary form; the data scheduling device is used for combining the convolution kernel element with the The corresponding elements in the data to be convoluted are loaded into the binary convolution device; the pooling device is used to perform pooling processing on the results obtained by convolution; and the normalization device is used to perform pooling processing on The results are normalized.

Description

A kind of binary convolutional neural network processor and using method thereof

技术领域technical field

本发明涉及用于神经网络模型计算中数据的存储与调度。The invention relates to the storage and scheduling of data used in neural network model calculation.

背景技术Background technique

随着人工智能技术的发展，涉及深度神经网络、尤其是卷积神经网络的技术在近几年得到了飞速的发展，在图像识别、语音识别、自然语言理解、天气预测、基因表达、内容推荐和智能机器人等领域均取得了广泛的应用。With the development of artificial intelligence technology, technologies involving deep neural networks, especially convolutional neural networks, have developed rapidly in recent years. In image recognition, speech recognition, natural language understanding, weather prediction, gene expression, content recommendation It has been widely used in fields such as artificial intelligence and intelligent robots.

所述深度神经网络可以被理解为一种运算模型，其中包含大量数据节点，每个数据节点与其他数据节点相连，各个节点间的连接关系用权重表示。随着深度神经网络的不断发展，其复杂程度也在不断地提高。The deep neural network can be understood as an operation model, which contains a large number of data nodes, each data node is connected to other data nodes, and the connection relationship between each node is represented by weight. As deep neural networks continue to develop, so do their complexity.

为了权衡复杂度和运算效果之间的矛盾，在参考文献：Courbariaux M,Hubara I,Soudry D,et al.Binarized neural networks:Training deep neural networks withweights and activations constrained to+1or-1[J].arXiv preprint arXiv:1602.02830,2016.中提出了可以采用“二值卷积神经网络模型”来降低传统神经网络的复杂度。在所述二值卷积神经网络中，卷积神经网络中的权重、输入数据、输出数据均采用“二值形式”，即通过“1”和“-1”近似地表示其大小，例如以“1”来表示大于等于0的数值，并用“-1”来表示小于0的数值。通过上述方式，降低了神经网络中用于操作的数据位宽，由此极大程度地降低了所需的参数容量，致使二值卷积神经网络尤其适用于在物端实现图像识别、增强现实和虚拟现实。In order to balance the contradiction between complexity and operation effect, in the reference: Courbariaux M, Hubara I, Soudry D, et al. Binarized neural networks: Training deep neural networks with weights and activations constrained to+1or-1[J].arXiv Preprint arXiv:1602.02830, 2016. It is proposed that the "binary convolutional neural network model" can be used to reduce the complexity of traditional neural networks. In the binary convolutional neural network, the weights, input data, and output data in the convolutional neural network are all in "binary form", that is, their size is approximately represented by "1" and "-1", such as "1" is used to indicate a value greater than or equal to 0, and "-1" is used to indicate a value less than 0. Through the above method, the data bit width used for operation in the neural network is reduced, thereby greatly reducing the required parameter capacity, making the binary convolutional neural network especially suitable for image recognition and augmented reality at the object end. and virtual reality.

在现有技术中通常采用通用的计算机处理器来运行深度神经网络，例如中央处理器(CPU)和图形处理器(GPU)等。然而，并不存在针对二值卷积神经网络的专用处理器。通用的计算机处理器计算单元位宽通常为多比特，计算二值神经网络会产生资源浪费。In the prior art, a general-purpose computer processor is usually used to run a deep neural network, such as a central processing unit (CPU) and a graphics processing unit (GPU). However, dedicated processors for binary convolutional neural networks do not exist. The bit width of the calculation unit of a general-purpose computer processor is usually multi-bit, and the calculation of a binary neural network will result in waste of resources.

发明内容Contents of the invention

因此，本发明的目的在于克服上述现有技术的缺陷，提供一种二值卷积神经网络处理器，包括：Therefore, the object of the present invention is to overcome the defective of above-mentioned prior art, a kind of binary convolutional neural network processor is provided, comprising:

待计算数据存储装置，用于存储二值形式的待卷积数据的元素以及二值形式的卷积核元素；The data storage device to be calculated is used to store the elements of the data to be convoluted in binary form and the convolution kernel elements in binary form;

二值卷积装置，用于对所述二值形式的卷积核元素及所述二值形式的待卷积数据中相应的元素进行二值卷积操作；A binary convolution device, configured to perform a binary convolution operation on the binary convolution kernel elements and corresponding elements in the binary data to be convoluted;

数据调度装置，用于将所述卷积核元素与所述待卷积数据中相应的元素载入所述二值卷积装置；A data scheduling device, configured to load the convolution kernel element and the corresponding element in the data to be convolved into the binary convolution device;

池化装置，用于对卷积所获得的结果进行池化处理；以及a pooling device for pooling the results obtained by the convolution; and

归一化装置，用于对经过池化的结果进行归一化操作。A normalization device, used for normalizing the pooled results.

优选地，根据所述的二值卷积神经网络处理器，其中所述二值卷积装置，包括：Preferably, according to the binary convolutional neural network processor, wherein the binary convolution device includes:

XNOR门，其以所述二值形式的卷积核元素及所述二值形式的待卷积数据中相应的元素作为其输入；XNOR gate, which takes the convolution kernel element in the binary form and the corresponding element in the data to be convoluted in the binary form as its input;

累加装置，其将所述XNOR门的输出作为其输入，用于对所述XNOR门的输出进行累加，以输出二值卷积操作的结果；an accumulating device, which takes the output of the XNOR gate as its input, and is used to accumulate the output of the XNOR gate to output the result of the binary convolution operation;

其中，所述累加装置包括OR门和或汉明重量计算单元，其中，Wherein, the accumulating means comprises an OR gate and or a Hamming weight calculation unit, wherein,

所述OR门的至少一个输入为所述XNOR门的输出；At least one input of the OR gate is the output of the XNOR gate;

所述汉明重量计算单元的至少一个输入为所述XNOR门的输出。At least one input of the Hamming weight calculation unit is the output of the XNOR gate.

优选地，根据所述的二值卷积神经网络处理器，其中所述待计算数据存储装置还被用于在线地对所获得的经过二值转换的卷积核和或待卷积数据进行存储。Preferably, according to the binary convolutional neural network processor, wherein the storage device for data to be calculated is also used to store the obtained binary-converted convolution kernel and or data to be convolved online .

优选地，根据所述的二值卷积神经网络处理器，其中还包括：Preferably, according to the binary convolutional neural network processor, it also includes:

二值化装置，用于将所获得的卷积核和或待卷积数据转换为二值形式。The binarization device is used for converting the obtained convolution kernel and or the data to be convolved into a binary form.

优选地，根据所述的二值卷积神经网络处理器，其中所述数据调度装置中设置有寄存器，用于在使用时载入需要重复使用的卷积核元素。Preferably, according to the binary convolutional neural network processor, a register is set in the data scheduling device for loading convolution kernel elements that need to be reused during use.

优选地，根据上述任意一项所述的二值卷积神经网络处理器，在所述待计算数据存储装置中所述待卷积数据的元素以及所述卷积核元素按照图层交错的方式而存储。Preferably, according to the binary convolutional neural network processor described in any one of the above, the elements of the data to be convolved and the elements of the convolution kernel in the storage device for the data to be calculated are interleaved in layers And storage.

优选地，根据所述的二值卷积神经网络处理器，在所述待计算数据存储装置中所述待卷积数据的元素根据卷积核的大小及卷积操作时依次参与计算的待卷积数据的元素而存储。Preferably, according to the binary convolutional neural network processor, the elements of the data to be convoluted in the data storage device to be calculated are sequentially involved in the calculation according to the size of the convolution kernel and the convolution operation. The elements of the product data are stored.

优选地，根据所述的二值卷积神经网络处理器，在所述待计算数据存储装置中所述待卷积数据的元素和或所述卷积核元素的存储方式满足以下一项或多项：Preferably, according to the binary convolutional neural network processor, the elements of the data to be convolved and or the elements of the convolution kernel in the storage device for the data to be calculated are stored in such a way that one or more of the following item:

依照所述卷积核和所述待卷积数据的矩阵排布顺序而存储；Stored according to the matrix arrangement order of the convolution kernel and the data to be convolved;

卷积核和或待卷积数据的矩阵中处于同一位置、不同通道中的元素连续地存储在连续的多个存储单元中；Elements in the same position and in different channels in the matrix of the convolution kernel and or the data to be convoluted are continuously stored in multiple consecutive storage units;

同一卷积核中同一权重下的全部元素和或同一待卷积数据中用于进行卷积操作的子矩阵中的全部元素存储在存储装置中连续的多个存储单元中。All the elements under the same weight in the same convolution kernel and or all the elements in the sub-matrix used for the convolution operation in the same data to be convolved are stored in multiple consecutive storage units in the storage device.

并且，本发明还提供了一种针对上述任意一项所述的二值卷积神经网络处理器的使用方法，包括：Moreover, the present invention also provides a method for using the binary convolutional neural network processor described in any one of the above, including:

1)将所述待计算数据存储装置中的待卷积数据载入寄存器；1) loading the data to be convoluted in the data storage device to be calculated into a register;

2)将所述寄存器中的所述待卷积数据以及所述待计算数据存储装置中需要与所述待卷积数据执行乘法的元素载入至二值卷积装置中，以进行二值卷积操作；2) Load the data to be convolved in the register and the elements in the data to be calculated storage device that need to be multiplied with the data to be convolved into the binary convolution device to perform binary convolution accumulation operation;

3)由所述池化装置对所述二值卷积装置的输出进行池化处理；3) performing pooling processing on the output of the binary convolution device by the pooling device;

4)由所述归一化装置对所述池化装置的输出进行归一化操作。4) performing a normalization operation on the output of the pooling device by the normalizing device.

以及一种计算机可读存储介质，其中存储有计算机程序，所述计算机程序在被执行时用于实现上述的方法。And a computer-readable storage medium, in which a computer program is stored, and the computer program is used to implement the above method when executed.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

提供了经过简化的用于执行卷积运算的硬件结构、以及基于该结构的二值卷积神经网络处理器及相应的计算方法，通过在运算过程中减少进行计算的数据的位宽，达到提高运算效率、降低存储容量及能耗的效果。Provides a simplified hardware structure for performing convolution operations, a binary convolutional neural network processor based on this structure, and corresponding calculation methods. By reducing the bit width of the calculated data during the operation, the The effect of computing efficiency, reducing storage capacity and energy consumption.

附图说明Description of drawings

以下参照附图对本发明实施例作进一步说明，其中：Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1是神经网络的多层结构的示意图；Fig. 1 is the schematic diagram of the multilayer structure of neural network;

图2是在二维空间中进行卷积计算的示意图；Fig. 2 is a schematic diagram of convolution calculation in two-dimensional space;

图3是根据本发明的一个实施例的二值卷积装置的硬件结构示意图；FIG. 3 is a schematic diagram of the hardware structure of a binary convolution device according to an embodiment of the present invention;

图4是根据本发明又一个实施例的二值卷积装置的硬件结构示意图；Fig. 4 is a schematic diagram of the hardware structure of a binary convolution device according to yet another embodiment of the present invention;

图5是根据本发明又一个实施例的二值卷积装置的硬件结构示意图；Fig. 5 is a schematic diagram of the hardware structure of a binary convolution device according to yet another embodiment of the present invention;

图6a～6c示出了本发明采用汉明重量计算元件的二值卷积装置的硬件结构示意图；Figures 6a to 6c show a schematic diagram of the hardware structure of the binary convolution device using the Hamming weight calculation element in the present invention;

图7是根据本发明的一个实施例对多通道的卷积核即权重0和权重1以及待卷积数据进行存储的示意图；FIG. 7 is a schematic diagram of storing multi-channel convolution kernels, namely weight 0 and weight 1, and data to be convoluted according to an embodiment of the present invention;

图8是根据本发明的一个实施例的二值卷积神经网络处理器的结构的示意图；8 is a schematic diagram of the structure of a binary convolutional neural network processor according to an embodiment of the present invention;

图9是根据本发明的一个实施例采用二值卷积神经网络处理器进行计算的示意图；Fig. 9 is a schematic diagram of calculation using a binary convolutional neural network processor according to an embodiment of the present invention;

图10是根据本发明的又一个实施例采用二值卷积神经网络处理器进行计算的示意图。Fig. 10 is a schematic diagram of calculation using a binary convolutional neural network processor according to yet another embodiment of the present invention.

具体实施方式detailed description

下面结合附图和具体实施方式对本发明作详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

计算机学科中所使得神经网络是一种仿照生物学上神经突触联接结构的数学模型，利用由神经网络所组成的应用系统可以实现诸如机器学习、模式识别等诸多功能。The neural network in computer science is a mathematical model imitating the synaptic connection structure in biology. The application system composed of neural network can realize many functions such as machine learning and pattern recognition.

所述神经网络在结构上分为多层，图1示出了一种神经网络多层结构的示意图。参考图1，所述多层结构中的第一层为输入层，最后一层为输出层，其余各层为隐藏层。在使用所述神经网络时，向输入层输入原始图像，即输入层图层，(在本发明中的所述“图像”、“图层”指的是待处理的原始数据，不仅仅是狭义的通过拍摄照片获得的图像)，由神经网络中的每一层对所输入的图层进行加工处理并将结果输入到神经网络的下一层中，并最终将输出层的输出作为所输出的结果。The neural network is structurally divided into multiple layers, and FIG. 1 shows a schematic diagram of a multi-layer structure of a neural network. Referring to Fig. 1, the first layer in the multi-layer structure is an input layer, the last layer is an output layer, and the remaining layers are hidden layers. When using the neural network, the original image is input to the input layer, that is, the input layer layer, (the "image" in the present invention, "layer" refers to the original data to be processed, not only in a narrow sense The image obtained by taking photos), each layer in the neural network processes the input layer and inputs the result to the next layer of the neural network, and finally uses the output of the output layer as the output result.

如前文中所述地，为了应对神经网络日益复杂的结构，现有技术提出了一种二值卷积神经网络的概念。顾名思义，二值卷积神经网络的运算包括对所输入的数据进行“卷积”操作，并且其还包括诸如“池化”、“归一化”、“二值化”等操作。As mentioned above, in order to cope with the increasingly complex structure of the neural network, the prior art proposes a concept of a binary convolutional neural network. As the name implies, the operation of the binary convolutional neural network includes "convolution" operations on the input data, and it also includes operations such as "pooling", "normalization", and "binarization".

作为二值卷积神经网络中重要的一项操作，。下面将通过图2详细介绍“卷积”的计算过程。As an important operation in the binary convolutional neural network, . The calculation process of "convolution" will be introduced in detail through Figure 2 below.

图2示出了在二维空间中利用大小为3乘3的“二值”的卷积核对大小为5乘5的“二值”的图像进行卷积的计算过程。参考图2，首先针对图像从上至下的第1-3行、从左至右的第1-3列范围内的各个元素，分别采用在卷积核中对应的元素与每个所述元素相乘：例如，采用卷积核中第1行第1列的元素(表示为“卷积核(1,1)”)乘以图像中第1行第1列的元素(表示为“图像(1,1)”)得到1×1＝1，采用卷积核中第1行第2列的卷积核(1,2)乘以图像中第1行第2列的元素图像(1,2)得到1×0＝0，类似地计算卷积核(1,3)乘以图像(1,3)得到1×1＝1，依次类推计算得出9个结果并将这9个结果相加得到1+0+1+0+1+0+0+0+1＝4以作为卷积结果中第1行第1列的元素，卷积结果(1,1)。类似地，计算卷积核(1,1)乘以图像(1,2)、卷积核(1,2)乘以图像(1,3)、卷积核(1,3)乘以图像(1,4)、卷积核(2,1)乘以图像(2,2)…，依次类推计算出1+0+0+1+0+0+0+1＝3以作为卷积结果(1,2)。采用上述方式可以计算出如图2所示出的大小为3乘3的卷积结果矩阵。FIG. 2 shows the calculation process of convolving a "binary" image with a size of 5x5 by using a "binary" convolution kernel with a size of 3x3 in a two-dimensional space. Referring to Figure 2, first, for each element in the 1-3 rows from top to bottom of the image, and the 1-3 columns from left to right, the corresponding elements in the convolution kernel and each of the elements are respectively used Multiplication: For example, use the element in row 1 and column 1 in the convolution kernel (expressed as "convolution kernel (1,1)") to multiply the element in row 1 and column 1 in the image (expressed as "image( 1,1)") to get 1×1=1, the convolution kernel (1,2) in row 1, column 2 in the convolution kernel is multiplied by the element image in row 1, column 2 in the image (1,2 ) to get 1×0=0, similarly calculate the convolution kernel (1,3) multiplied by the image (1,3) to get 1×1=1, and so on to get 9 results and add these 9 results Obtain 1+0+1+0+1+0+0+0+1=4 as the element of row 1 and column 1 in the convolution result, the convolution result (1,1). Similarly, compute kernel(1,1) times image(1,2), kernel(1,2) times image(1,3), kernel(1,3) times image( 1,4), the convolution kernel (2,1) is multiplied by the image (2,2)..., and so on to calculate 1+0+0+1+0+0+0+1=3 as the convolution result ( 1,2). The convolution result matrix with a size of 3 by 3 as shown in FIG. 2 can be calculated by using the above method.

所获得到的如图2所示出的卷积结果通过缓冲和二值化处理被输入到下一层的二值卷积神经网络中。The obtained convolution result as shown in Fig. 2 is input into the binary convolutional neural network of the next layer through buffering and binarization processing.

上述示例示出了卷积的计算过程所包括的“乘”、以及“加”或“累加求和”的运算。The above example shows the operations of "multiplication" and "addition" or "accumulation and summation" included in the calculation process of convolution.

发明人意识到基于二值的乘法运算所特有的特性，使得二值卷积运算中的“乘”可被“异或非”运算所代替，即仅利用一个逻辑元件XNOR门便可完成在现有技术中必须采用乘法器才可完成的“乘”的运算。可以看出，基于二值的卷积过程相较于传统的卷积更为简单，其无需进行诸如“2×4”这样复杂的乘法运算，在进行“乘”的运算时，若进行乘法运算的元素中有任意一个为“0”则所获得的结果便为“0”，若进行乘法运算的全部元素均为“1”则所获得的结果便为“1”。The inventor realized that based on the unique characteristics of the binary multiplication operation, the "multiplication" in the binary convolution operation can be replaced by the "exclusive or not" operation, that is, only one logic element XNOR gate can be used to complete the present invention. In the existing technology, the operation of "multiplication" must be completed by using a multiplier. It can be seen that the binary-based convolution process is simpler than the traditional convolution, and it does not need to perform complex multiplication operations such as "2×4". When performing "multiplication" operations, if the multiplication operation If any of the elements in the multiplication operation is "0", the result obtained is "0", and if all the elements of the multiplication operation are "1", the result obtained is "1".

下面将通过一个具体的示例来详细说明在本发明中可以利用XNOR门元件来代替乘法器的原理。The principle that the XNOR gate element can be used to replace the multiplier in the present invention will be described in detail below through a specific example.

在实际使用二值化的卷积时，首先会对图像和卷积核中的非二值数值z进行二值化处理，即：When actually using binarized convolution, the non-binary value z in the image and the convolution kernel is first binarized, that is:

其中，将大于等于0的数值z二值化为“1”以代表图2中用于卷积运算的符号“1”，将小于0的数值z二值化为“-1”以代表图2中用于卷积运算的符号“0”。Among them, the value z that is greater than or equal to 0 is binarized to "1" to represent the symbol "1" used in the convolution operation in Figure 2, and the value z that is less than 0 is binarized to "-1" to represent Figure 2 The symbol "0" used in the convolution operation.

对经过二值化处理的图像和卷积核的值进行“异或非”运算，即存在以下几种情况：Perform "XOR" operation on the binarized image and the value of the convolution kernel, that is There are several situations:

输入Ainput A 输入Binput B 输出FOutput F 符号symbol -1-1 -1-1 11 11 -1-1 11 -1-1 00 11 -1-1 -1-1 00 11 11 11 11

通过上述真值表可以看出，在针对二值化的数值进行“乘”的运算时，可以采用用于执行“异或非”运算的逻辑元件XNOR门来代替乘法器。而如本领域公知地，乘法器的复杂度远高于一个逻辑元件XNOR门。It can be seen from the above truth table that when the "multiplication" operation is performed on the binarized value, the logic element XNOR gate for performing the "exclusive NOR" operation can be used instead of the multiplier. However, as is well known in the art, the complexity of a multiplier is much higher than one logic element XNOR gate.

因此，发明人认为通过采用逻辑元件XNOR门来代替传统处理器中的乘法器，可以大幅地降低二值卷积神经网络的处理器所使用器件的复杂度。Therefore, the inventor believes that the complexity of the devices used in the binary convolutional neural network processor can be greatly reduced by using the logic element XNOR gate to replace the multiplier in the traditional processor.

此外，发明人还意识到基于二值的加法运算所特有的特性，使得上述二值卷积运算中的“加”可被“或”运算所代替，即可以利用逻辑元件OR门便来代替在现有技术中所采用的加法器。这是由于，对上述XNOR门的输出进行的“或”运算的结果可被表达为G＝F₁+F₂...+F_n，并最终输出单比特的结果G，其中F_k表示第k个XNOR门的输出，n表示其输出被用作为OR门的输入的XNOR门总数。In addition, the inventor also realized that the unique characteristics of the binary-based addition operation make the "addition" in the above-mentioned binary convolution operation be replaced by an "or" operation, that is, the logical element OR gate can be used to replace the The adder used in the prior art. This is because the result of the "OR" operation on the output of the above-mentioned XNOR gate can be expressed as G=F ₁ +F ₂ ...+F _n , and finally output a single-bit result G, where F _k represents the first Outputs of k XNOR gates, n represents the total number of XNOR gates whose outputs are used as inputs to OR gates.

基于发明人的上述分析，本发明提供了一种可被用于二值卷积神经网络处理器的二值卷积装置，其利用基于二值的乘法运算、加法运算的特性，简化了处理器中用于执行卷积运算的硬件的构成，由此提高卷积运算的速度、降低处理器的总体能耗。Based on the above analysis of the inventor, the present invention provides a binary convolution device that can be used in a binary convolutional neural network processor, which uses the characteristics of binary-based multiplication and addition operations to simplify the processor The configuration of the hardware used to perform the convolution operation, thereby increasing the speed of the convolution operation and reducing the overall energy consumption of the processor.

图3示出了根据本发明的一个实施例的二值卷积装置的硬件结构。如图3所示，该二值卷积装置包括9个XNOR门以及1个OR门，全部9个XNOR门的输出被用作所述OR门的输入。在进行卷积运算时，由每个XNOR门分别计算n₁×w₁、n₂×w₂…n₉×w₉，以获得输出F₁～F₉；OR门将F₁～F₉作为其输入，输出卷积结果中的第一个元素G₁。类似地，采用同一个卷积核，针对图像中的其他区域进行计算，可以获得卷积结果中的其他元素的大小，此处不再复述。FIG. 3 shows the hardware structure of a binary convolution device according to an embodiment of the present invention. As shown in FIG. 3 , the binary convolution device includes 9 XNOR gates and 1 OR gate, and the outputs of all 9 XNOR gates are used as the input of the OR gate. When performing convolution operation, each XNOR gate calculates n ₁ ×w ₁ , n ₂ ×w ₂ ...n ₉ ×w ₉ to obtain output F ₁ ~ F ₉ ; OR gate uses F ₁ ~ F ₉ as its Input, output the first element G ₁ in the convolution result. Similarly, by using the same convolution kernel to perform calculations on other areas in the image, the size of other elements in the convolution result can be obtained, which will not be repeated here.

在图3所示出的实施例中，并行地利用多个XNOR门进行乘的计算，提高了卷积计算的速率。然而应当理解，在本发明中还可以对所述二值卷积装置的硬件结构进行变形，下面将通过其他几个实施例进行举例说明。In the embodiment shown in FIG. 3 , multiple XNOR gates are used in parallel to perform multiplication calculations, which increases the rate of convolution calculations. However, it should be understood that the hardware structure of the binary convolution device may also be modified in the present invention, which will be illustrated below through several other embodiments.

图4示出了根据本发明的又一个实施例的二值卷积装置的硬件结构。如图4所示，该二值卷积装置包括1个XNOR门、1个OR门、以及一个寄存器，所述寄存器用于存储OR门的输出并且其所存储的值被用作所述OR门的其中一个输入，并且所述OR门的另一个输入为所述XNOR门的输出。在进行卷积运算时，依照时刻的推进，分别在第一至第九个时刻将n₁和w₁、n₂和w₂、…n₉和w₉作为XNOR门的输入，相应地对应于每个时刻从XNOR门输出F₁、F₂…F₉以作为OR门的其中一个输入，并且将寄存器中所存储的在前一时刻从OR门中输出的结果作为OR门的另一个输入。例如，当XNOR门输出F₁(其大小等于n₁×w₁)时，从寄存器中读取出预存的符号“0”将其与F1一并作为OR门的输入，并从OR门输出F₁；当XNOR门输出F₂(其大小等于n₂×w₂)时，从寄存器中读取出F₁将其与F₂一并作为OR门的输入，并从OR门输出F₁+F₂，依次类推直至输出针对F₁～F₉的累加结果G₁。Fig. 4 shows the hardware structure of a binary convolution device according to yet another embodiment of the present invention. As shown in Figure 4, the binary convolution device includes 1 XNOR gate, 1 OR gate, and a register, the register is used to store the output of the OR gate and its stored value is used as the OR gate One of the inputs of the OR gate, and the other input of the OR gate is the output of the XNOR gate. When performing the convolution operation, according to the progress of time, n ₁ and w ₁ , n ₂ and w ₂ , ... n ₉ and w ₉ are used as the input of the XNOR gate at the first to ninth time respectively, corresponding to F ₁ , F ₂ . . . F ₉ are output from the XNOR gate at each moment as one input of the OR gate, and the result output from the OR gate at the previous moment stored in the register is used as the other input of the OR gate. For example, when the XNOR gate outputs F ₁ (its size is equal to n ₁ ×w ₁ ), read the pre-stored symbol "0" from the register and use it together with F1 as the input of the OR gate, and output F from the OR gate ₁ ; when the XNOR gate outputs F ₂ (its size is equal to n ₂ ×w ₂ ), read F ₁ from the register and use it and F ₂ together as the input of the OR gate, and output F ₁ +F from the OR gate ₂ , and so on until the accumulation result G ₁ for F ₁ to F ₉ is output.

在图4所示出的实施例中，通过增加对XNOR门和OR门的复用率，减少了所采用元件数量，并且该方案所采用的是仅具有两个输入端的OR门，其硬件复杂程度更低。In the embodiment shown in Figure 4, the number of components used is reduced by increasing the multiplexing rate of the XNOR gate and the OR gate, and the solution uses an OR gate with only two input terminals, and its hardware is complex to a lesser degree.

图5示出了根据本发明的又一个实施例的二值卷积装置的硬件结构。该实施例与图4所示出的实施例类似，均只采用了一个XNOR门、一个OR门和一个寄存器，不同的是在图5中XNOR门的输入被存入可以同时存储多位结果的寄存器中，并且寄存器中的各个结果被用作OR门的输入。该实施例的使用方法与图4中的实施例相类似，均是对XNOR门进行复用，不同的是图5将每个时刻XNOR门所输出的结果存入能够同时保存多位结果的寄存器中，并在获得了全部F₁～F₉后，由OR门进行“或”的运算以输出G₁。Fig. 5 shows the hardware structure of a binary convolution device according to yet another embodiment of the present invention. This embodiment is similar to the embodiment shown in Figure 4, all only adopting an XNOR gate, an OR gate and a register, the difference is that in Figure 5 the input of the XNOR gate is stored in a multi-bit result that can be stored simultaneously register, and the individual results in the register are used as inputs to the OR gate. The usage method of this embodiment is similar to the embodiment in Fig. 4, both are to multiplex the XNOR gate, the difference is that Fig. 5 stores the result output by the XNOR gate at each moment into a register capable of simultaneously storing multi-bit results , and after obtaining all F ₁ -F ₉ , the OR gate performs an "or" operation to output G ₁ .

在本发明图3、4、5所提供的实施例中，均采用了OR门来实现“加”或“累加”的功能，并且所述OR门的输入均来自于XNOR门的输出，致使最终从OR门输出的结果均为单比特值，由此可以简化计算过程、增加运算速率。该方案所提供的硬件结构尤其适用于针对二值神经网络的专用处理器，这是由于二值神经网络采用数值“1”和“-1”表示神经网络中的权重和数据，在神经网络计算过程存在大量乘法和加法操作，减少计算操作数位宽可以有效地降低计算复杂度。In the embodiments provided in Fig. 3, 4, and 5 of the present invention, the OR gate is used to realize the function of "addition" or "accumulation", and the input of the OR gate comes from the output of the XNOR gate, so that the final The results output from the OR gate are all single-bit values, which can simplify the calculation process and increase the calculation rate. The hardware structure provided by this scheme is especially suitable for special-purpose processors for binary neural networks. This is because binary neural networks use the values "1" and "-1" to represent the weights and data in the neural network. There are a large number of multiplication and addition operations in the process, and reducing the bit width of the calculation operand can effectively reduce the calculation complexity.

然而，由于上述采用OR门来实现“加”或“累加”的功能的方案均为单比特计算，因而会引入一定程度的误差。对此，本发明还提供了一种可选的方案，即采用汉明重量计算元件来代替如图3、4、5中所示出的OR门以实现“加”或“累加”的功能。图6a～6c示出了具有汉明重量计算元件的硬件结构，在所述可选的方案中，汉明重量计算元件将XNOR门的输出作为其输入，输出所输出数据中逻辑“1”的数据，即汉明重量。所述方案与上述采用OR门的方案相类似，同样可以达到简化计算过程的效果，并且该方案还可以实现精准的求和操作。However, since the above-mentioned schemes of using the OR gate to realize the function of "addition" or "accumulation" are all single-bit calculations, a certain degree of error will be introduced. In this regard, the present invention also provides an optional solution, that is, the Hamming weight calculation element is used to replace the OR gate shown in Figures 3, 4, and 5 to realize the "addition" or "accumulation" function. Fig. 6 a ~ 6c have shown the hardware structure that has the Hamming weight calculation element, in described alternative scheme, the Hamming weight calculation element takes the output of the XNOR gate as its input, outputs the logic "1" in the output data data, the Hamming weight. The solution described above is similar to the above-mentioned solution using the OR gate, and can also achieve the effect of simplifying the calculation process, and the solution can also achieve precise summing operations.

发明人发现，基于本发明所提供的上述二值卷积装置针对每一次“乘”及“累加”计算，所操作的均为单个比特的数据，并且通过该二值卷积装置所输出的也均为单个比特的数据，而这样的特征尤其适用于采用“图层交错型数据映射方式”来存储和调度参与卷积运算及计算所获得的数据，从而达到减少数据载入次数，充分利用数据的局部性提高数据的重复利用率的效果。The inventors found that, based on the above-mentioned binary convolution device provided by the present invention, for each "multiplication" and "accumulation" calculation, the operation is a single bit of data, and the output of the binary convolution device is also All are single-bit data, and this feature is especially suitable for using the "layer interleaved data mapping method" to store and schedule the data obtained by participating in convolution operations and calculations, so as to reduce the number of data loading and make full use of data. The locality improves the effect of data reuse.

本发明中的所述“图层交错型数据映射方式”指的是，按照通道(Channel)的方向将卷积核和待卷积数据中的各个元素依次存储至存储装置的每一行中，即在存储装置中数据按照图层交错的方式进行存储，相邻的两个数据元素来自不同的通道而不是同一通道。如图7所示，在本发明中，在同一z轴上的卷积核和待卷积数据的元素对应同一“通道”，即具有相同z值的元素属于同一通道。The "layer interleaved data mapping method" in the present invention refers to sequentially storing each element in the convolution kernel and the data to be convoluted into each row of the storage device according to the direction of the channel, that is In the storage device, data is stored in an interleaved manner, and two adjacent data elements come from different channels rather than the same channel. As shown in Figure 7, in the present invention, the convolution kernel on the same z-axis and the elements of the data to be convoluted correspond to the same "channel", that is, elements with the same z value belong to the same channel.

为更加形象具体地描述所述数据计算方式，图7以(x,y,z)＝2*2*2的卷积核权重0和卷积核权重1，与(x,y,z)＝2*3*2的待卷积数据为例，详细阐述本发明提供的适用于二值卷积神经网络的图层交错型数据映射方式。参考图7，权重0和权重1中的元素按照该元素所处的空间位置，被分别划分为四组：其中，权重0的四组权重分别为A_z、B_z、C_z和D_z，如图所示，z为0、1；权重1的四组权重分别为a_z、b_z、c_z和d_z，如图所示，z为0、1。In order to describe the data calculation method more vividly and concretely, Fig. 7 uses (x, y, z) = 2*2*2 convolution kernel weight 0 and convolution kernel weight 1, and (x, y, z) = Taking 2*3*2 data to be convolved as an example, the layer interleaved data mapping method suitable for binary convolutional neural networks provided by the present invention is described in detail. Referring to Figure 7, the elements in weight 0 and weight 1 are divided into four groups according to the spatial position of the element: among them, the four groups of weight 0 are A _z , B _z , C _z and D _z , As shown in the figure, z is 0, 1; the four groups of weights of weight 1 are a _z , b _z , c _z and d _z respectively, and z is 0, 1 as shown in the figure.

参考图7，根据本发明的一个实施例，可以采用以下方式来存储卷积核权重0、卷积核权重1、以及待卷积数据中的各个元素。Referring to FIG. 7 , according to an embodiment of the present invention, the convolution kernel weight 0, the convolution kernel weight 1, and each element in the data to be convolved can be stored in the following manner.

图7中，为了方便说明，依据每个卷积核的尺寸和步进大小，将权重0和权重1的三维矩阵中的元素按照所处的通道划分为两个二维矩阵，例如将权重0划分为由A₀、B₀、C₀、D₀所组成的二维矩阵和由A₁、B₁、C₁、D₁所组成的二维矩阵；类似地，将待卷积数据的三维矩阵中的元素按照所处的通道划分为两个二维矩阵，即由X₀、Y₀、Z₀、P₀、Q₀、R₀所组成的二维矩阵和由X₁、Y₁、Z₁、P₁、Q₁、R₁所组成的二维矩阵。In Figure 7, for the convenience of illustration, according to the size and step size of each convolution kernel, the elements in the three-dimensional matrix with weight 0 and weight 1 are divided into two two-dimensional matrices according to the channel they are in, for example, weight 0 Divided into a two-dimensional matrix composed of A ₀ , B ₀ , C ₀ , D ₀ and a two-dimensional matrix composed of A ₁ , B ₁ , C ₁ , D ₁ ; similarly, the three-dimensional matrix of the data to be convoluted The elements in the matrix are divided into two two-dimensional matrices according to the channel they are in, that is, the two-dimensional matrix composed of X ₀ , Y ₀ , Z ₀ , P ₀ , Q ₀ , R ₀ and the two-dimensional matrix composed of X ₁ , Y ₁ , A two-dimensional matrix composed of Z ₁ , P ₁ , Q ₁ , and R ₁ .

在存储卷积核权重0时，在权重存储装置的一行连续的存储单元中，依次存储权重0中的元素A₀、A₁、B₀、B₁、C₀、C₁、D₀和D₁，共8个比特。可以看出，在存储单元中，相邻的两个元素彼此来自于不同的通道，例如A₀和A₁分别来自不同的通道，A₁和B₀也来自不同的通道，按照这样的方式即为前文中所述按照图层交错的存储方式。When storing convolution kernel weight 0, elements A ₀ , A ₁ , B ₀ , B ₁ , C ₀ , C ₁ , D ₀ and D in weight 0 are sequentially stored in a row of continuous storage units of the weight storage device ₁ , 8 bits in total. It can be seen that in the storage unit, two adjacent elements come from different channels. For example, A ₀ and A ₁ come from different channels, and A ₁ and B ₀ also come from different channels. In this way, It is the interleaved storage method according to the layers mentioned above.

在存储卷积核权重1时，在该权重存储装置的另外一行连续的存储单元中，依次存储权重1中的元素的a₀、a₁、b₀、b₁、c₀、c₁、d₀和d₁，共8个比特。与权重0的存储方式类似地，相邻的两个元素同样来自于不同的通道。When storing convolution kernel weight 1, in another row of continuous storage units of the weight storage device, a ₀ , a ₁ , b ₀ , b ₁ , c ₀ , c ₁ , d of elements in weight 1 are sequentially stored ₀ and d ₁ , a total of 8 bits. Similar to the storage method of weight 0, two adjacent elements also come from different channels.

在权重存储装置中，位于相同x轴和相同y轴的权重元素(例如A₀和A₁)作为相邻元素依次存储，在相同x轴和相同y轴的元素存储完毕后存储下一组具有相同x轴和y轴的权重元素(例如B₀和B₁)，依次类推，将卷积核内其他权重元素存储完毕。In the weight storage device, the weight elements on the same x-axis and the same y-axis (for example, A ₀ and A ₁ ) are stored as adjacent elements in sequence, and the next group with Weight elements of the same x-axis and y-axis (such as B ₀ and B ₁ ), and so on, store other weight elements in the convolution kernel.

在存储待卷积数据时，可以根据卷积核的大小及卷积操作时依次参与计算的数据元素进行存储。参考图2所示出的卷积计算规则，可知需要首先针对A_z X_z、B_z Y_z、C_z P_z和D_zQ_z进行计算，再针对A_z Y_z、B_z Z_z、C_z Q_z和D_zR_z进行计算。因此，在存储待卷积数据的各个元素时，除按照图层交错的存储方式之外，还应当考虑卷积计算的规则，从而依次存储参与计算的数据元素，例如将X_z、Y_z、P_z、Q_z存储在一行或一列连续的存储单元中，将Y_z、Z_z、Q_z、R_z存储在另外一行或一列连续的存储单元中。When storing the data to be convoluted, it can be stored according to the size of the convolution kernel and the data elements sequentially involved in the calculation during the convolution operation. Referring to the convolution calculation rules shown in Figure 2, it can be seen that A _z X _z , B _z Y _z , C _z P _z and D _z Q _z need to be calculated first, and then A _z Y _z , B _z Z _z , C _z Q _z and D _z R _z for calculation. Therefore, when storing each element of the data to be convolved, in addition to the interleaved storage method of layers, the rules of convolution calculation should also be considered, so as to store the data elements involved in the calculation in sequence, such as X _z , Y _z , P _z , Q _z are stored in one row or one column of continuous memory cells, and Y _z , Z _z , Q _z , R _z are stored in another row or one column of continuous memory cells.

参考图7，在数据存储装置的一列连续的存储单元中，依次存储X₀、X₁、Y₀、Y₁、P₀、P₁、Q₀、Q₁。在数据存储装置的另外一列连续的存储单元中，依次存储Y₀、Y₁、Z₀、Z₁、Q₀、Q₁、R₀、R₁。Referring to FIG. 7 , X ₀ , X ₁ , Y ₀ , Y ₁ , P ₀ , P ₁ , Q ₀ , and Q ₁ are sequentially stored in a column of continuous memory cells of the data storage device. In another column of continuous memory cells of the data storage device, Y ₀ , Y ₁ , Z ₀ , Z ₁ , Q ₀ , Q ₁ , R ₀ , and R ₁ are stored sequentially.

与存储卷积核的元素相类似地，在数据存储装置中，位于相同x轴和相同y轴的数据元素(例如X₀和X₁)被分为一组并作为相邻元素依次存储，在相同x轴和相同y轴的元素存储完毕后存储下一组具有相同x轴和y轴的权重元素(例如Y₀和Y₁)，依次类推，将待卷积数据矩阵中与卷积核尺寸相当的子矩阵(例如在图7中以虚线标出的)内其他数据元素存储完毕。Similar to storing the elements of the convolution kernel, in the data storage device, the data elements (such as X ₀ and X ₁ ) located on the same x-axis and the same y-axis are grouped into one group and stored as adjacent elements in sequence, in After storing the elements of the same x-axis and the same y-axis, store the next set of weight elements with the same x-axis and y-axis (for example, Y ₀ and Y ₁ ), and so on, combine the data matrix to be convolved with the size of the convolution kernel Other data elements in the corresponding sub-matrix (eg, marked with dotted lines in FIG. 7) are stored.

尽管在图7所示出的示例中，卷积核和待卷积数据的通道数均为2，然而应当理解在本发明中对于通道数大于2的卷积核和待卷积数据也可以按照图层交错的存储方式。Although in the example shown in FIG. 7 , the number of channels of the convolution kernel and the data to be convoluted is both 2, it should be understood that in the present invention, the convolution kernel and the data to be convoluted with the number of channels greater than 2 can also be calculated according to How layers are stored interleaved.

优选地，在存储时，依次填满存储装置中连续的多个存储单元，即依照卷积核和待卷积数据的矩阵排布顺序，在存储装置中进行存储。Preferably, when storing, a plurality of continuous storage units in the storage device are sequentially filled up, that is, stored in the storage device according to the matrix arrangement order of the convolution kernel and the data to be convolved.

优选地，将卷积核和或待卷积数据的矩阵中处于同一位置、不同通道中的元素连续地存储在存储装置中连续的多个存储单元。Preferably, the convolution kernel and or the elements in the matrix of the data to be convolved at the same position but in different channels are continuously stored in a plurality of continuous storage units in the storage device.

优选地，将同一卷积核中同一权重下的全部元素和或同一待卷积数据中用于进行卷积操作的子矩阵中的全部元素存储在存储装置中连续的多个存储单元中。Preferably, all elements under the same weight in the same convolution kernel and or all elements in the sub-matrix used for convolution operation in the same data to be convoluted are stored in a plurality of continuous storage units in the storage device.

图7中为了方便解释，将权重存储装置和数据存储装置设置为彼此不同的存储装置，然而应当理解本发明即可以将所述权重存储装置和所述数据存储装置分别设置在不同的存储器上，也可以存储在同一存储器的不同区域上，例如统一地存储在待计算数据存储装置上。In Fig. 7, for the convenience of explanation, the weight storage device and the data storage device are set as different storage devices from each other, but it should be understood that the present invention can respectively set the weight storage device and the data storage device on different memories, It can also be stored in different areas of the same memory, for example, stored collectively on the data storage device to be calculated.

并且，本领域技术人员应当理解，上述实施例所描述的存储方式既可以优先于二值神经网络的计算过程，在处理器外离线地完成，也可以在处理器上在线地完成，例如在处理器的片上芯片中完成，或以计算机程序的方式进行存储，并通过处理器来执行所述计算机程序。Moreover, those skilled in the art should understand that the storage method described in the above embodiments can be completed offline outside the processor or online on the processor, for example, in the processing implemented in an on-chip chip of a device, or stored in the form of a computer program, and the computer program is executed by a processor.

采用根据本发明的上述图层交错型数据映射方式来存储各个卷积核以及待卷积数据中的各个元素，可以减少数据的载入次数、提高数据的复用率。Using the layer interleaved data mapping method according to the present invention to store each convolution kernel and each element in the data to be convoluted can reduce the number of data loading and improve the data multiplexing rate.

还应当理解，采用上述“图层交错型数据映射方式”来存储卷积核元素以及与所述待卷积数据中相应的元素的目的在于方便读取，以快速便捷地确定二值卷积装置的输入。因此，凡是可以实现在所述卷积核元素的存储位置与所述待卷积数据中相应的元素的存储位置之间建立映射关系的方式，均可被用于存储所述卷积核元素以及与所述待卷积数据的元素。It should also be understood that the purpose of using the above-mentioned "layer interleaved data mapping method" to store the convolution kernel elements and the corresponding elements in the data to be convoluted is to facilitate reading, so as to quickly and conveniently determine the binary convolution device input of. Therefore, any manner that can establish a mapping relationship between the storage location of the convolution kernel element and the storage location of the corresponding element in the data to be convolved can be used to store the convolution kernel element and with the elements of the data to be convolved.

例如，在连续的存储单元的长度小于8比特时，例如仅为4比特，对权重0中的A₀、A₁、B₀、B₁、C₀、C₁、D₀和D₁进行折叠式的存储，即在连续的存储单元中存储A₀、A₁、B₀、B₁，并在另一行连续的存储单元中存储C₀、C₁、D₀和D₁。For example, when the length of the continuous storage unit is less than 8 bits, for example, only 4 bits, A ₀ , A ₁ , B ₀ , B ₁ , C ₀ , C ₁ , D ₀ and D ₁ in weight 0 are folded storage in the same way, that is, store A ₀ , A ₁ , B ₀ , and B ₁ in consecutive memory cells, and store C ₀ , C ₁ , D ₀ , and D ₁ in another row of consecutive memory cells.

在使用通过上述方式而存储的卷积核元素以及待卷积数据中的相应元素进行卷积运算时，适宜于采用单指令多数据流(SIMD)的方式来执行，即通过单条指令将所存储的多个数据载入至运算单元。针对所存储数据进行载入及计算的方法将在随后的实施例中详细介绍。通过这样的方式，可以减少计算单元的位宽、降低计算单元的硬件开销。When using the convolution kernel elements stored in the above manner and the corresponding elements in the data to be convoluted to perform convolution operations, it is suitable to use the single instruction multiple data flow (SIMD) method to perform, that is, the stored A plurality of data of is loaded into the operation unit. The method for loading and calculating the stored data will be described in detail in the following embodiments. In this way, the bit width of the computing unit can be reduced, and the hardware overhead of the computing unit can be reduced.

综合前文中所提到的二值卷积装置以及卷积核和待卷积数据中元素的存储方式和调用方式，可以提供一种计算单元位款少、硬件结构相对简单、针对二值卷积神经网络的专用处理器。Combining the binary convolution device mentioned above and the storage and calling methods of the convolution kernel and the elements in the data to be convoluted, it is possible to provide a computing unit with a small amount of money and a relatively simple hardware structure for binary convolution A dedicated processor for neural networks.

参考图8，根据本发明的一个实施例，提供了一种二值卷积神经网络处理器10，包括：Referring to FIG. 8, according to an embodiment of the present invention, a binary convolutional neural network processor 10 is provided, including:

数据调度装置101、待计算数据存储装置102、二值卷积装置103、池化装置104、归一化装置105、二值化装置106。A data scheduling device 101 , a storage device for data to be calculated 102 , a binary convolution device 103 , a pooling device 104 , a normalization device 105 , and a binarization device 106 .

其中，待计算数据存储装置102用于存储二值形式的卷积核元素以及二值形式的待卷积数据。如前文中所述，所述存储方式应当能够反映出用于卷积计算的卷积核的元素与待卷积数据中相应的元素之间的映射关系。例如，以按照图层交错的方式存储卷积核元素和待卷积数据、以及根据卷积核的大小及卷积操作时依次参与计算的待卷积数据的元素来存储待卷积数据。具体的存储方式可以参考前述实施例。Wherein, the storage device 102 for data to be calculated is used for storing convolution kernel elements in binary form and data to be convolved in binary form. As mentioned above, the storage method should be able to reflect the mapping relationship between the elements of the convolution kernel used for convolution calculation and the corresponding elements in the data to be convolved. For example, the elements of the convolution kernel and the data to be convolved are stored in an interleaved manner according to the layers, and the data to be convolved is stored according to the size of the convolution kernel and the elements of the data to be convolved that are sequentially involved in the calculation during the convolution operation. For a specific storage manner, reference may be made to the foregoing embodiments.

数据调度装置101，用于根据所述映射关系，将所述卷积核元素与所述待卷积数据中相应的元素载入所述二值卷积装置。例如，在所述数据调度装置101中设置寄存器，并在使用时将需要重复使用的卷积核元素载入寄存器中。The data scheduling device 101 is configured to load the convolution kernel element and the corresponding element in the data to be convolved into the binary convolution device according to the mapping relationship. For example, registers are set in the data scheduling device 101, and the convolution kernel elements to be reused are loaded into the registers during use.

二值卷积装置103，用于对所述二值形式的卷积核元素及所述二值形式待卷积数据中相应的元素进行二值卷积操作。所述二值卷积装置103可以采用如前述实施例中任意一种结构，通过XNOR门实现对卷积核元素及待卷积数据中相应的元素的乘的运算，并通过OR门或汉明重量计算元件实现对通过乘的运算所得结果的累加。The binary convolution device 103 is configured to perform a binary convolution operation on the binary convolution kernel elements and corresponding elements in the binary data to be convoluted. The binary convolution device 103 can adopt any structure as in the foregoing embodiments, and realize the multiplication operation of the convolution kernel element and the corresponding element in the data to be convoluted through the XNOR gate, and through the OR gate or Hamming The weight calculation element implements the accumulation of the results obtained through the multiplication operation.

池化装置104，用于对卷积所获得的结果进行池化处理。The pooling device 104 is configured to perform pooling processing on the result obtained by the convolution.

归一化装置105，用于对经过池化的结果进行归一化操作以加速神经网络的参数训练过程。A normalization device 105 is configured to perform a normalization operation on the pooled results to speed up the parameter training process of the neural network.

在本发明的一些实施例中，可以在线地从数据源处获得用于二值卷积操作的卷积核和或待卷积数据。由于所获得的数据不一定为二值化的数据，因此在所述实施例中，还可以在二值卷积神经网络处理器10中设置二值化装置106，以将所获得的数据转换为二值形式。并且，还可以由待计算数据存储装置102在线地对经过二值转换的数据进行存储。In some embodiments of the present invention, the convolution kernel used for the binary convolution operation and or the data to be convolved can be obtained online from the data source. Since the obtained data is not necessarily binarized data, in the embodiment, a binarization device 106 can also be set in the binary convolutional neural network processor 10 to convert the obtained data into binary form. In addition, the binary-converted data can also be stored online by the data-to-be-calculated storage device 102 .

应当理解，对于已经在进行卷积神经网络计算之前预先离线地在待计算数据存储装置102中存储了卷积核和或待卷积数据的实施例，不必在二值卷积神经网络处理器10中设置二值化装置106。It should be understood that, for the embodiment in which the convolution kernel and or the data to be convolved have been stored offline in advance in the data storage device 102 before performing the convolutional neural network calculation, it is not necessary to A binarization device 106 is set in.

下面将参考图9和图10，通过具体的实施例详细介绍采用如图8中所示出的二值卷积神经网络处理器10进行计算的过程。Referring to FIG. 9 and FIG. 10 , the calculation process using the binary convolutional neural network processor 10 shown in FIG. 8 will be described in detail through specific embodiments.

图9示出了根据本发明的一个实施例，采用上述二值卷积神经网络处理器进行计算的过程。图9采用了与图7中相同的符号来表述卷积核元素及待卷积数据元素，例如，X₀、X₁、A₀、A₁等。其中，在权重存储矩阵中以一个存储字来存储一行处于同一通道中的全部卷积核元素，如图所示，所述存储字位宽是8比特，每个元素占据1比特。类似地，待卷积数据矩阵中一个存储字的位宽同样是8比特。此外，在图9中，XNOR门和寄存器组的位宽均为2比特。在计算过程中，遵循同一卷积核内的数据在同一累加器中累加的原则。其计算过程如下：FIG. 9 shows a calculation process using the above-mentioned binary convolutional neural network processor according to an embodiment of the present invention. FIG. 9 uses the same symbols as those in FIG. 7 to represent convolution kernel elements and data elements to be convolved, for example, X ₀ , X ₁ , A ₀ , A ₁ and so on. In the weight storage matrix, all the convolution kernel elements in a row in the same channel are stored in one storage word. As shown in the figure, the bit width of the storage word is 8 bits, and each element occupies 1 bit. Similarly, the bit width of a storage word in the data matrix to be convoluted is also 8 bits. In addition, in FIG. 9, the bit widths of the XNOR gate and the register bank are both 2 bits. During the calculation, follow the principle that the data in the same convolution kernel is accumulated in the same accumulator. Its calculation process is as follows:

步骤1，将待卷积数据中的高两位(即X₀和X₁)载入至寄存器组中；Step 1, load the upper two bits (namely X ₀ and X ₁ ) in the data to be convolved into the register set;

参考图2中所示出的卷积原理图可知，在图9中，待卷积数据中的元素将被反复地使用X₀和X₁，以在随后的步骤中计算出A₀X₀、B₀X₀、A₁X₁、B₁X₁，因此需要将待卷积数据中的2比特的数据存入寄存器中。Referring to the convolution principle diagram shown in Figure 2, it can be seen that in Figure 9, the elements in the data to be convoluted will be repeatedly used X ₀ and X ₁ to calculate A ₀ X ₀ , B ₀ X ₀ , A ₁ X ₁ , B ₁ X ₁ , so it is necessary to store 2-bit data in the data to be convoluted into the register.

步骤2，将寄存器组中的待卷积数据和权重矩阵中第一行的前两位权重数据(A₀和A₁)载入至XNOR门中；Step 2, load the data to be convolved in the register set and the first two weight data (A ₀ and A ₁ ) of the first row in the weight matrix into the XNOR gate;

步骤3，通过加法单元对XNOR门的计算结果执行OR运算或计算汉明重量；Step 3, perform an OR operation or calculate the Hamming weight on the calculation result of the XNOR gate through the addition unit;

如前文中所述，OR运算或计算汉明重量可以达到“加”的效果，在此步骤中，可以计算得出A₀X₀和A₁X₁。As mentioned above, the OR operation or the calculation of the Hamming weight can achieve the effect of "addition". In this step, A ₀ X ₀ and A ₁ X ₁ can be calculated.

步骤4，将加法单元计算结果输入值累加器0中；Step 4, input the calculation result of the addition unit into the value accumulator 0;

所述累加器0是针对同一卷积核内的数据进行累加。The accumulator 0 is for accumulating data in the same convolution kernel.

步骤5，将寄存器组中的待卷积数据和权重矩阵中第二行的前两位权重数据(a₀和a₁)载入至XNOR门中；Step 5, load the data to be convolved in the register set and the first two weight data (a ₀ and a ₁ ) of the second row in the weight matrix into the XNOR gate;

步骤6，加法单元对XNOR门的计算结果执行OR运算或计算汉明重量，计算得出a₀X₀和a₁X₁。In step 6, the adding unit performs an OR operation on the calculation result of the XNOR gate or calculates the Hamming weight to obtain a ₀ X ₀ and a ₁ X ₁ .

步骤7，将加法单元计算结果输入至累加器1中，以此类推，将X₀和X₁依次与权重存储阵列中指定八行的前两位权重进行计算；Step 7, input the calculation result of the addition unit into the accumulator 1, and so on, calculate the weights of X ₀ and X ₁ with the first two digits of the specified eight rows in the weight storage array in turn;

步骤8，与前述步骤中类似地，将待卷积数据中的第三位与第四位(Y₀和Y₁)载入至寄存器组中；Step 8, similar to the previous steps, load the third and fourth bits (Y ₀ and Y ₁ ) in the data to be convolved into the register set;

步骤9，将寄存器组中的待卷积数据和权重矩阵中第一行的第三位和第四位权重数据(B₀和B₁)载入至XNOR门中；Step 9, loading the data to be convolved in the register set and the third and fourth weight data (B ₀ and B ₁ ) of the first row in the weight matrix into the XNOR gate;

步骤10，通过加法单元对XNOR门的计算结果执行OR运算或计算汉明重量；Step 10, performing an OR operation or calculating the Hamming weight on the calculation result of the XNOR gate through the addition unit;

步骤11，将加法单元计算结果输入值累加器1中，此后与步骤5至步骤7类似，将b₀和b₁等位于同列的数据依次与Y₀和Y₁进行计算；Step 11, input the calculation result of the addition unit into the value accumulator 1, and thereafter, similar to steps 5 to 7, calculate b ₀ and b ₁ in the same column with Y ₀ and Y ₁ in turn;

步骤12，将累加器得到一个输出图层的数据时，将累加器计算结果载入至缓冲单元；Step 12, when the accumulator obtains the data of an output layer, load the calculation result of the accumulator into the buffer unit;

步骤13，当缓冲单元得到输出图层完整数据后，将输出待卷积数据载入至池化单元进行池化操作；Step 13, after the buffer unit obtains the complete data of the output layer, load the output data to be convoluted into the pooling unit for pooling operation;

步骤14，将池化操作计算结果载入至批量归一化单元进行批量归一化操作；Step 14, load the calculation result of the pooling operation into the batch normalization unit to perform the batch normalization operation;

步骤15，将批量归一化的计算结果载入至二值化单元进行二值化操作。Step 15, load the calculation result of batch normalization into the binarization unit to perform binarization operation.

可以看出，采用如前文中所述的方式依据所述卷积核元素的存储位置与所述待卷积数据中相应的元素的存储位置之间所存在映射关系，可以快速地确定需要进行卷积的相应元素以将其输入XNOR门中。It can be seen that, according to the mapping relationship between the storage location of the convolution kernel element and the storage location of the corresponding element in the data to be convoluted, it can be quickly determined that convolution corresponding element of the product to feed it into the XNOR gate.

当存储单元位宽小于图9中所示出的矩阵位宽时，还可以对所述矩阵采用分块折叠的方式来存储卷积核元素及待卷积数据元素，如图10所示。类似地，图10也采用了与图7中相同的符号来表述卷积核元素及待卷积数据元素，与图9的区别在于当需要读取属于待卷积数据的同一数据块中的数据时还需要考虑该数据在寄存器组中所存储的位置。When the bit width of the storage unit is smaller than the bit width of the matrix shown in FIG. 9 , the matrix can also be folded in blocks to store convolution kernel elements and data elements to be convolved, as shown in FIG. 10 . Similarly, Figure 10 also uses the same symbols as those in Figure 7 to describe the convolution kernel elements and data elements to be convoluted, the difference from Figure 9 is that when it is necessary to read the data in the same data block that belongs to the data to be convolved You also need to consider where the data is stored in the register file.

通过本发明的实施例可以看出，本发明基于二值化运算的特性，提供了经过简化的用于执行卷积运算的硬件结构、以及基于该结构的二值卷积神经网络处理器及相应的计算方法，通过在运算过程中减少进行计算的数据的位宽，达到提高运算效率、降低存储容量及能耗的效果。It can be seen from the embodiments of the present invention that the present invention provides a simplified hardware structure for performing convolution operations based on the characteristics of binarization operations, and a binary convolutional neural network processor based on the structure and corresponding By reducing the bit width of the calculated data during the calculation process, the calculation method can improve the calculation efficiency and reduce the storage capacity and energy consumption.

并且，本发明采用图层交错型数据映射方式进行数据存储和计算，简化了卷积计算时调取待卷积数据以及卷积核数据的过程，减少了硬件开销并提高了数据利用率。Moreover, the present invention adopts a layer interleaved data mapping method for data storage and calculation, which simplifies the process of calling data to be convoluted and convolution kernel data during convolution calculation, reduces hardware overhead and improves data utilization.

需要说明的是，上述实施例中介绍的各个步骤并非都是必须的，本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。It should be noted that not all the steps described in the foregoing embodiments are necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, etc. according to actual needs.

最后所应说明的是，以上实施例仅用以说明本发明的技术方案而非限制。尽管上文参照实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，对本发明的技术方案进行修改或者等同替换，都不脱离本发明技术方案的精神和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail above with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in Within the scope of the claims of the present invention.

Claims

1. A binary convolutional neural network processor, comprising:

The data storage device to be calculated is used to store the elements of the data to be convoluted in binary form and the convolution kernel elements in binary form;

A binary convolution device, configured to perform a binary convolution operation on the binary convolution kernel elements and corresponding elements in the binary data to be convoluted;

A data scheduling device, configured to load the convolution kernel element and the corresponding element in the data to be convolved into the binary convolution device;

a pooling device for pooling the results obtained by the convolution; and

A normalization device, used for normalizing the pooled results.

2. The binary convolutional neural network processor according to claim 1, wherein said binary convolution device comprises:

XNOR gate, which takes the convolution kernel element in the binary form and the corresponding element in the data to be convoluted in the binary form as its input;

an accumulating device, which takes the output of the XNOR gate as its input, and is used to accumulate the output of the XNOR gate to output the result of the binary convolution operation;

Wherein, the accumulating means comprises an OR gate and or a Hamming weight calculation unit, wherein,

At least one input of the OR gate is the output of the XNOR gate;

At least one input of the Hamming weight calculation unit is the output of the XNOR gate.

3. The binary convolutional neural network processor according to claim 1, wherein the data storage device to be calculated is also used to convert the obtained convolution kernel and or the data to be convoluted through binary conversion on-line to store.

4. The binary convolutional neural network processor according to claim 3, further comprising:

The binarization device is used for converting the obtained convolution kernel and or the data to be convolved into a binary form.

5. The binary convolutional neural network processor according to claim 1, wherein said data scheduling device is provided with registers for loading convolution kernel elements that need to be reused during use.

6. The binary convolutional neural network processor according to any one of claims 1-5, in the data storage device to be calculated, the elements of the data to be convoluted and the convolution kernel elements are according to the diagram stored in an interleaved manner.

7. The binary convolutional neural network processor according to claim 6, in the data storage device to be calculated, the elements of the data to be convolved are involved in the calculation according to the size of the convolution kernel and the convolution operation. The elements of the data to be convolved are stored.

8. The binary convolutional neural network processor according to claim 7, in the data storage device to be calculated, the elements of the data to be convolved and or the storage mode of the convolution kernel elements meet the following one or more:

Stored according to the matrix arrangement order of the convolution kernel and the data to be convolved;

Elements in the same position and in different channels in the matrix of the convolution kernel and or the data to be convoluted are continuously stored in multiple consecutive storage units;

All the elements under the same weight in the same convolution kernel and or all the elements in the sub-matrix used for the convolution operation in the same data to be convolved are stored in multiple consecutive storage units in the storage device.

9. A method for using the binary convolutional neural network processor according to any one of claims 1-8, comprising:

1) loading the data to be convoluted in the data storage device to be calculated into a register;

2) Load the data to be convolved in the register and the elements in the data to be calculated storage device that need to be multiplied with the data to be convolved into the binary convolution device to perform binary convolution accumulation operation;

3) performing pooling processing on the output of the binary convolution device by the pooling device;

4) performing a normalization operation on the output of the pooling device by the normalizing device.

10. A computer-readable storage medium in which is stored a computer program for implementing the method as claimed in claim 9 when executed.