CN109754062A

CN109754062A - Execution method of convolution expansion instruction and related products

Info

Publication number: CN109754062A
Application number: CN201711086019.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2019-05-14
Anticipated expiration: 2037-11-07
Also published as: CN109754062B

Abstract

Present disclosure provides the implementation method and Related product of a kind of convolution extended instruction, comprising: computing device reads input data, convolution kernel and the auxiliary operation that the convolution extended instruction obtains the convolution extended instruction from memory；The convolution extended instruction includes: operation code and operation domain, and the operation domain includes: register and auxiliary domain, and the register is used to determine the address of input data and the address of convolution kernel, and the auxiliary domain is for identifying auxiliary operation；Computing device executes convolution operation and auxiliary operation to the address of the input data and the address of convolution kernel.The technical solution that present disclosure provides has the advantages of reducing calculation amount, reducing power consumption.

Description

Execution method of convolution expansion instruction and related products

技术领域technical field

本披露涉及神经网络技术领域，具体涉及一种卷积扩展指令的实现方法以及相关产品。The present disclosure relates to the technical field of neural networks, and in particular, to a method for implementing a convolution expansion instruction and related products.

背景技术Background technique

卷积神经网络是近年来广泛应用于模式识别、图像处理等领域的一种高效识别算法，它具有结构简单、训练参数少和适应性强、平移、旋转、缩放等特点。由于CNN/DNN的特征检测层通过训练数据进行学习，所以在使用CNN/DNN时，避免了显示的特征抽取，而隐式地从训练数据中进行学习；再者由于同一特征映射面上的神经元权值相同，所以网络可以并行学习，这也是卷积网络相对于神经元彼此相连网络的一大优势。Convolutional neural network is an efficient recognition algorithm that has been widely used in pattern recognition, image processing and other fields in recent years. It has the characteristics of simple structure, few training parameters, strong adaptability, translation, rotation and scaling. Since the feature detection layer of CNN/DNN learns through training data, when using CNN/DNN, it avoids explicit feature extraction and implicitly learns from training data; The meta weights are the same, so the network can learn in parallel, which is also a big advantage of convolutional networks over networks where neurons are connected to each other.

在已有的计算机领域应用中，与卷积运算相关的应用十分普遍。本发明专注于卷积神经网络，目前可以执行此种运算的主流装置如下：Among the existing applications in the computer field, applications related to convolution operations are very common. The present invention focuses on convolutional neural networks, and currently the mainstream devices that can perform such operations are as follows:

在现有技术中，一种进行卷积神经网络运算的已知方案是使用通用处理器，该方法通过通用寄存器堆和通用功能部件来执行通用指令，从而执行卷积神经网络运算。然而，该方法的缺点之一是单个通用处理器多用于标量计算，在进行卷积神经网络运算时运算性能较低。而使用多个通用处理器并行执行时，通用处理器之间的相互通讯又有可能成为性能瓶颈。In the prior art, a known solution for performing convolutional neural network operations is to use a general-purpose processor, which executes general-purpose instructions through a general-purpose register file and general-purpose functional components to perform convolutional neural network operations. However, one of the disadvantages of this method is that a single general-purpose processor is mostly used for scalar computation, and the computational performance is low when performing convolutional neural network operations. When using multiple general-purpose processors to execute in parallel, the mutual communication between the general-purpose processors may become a performance bottleneck.

在另一种现有技术中，使用图形处理器(GPU)来进行向量计算，其中，通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来进行卷积神经网络运算。然而，上述方案中，GPU片上缓存太小，在进行大规模卷积神经网络运算时需要不断进行片外数据搬运，片外带宽成为了主要性能瓶颈。In another prior art, a graphics processing unit (GPU) is used for vector computations, where convolutional neural network operations are performed by executing general-purpose SIMD instructions using a general-purpose register file and a general-purpose stream processing unit. However, in the above solution, the on-chip cache of the GPU is too small, which requires continuous off-chip data transfer when performing large-scale convolutional neural network operations, and the off-chip bandwidth becomes the main performance bottleneck.

披露内容Disclosure

本披露实施例提供了一种卷积扩展指令的实现方法以及卷积扩展指令及相关产品，可实现提升性能瓶颈，降低功耗的优点。Embodiments of the present disclosure provide a method for implementing a convolution expansion instruction, a convolution expansion instruction, and related products, which can achieve the advantages of improving performance bottlenecks and reducing power consumption.

第一方面，本披露实施例提供一种卷积扩展指令的执行方法，所述方法包括如下步骤：In a first aspect, an embodiment of the present disclosure provides a method for executing a convolution expansion instruction, and the method includes the following steps:

计算装置从存储器读取所述卷积扩展指令获取所述卷积扩展指令的输入数据、卷积核以及激活操作；The computing device reads the convolution expansion instruction from the memory to obtain the input data, convolution kernel and activation operation of the convolution expansion instruction;

所述卷积扩展指令包括：操作码和操作域，所述操作码包括：所述卷积扩展指令的标识；所述操作域包括：卷积子域和激活子域，所述卷积子域包括：存储输入数据的地址和卷积核的地址，所述激活子域包括：所述激活操作的标识码或所述激活操作的插值表地址；The convolution extension instruction includes: an operation code and an operation field, the operation code includes: the identifier of the convolution extension instruction; the operation field includes: a convolution subfield and an activation subfield, and the convolution subfield Including: storing the address of the input data and the address of the convolution kernel, and the activation subfield includes: the identification code of the activation operation or the interpolation table address of the activation operation;

计算装置对所述输入数据和卷积核执行卷积运算得到中间结果，通过所述激活子域对所述中间结果执行激活操作得到所述指令的最终结果。The computing device performs a convolution operation on the input data and the convolution kernel to obtain an intermediate result, and performs an activation operation on the intermediate result through the activation subfield to obtain a final result of the instruction.

可选的，所述激活操作包括：卷积神经网络Maxout操作、卷积神经网络PReLU 操作、卷积神经网络RReLU操作、卷积神经网络Leaky ReLU操作、非线性激活操作或线性激活操作操作。Optionally, the activation operation includes: a convolutional neural network Maxout operation, a convolutional neural network PReLU operation, a convolutional neural network RReLU operation, a convolutional neural network Leaky ReLU operation, a nonlinear activation operation or a linear activation operation.

可选的，如所述激活子域包括：激活操作的插值表地址，所述通过所述激活子域对所述中间结果执行激活操作得到所述指令的最终结果，包括：Optionally, if the activation subfield includes: an interpolation table address of the activation operation, the final result of the instruction is obtained by performing the activation operation on the intermediate result through the activation subfield, including:

计算装置提取所述激活操作的插值表地址对应的插值表，将所述中间结果与所述插值表执行激活运算得到所述指令的最终结果。The computing device extracts the interpolation table corresponding to the interpolation table address of the activation operation, and performs the activation operation on the intermediate result and the interpolation table to obtain the final result of the instruction.

可选的，如所述激活子域包括：激活操作的标识码，所述通过所述激活子域对所述中间结果执行激活操作得到所述指令的最终结果，包括：Optionally, if the activation subfield includes: an identification code of the activation operation, the final result of the instruction is obtained by performing the activation operation on the intermediate result through the activation subfield, including:

计算装置识别所述激活操作的标识码确定所述激活操作，读取所述激活操作的插值表，将所述插值表与所述中间结果执行激活运算得到所述指令的最终结果。The computing device identifies the identification code of the activation operation to determine the activation operation, reads the interpolation table of the activation operation, and executes the activation operation on the interpolation table and the intermediate result to obtain the final result of the instruction.

可选的，所述计算装置对所述输入数据和卷积核执行卷积运算得到中间结果，包括：Optionally, the computing device performs a convolution operation on the input data and the convolution kernel to obtain an intermediate result, including:

计算装置的主运算模块将所述输入数据拆分成多个部分得到多个输入子数据，将多个输入子数据分发给多个从运算模块，将卷积核发送给多个从运算模块，所述多个从运算模块并行执行输入子数据与卷积核的乘法运算得到多个子结果，计算装置的主运算模块将所述多个子结果拼接得到所述中间结果。The main operation module of the computing device splits the input data into multiple parts to obtain multiple input sub-data, distributes the multiple input sub-data to multiple slave operation modules, and sends the convolution kernel to multiple slave operation modules, The multiple slave operation modules execute the multiplication operation of the input sub-data and the convolution kernel in parallel to obtain multiple sub-results, and the main operation module of the computing device splices the multiple sub-results to obtain the intermediate result.

第二方面，提供一种计算装置，所述计算装置包括：存储器、运算单元、互联模块、运算单元、控制器单元和数据访问单元；In a second aspect, a computing device is provided, the computing device comprising: a memory, an arithmetic unit, an interconnection module, an arithmetic unit, a controller unit, and a data access unit;

其中，所述运算单元，包括：加法运算器、乘法运算器；Wherein, the arithmetic unit includes: an adder and a multiplier;

控制器单元，用于从存储器读取所述卷积扩展指令获取所述卷积扩展指令的输入数据、卷积核以及激活操作；a controller unit, configured to read the convolution expansion instruction from the memory to obtain the input data, convolution kernel and activation operation of the convolution expansion instruction;

数据访问单元，用于获取所述输入数据的地址和卷积核的地址对应的输入数据以及卷积核；A data access unit, used to obtain the input data and the convolution kernel corresponding to the address of the input data and the address of the convolution kernel;

所述运算单元，用于对所述输入数据和卷积核执行卷积运算得到中间结果，通过所述激活子域对所述中间结果执行激活操作得到所述指令的最终结果。The operation unit is configured to perform a convolution operation on the input data and the convolution kernel to obtain an intermediate result, and perform an activation operation on the intermediate result through the activation subfield to obtain the final result of the instruction.

可选的，如所述激活子域包括：激活操作的插值表地址；Optionally, the activation subfield includes: the interpolation table address of the activation operation;

所述数据访问单元，用于提取所述激活操作的插值表地址对应的插值表；The data access unit is used to extract the interpolation table corresponding to the interpolation table address of the activation operation;

所述运算单元，用于将所述中间结果与所述插值表执行激活运算得到所述指令的最终结果。The operation unit is configured to perform an activation operation on the intermediate result and the interpolation table to obtain the final result of the instruction.

可选的，如所述激活子域包括：激活操作的标识码；所述运算单元还包括：激活运算器；Optionally, the activation subfield includes: an identification code of the activation operation; the operation unit further includes: an activation operator;

所述控制器单元，用于识别所述激活操作的标识码确定所述激活操作；the controller unit, for identifying the activation operation by identifying the identification code of the activation operation;

所述激活运算器，用于取所述激活操作的插值表，将所述插值表与所述中间结果执行激活运算得到所述指令的最终结果。The activation calculator is configured to obtain the interpolation table of the activation operation, and perform the activation operation on the interpolation table and the intermediate result to obtain the final result of the instruction.

可选的，所述运算单元还包括：主运算模块和多个从运算模块，所述主运算模块包括：加法运算器和乘法运算器，所述从运算模块包括：加法运算器和乘法运算器；Optionally, the operation unit further includes: a master operation module and a plurality of slave operation modules, the master operation module includes: an adder and a multiplier, and the slave operation module includes: an adder and a multiplier ;

所述主运算模块，用于将所述输入数据拆分成多个部分得到多个输入子数据，将多个输入子数据分发给多个从运算模块，将卷积核发送给多个从运算模块，所述多个从运算模块，用于并行执行输入子数据与卷积核的乘法运算得到多个子结果，所述主运算模块，用于将所述多个子结果拼接得到所述中间结果。The main operation module is used to split the input data into multiple parts to obtain multiple input sub-data, distribute the multiple input sub-data to multiple slave operation modules, and send the convolution kernel to multiple slave operations module, the multiple slave operation modules are used for performing the multiplication operation of the input sub-data and the convolution kernel in parallel to obtain multiple sub-results, and the master operation module is used for splicing the multiple sub-results to obtain the intermediate result.

第三方面，提供一种计算机可读存储介质，其特征在于，其存储用于电子数据交换的计算机程序，其中，所述计算机程序使得计算机执行第一方面提供的方法。A third aspect provides a computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method provided in the first aspect.

第四方面，提供一种计算机程序产品，所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质，所述计算机程序可操作来使计算机执行第一方面所述的方法。In a fourth aspect, a computer program product is provided, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform the method of the first aspect.

第五方面，提供一种芯片，所述芯片包括第二方面提供的计算装置。A fifth aspect provides a chip, where the chip includes the computing device provided in the second aspect.

第六方面，提供一种芯片封装结构，所述芯片封装结构包括第五方面提供的芯片。In a sixth aspect, a chip packaging structure is provided, and the chip packaging structure includes the chip provided in the fifth aspect.

第七方面，提供一种板卡，所述板卡包括第六方面提供的芯片封装结构。According to a seventh aspect, a board card is provided, and the board card includes the chip packaging structure provided in the sixth aspect.

第八方面，提供一种电子装置，所述电子装置包括第七方面提供的一种板卡。In an eighth aspect, an electronic device is provided, and the electronic device includes the board provided in the seventh aspect.

可以看出，通过本披露实施例，其具有单一指令实现卷积运算以及激活操作的优点，所以其具有减少计算时间，节省功耗的优点。It can be seen that the embodiment of the present disclosure has the advantage of implementing the convolution operation and the activation operation with a single instruction, so it has the advantages of reducing calculation time and saving power consumption.

附图说明Description of drawings

为了更清楚地说明本披露实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本披露的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1是本披露提供的一种计算设备的结构示意图。FIG. 1 is a schematic structural diagram of a computing device provided by the present disclosure.

图2是本披露实施例提供的互联模块的示意框图。FIG. 2 is a schematic block diagram of an interconnection module provided by an embodiment of the present disclosure.

图2a是本披露实施例提供的用于执行卷积神经网络正向运算的装置中主运算模块的示意框图。Fig. 2a is a schematic block diagram of a main operation module in an apparatus for performing forward operation of a convolutional neural network provided by an embodiment of the present disclosure.

图2b是本披露实施例提供的用于执行卷积神经网络正向运算的装置中从运算模块的示意框图。Fig. 2b is a schematic block diagram of a slave operation module in the apparatus for performing forward operation of a convolutional neural network provided by an embodiment of the present disclosure.

图3是本披露实施例提供的卷积神经网络运算装置执行卷积变换指令的流程图。FIG. 3 is a flowchart of executing a convolution transformation instruction by a convolutional neural network computing device provided in an embodiment of the present disclosure.

图3a是本披露实施例提供的一种卷积核的示意图。FIG. 3a is a schematic diagram of a convolution kernel provided by an embodiment of the present disclosure.

图3b是本披露实施例提供的一种输入数据的示意图。FIG. 3b is a schematic diagram of input data provided by an embodiment of the present disclosure.

图3c是本披露实施例提供的一种卷积核的移动的示意图。FIG. 3c is a schematic diagram of movement of a convolution kernel provided by an embodiment of the present disclosure.

图3d是本披露实施例提供的另一种卷积核的移动的示意图。FIG. 3d is a schematic diagram of another movement of a convolution kernel provided by an embodiment of the present disclosure.

图3e是本披露实施例提供的又一种卷积核的移动的示意图。FIG. 3e is a schematic diagram of still another convolution kernel movement provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将结合本披露实施例中的附图，对本披露实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本披露一部分实施例，而不是全部的实施例。基于本披露中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

本披露的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first," "second," "third," and "fourth" in the description and claims of the present disclosure and the accompanying drawings are used to distinguish different objects, rather than to describe a specific order. . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本披露的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present disclosure. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

下面以卷积运算指令为例来说明卷积指令运算的方法，该卷积指令可以应用在神经网络中，当然在实际应用中，也可以应用到其他的计算场景内，本披露并不限制上述卷积指令具体的实现场景，该卷积运算指令也可以称为卷积神经网络。对于卷积指令来说，其实际需要执行的公式可以为s＝s(∑wx_i+b) 其中，即将卷积核w(可包括多个数据)乘以输入数据x_i，The convolution instruction operation method is described below by taking the convolution operation instruction as an example. The convolution instruction can be applied in neural networks. Of course, in practical applications, it can also be applied in other computing scenarios. The present disclosure does not limit the above The specific implementation scenario of the convolution instruction, the convolution operation instruction may also be called a convolutional neural network. For the convolution instruction, the actual formula that needs to be executed can be s=s(∑wx _i +b) where, the convolution kernel w (which may include multiple data) is multiplied by the input data _xi ,

进行求和，然后可以根据实际的计算加上偏置b得到初步计算结果h，然后初步计算结果还可做激活运算s(h)，以得到最终的输出结果S。依据该公式即可以得到该计算拓扑结构为，乘法运算器-加法运算器-激活运算器。After summing, the initial calculation result h can be obtained by adding the offset b according to the actual calculation, and then the activation operation s(h) can be performed on the preliminary calculation result to obtain the final output result S. According to the formula, the calculation topology can be obtained as multiplier-adder-activation operator.

对于现有的卷积指令，如果需要执行激活运算，其需要通过多个指令来执行，以上述公式为例，首先，其需要通过卷积运算指令得到初步计算结果h,然后在通过卷积激活指令对该h执行激活运算，即需要至少两个卷积指令来得到上述公式的结果S，此种方式首先对于卷积指令的数量来说需要多个数量，另外，对于芯片或计算装置其由于需要重复调用数据，所以其需要更多的计算开销，并且功耗也更高。For the existing convolution instruction, if the activation operation needs to be executed, it needs to be executed through multiple instructions. Taking the above formula as an example, first, it needs to obtain the preliminary calculation result h through the convolution operation instruction, and then activate the convolution through the convolution instruction. The instruction performs an activation operation on this h, that is, at least two convolution instructions are required to obtain the result S of the above formula. This method requires multiple numbers for the number of convolution instructions. In addition, for the chip or computing device, due to the The data needs to be called repeatedly, so it requires more computational overhead and higher power consumption.

本披露提供了一种计算装置，该计算装置如图1所示，包括：存储介质111、寄存器单元112、互联模块113、运算单元114、控制器单元115和数据访问单元116；The present disclosure provides a computing device. As shown in FIG. 1 , the computing device includes: a storage medium 111, a register unit 112, an interconnection module 113, an arithmetic unit 114, a controller unit 115, and a data access unit 116;

其中，运算单元114可以包括：乘法计算器和加法运算器，当然该运算单元还可以包括：比较器、激活运算器、OP变换器中至少一种。The operation unit 114 may include: a multiplier and an adder, and of course, the operation unit may also include at least one of a comparator, an activation operator, and an OP converter.

互联模块113，用于控制运算单元114中计算器的连接关系使得至少二种计算器组成不同的计算拓扑结构。The interconnection module 113 is used to control the connection relationship of the calculators in the operation unit 114 so that at least two kinds of calculators form different computing topology structures.

寄存器单元112，用于存储运算指令、输入数据、卷积核在存储介质的地址、卷积指令对应的计算拓扑结构。The register unit 112 is used to store operation instructions, input data, addresses of the convolution kernels in the storage medium, and calculation topology structures corresponding to the convolution instructions.

存储介质111可以为片外存储器，当然在实际应用中，也可以为片内存储器，用于存储输入数据、卷积核，该输入数据、卷积核具体可以为向量、矩阵或多维数据。The storage medium 111 can be an off-chip memory, and of course, in practical applications, it can also be an on-chip memory for storing input data and convolution kernels, and the input data and convolution kernels can specifically be vectors, matrices, or multi-dimensional data.

控制器单元115，用于从寄存器单元112内提取运算指令(具体可以为卷积指令)、该运算指令对应的操作域以及该运算指令对应的第一计算拓扑结构，将该运算指令译码成执行指令，该执行指令用于控制运算单元执行运算操作，将该操作域传输至数据访问单元116，将该计算拓扑结构传输至互联模块113。The controller unit 115 is used to extract an operation instruction (specifically, a convolution instruction), an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction from the register unit 112, and decode the operation instruction into a The execution instruction is used to control the operation unit to perform the operation operation, transmit the operation domain to the data access unit 116 , and transmit the calculation topology structure to the interconnection module 113 .

数据访问单元116，用于从存储介质111中提取该操作域对应的输入数据、卷积核，并将该输入数据、卷积核传输至运算单元114。The data access unit 116 is configured to extract the input data and the convolution kernel corresponding to the operation domain from the storage medium 111 , and transmit the input data and the convolution kernel to the operation unit 114 .

互联模块113、用于依据控制运算单元114中计算器的连接关系形成第一计算拓扑结构。The interconnection module 113 is configured to form a first computing topology structure according to the connection relationship of the calculators in the control operation unit 114 .

运算单元114，用于按第一计算拓扑结构以及该执行指令调用计算器对该数据块执行运算操作得到运算结果，将该运算结果传输至数据访问单元存储在存储介质内。The operation unit 114 is configured to call the calculator to perform an operation on the data block according to the first computing topology and the execution instruction to obtain an operation result, and transmit the operation result to the data access unit for storage in the storage medium.

该运算指令可以如图1所示，包括：操作域以及操作码，以卷积运算指令为例，操作域可以包括：卷积子域和激活子域，如表1所示，其中，寄存器号(可选的，寄存器也可以是寄存器堆)0、寄存器号(可选的，寄存器堆)1、寄存器号(可选的，寄存器堆)2、寄存器号(可选的，寄存器堆)3、寄存器号4，寄存器号0、寄存器号1、寄存器号2、寄存器号3可以为卷积子域，具体的，寄存器号4可以为激活子域。The operation instruction can be shown in Figure 1, including: an operation field and an operation code. Taking the convolution operation instruction as an example, the operation field can include: a convolution subfield and an activation subfield, as shown in Table 1, where the register number (Optional, the register can also be a register file) 0, register number (optional, register file) 1, register number (optional, register file) 2, register number (optional, register file) 3, Register number 4, register number 0, register number 1, register number 2, and register number 3 may be convolution subfields. Specifically, register number 4 may be an activation subfield.

表1：Table 1:

当为激活函数插值表地址时，对于计算装置，其可以节省激活计算器的设置，并且对于激活函数插值表地址的设置还可以节省译码器解析开销，降低计算量，节省芯片功耗和面积。下面详细说明其具体的实现方式，如CONV_ACTIVATE 包含激活函数插值表地址，则CONV_ACTIVATE指令在执行完卷积操作以后得到卷积操作的结果(即中间结果)，然后提取该激活函数插值表地址对应的插值表对该卷积操作的结果执行激活运算直接得到结果。此方式只需要读取一次 CONV_ACTIVATE指令，并且执行也无需单独的激活计算器执行，所以其具有解析指令开销小，降低计算量，节省硬件配置的优点，如果CONV_ACTIVATE包含激活函数操作码，则CONV_ACTIVATE指令在执行完卷积操作以后得到卷积操作的结果，解析该激活函数操作码得到对应的激活函数，然后将该激活函数发送给激活计算器，激活计算器依据该激活函数提取插值表对该卷积操作的结果执行激活运算，其需要解析多次指令，另外需要单独的激活计算器来执行激活运算。When it is the activation function interpolation table address, for the computing device, it can save the setting of the activation calculator, and the setting of the activation function interpolation table address can also save the decoder parsing overhead, reduce the calculation amount, and save the chip power consumption and area. . The specific implementation method is described in detail below. For example, if CONV_ACTIVATE contains the address of the activation function interpolation table, the CONV_ACTIVATE command obtains the result of the convolution operation (ie the intermediate result) after the convolution operation is performed, and then extracts the address corresponding to the activation function interpolation table address. The interpolation table performs the activation operation on the result of the convolution operation to get the result directly. This method only needs to read the CONV_ACTIVATE instruction once, and the execution does not require a separate activation calculator to execute, so it has the advantages of low cost of parsing instructions, reducing the amount of calculation, and saving hardware configuration. If CONV_ACTIVATE contains the activation function opcode, the CONV_ACTIVATE instruction After the convolution operation is performed, the result of the convolution operation is obtained, the operation code of the activation function is parsed to obtain the corresponding activation function, and then the activation function is sent to the activation calculator, and the activation calculator extracts the interpolation table according to the activation function. The result of the product operation performs the activation operation, which requires parsing multiple instructions, and additionally requires a separate activation calculator to perform the activation operation.

该运算指令还可以如表2所示，包括：操作码CONV_AC_OP、寄存器号(可选的，寄存器堆)0、寄存器号(可选的，寄存器堆)1、寄存器号(可选的，寄存器堆)2、寄存器号(可选的，寄存器堆)3、寄存器号(可选的，寄存器堆)4、辅助操作码，寄存器号0、寄存器号1、寄存器号2、寄存器号3可以为卷积子域，寄存器号4可以为激活子域，OP操作码可以为OP子域，具体的，如表2所示。The operation instruction may also be shown in Table 2, including: opcode CONV_AC_OP, register number (optional, register file) 0, register number (optional, register file) 1, register number (optional, register file) )2, register number (optional, register file) 3, register number (optional, register file) 4, auxiliary opcode, register number 0, register number 1, register number 2, register number 3 can be convolution Subfield, register number 4 can be the activation subfield, and the OP operation code can be the OP subfield, specifically, as shown in Table 2.

表2：Table 2:

上述运算指令可以包括卷积指令集，该指令集包含有不同功能的卷积神经网络CONV指令、CONV_ACTIVATE指令、CONV_OP以及CONFIG指令、IO指令、NOP 指令、JUMP指令、MOVE指令。The above operation instruction may include a convolution instruction set, the instruction set includes the convolutional neural network CONV instruction, CONV_ACTIVATE instruction, CONV_OP and CONFIG instruction, IO instruction, NOP instruction, JUMP instruction, and MOVE instruction with different functions.

如表1和表2所示的辅助操作码具体可以为包含有计算操作以及计算器连接关系。以OP操作为例，对于OP操作有多种，假设1表示转置，0表示共轭，假设该辅助操作码可以为4bit，在实际应用中也可以为其他的比特数量，例如 6bit、8bit等等，对于CONV_OP的辅助操作码，如其为1111，则可以表示为，转置操作，需要执行转置操作的可以包括：输入数据、卷积核、初步计算结果，这里假设1111的第2个bit表示输入数据是否执行OP操作，第3个bit表示卷积核是否执行OP操作，第4个bit表示初步计算结果是否执行OP操作，假设1可以为执行OP操作，假设0可以为不执行OP操作。当然在实际应用中也可以是其他的操作。The auxiliary operation codes shown in Table 1 and Table 2 may specifically include calculation operations and calculator connection relationships. Taking the OP operation as an example, there are many types of OP operations, assuming that 1 means transposition and 0 means conjugation. It is assumed that the auxiliary opcode can be 4bit, and in practical applications, it can also be other bit numbers, such as 6bit, 8bit, etc. Etc., for the auxiliary opcode of CONV_OP, if it is 1111, it can be expressed as a transposition operation. The transposition operation that needs to be performed can include: input data, convolution kernel, and preliminary calculation results. Here, it is assumed that the second bit of 1111 Indicates whether the input data performs the OP operation, the third bit indicates whether the convolution kernel performs the OP operation, and the fourth bit indicates whether the preliminary calculation result performs the OP operation. Suppose 1 can be used to perform the OP operation, and 0 can be used to not perform the OP operation. . Of course, other operations are also possible in practical applications.

在一种实施例中，CONV_ACTIVATE指令包括：In one embodiment, the CONV_ACTIVATE instruction includes:

卷积激活指令，根据该指令，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在卷积运算部件中做卷积操作，然后将输出结果做激活函数运算；上述设定大小可以由厂家或用户自行定义。The convolution activation command, according to the command, the device fetches the input data and the convolution kernel of the set size from the specified address of the memory (preferably, the high-speed temporary storage memory) respectively, performs the convolution operation in the convolution operation unit, and then Perform the activation function operation on the output result; the above setting size can be defined by the manufacturer or the user.

卷积激活指令具体可以包括：Convolution activation instructions can specifically include:

卷积神经网络Maxout指令，具体可以包括：装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在卷积运算部件中做卷积操作，然后将输出结果做Maxout激活；上述设定大小可以由厂家或用户自行定义。对于卷积神经网络Maxout指令的具体表现形式可以为，在 CONV_ACTIVATE指令的操作域的寄存器号4中添加Maxout的插值表或Maxout操作码。The convolutional neural network Maxout instruction may specifically include: the device takes out the input data and convolution kernel of the set size from the specified address of the memory (preferably, the high-speed temporary storage memory) respectively, and performs the convolution operation in the convolution operation component. , and then activate the output result as Maxout; the above set size can be defined by the manufacturer or the user. The specific expression form of the Maxout instruction for the convolutional neural network may be, adding the Maxout interpolation table or the Maxout opcode to the register number 4 of the operation domain of the CONV_ACTIVATE instruction.

对于Maxout，其数学表达式可以为：For Maxout, its mathematical expression can be:

whereZ_ij＝x^TW_ij+b_ij whereZ _ij =x ^T W _ij +b _ij

其中，h_i表示Maxout的输出结果，W_ij表示卷积核，b_ij表示偏置，X^T表示输入数据的转置。Among them, h _i represents the output result of Maxout, W _ij represents the convolution kernel, b _ij represents the bias, and X ^T represents the transposition of the input data.

卷积神经网络PReLU指令，用于根据该指令对计算装置的输出结果做PReLU 激活，装置分别从高速暂存存储器的指定地址取出设定大小的输入数据和卷积核，在卷积运算部件中做卷积操作，然后将输出结果做PReLU激活；对于卷积神经网络PReLU指令的具体表现形式可以为，在CONV_ACTIVATE指令的操作域的寄存器号4中添加PReLU的插值表或PReLU操作码。The convolutional neural network PReLU instruction is used to perform PReLU activation on the output result of the computing device according to the instruction. Do the convolution operation, and then activate the output result by PReLU; for the specific representation of the convolutional neural network PReLU instruction, add the PReLU interpolation table or PReLU opcode to register number 4 of the operation domain of the CONV_ACTIVATE instruction.

卷积神经网络RReLU指令，用于根据该指令对计算装置的输出结果做RReLU 激活，装置分别从高速暂存存储器的指定地址取出设定大小的输入数据和卷积核，在卷积运算部件中做卷积操作，然后将输出结果做RReLU激活；对于卷积神经网络RReLU指令的具体表现形式可以为，在CONV_ACTIVATE指令的操作域的寄存器号4中添加RReLU的插值表或RReLU操作码。The convolutional neural network RReLU instruction is used to perform RReLU activation on the output result of the computing device according to the instruction, and the device fetches the input data and convolution kernel of the set size from the specified address of the high-speed temporary storage memory, respectively, in the convolution operation unit Do the convolution operation, and then activate the output result by RReLU; for the specific representation of the convolutional neural network RReLU instruction, add the RReLU interpolation table or RReLU opcode to the register number 4 of the operation domain of the CONV_ACTIVATE instruction.

卷积神经网络Leaky ReLU指令，用于根据该指令对计算装置的输出结果做 LeakyReLU激活，装置分别从高速暂存存储器的指定地址取出设定大小的输入数据和卷积核，在卷积运算部件中做卷积操作，然后将输出结果做Leaky ReLU 激活；对于卷积神经网络Leaky ReLU指令的具体表现形式可以为，在 CONV_ACTIVATE指令的操作域的寄存器号4中添加RReLU的插值表或Leaky ReLU 操作码。The convolutional neural network Leaky ReLU command is used to activate LeakyReLU on the output result of the computing device according to the command. Do the convolution operation in , and then activate the output result by Leaky ReLU; for the specific expression of the Leaky ReLU instruction of the convolutional neural network, the interpolation table of RReLU or the Leaky ReLU operation can be added to the register number 4 of the operation field of the CONV_ACTIVATE instruction. code.

对于ReLU，其数学表达式为：f(X)＝max(0,x)；For ReLU, its mathematical expression is: f(X)=max(0,x);

Leaky ReLU、RReLU、PReLU，其数学表达式可以为：Leaky ReLU, RReLU, PReLU, its mathematical expression can be:

f(X)＝αx(x＜0)，f(X)＝x(x≥0)；f(X)=αx(x<0), f(X)=x(x≥0);

对于上述数学表达式，针对不同的α取值对应Leaky ReLU、RReLU或PReLU，当α＞0时，为PReLU；当α＜0时，为Leaky ReLU，当α为高斯分布的随机数时，为RReLU。For the above mathematical expressions, different values of α correspond to Leaky ReLU, RReLU or PReLU. When α>0, it is PReLU; when α<0, it is Leaky ReLU, and when α is a random number of Gaussian distribution, it is RReLU.

CONV_ACTIVATE指令也可以包括其他的运算指令，进行非线性激活或线性激活操作。The CONV_ACTIVATE instruction can also include other operation instructions to perform nonlinear activation or linear activation operations.

在一种实施例中，CONV_OP指令包括：In one embodiment, the CONV_OP instruction includes:

卷积变换指令，根据该指令，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在OP(共轭或转置)运算部件中对输入数据和/或卷积核做变换操作，然后在卷积运算部件中做卷积操作，然后将输出结果做变换；上述设定大小、OP类型可以由厂家或用户自行定义。The convolution transformation instruction, according to the instruction, the device fetches the input data and convolution kernel of the set size from the specified address of the memory (preferably, the cache memory) respectively, in the OP (conjugate or transpose) operation part Transform the input data and/or the convolution kernel, then perform the convolution operation in the convolution operation unit, and then transform the output result; the above set size and OP type can be defined by the manufacturer or the user.

卷积变换指令具体包括：The convolution transformation instructions specifically include:

卷积神经网络Reshape指令，用于根据该指令对计算装置的输出结果做 Reshape操作，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在OP运算部件中做reshape(维度重整，如 nchw->chwn等)操作，然后在卷积运算部件中做卷积操作，然后将输出结果做 reshape操作；上述设定大小可以由厂家或用户自行定义。The convolutional neural network Reshape command is used to perform a Reshape operation on the output result of the computing device according to the command. The device retrieves the input data and convolution kernel of the set size from the specified address of the memory (preferably, the cache memory). , perform the reshape (dimension reorganization, such as nchw->chwn, etc.) operation in the OP operation unit, then perform the convolution operation in the convolution operation unit, and then perform the reshape operation on the output result; the above setting size can be set by the manufacturer or User-defined.

所谓维度重整，指的是卷积运算的输入数据和卷积核的四个维度进行重整。The so-called dimension reorganization refers to the reorganization of the input data of the convolution operation and the four dimensions of the convolution kernel.

图3a所示的M个卷积核，每个卷积核为5*3*3的三维数据块，那么其运算窗口也为5*3*3的三维数据块，对于如图3a所示的M个卷积核中的KH以及KW 表示其KH对应的维度为输入数据的H维度，该KW表示的对应的维度为输入数据的W维度。图3c、3d、3e中灰色部分方块是每一次滑动运算窗口进行运算使用的数据，其滑动的方向可以是以H为滑动方向以后在以W为滑动方向或以W 为滑动方向完成以后在以H为滑动方向。具体地，对于卷积来说是，每一个滑动窗口处的运算是图中灰色部分方块表示的数据块与“图3a卷积1-卷积核”表示的M个卷积核数据块分别进行内积运算，卷积将对每一个滑动窗口位置对应每一个卷积核输出一个数值，即对于每个滑动窗口具有M个输出数值；所述“图 3a-图3e”中均使用一个方块表示一个数值，也可以称为一个权值；示意图中所使用的数字均仅限举例说明，实际情况中维度数据可能是任意数值(包括某个维度为1的情况，这种情况下，所述四维数据块自动成为三维数据块，例如，当同时计算的样本数量为1的情况下，输入数据就是一个三维数据块；在例如，当卷积核数量为1的情况下，卷积和数据为一个三维数据块)。使用所述芯片装置进行输入数据B和卷积核A之间的卷积运算；For the M convolution kernels shown in Figure 3a, each convolution kernel is a 5*3*3 three-dimensional data block, then its operation window is also a 5*3*3 three-dimensional data block. For the three-dimensional data block shown in Figure 3a The KH and KW in the M convolution kernels indicate that the dimension corresponding to the KH is the H dimension of the input data, and the corresponding dimension indicated by the KW is the W dimension of the input data. The gray blocks in Figures 3c, 3d, and 3e are the data used for each sliding operation window, and the sliding direction can take H as the sliding direction and then use W as the sliding direction or W as the sliding direction. H is the sliding direction. Specifically, for convolution, the operation at each sliding window is that the data block represented by the gray block in the figure and the M convolution kernel data blocks represented by "Convolution 1-convolution kernel in Figure 3a" are performed separately. Inner product operation, the convolution will output a value for each convolution kernel corresponding to each sliding window position, that is, there are M output values for each sliding window; a square is used in the "Figure 3a-Figure 3e" to represent A numerical value can also be called a weight; the numbers used in the schematic diagram are only for illustration, and in actual situations, the dimension data may be any numerical value (including the case where a certain dimension is 1, in this case, the four-dimensional The data block automatically becomes a three-dimensional data block. For example, when the number of samples calculated at the same time is 1, the input data is a three-dimensional data block; for example, when the number of convolution kernels is 1, the convolution and data are a 3D data blocks). Use the chip device to perform a convolution operation between the input data B and the convolution kernel A;

对于一个卷积层，其权值(所有的卷积核)如“图3a卷积1-卷积核”所示，记其卷积核的数量为M，每个卷积核由C个KH行KW列的矩阵组成，所以卷积层的权值可以表示为一个四个维度分别是M，C，KH，KW的四维数据块；卷积层的输入数据为四维数据块，由N个三维数据块组成，每个所述三维数据块由C个H 行W列的特征矩阵组成(即四个维度分别是N，C，H，W的数据块)；For a convolutional layer, its weights (all convolution kernels) are shown in "Fig. 3a Convolution 1-convolution kernel", and the number of convolution kernels is M, and each convolution kernel consists of C KH It consists of a matrix of rows, KW, and columns, so the weight of the convolutional layer can be expressed as a four-dimensional data block with four dimensions of M, C, KH, and KW; the input data of the convolutional layer is a four-dimensional data block, which consists of N three-dimensional data blocks. data blocks, each of the three-dimensional data blocks is composed of C feature matrices with H rows and W columns (that is, the four dimensions are respectively N, C, H, and W data blocks);

卷积神经网络Pad指令，用于根据该指令对计算装置的输出结果做Pad操作，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在OP运算部件中对卷积核做pad(外围扩增)操作，然后在卷积运算部件中做卷积操作；上述设定大小可以由厂家或用户自行定义。对于卷积神经网络Pad指令的具体表现形式可以为，在CONV_OP或CONV_AC_OP 指令的操作域的辅助操作码中添加Pad操作码。The convolutional neural network Pad instruction is used to perform a Pad operation on the output result of the computing device according to the instruction, and the device fetches the input data and convolution kernel of the set size from the specified address of the memory (preferably, the cache memory) respectively. , perform the pad (peripheral amplification) operation on the convolution kernel in the OP operation unit, and then perform the convolution operation in the convolution operation unit; the above setting size can be defined by the manufacturer or the user. The specific expression form of the Pad instruction of the convolutional neural network may be that the Pad opcode is added to the auxiliary opcode of the operation domain of the CONV_OP or CONV_AC_OP instruction.

外围扩增指的是对于卷积核的外围多加了N圈数据,N是正整数。N可以是 1.这时候这个指令格式不变。圈的意思是原先kh*kw大小的二维数据块，通过外围补数扩大为(kh+2N)*(kw+2N).Peripheral amplification refers to adding N circles of data to the periphery of the convolution kernel, where N is a positive integer. N can be 1. At this time, the format of this instruction remains unchanged. The circle means that the original two-dimensional data block of kh*kw size is expanded to (kh+2N)*(kw+2N) by the peripheral complement.

N如果大于1，要么指令格式增加一个操作域(寄存器5)，来存储这个n的数值，即在CONV_OP的操作域增加一个寄存器5，该寄存器5用于存储n的数值。如果指令格式不变，执行指令的方法改变，在执行CONV指令之前，利用config 指令来调用n的数值，在执行CONV指令之前执行pad操作。If N is greater than 1, either an operation field (register 5) is added to the instruction format to store the value of n, that is, a register 5 is added to the operation field of CONV_OP, and this register 5 is used to store the value of n. If the instruction format is unchanged, the method of executing the instruction is changed. Before executing the CONV instruction, use the config instruction to call the value of n, and execute the pad operation before executing the CONV instruction.

另外数据可以全为0，这是最基本的pad操作。In addition, the data can be all 0, which is the most basic pad operation.

可选的，数据可以是0和1随机分布的。这种情况下，操作码得改成 conv-pad-random。方法得多一步为：使用随机数生成器生成pad需要填充的数，共(kh+2N)*(kw+2N)–kh*hw个数据.Optionally, the data can be randomly distributed with 0s and 1s. In this case, the opcode has to be changed to conv-pad-random. The method is one more step: use a random number generator to generate the number that pad needs to be filled, a total of (kh+2N)*(kw+2N)–kh*hw data.

卷积神经网络Crop指令，用于根据该指令对计算装置的输出结果做Crop 操作，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在OP运算部件中对输入做crop(尺寸裁剪)，然后在卷积运算部件中做卷积操作，上述设定大小可以由厂家或用户自行定义。The convolutional neural network Crop command is used to perform a Crop operation on the output result of the computing device according to the command, and the device respectively fetches the input data and convolution kernel of the set size from the specified address of the memory (preferably, the cache memory). , Crop (size crop) the input in the OP operation unit, and then perform the convolution operation in the convolution operation unit. The above set size can be defined by the manufacturer or the user.

尺寸裁剪的定义是从原本H*W大小的二维数据块，截取出其中H1*W1大小的二维数据块，其中H1和W1小于等于H和W。The definition of size cropping is to cut out a two-dimensional data block of size H1*W1 from the original two-dimensional data block of H*W size, where H1 and W1 are less than or equal to H and W.

卷积神经网络Dilate指令，用于根据该指令对计算装置的输出结果做 Dilate操作，装置分别从存储器(优选的，为高速暂存存储器)的指定地址取出设定大小的输入数据和卷积核，在OP运算部件中对卷积核做dilate(内部插 0)操作，然后在卷积运算部件中做卷积操作；上述设定大小可以由厂家或用户自行定义。The convolutional neural network Dilate instruction is used to perform a Dilate operation on the output result of the computing device according to the instruction, and the device fetches the input data and the convolution kernel of the set size from the specified address of the memory (preferably, the cache memory) respectively. , perform the dilate (internally insert 0) operation on the convolution kernel in the OP operation unit, and then perform the convolution operation in the convolution operation unit; the above setting size can be defined by the manufacturer or the user.

dilate(内部插0)的定义是:对于kh*kw大小的卷积核，在其内部(前面提到的pad是在外围)均匀或随机地插0或随机数，起到了对卷积核“稀释”的效果，这样做可以增强卷积核的特征提取效果。The definition of dilate (internal insertion of 0) is: for the convolution kernel of kh*kw size, insert 0 or random numbers uniformly or randomly in its interior (the pad mentioned above is on the periphery), which plays a role in the convolution kernel " Dilution" effect, which can enhance the feature extraction effect of the convolution kernel.

CONV_OP指令也可以包括其他的变换指令，例如对输入、权值做BLAS变换等。The CONV_OP instruction can also include other transformation instructions, such as BLAS transformation on the input and weights.

上述指令集包含有不同功能的卷积神经网络CONV_AC_OP指令以及CONFIG 指令、IO指令、NOP指令、JUMP指令和MOVE指令。The above instruction set includes the convolutional neural network CONV_AC_OP instruction with different functions, as well as the CONFIG instruction, the IO instruction, the NOP instruction, the JUMP instruction and the MOVE instruction.

在一种实施例中，CONV_AC_OP可以通过辅助操作码的设置，实现CONV、 ACTICATE和OP操作的任意组合。In one embodiment, CONV_AC_OP can implement any combination of CONV, ACTICATE and OP operations through the setting of auxiliary opcodes.

图2示意性示出了互连模块113的一种实施方式：H树模块。互连模块113 构成主运算模块5和多个从运算模块6之间的数据通路，是由多个节点构成的二叉树通路，每个节点将上游的数据同样地发给下游的两个节点，将下游的两个节点返回的数据进行合并，并返回给上游的节点。例如，在卷积神经网络开始计算阶段，主运算模块5内的神经元数据通过互连模块4发送给各个从运算模块6；当从运算模块6的计算过程完成后，当从运算模块的计算过程完成后，每个从运算模块输出的神经元的值会在互连模块中逐级拼成一个完整的由神经元组成的向量。举例说明，假设装置中共有N个从运算模块，则输入数据Xi被发送到N个从运算模块，每个从运算模块将输入数据Xi与该从运算模块相应的卷积核做卷积运算，得到一标量数据，各从运算模块的标量数据被互连模块4 合并成一个含有N个元素的中间向量。假设卷积窗口总共遍历得到A*B个(X方向为A个，Y方向为B个，X、Y为三维正交坐标系的坐标轴)输入数据Xi，则对A*B个Xi执行上述卷积操作，得到的所有向量在主运算模块中合并得到A*B*N 的三维中间结果。Figure 2 schematically shows one embodiment of the interconnection module 113: an H-tree module. The interconnection module 113 constitutes a data path between the master operation module 5 and a plurality of slave operation modules 6, and is a binary tree path composed of multiple nodes. The data returned by the two downstream nodes is merged and returned to the upstream node. For example, in the calculation stage of the convolutional neural network, the neuron data in the main operation module 5 is sent to each slave operation module 6 through the interconnection module 4; when the calculation process of the slave operation module 6 is completed, when the calculation process of the slave operation module After the process is completed, the value of each neuron output from the operation module will be assembled into a complete vector composed of neurons in the interconnection module. For example, assuming that there are N slave operation modules in the device, the input data Xi is sent to N slave operation modules, and each slave operation module performs a convolution operation on the input data Xi and the corresponding convolution kernel of the slave operation module, A scalar data is obtained, and the scalar data of each slave operation module is combined by the interconnection module 4 into an intermediate vector containing N elements. Assuming that the convolution window is traversed to obtain A*B pieces of input data Xi (A pieces in the X direction, B pieces in the Y direction, X, Y are the coordinate axes of the three-dimensional orthogonal coordinate system) input data Xi, then perform the above execution on the A*B pieces of Xi. Convolution operation, all the obtained vectors are combined in the main operation module to obtain the three-dimensional intermediate result of A*B*N.

图2a示出了根据本披露实施例的用于执行卷积神经网络正向运算的装置中主运算模块5的结构的示例框图。如图2b所示，主运算模块5包括第一运算单元51、第一数据依赖关系判定单元52和第一存储单元53。Fig. 2a shows an example block diagram of the structure of the main operation module 5 in the apparatus for performing forward operation of a convolutional neural network according to an embodiment of the present disclosure. As shown in FIG. 2 b , the main operation module 5 includes a first operation unit 51 , a first data dependency relationship determination unit 52 and a first storage unit 53 .

其中，第一运算单元51包括向量加法单元511以及激活单元512。第一运算单元51接收来自控制器单元的控制信号，完成主运算模块5的各种运算功能，向量加法单元511用于实现卷积神经网络正向计算中的加偏置操作，该部件将偏置数据与所述中间结果对位相加得到偏置结果，激活运算单元512对偏置结果执行激活函数操作。所述偏置数据可以是从外部地址空间读入的，也可以是存储在本地的。The first operation unit 51 includes a vector addition unit 511 and an activation unit 512 . The first operation unit 51 receives the control signal from the controller unit and completes various operation functions of the main operation module 5. The vector addition unit 511 is used to realize the bias addition operation in the forward calculation of the convolutional neural network. The offset result is obtained by bitwise addition of the set data and the intermediate result, and the activation operation unit 512 performs an activation function operation on the offset result. The offset data may be read in from an external address space or stored locally.

第一数据依赖关系判定单元52是第一运算单元51读写第一存储单元53的端口，保证第一存储单元53中数据的读写一致性。同时，第一数据依赖关系判定单元52也负责将从第一存储单元53读取的数据通过互连模块4发送给从运算模块，而从运算模块6的输出数据通过互连模块4直接发送给第一运算单元51。控制器单元2输出的指令发送给计算单元51和第一数据依赖关系判定单元 52，来控制其行为。The first data dependency determination unit 52 is a port for the first operation unit 51 to read and write the first storage unit 53 , and ensures the read-write consistency of the data in the first storage unit 53 . At the same time, the first data dependency determination unit 52 is also responsible for sending the data read from the first storage unit 53 to the slave operation module through the interconnection module 4, and the output data from the operation module 6 is directly sent to the slave operation module through the interconnection module 4. The first arithmetic unit 51 . The instructions output by the controller unit 2 are sent to the calculation unit 51 and the first data dependency determination unit 52 to control their behavior.

存储单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据。The storage unit 53 is used for buffering the input data and output data used by the main operation module 5 in the calculation process.

图2b示出了根据本披露实施例的用于执行卷积神经网络正向运算的装置中从运算模块6的结构的示例框图。如图2A所示，每个从运算模块6包括第二运算单元61、数据依赖关系判定单元62、第二存储单元63和第三存储单元64。FIG. 2b shows an example block diagram of the structure of the slave operation module 6 in the apparatus for performing forward operation of a convolutional neural network according to an embodiment of the present disclosure. As shown in FIG. 2A , each slave operation module 6 includes a second operation unit 61 , a data dependency relationship determination unit 62 , a second storage unit 63 and a third storage unit 64 .

第二运算单元61接收控制器单元2发出的控制信号并进行卷积运算。第二运算单元包括OP变换单元808，向量乘单元611和累加单元612，分别负责卷积运算中的向量乘运算、累加运算以及OP变换操作。The second operation unit 61 receives the control signal sent by the controller unit 2 and performs a convolution operation. The second operation unit includes an OP transformation unit 808, a vector multiplication unit 611 and an accumulation unit 612, which are respectively responsible for vector multiplication operation, accumulation operation and OP transformation operation in the convolution operation.

第二数据依赖关系判定单元62负责计算过程中对第二存储单元63的读写操作。第二数据依赖关系判定单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如，所有发往数据依赖关系单元62的控制信号都会被存入数据依赖关系单元62内部的指令队列里，在该队列中，读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突，则该指令必须等到所依赖的写指令被执行后才能够执行。The second data dependency determination unit 62 is responsible for read and write operations on the second storage unit 63 in the calculation process. Before the second data dependency determination unit 62 performs the read and write operations, it will firstly ensure that there is no read-write consistency conflict in the data used between the instructions. For example, all control signals sent to the data dependency unit 62 will be stored in the command queue inside the data dependency unit 62. In this queue, if the range of the read data of the read command is the same as that of the write command at the front of the queue If the range of write data conflicts, the instruction must wait until the dependent write instruction is executed before it can be executed.

第二存储单元63缓存该从运算模块6的输入数据和输出标量数据。The second storage unit 63 buffers the input data and output scalar data of the slave operation module 6 .

第三存储单元64缓存该从运算模块6在计算过程中需要的卷积核数据。The third storage unit 64 buffers the convolution kernel data required by the slave operation module 6 in the calculation process.

图3是本披露实施例提供的卷积神经网络运算装置执行卷积变换指令的流程图，如图3所示，执行卷积神经网络指令的过程包括，这里的卷积神经网络指令以：CONV_AC_OP为例，在实际应用中，也可以为其他的扩展指令，例如 CONV_ACTIVATE或CONV_OP指令：如该扩展指令为CONV_OP指令时，只需要执行 OP操作即可，无需执行S9中对偏置数据执行激活操作，即如该扩展指令为 CONV_OP指令时，该偏置数据即为最终的输出结果。如该扩展指令为 CONV_ACTIVATE指令时，该运算装置无需OP模块，步骤S7中也无需执行OP变换。FIG. 3 is a flowchart of a convolutional neural network computing device according to an embodiment of the present disclosure executing a convolutional transformation instruction. As shown in FIG. 3 , the process of executing a convolutional neural network instruction includes, where the convolutional neural network instruction is: CONV_AC_OP For example, in practical applications, it can also be other extended instructions, such as CONV_ACTIVATE or CONV_OP instruction: if the extended instruction is a CONV_OP instruction, only the OP operation needs to be performed, and there is no need to perform the activation operation on the offset data in S9. , that is, if the extended instruction is the CONV_OP instruction, the offset data is the final output result. If the extended instruction is the CONV_ACTIVATE instruction, the operation device does not need an OP module, and also does not need to perform OP conversion in step S7.

在步骤S1，在寄存器单元112的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the register unit 112.

在步骤S2，运算开始，控制器单元115从寄存器单元112的首地址读取该条IO指令，根据译出的控制信号，数据访问单元116从存储介质111读取相应的所有卷积神经网络运算指令，并将其缓存在寄存器单元112中。In step S2, the operation starts, the controller unit 115 reads the IO instruction from the first address of the register unit 112, and according to the decoded control signal, the data access unit 116 reads all corresponding convolutional neural network operations from the storage medium 111 instruction and cache it in register unit 112.

在步骤S3，控制器单元115从寄存器单元11读入下一条IO指令，根据译出的控制信号，数据访问单元116从存储介质111读取主运算模块5需要的所有数据(例如，包括输入数据、用于作快速的激活函数运算的插值表、用于配置运算器件参数的常数表、偏置数据等)至主运算模块5的第一存储单元53。In step S3, the controller unit 115 reads the next IO instruction from the register unit 11, and according to the decoded control signal, the data access unit 116 reads all the data (for example, including the input data) required by the main operation module 5 from the storage medium 111 , an interpolation table for fast activation function operation, a constant table for configuring the parameters of the operation device, offset data, etc.) to the first storage unit 53 of the main operation module 5 .

在步骤S4，控制器单元115从寄存器单元11读入下一条IO指令，根据译出的控制信号，数据访问单元116从存储介质111读取从运算模块6需要的卷积核数据。In step S4 , the controller unit 115 reads the next IO instruction from the register unit 11 , and according to the decoded control signal, the data access unit 116 reads the convolution kernel data required from the operation module 6 from the storage medium 111 .

在步骤S5，控制器单元115从寄存器单元11读入下一条CONFIG指令，根据译出的控制信号，装置配置该层神经网络计算需要的各种常数。例如，第一运算单元51、第二运算单元61根据控制信号里的参数配置单元内部寄存器的值，所述参数包括例如激活函数需要的数据；以及OP运算所需的各项常数，如pad 的N、crop的H1和W1、reshape的维度顺序等。In step S5, the controller unit 115 reads the next CONFIG instruction from the register unit 11, and according to the decoded control signal, the device configures various constants required for the calculation of the neural network of this layer. For example, the first operation unit 51 and the second operation unit 61 configure the value of the internal register of the unit according to the parameters in the control signal, and the parameters include, for example, the data required by the activation function; and the constants required for the OP operation, such as pad N, H1 and W1 of crop, dimension order of reshape, etc.

在步骤S6，控制器单元115接着从寄存器单元11读入下一条CONV_AC_OP 指令，根据译出的控制信号，主运算模块5首先通过互联模块113将卷积窗口内的输入数据发给各从运算模块6，保存至从运算模块6的第二存储单元63，之后，再依据指令移动卷积窗口。In step S6, the controller unit 115 then reads the next CONV_AC_OP instruction from the register unit 11, and according to the decoded control signal, the master operation module 5 first sends the input data in the convolution window to each slave operation module through the interconnection module 113 6. Save to the second storage unit 63 of the slave operation module 6, and then move the convolution window according to the instruction.

在步骤S7，根据CONV_AC_OP指令译出的控制信号，从运算模块6的运算单元61从第三存储单元64读取卷积核，从第二存储单元63读取输入数据，OP模块对该输入数据以及卷积核做OP变化，然后从运算模块6的运算单元61执行输入数据(OP变换)和卷积核(OP变换)的卷积运算，将中间结果通过互联模块113返回。In step S7, according to the control signal decoded by the CONV_AC_OP instruction, the convolution kernel is read from the third storage unit 64 from the operation unit 61 of the operation module 6, and the input data is read from the second storage unit 63, and the OP module is used for the input data. And the convolution kernel is changed by OP, and then the operation unit 61 of the operation module 6 performs the convolution operation of the input data (OP transformation) and the convolution kernel (OP transformation), and returns the intermediate result through the interconnection module 113 .

在步骤S8，在互联模块113中，各从运算模块6返回的中间结果被逐级拼成完整的中间向量。In step S8, in the interconnection module 113, each intermediate result returned from the operation module 6 is spliced into a complete intermediate vector step by step.

在步骤S9，主运算模块5得到互连模块4返回的中间向量，卷积窗口遍历所有输入数据，主运算模块将所有返回向量拼接成中间结果；(可选的)根据 CONV_AC_OP指令译出的控制信号，从第一存储单元53读取偏置数据，与中间结果通过向量加单元511相加得到偏置结果，主运算模块5读取CONV_AC_OP寄存器号4内激活函数插值表地址对应的插值表，将偏置结果和插值表做激活运算得到最后的输出数据，并将最后的输出数据写回至第一存储单元53中。In step S9, the main operation module 5 obtains the intermediate vector returned by the interconnection module 4, the convolution window traverses all input data, and the main operation module splices all the returned vectors into an intermediate result; (optional) Control according to the CONV_AC_OP instruction signal, read the offset data from the first storage unit 53, add the intermediate result to the offset result through the vector addition unit 511, and the main operation module 5 reads the interpolation table corresponding to the activation function interpolation table address in the CONV_AC_OP register number 4, Activate the offset result and the interpolation table to obtain the final output data, and write the final output data back to the first storage unit 53 .

在步骤S10，控制器单元115接着从指令存储单元读入下一条IO指令，根据译出的控制信号，数据访问单元116将第一存储单元53中的输出数据存至外部地址空间指定地址，运算结束。In step S10, the controller unit 115 then reads the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit 116 stores the output data in the first storage unit 53 to the specified address in the external address space, and calculates Finish.

本披露实施例还提供一种计算机存储介质，其中，该计算机存储介质存储用于电子数据交换的计算机程序，该计算机程序使得计算机执行如上述方法实施例中记载的任何一种卷积扩展指令的实现方法的部分或全部步骤。Embodiments of the present disclosure further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, the computer program causing the computer to execute any one of the convolution expansion instructions described in the above method embodiments. Implement some or all of the steps of the method.

本披露实施例还提供一种计算机程序产品，所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质，所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种卷积扩展指令的实现方法的部分或全部步骤。Embodiments of the present disclosure also provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the method described in the foregoing method embodiments Some or all of the steps in the implementation of any convolution expansion instruction.

本公开另一实施例，还公开了一种芯片，其包括了上述实施例的神经网络计算装置(如图1所示)。Another embodiment of the present disclosure further discloses a chip, which includes the neural network computing device of the above embodiment (as shown in FIG. 1 ).

本公开的另一实施例，还公开了一种芯片封装结构，其包括了上述芯片。Another embodiment of the present disclosure further discloses a chip package structure including the above-mentioned chip.

本公开的另一实施例，还公开了一种板卡，其包括了上述芯片封装结构。Another embodiment of the present disclosure further discloses a board including the above-mentioned chip packaging structure.

本公开的另一实施例，还公开了一种电子装置，其包括了上述板卡。Another embodiment of the present disclosure further discloses an electronic device including the above board.

电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。Electronic devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, video cameras, projectors, watches, headphones, mobile Storage, Wearables Vehicles, Home Appliances, and/or Medical Devices.

所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本披露并不受所描述的动作顺序的限制，因为依据本披露，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于可选实施例，所涉及的动作和模块并不一定是本披露所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. As in accordance with the present disclosure, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置，可通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本披露各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

以上对本披露实施例进行了详细介绍，本文中应用了具体个例对本披露的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本披露的方法及其核心思想；同时，对于本领域的一般技术人员，依据本披露的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本披露的限制。The embodiments of the present disclosure have been introduced in detail above, and the principles and implementations of the present disclosure are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In conclusion, the contents of this description should not be construed as a limitation of the present disclosure.

Claims

1. a kind of execution method of convolution extended instruction, which is characterized in that described method includes following steps:

Computing device reads the convolution extended instruction from memory and obtains the input data of the convolution extended instruction, convolution kernel And activation operation；

The convolution extended instruction includes: operation code and operation domain, and the operation code includes: the mark of the convolution extended instruction Know；The operation domain includes: convolution subdomain and activation subdomain, and the convolution subdomain includes: address and the volume for storing input data The address of product core, the activation subdomain include: the identification code of the activation operation or the interpolation table address of the activation operation；

Computing device executes convolution algorithm to the input data and convolution kernel and obtains intermediate result, passes through the activation subdomain pair The intermediate result executes activation operation and obtains the final result of described instruction.

2. the method according to claim 1, wherein

The activation operation includes: convolutional neural networks Maxout operation, convolutional neural networks PReLU operation, convolutional Neural net Network RReLU operation, convolutional neural networks Leaky ReLU operation, nonlinear activation operation or linear activation operation operation.

3. the method according to claim 1, wherein if the activation subdomain includes: the interpolation table of activation operation Address, described executed by the activation subdomain to the intermediate result are activated operation to obtain the final result of described instruction, are wrapped It includes:

Computing device extracts the corresponding interpolation table of interpolation table address of the activation operation, by the intermediate result and the interpolation Table executes activation operation and obtains the final result of described instruction.

4. the method according to claim 1, wherein as it is described activation subdomain include: activation operation identification code, Described executed by the activation subdomain to the intermediate result activates operation to obtain the final result of described instruction, comprising:

Computing device identifies that the identification code of the activation operation determines the activation operation, reads the interpolation of the activation operation The interpolation table and the intermediate result are executed activation operation and obtain the final result of described instruction by table.

5. the method according to claim 1, wherein the computing device holds the input data and convolution kernel Row convolution algorithm obtains intermediate result, comprising:

The input data is split into multiple portions and obtains multiple input subdatas by the main computing module of computing device, will be multiple Input subdata be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from operation mould Block executes input subdata parallel and the multiplying of convolution kernel obtains multiple sons as a result, the main computing module of computing device is by institute Multiple sub- results are stated to splice to obtain the intermediate result.

6. a kind of computing device, which is characterized in that the computing device includes: memory, arithmetic element, interconnection module, operation Unit, controller unit and data access unit；

Wherein, the arithmetic element, comprising: adder calculator, multiplicative operator；

Controller unit, for reading the input number that the convolution extended instruction obtains the convolution extended instruction from memory According to, convolution kernel and activation operation；

Data access unit, for obtaining the corresponding input data in address and volume of the address and convolution kernel of the input data Product core；

The arithmetic element obtains intermediate result for executing convolution algorithm to the input data and convolution kernel, by described Activation subdomain executes activation operation to the intermediate result and obtains the final result of described instruction.

7. computing device according to claim 6, which is characterized in that

8. computing device according to claim 6, which is characterized in that if the activation subdomain includes: inserting for activation operation It is worth table address；

The data access unit, for extracting the corresponding interpolation table of interpolation table address of the activation operation；

The arithmetic element obtains the final of described instruction for the intermediate result to be executed activation operation with the interpolation table As a result.

9. computing device according to claim 6, which is characterized in that if the activation subdomain includes: the mark of activation operation Know code；The arithmetic element further include: activation arithmetic unit；

The controller unit, the identification code of the activation operation determines the activation operation for identification；

The activation arithmetic unit executes the interpolation table and the intermediate result for taking the interpolation table of the activation operation Activation operation obtains the final result of described instruction.

10. computing device according to claim 8, which is characterized in that the arithmetic element further include: main computing module and It is multiple from computing module, the main computing module includes: adder calculator and multiplicative operator, described to include: from computing module Adder calculator and multiplicative operator；

The main computing module obtains multiple input subdatas for the input data to be split into multiple portions, will be multiple Input subdata be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from operation mould Block, the multiplying for executing input subdata and convolution kernel parallel obtain multiple sons as a result, the main computing module, is used for Splice the multiple sub- result to obtain the intermediate result.

11. a kind of computer readable storage medium, which is characterized in that it stores the computer program for being used for electronic data interchange, Wherein, the computer program makes computer execute the method according to claim 1 to 5.

12. a kind of computer program product, which is characterized in that the computer program product includes storing computer program Non-transient computer readable storage medium, the computer program are operable to that computer is made to execute such as claim 1-5 Method described in one.

13. a kind of electronic device, which is characterized in that the electronic device includes processor, and the processor includes that right such as is wanted Seek computing device described in 6-10 any one.