TW201928793A

TW201928793A - Neural network forward operation method and related product having the advantages of small calculation amount and low power consumption

Info

Publication number: TW201928793A
Application number: TW107144040A
Authority: TW
Inventors: 劉少禮; 宋新開; 王秉睿; 張堯; 胡帥
Original assignee: 大陸商北京中科寒武紀科技有限公司
Priority date: 2017-12-14
Filing date: 2018-12-07
Publication date: 2019-07-16
Also published as: CN109961131A; CN109961131B; TWI767098B

Abstract

The present disclosure provides a neural network forward operation method performed on an integrated circuit chip device, the neural network comprising multiple layers, wherein the method includes the following steps: receiving a first calculation instruction, and parsing the first calculation instruction to obtain the first operation of the first calculation instruction included in the ith layer of the forward operation and the input data and weight data corresponding to the first calculation instruction; determining a first complexity of the first operation according to the input data, the weight data, and the first operation, and determining, according to the first complexity, the first data type of the input data and the weight data when performing the first operation, the first data type comprising: a floating point type or a fixed point type; and applying the input data and the weight data to a first operation included in the first layer of the forward operation using the first data type. The technical solution provided by the present disclosure has the advantages of small calculation amount and low power consumption.

Description

Neural network forward calculation method and related products

本披露涉及神經網絡領域，尤其涉及一種神經網絡正向運算方法及相關產品。The present disclosure relates to the field of neural networks, and in particular, to a neural network forward operation method and related products.

人工神經網絡（Artificial Neural Network， ANN ），是20世紀80 年代以來人工智能領域興起的研究熱點。它從信息處理角度對人腦神經元網絡進行抽象，建立某種簡單模型，按不同的連接方式組成不同的網絡。在工程與學術界也常直接簡稱為神經網絡或類神經網絡。神經網絡是一種運算模型，由大量的節點（或稱神經元）之間相互聯接構成。現有的神經網絡的運算基於CPU（Central Processing Unit，中央處理器）或GPU（Graphics Processing Unit，圖形處理器）來實現神經網絡的正向運算，此種正向運算的計算量大，功耗高。Artificial neural network (Artificial Neural Network, ANN) is a research hotspot that has emerged in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes some simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to as neural network or neural network. A neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other. The existing neural network operations are based on the CPU (Central Processing Unit, Central Processing Unit) or GPU (Graphics Processing Unit, Graphics Processor) to implement the forward operation of the neural network. This forward operation has a large amount of calculation and high power consumption. .

本披露實施例提供了一種神經網絡正向運算方法及相關產品，可提升計算裝置的處理速度，提高效率。The embodiments of the present disclosure provide a neural network forward operation method and related products, which can improve the processing speed and efficiency of a computing device.

第一方面，提供一種集成電路芯片裝置上執行的神經網絡正向運算方法，該神經網絡包含多層，所述方法包括如下步驟：In a first aspect, a neural network forward operation method performed on an integrated circuit chip device is provided. The neural network includes multiple layers. The method includes the following steps:

接收第一計算指令，解析第一計算指令得到所述第一計算指令在所述正向運算的第i層包含的第一運算以及第一計算指令對應的輸入數據以及權值數據；，所述i的取值範圍為大於等於1的整數，如所述i大於等於2，所述輸入數據為第i-1層的輸出數據；Receiving a first calculation instruction, and analyzing the first calculation instruction to obtain a first operation included in the i-th layer of the forward operation by the first calculation instruction, and input data and weight data corresponding to the first calculation instruction; The value of i is an integer greater than or equal to 1, and if the i is greater than or equal to 2, the input data is the output data of the i-1th layer;

依據該輸入數據、權值數據以及第一運算確定第一運算的第一複雜度，依據所述第一複雜度確定該輸入數據以及權值數據在執行第一運算時的第一數據類型，所述第一數據類型包括：浮點類型或定點類型；A first complexity of the first operation is determined according to the input data, weight data, and the first operation, and a first data type of the input data and weight data when the first operation is performed is determined according to the first complexity. The first data type includes: a floating-point type or a fixed-point type;

將輸入數據以及權值數據以所述第一數據類型執行所述正向運算的第一層包含的第一運算。Performing the first operation included in the first layer of the forward operation with the input data and the weight data in the first data type.

第二方面，提供一種集成電路芯片裝置，所述集成電路芯片裝置用於執行神經網絡的正向運算，所述神經網絡包含多層，所述裝置包括：處理電路以及外部介面；According to a second aspect, an integrated circuit chip device is provided. The integrated circuit chip device is configured to perform a forward operation of a neural network. The neural network includes multiple layers. The device includes a processing circuit and an external interface.

所述外部介面，用於接收第一計算指令；The external interface is configured to receive a first calculation instruction;

所述處理電路，用於解析第一計算指令得到所述第一計算指令在所述正向運算的第i層包含的第一運算、第一計算指令對應的輸入數據以及權值數據；，所述i的取值範圍為大於等於1的整數，如所述i大於等於2，所述輸入數據為第i-1層的輸出數據;The processing circuit is configured to parse a first calculation instruction to obtain a first operation included in the i-th layer of the forward operation by the first calculation instruction, input data corresponding to the first calculation instruction, and weight data; The value range of i is an integer greater than or equal to 1, and if the i is greater than or equal to 2, the input data is the output data of the i-1th layer;

所述處理電路，還用於依據該輸入數據、權值數據以及第一運算確定第一運算的第一複雜度，依據所述第一複雜度確定該輸入數據以及權值數據在執行第一運算時的第一數據類型，所述第一數據類型包括：浮點類型或定點類型；The processing circuit is further configured to determine a first complexity of the first operation according to the input data, weight data, and the first operation, and determine that the input data and weight data are performing the first operation according to the first complexity. The first data type of the time, the first data type includes: a floating point type or a fixed point type;

所述處理電路，還用於將輸入數據以及權值數據以第一數據類型執行所述正向運算的第i層包含的第一運算。The processing circuit is further configured to perform the first operation included in the i-th layer of the forward operation by using the input data and the weight data as the first data type.

第三方面，提供一種神經網絡運算裝置，所述神經網絡運算裝置包括一個或多個第二方面提供的集成電路芯片裝置。In a third aspect, a neural network computing device is provided. The neural network computing device includes one or more integrated circuit chip devices provided in the second aspect.

第四方面，提供一種組合處理裝置，所述組合處理裝置包括：第三方面提供的神經網絡運算裝置、通用互聯介面和通用處理裝置；According to a fourth aspect, a combined processing device is provided. The combined processing device includes: a neural network computing device, a universal interconnection interface, and a universal processing device provided in the third aspect;

所述神經網絡運算裝置通過所述通用互聯介面與所述通用處理裝置連接。The neural network computing device is connected to the universal processing device through the universal interconnection interface.

第五方面，提供一種芯片，所述芯片集成第二方面的裝置、第三方面的裝置或第四方面的裝置。According to a fifth aspect, a chip is provided, and the chip integrates the device of the second aspect, the device of the third aspect, or the device of the fourth aspect.

第六方面，提供一種電子設備，所述電子設備包括第四方面的芯片。According to a sixth aspect, an electronic device is provided, and the electronic device includes the chip of the fourth aspect.

可以看出，通過本披露實施例，提供數據轉換運算電路將數據塊的類型進行轉換後運算，節省了傳輸資源以及計算資源，所以其具有功耗低，計算量小的優點。It can be seen that, according to the embodiment of the present disclosure, a data conversion operation circuit is provided to perform a conversion operation on the type of the data block, which saves transmission resources and calculation resources, so it has the advantages of low power consumption and small calculation amount.

為了使本技術領域的人員更好地理解本披露方案，下面將結合本披露實施例中的圖式，對本披露實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本披露一部分實施例，而不是全部的實施例。基於本披露中的實施例，所屬技術領域中具有通常知識者在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本披露保護的範圍。In order to enable those skilled in the art to better understand the disclosure scheme, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in combination with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are merely These embodiments are part of, but not all of the embodiments of this disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by persons with ordinary knowledge in the technical field without making creative labor fall into the scope of protection of the present disclosure.

在第一方面提供的方法中，依據所述第一複雜度確定該輸入數據以及權值數據在執行第一運算時的第一數據類型，包括：In the method provided by the first aspect, determining the first data type of the input data and the weight data when performing the first operation according to the first complexity includes:

將所述第一複雜度與預設閾值比較，如所述第一複雜度高於所述預設閾值，確定所述第一數據類型為定點類型，如所述第一複雜度低於或等於所述預設閾值，確定所述第一數據類型為浮點類型。Compare the first complexity with a preset threshold, if the first complexity is higher than the preset threshold, determine that the first data type is a fixed-point type, such that the first complexity is lower than or equal to The preset threshold determines that the first data type is a floating point type.

在第一方面提供的方法中，所述方法在所述依據所述第一複雜度確定該輸入數據以及權值數據在執行第一運算時的第一數據類型之後還包括：In the method provided by the first aspect, after the determining the first data type of the input data and the weight data when performing the first operation according to the first complexity, the method further includes:

確定所述輸入數據以及權值數據屬於第二數據類型，如所述第二數據類型與所述第一數據類型不同，將屬於第二數據類型的所述輸入數據以及屬於第二數據類型的所述權值數據轉換成屬於第一數據類型的所述輸入數據以及屬於第一數據類型的所述權值數據。Determining that the input data and weight data belong to the second data type, and if the second data type is different from the first data type, the input data belonging to the second data type and all data belonging to the second data type The weight data is converted into the input data belonging to the first data type and the weight data belonging to the first data type.

在第一方面提供的方法中，如所述第一運算為卷積運算，所述輸入數據為卷積輸入數據，所述權值數據為卷積核，In the method provided by the first aspect, if the first operation is a convolution operation, the input data is convolution input data, and the weight data is a convolution kernel,

第一複雜度=α*C*kW*kW*M*N*W*C*H；First complexity = α * C * kW * kW * M * N * W * C * H;

其中，α為卷積系數，取值範圍為大於1；C、kW、kW、M為卷積核四個維度的值，N、W、C、H為卷積輸入數據四個維度的值；Among them, α is the convolution coefficient, and the value range is greater than 1. C, kW, kW, and M are the values of the four dimensions of the convolution kernel, and N, W, C, and H are the values of the four dimensions of the convolution input data;

如所述第一複雜度大於設定閾值，確定該卷積輸入數據以及卷積核是否為浮點數據，如該卷積輸入數據以及卷積核不為浮點數據，將該卷積輸入數據轉換成浮點數據，將卷積核轉換成浮點數據，然後將卷積輸入數據、卷積核以浮點數據類型執行卷積運算。If the first complexity is greater than a set threshold, determine whether the convolution input data and the convolution kernel are floating point data. If the convolution input data and the convolution kernel are not floating point data, input the convolution input. The data is converted into floating-point data, the convolution kernel is converted into floating-point data, and then the convolution input data and the convolution kernel are used to perform convolution operations in the floating-point data type.

在第一方面提供的方法中，如所述第一運算為：矩陣乘矩陣運算，所述輸入數據為所述矩陣乘矩陣運算的第一矩陣，所述權值為所述矩陣乘矩陣運算的第二矩陣；In the method provided by the first aspect, if the first operation is: a matrix multiplication matrix operation, the input data is a first matrix of the matrix multiplication matrix operation, and the weight value is Second matrix

第一複雜度=β*F*G*E*F;其中，β為矩陣系數，取值範圍為大於等於1，F、G為第一矩陣的行、列值，E、F為第二矩陣的行、列值；The first complexity = β * F * G * E * F; where β is a matrix coefficient and the value range is 1 or more, F and G are the row and column values of the first matrix, and E and F are the second matrix Row and column values

如所述第一複雜度大於設定閾值，確定該第一矩陣以及第二矩陣是否為浮點數據，如該第一矩陣以及第二矩陣不為浮點數據，將該第一矩陣轉換成浮點數據，將第二矩陣轉換成浮點數據，然後將第一矩陣、第二矩陣以浮點數據類型執行矩陣乘矩陣運算。If the first complexity is greater than a set threshold, determine whether the first matrix and the second matrix are floating-point data. If the first matrix and the second matrix are not floating-point data, convert the first matrix into For floating-point data, the second matrix is converted into floating-point data, and then the first matrix and the second matrix are performed as matrix floating-matrix operations in the floating-point data type.

在第一方面提供的方法中，如所述第一運算為：矩陣乘向量運算，所述輸入數據為所述矩陣乘向量運算的第一矩陣，所述權值為所述矩陣乘向量運算的向量；In the method provided by the first aspect, if the first operation is: a matrix multiplying vector operation, the input data is a first matrix of the matrix multiplying vector operation, and the weight value is vector;

第一複雜度=β*F*G*F;其中，β為矩陣系數，取值範圍為大於等於1，F、G為第一矩陣的行、列值，F為向量的列值；The first complexity = β * F * G * F; where β is a matrix coefficient and the value range is greater than or equal to 1, F, G are the row and column values of the first matrix, and F is the column value of the vector;

如所述第一複雜度大於設定閾值，確定該第一矩陣以及向量是否為浮點數據，如該第一矩陣以及向量不為浮點數據，將該第一矩陣轉換成浮點數據，將向量轉換成浮點數據，然後將第一矩陣、向量以浮點數據類型執行矩陣乘向量運算。If the first complexity is greater than a set threshold, determine whether the first matrix and vector are floating point data. If the first matrix and vector are not floating point data, convert the first matrix to floating point data. , Convert the vector to floating-point data, and then perform matrix multiplication vector operations on the first matrix and vector with floating-point data type.

在第一方面提供的方法中，第i層還可以包括如下運算：偏執運算、全連接運算、GEMM運算、GEMV運算、激活運算中的一種或任意組合。In the method provided by the first aspect, the i-th layer may further include one or any combination of paranoid operation, fully connected operation, GEMM operation, GEMV operation, and activation operation.

在第二方面提供的裝置中，所述處理電路，具體用於將所述第一複雜度與預設閾值比較，如所述第一複雜度高於所述預設閾值，計算裝置確定所述第一數據類型為定點類型，如所述第一複雜度低於或等於所述預設閾值，計算裝置確定所述第一數據類型為浮點類型。In the apparatus provided by the second aspect, the processing circuit is specifically configured to compare the first complexity with a preset threshold, and if the first complexity is higher than the preset threshold, the computing device determines the The first data type is a fixed-point type. If the first complexity is lower than or equal to the preset threshold, the computing device determines that the first data type is a floating-point type.

在第二方面提供的裝置中，所述集成電路芯片裝置還包括：數據類型轉換電路；In the device provided by the second aspect, the integrated circuit chip device further includes: a data type conversion circuit;

所述處理電路，還用於確定所述輸入數據以及權值數據屬於的第二數據類型，如所述第二數據類型與所述第一數據類型不同，向所述數據類型轉換電路發送轉換命令，The processing circuit is further configured to determine a second data type to which the input data and weight data belong. If the second data type is different from the first data type, send a conversion command to the data type conversion circuit. ,

所述數據類型轉換電路，用於依據所述轉換命令將屬於第二數據類型的所述輸入數據以及屬於第二數據類型的所述權值數據轉換成屬於第一數據類型的所述輸入數據以及屬於第一數據類型的所述權值數據。The data type conversion circuit is configured to convert the input data belonging to the second data type and the weight data belonging to the second data type into the input data belonging to the first data type according to the conversion command, and The weight data belonging to the first data type.

在第二方面提供的裝置中，如所述第一運算為卷積運算，所述輸入數據為卷積輸入數據，所述權值數據為卷積核，In the apparatus provided by the second aspect, if the first operation is a convolution operation, the input data is convolution input data, and the weight data is a convolution kernel,

所述處理電路，用於計算第一複雜度，第一複雜度=α*C*kW*kW*M*N*W*C*H；The processing circuit is configured to calculate a first complexity, and the first complexity = α * C * kW * kW * M * N * W * C * H;

所述處理電路，還用於如所述第一複雜度大於設定閾值，確定該卷積輸入數據以及卷積核是否為浮點數據，如該卷積輸入數據以及卷積核不為浮點數據，將該卷積輸入數據轉換成浮點數據，將卷積核轉換成浮點數據，然後將卷積輸入數據、卷積核以浮點數據類型執行卷積運算。The processing circuit is further configured to determine whether the convolution input data and the convolution kernel are floating point data if the first complexity is greater than a set threshold, such as the convolution input data and the convolution kernel are not floating point. Data, the convolution input data is converted into floating point data, the convolution kernel is converted into floating point data, and then the convolution input data and the convolution kernel are used to perform a convolution operation in a floating point data type.

在第二方面提供的裝置中，如所述第一運算為：矩陣乘矩陣運算，所述輸入數據為所述矩陣乘矩陣運算的第一矩陣，所述權值為所述矩陣乘矩陣運算的第二矩陣；In the apparatus provided by the second aspect, if the first operation is: a matrix multiplication matrix operation, the input data is a first matrix of the matrix multiplication matrix operation, and the weight value is Second matrix

所述處理電路，用於計算第一複雜度；The processing circuit is configured to calculate a first complexity;

所述處理電路，還用於如所述第一複雜度大於設定閾值，確定該第一矩陣以及第二矩陣是否為浮點數據，如該第一矩陣以及第二矩陣不為浮點數據，將該第一矩陣轉換成浮點數據，將第二矩陣轉換成浮點數據，然後將第一矩陣、第二矩陣以浮點數據類型執行矩陣乘矩陣運算。The processing circuit is further configured to determine whether the first matrix and the second matrix are floating-point data if the first complexity is greater than a set threshold, such as the first matrix and the second matrix are not floating-point data. , Convert the first matrix into floating-point data, convert the second matrix into floating-point data, and then perform a matrix multiplication matrix operation on the first matrix and the second matrix in a floating-point data type.

在第二方面提供的裝置中，如所述第一運算為：矩陣乘向量運算，所述輸入數據為所述矩陣乘向量運算的第一矩陣，所述權值為所述矩陣乘向量運算的向量；In the apparatus provided by the second aspect, if the first operation is: a matrix multiplying vector operation, the input data is a first matrix of the matrix multiplying vector operation, and the weight value is vector;

所述處理電路，還用於如所述第一複雜度大於設定閾值，確定該第一矩陣以及向量是否為浮點數據，如該第一矩陣以及向量不為浮點數據，將該第一矩陣轉換成浮點數據，將向量轉換成浮點數據，然後將第一矩陣、向量以浮點數據類型執行矩陣乘向量運算。The processing circuit is further configured to determine whether the first matrix and vector are floating-point data if the first complexity is greater than a set threshold, and if the first matrix and vector are not floating-point data, A matrix is converted into floating-point data, a vector is converted into floating-point data, and a first matrix and a vector are performed as a floating-point data type by performing a matrix multiplication vector operation.

在第二方面提供的裝置中，所述i層還可以包括如下運算：偏執運算、全連接運算、GEMM運算、GEMV運算、激活運算中的一種或任意組合。In the device provided by the second aspect, the i-layer may further include one or any combination of paranoid operation, fully connected operation, GEMM operation, GEMV operation, and activation operation.

如圖1所示，為本披露實施例提供的一種神經網絡的正向運算，每一層使用自己的輸入數據和權值按照層的類型所指定的運算規則計算得到相應的輸出數據；As shown in FIG. 1, a forward operation of a neural network is provided according to an embodiment of the present disclosure. Each layer uses its own input data and weights to calculate corresponding output data according to an operation rule specified by the type of the layer.

神經網絡的正向運算過程（也叫推理，inference）是逐層處理各層的輸入數據，經過一定的計算，得到輸出數據的過程，具有如下特徵：The neural network's forward computing process (also called inference) is a process of processing the input data of each layer layer by layer and obtaining the output data after a certain calculation. It has the following characteristics:

某一層的輸入：Inputs for a layer:

某一層的輸入可以是神經網絡的輸入數據；The input of a certain layer can be the input data of a neural network;

某一層的輸入可以是其他層的輸出；The input of one layer can be the output of other layers;

某一層的輸入可以是本層上一時刻的輸出（對應於循環神經網絡的情況）；The input of a certain layer can be the output of the previous moment in this layer (corresponding to the case of recurrent neural network);

某一層可以同時從多個上述輸入源獲取輸入；A layer can obtain input from multiple input sources at the same time;

某一層的輸出：Output of a layer:

某一層的輸出可以作為神經網絡的輸出結果；The output of a layer can be used as the output of a neural network;

某一層的輸出可以是其它層的輸入；The output of one layer can be the input of other layers;

某一層的輸出可以是下一時刻本層的輸入（循環神經網絡的情況）；The output of a layer can be the input of this layer at the next moment (in the case of a recurrent neural network);

某一層的輸出可以向上述多個輸出方向輸出結果；The output of a certain layer can output results to the above multiple output directions;

具體地，所述神經網絡中的層的運算的類型包括但不限於以下幾種：Specifically, the types of operations of the layers in the neural network include but are not limited to the following:

卷積層（即執行卷積運算）；Convolutional layer (ie performing a convolution operation);

全連接層（即執行全連接運算）；Fully connected layer (that is, performing a fully connected operation);

歸一化（規則化）層：包括LRN（Local Response Normalization）層，BN（Batch Normalization）層等類型；Normalization (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch Normalization) layer and other types;

池化層；Pooling layer

激活層：包括但不限於以下類型Sigmoid層，ReLU層，PReLu層，LeakyReLu層，Tanh層；Activation layer: including but not limited to the following types of Sigmoid layer, ReLU layer, PReLu layer, LeakyReLu layer, Tanh layer;

層的反向運算，每一層的反向運算需要執行兩部分運算：一部分是使用可能是稀疏表示的輸出數據梯度和可能是稀疏表示的輸入數據計算出權值的梯度（用於在「權值更新」步驟更新本層的權值），另一部分是使用可能是稀疏表示的輸出數據梯度和可能是稀疏表示的權值，計算出輸入數據梯度（用於作為反向運算中下一層的輸出數據梯度以供其進行反向運算）；Layer inversion operation, each layer needs to perform two parts of the operation: one is to use the output data gradient that may be a sparse representation and the input data that may be a sparse representation to calculate the weight gradient (used in the "weight Update "step to update the weights of this layer), the other part is to calculate the gradient of the input data using the output data gradient that may be sparse representation and the weight value that may be sparse representation (for the output data of the next layer in the inverse operation) Gradient for inverse operation);

反向運算按照與正向運算相反的順序，從最後一層開始反向傳遞梯度。The reverse operation transfers gradients in the reverse order from the last layer, starting from the last layer.

在一種可選方案中，某一層反向計算得到的輸出數據梯度可以來自：In an optional solution, the gradient of the output data obtained by inverse calculation of a certain layer can come from:

神經網絡最後的損失函數（lost function或者cost function）回傳的梯度；The gradient returned by the last loss function or cost function of the neural network;

其它層的輸入數據梯度；Input data gradients of other layers;

本層上一時刻的輸入數據梯度（對應於循環神經網絡的情況）；The input data gradient at the previous moment in this layer (corresponding to the case of recurrent neural network);

某一層可以同時從多個上述源獲取輸出數據梯度；A certain layer can obtain output data gradients from multiple sources at the same time;

在執行完神經網絡的反向運算之後，就計算出了各層的權值的梯度，在這個步驟中，所述裝置的第一輸入緩存和第二輸入緩存分別用於存儲本層的權值和權值的梯度，然後在運算單元中使用權值梯度對權值進行更新；After performing the inverse operation of the neural network, the gradient of the weights of each layer is calculated. In this step, the first input buffer and the second input buffer of the device are used to store the weights and Weight gradient, and then use the weight gradient in the arithmetic unit to update the weight;

上文中提到的運算都是神經網絡中的一層的運算，對於多層神經網絡，其實現過程是，在正向運算中，當上一層人工神經網絡執行完成之後，下一層的運算指令會將運算單元中計算出的輸出數據作為下一層的輸入數據進行運算（或者是對該輸出數據進行某些操作再作為下一層的輸入數據），同時，將權值也替換為下一層的權值；在反向運算中，當上一層人工神經網絡的反向運算執行完成後，下一層運算指令會將運算單元中計算出的輸入數據梯度作為下一層的輸出數據梯度進行運算（或者是對該輸入數據梯度進行某些操作再作為下一層的輸出數據梯度），同時將權值替換為下一層的權值；（用以下的圖表示，以下圖中虛線的箭頭表示反向運算，實線的箭頭表示正向運算，各圖下面的標注表示圖的含義）The operations mentioned above are all one-level operations in neural networks. For multi-layer neural networks, the implementation process is that in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instructions in the next layer will calculate the operation. The output data calculated in the unit is used as the input data of the next layer (or some operation is performed on the output data and then used as the input data of the next layer), and the weight is also replaced by the weight of the next layer; In the inverse operation, after the inverse operation of the artificial neural network in the previous layer is completed, the operation instructions in the next layer will calculate the input data gradient calculated in the operation unit as the output data gradient in the next layer (or the input data). The gradient performs some operations and then acts as the output data gradient of the next layer), and at the same time replaces the weight with the weight of the next layer; (represented by the following figure, the dotted arrow in the following figure represents the reverse operation, and the solid arrow represents (Forward operation, the label below each figure indicates the meaning of the figure)

定點化數據的表示方法Representation of fixed-point data

定點化的方法是指將網絡中的某個數據塊的數據表示轉換成特定的某種固定小數點位置的數據表示方式（映射到電路裝置上數據的0/1比特位擺放方式）；The fixed-point method refers to converting the data representation of a data block in the network into a specific data representation of a fixed decimal point position (mapped to the 0/1 bit position of the data on the circuit device);

在一種可選方案中，將多個數據組成個數據塊作為一個整體使用同樣的定點表示方法進行定點化表示；In an optional solution, a plurality of data is combined into a data block as a whole to perform fixed-point representation using the same fixed-point representation method;

圖1a示出了根據本發明實施例的用於存儲數據的短位數定點數據結構的具體表示方法。其中，1Bit位用於表示符號，M位用於表示整數部分，N位用於表示小數部分；相比於32位浮點數據表示形式，本發明採用的短位定點數據表示形式除了佔用比特位數更少外，對於神經網絡中同一層、同一類型的數據，如第一個卷積層的所有權值數據，還另外設置了一個標誌位Point location記錄小數點的位置，這樣可以根據實際數據的分布調整數據表示的精度與可表示數據範圍。FIG. 1 a shows a specific representation method of a short-bit fixed-point data structure for storing data according to an embodiment of the present invention. Among them, 1Bit is used to represent a symbol, M is used to represent an integer part, and N is used to represent a decimal part. Compared to a 32-bit floating-point data representation, the short-bit fixed-point data representation used in the present invention is in addition to occupying bits In addition to fewer digits, for the data of the same layer and the same type in the neural network, such as the ownership value data of the first convolution layer, a flag point location is also set to record the position of the decimal point, which can be based on the actual data. The accuracy and range of data that can be represented by the distribution adjustment data.

對於浮點數的表示即32bit來表示，但是對於此技術方案，其採用定點數可以減少一個數值的比特位的位數，從而降低傳輸的數據量以及運算的數據量。The floating-point number is represented by 32bit, but for this technical solution, the use of fixed-point numbers can reduce the number of bits of a numerical value, thereby reducing the amount of data transmitted and the amount of data calculated.

輸入數據用圖2a表示（N個樣本，每個樣本有C個通道，每個通道的特徵圖的高為H，寬為W），權值也即卷積核用圖2b表示（有M個卷積核，每個卷積核有C個通道，高和寬分別為KH和KW）。對於輸入數據的N個樣本，卷積運算的規則都是一樣的，下面解釋在一個樣本上進行卷積運算的過程，在一個樣本上，M個卷積核中的每一個都要進行同樣的運算，每個卷積核運算得到一張平面特徵圖，M個卷積核最終計算得到M個平面特徵圖，（對一個樣本，卷積的輸出是M個特徵圖），對於一個卷積核，要在一個樣本的每一個平面位置進行內積運算，然後沿著H和W方向進行滑動，例如，圖2c表示一個卷積核在輸入數據的一個樣本中右下角的位置進行內積運算的對應圖；圖2d表示卷積的位置向左滑動一格和圖2e表示卷積的位置向上滑動一格。The input data is shown in Figure 2a (N samples, each sample has C channels, and the height of the feature map of each channel is H and width W), and the weight, that is, the convolution kernel is shown in Figure 2b (M Convolution kernel, each convolution kernel has C channels, height and width are KH and KW respectively). For N samples of the input data, the rules of the convolution operation are the same. The process of performing the convolution operation on one sample is explained below. On one sample, each of the M convolution kernels must be the same. Operation, each convolution kernel operates to obtain a planar feature map, and M convolution kernels finally calculate to obtain M planar feature maps (for a sample, the output of the convolution is M feature maps), for a convolution kernel , To perform an inner product operation at each plane position of a sample, and then slide along the H and W directions. For example, Figure 2c shows a convolution kernel that performs an inner product operation at the lower right corner of a sample of input data. Correspondence map; Figure 2d shows the position of the convolution slide one grid to the left and Figure 2e shows the position of the convolution slide one grid up.

當第一運算為卷積運算，所述輸入數據為卷積輸入數據，所述權值數據為卷積核，When the first operation is a convolution operation, the input data is convolution input data, and the weight data is a convolution kernel,

具體的，該卷積處理的方式可以採用如圖3a所示的芯片結構處理，主處理電路（也可以稱為主單元）的數據轉換運算電路可以在第一複雜度大於設定閾值時，將權值的部分或全部卷積核中的數據轉換成定點類型的數據，主處理電路的控制電路將權值的部分或全部卷積核中的數據發送到通過橫向數據輸入介面直接與主處理電路相連的那些基礎處理電路（也可以稱為基礎單元）（例如，圖3b中最上方的灰色填充的竪向數據通路）；Specifically, the convolution processing method may adopt a chip structure processing as shown in FIG. 3a. The data conversion operation circuit of the main processing circuit (also referred to as the main unit) may convert the weight when the first complexity is greater than a set threshold. The data in some or all of the convolution kernels is converted into fixed-point data. The control circuit of the main processing circuit sends the data in some or all of the weighted convolution kernels to be directly connected to the main processing circuit through the horizontal data input interface. Those basic processing circuits (also called basic units) (for example, the gray-filled vertical data path at the top in Figure 3b);

在一種可選方案中，主處理電路的控制電路將權值中某個卷積核的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3行第1個數，第2次發送第3行數據中的第2個數，第3次發送第3行的第3個數……，或者第1次發送第3行前兩個數，第二次發送第3行第3和第4個數，第三次發送第3行第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a certain convolution kernel in the weight value to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, the first Send the first number of the 3rd line once, send the 2nd number of the 3rd line data for the 2nd time, send the 3rd number of the 3rd line for the 3rd time ... or before the 3rd line of the 1st time Two numbers, the second time sends the 3rd and 4th numbers of the 3rd line, the third time sends the 5th and 6th numbers of the 3rd line ...;)

在一種可選方案中另一種情況是，主處理電路的控制電路將權值中某幾個卷積核的數據每次各發送一個數者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5行每行的第1個數，第2次發送第3,4,5行每行的第2個數，第3次發送第3,4,5行每行的第3個數……，或者第1次發送第3,4,5行每行前兩個數，第二次發送第3,4,5行每行第3和第4個數，第三次發送第3,4,5行每行第5和第6個數……；）In another alternative, the control circuit of the main processing circuit sends the data of some convolution kernels in the weight value to the basic processing circuit one by one each time; (for example, for a certain A basic processing circuit that sends the first number of each line 3,4,5 for the first time, the second number of each line 3,4,5 for the second time, and the third number of 3, The 3rd number of each line 4,5 ... or the first two numbers of 3,4,5 lines are sent for the first time, the 3rd and 4th of 5th lines are sent for the second time 4 numbers, 3rd, 4th, 5th lines send 5th and 6th numbers per line ...;)

主處理電路的控制電路把輸入數據按照卷積的位置進行劃分，主處理電路的控制電路將輸入數據中的部分或全部卷積位置中的數據發送到通過竪向數據輸入介面直接與主處理電路相連的那些基礎處理電路（例如，圖3b中基礎處理電路陣列左側的灰色填充的橫向數據通路）；The control circuit of the main processing circuit divides the input data according to the position of the convolution. The control circuit of the main processing circuit sends the data in some or all of the convolution positions in the input data to the main processing circuit directly through the vertical data input interface. Connected basic processing circuits (for example, the gray-filled horizontal data path to the left of the basic processing circuit array in Figure 3b);

在一種可選方案中，主處理電路的控制電路將輸入數據中某個卷積位置的數據每次發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3列第1個數，第2次發送第3列數據中的第2個數，第3次發送第3列的第3個數……，或者第1次發送第3列前兩個數，第二次發送第3列第3和第4個數，第三次發送第3列第5和第6個數……；）In an optional solution, the control circuit of the main processing circuit sends data of a convolution position in the input data to a basic processing circuit at a time or a portion of the data each time; (for example, for a basic processing circuit, the first Send the first number in the third column once, send the second number in the third column data for the second time, send the third number in the third column for the third time ..., or before the first column 3 Two numbers, the third and third numbers are sent in the third column, the third and third and fifth numbers are sent in the third column ...;)

在一種可選方案中另一種情況是，主處理電路的控制電路將輸入數據中某幾個卷積位置的數據每次各發送一個數或者一部分數給某個基礎處理電路；（例如，對於某一個基礎處理電路，第1次發送第3,4,5列每列的第1個數，第2次發送第3,4,5列每列的第2個數，第3次發送第3,4,5列每列的第3個數……，或者第1次發送第3,4,5列每列前兩個數，第二次發送第3,4,5列每列第3和第4個數，第三次發送第3,4,5列每列第5和第6個數……；）In another optional solution, the control circuit of the main processing circuit sends data of some convolution positions in the input data to a basic processing circuit each time or a part of the data; (for example, for a certain A basic processing circuit that sends the first number of each of the 3, 4, and 5 columns for the first time, sends the second number of each of the 3, 4, 5 columns for the second time, and sends the third number of 3, The 3rd number of each column in 4,5 columns ..., or the first two numbers of 3,4,5 columns are sent for the first time, and the 3rd and 4th of 5th columns are sent for the second time. 4 numbers, send the 3rd, 4th, 5th columns for the third time, 5th and 6th numbers for each column ...;)

基礎處理電路接收到權值的數據之後，將該數據通過其橫向的數據輸出介面傳輸給其相連接下一個基礎處理電路（例如，圖3b中基礎處理電路陣列中間的白色填充的橫向的數據通路）；基礎處理電路接收到輸入數據的數據後，將該數據通過其竪向的數據輸出介面傳輸給與其相連接的下一個基礎處理電路（例如，圖3b中基礎處理電路陣列中間的白色填充的竪向的數據通路）；After the basic processing circuit receives the weighted data, it transmits the data to its next basic processing circuit through its horizontal data output interface (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 3b). ); After receiving the input data, the basic processing circuit transmits the data to the next basic processing circuit connected to it through its vertical data output interface (for example, the white filled pad in the middle of the basic processing circuit array in Figure 3b) Vertical data path);

每一個基礎處理電路對接收到的數據進行運算；Each basic processing circuit performs operations on the received data;

在一種可選方案中，基礎處理電路每次計算一組或多組兩個數據的乘法，然後將結果累加到寄存器和/或片上緩存上；In an optional solution, the basic processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the results in a register and / or an on-chip buffer;

在一種可選方案中，基礎處理電路每次計算一組或多組兩個向量的內積，然後將結果累加到寄存器和/或片上緩存上；In an optional solution, the basic processing circuit calculates an inner product of one or more groups of two vectors at a time, and then accumulates the results in a register and / or an on-chip buffer;

基礎處理電路計算出結果後，可以將結果從數據輸出介面傳輸出去；After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

在一種可選方案中，該計算結果可以是內積運算的最終結果或中間結果；In an optional solution, the calculation result may be a final result or an intermediate result of the inner product operation;

具體地，如果該基礎處理電路有直接與主處理電路相連接的輸出介面則從該介面傳輸結果，如果沒有，則向著能夠直接向主處理電路輸出的基礎處理電路的方向輸出結果（例如，圖3b中，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface; if not, the result is output to the direction of the basic processing circuit that can directly output to the main processing circuit (for example, FIG. In 3b, the bottom line of the basic processing circuit directly outputs its output result to the main processing circuit, and the other basic processing circuits transmit the calculation results downward from the vertical output interface).

基礎處理電路接收到來自其他基礎處理電路的計算結果之後，將該數據傳輸給與其相連接的其他基礎處理電路或者主處理電路；After receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected to the data;

向著能夠直接向主處理電路輸出的方向輸出結果（例如，最下面一行基礎處理電路將其輸出結果直接輸出給主處理電路，其他基礎處理電路從竪向的輸出介面向下傳輸運算結果）；Output the result in a direction that can be directly output to the main processing circuit (for example, the bottom row of the basic processing circuit outputs its output result directly to the main processing circuit, and the other basic processing circuits transmit the calculation result downward from the vertical output interface);

主處理電路接收到各個基礎處理電路內積運算的結果，即可得到輸出結果。The main processing circuit receives the result of the inner product operation of each basic processing circuit and can obtain the output result.

參閱圖4a，圖4a為一種矩陣乘以矩陣的運算，如所述第一運算為：矩陣乘矩陣運算，所述輸入數據為所述矩陣乘矩陣運算的第一矩陣，所述權值為所述矩陣乘矩陣運算的第二矩陣；Referring to FIG. 4a, FIG. 4a is a matrix-by-matrix operation. For example, the first operation is: matrix-by-matrix operation, the input data is the first matrix of the matrix-by-matrix operation, and the weight is The second matrix of matrix multiplication matrix operation;

參閱圖4b，使用如圖3b所示的裝置完成矩陣乘矩陣的運算；Referring to FIG. 4b, the matrix multiplication matrix operation is completed using the apparatus shown in FIG. 3b;

下面描述計算尺寸是M行L列的矩陣S和尺寸是L行N列的矩陣P的乘法的運算，（矩陣S中的每一行與矩陣P的每一列長度相同，如圖2d所示）所述神經網絡計算裝置擁有K個基礎處理電路：The following describes the calculation of a multiplication of a matrix S whose size is M rows and L columns and a matrix P whose size is L rows and N columns. (Each row in matrix S has the same length as each column of matrix P, as shown in Figure 2d.) The neural network computing device has K basic processing circuits:

步驟S401b、主處理電路在如第一複雜度大於設定閾值時，將矩陣S和矩陣P轉換成定點類型數據，主處理電路的控制電路將矩陣S中的每一行數據分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的數據保存在片上緩存和/或寄存器中；具體的，可以發送至K個基礎處理電路中與主處理電路連接的基礎處理電路。Step S401b: When the first complexity is greater than a set threshold, the main processing circuit converts the matrix S and the matrix P into fixed-point type data, and the control circuit of the main processing circuit distributes each row of data in the matrix S to the K basic processing circuits. On one of them, the basic processing circuit stores the received data in an on-chip buffer and / or register; specifically, it can be sent to the basic processing circuits connected to the main processing circuit among the K basic processing circuits.

在一種可選方案中，如果S的行數M＜=K則，主處理電路的控制電路給M個基礎處理電路分別分發S矩陣的一行；In an optional solution, if the number of rows of S M <= K, the control circuit of the main processing circuit distributes one row of the S matrix to the M basic processing circuits respectively;

在一種可選方案中，如果S的行數M＞K，主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。In an alternative, if the number of rows of S is M> K, the control circuit of the main processing circuit distributes one or more rows of data in the S matrix to each basic processing circuit.

S中有Mi行分發到第i個基礎處理電路，這Mi行的集合稱為Ai，如圖2e表示第i個基礎處理電路上將要執行的計算。The Mi line in S is distributed to the i-th basic processing circuit. This set of Mi lines is called Ai, and Figure 2e shows the calculation to be performed on the i-th basic processing circuit.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中：In an alternative, in each basic processing circuit, for example, in the i-th basic processing circuit:

接收的由主處理電路分發的矩陣Ai，將矩陣Ai保存在第i個基礎處理電路寄存器和/或片上緩存中；優點是減少了之後的數據傳輸量，提高了計算效率，降低了功耗。The received matrix Ai distributed by the main processing circuit stores the matrix Ai in the i-th basic processing circuit register and / or on-chip cache; the advantage is that the subsequent data transmission amount is reduced, the calculation efficiency is improved, and the power consumption is reduced.

步驟S402b、主處理電路的控制電路將矩陣P中各部分以廣播的方式傳輸給各個基礎處理電路；Step S402b: The control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast manner;

在一種可選方案中，可以將矩陣P中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的矩陣P的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算；本實施例中的復用具體可以為基礎處理電路在計算中重復使用，例如矩陣P的數據的復用，可以是對矩陣P的數據在多次使用。In an optional solution, each part of the matrix P may be broadcast only once to a register or an on-chip buffer of each basic processing circuit, and the i-th basic processing circuit fully multiplexes the data of the matrix P obtained this time. Complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may specifically be used repeatedly by the basic processing circuit in the calculation, for example, the multiplexing of the data of the matrix P may be the multiple of the data of the matrix P Times of use.

在一種可選方案中，主處理電路的控制電路可以將矩陣P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣P的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；In an optional solution, the control circuit of the main processing circuit may broadcast the various parts of the matrix P to the registers or on-chip buffers of each basic processing circuit multiple times. Without multiplexing, the inner product operation corresponding to each row in the matrix Ai is completed in stages;

在一種可選方案中，主處理電路的控制電路可以將矩陣P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣P的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；In an optional solution, the control circuit of the main processing circuit may broadcast the various parts of the matrix P to the registers or on-chip buffers of each basic processing circuit multiple times. Perform partial multiplexing to complete the inner product operation corresponding to each row in the matrix Ai;

在一種可選方案中，每個基礎處理電路，例如第i個基礎處理電路，計算矩陣Ai的數據和矩陣P的數據的內積；In an optional solution, each basic processing circuit, for example, the i-th basic processing circuit, calculates an inner product of the data of the matrix Ai and the data of the matrix P;

步驟S403b、每個基礎處理電路的累加器電路將內積運算的結果進行累加並傳輸回主處理電路。Step S403b: The accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits the result to the main processing circuit.

在一種可選方案中，基礎處理電路可以將每次執行內積運算得到的部分和傳輸回主處理電路進行累加；In an optional solution, the basic processing circuit may transfer the part obtained by performing the inner product operation each time and transfer it to the main processing circuit for accumulation;

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；In an optional solution, the part obtained by the inner product operation performed by each basic processing circuit may also be stored in a register and / or an on-chip buffer of the basic processing circuit, and transferred to the main processing circuit after the accumulation is completed;

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路。In an optional solution, the part obtained by the inner product operation performed by the basic processing circuit and in some cases may be stored in a register of the basic processing circuit and / or an on-chip buffer for accumulation, and may be transmitted to the host in some cases. The processing circuit performs accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit.

參閱圖4c，為一種矩陣乘以向量的運算示意圖。如所述第一運算為：矩陣乘向量運算，所述輸入數據為所述矩陣乘向量運算的第一矩陣，所述權值為所述矩陣乘向量運算的向量；Refer to FIG. 4c, which is a schematic diagram of a matrix multiplying a vector. For example, the first operation is: a matrix multiplying vector operation, the input data is a first matrix of the matrix multiplying vector operation, and the weight value is a vector of the matrix multiplying vector operation;

參閱圖4d，圖4d提供了了一種矩陣乘向量的實現方法，具體可以包括：Referring to FIG. 4d, FIG. 4d provides a method for implementing a matrix multiplication vector, which may specifically include:

步驟S401、主處理電路的數據轉換運算電路將矩陣S中的每一行數據轉換成定點類型的數據，主處理電路的控制電路分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的分發數據保存在基礎處理電路的片上緩存和/或寄存器中；Step S401: The data conversion operation circuit of the main processing circuit converts each row of data in the matrix S into data of a fixed point type, and the control circuit of the main processing circuit is distributed to one of the K basic processing circuits, and the basic processing circuit will receive The obtained distribution data is stored in the on-chip buffer and / or register of the basic processing circuit;

在一種可選方案中，如果矩陣S的行數M＜=K則，主處理電路的控制電路給K個基礎處理電路分別分發S矩陣的一行；In an optional solution, if the number of rows of the matrix S is M <= K, the control circuit of the main processing circuit distributes one row of the S matrix to the K basic processing circuits respectively;

在一種可選方案中，如果矩陣S的行數M＞K，則主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。In an alternative, if the number of rows M of the matrix S is greater than K, the control circuit of the main processing circuit distributes one or more rows of data in the S matrix to each basic processing circuit.

分發到第i個基礎處理電路的S中的行的集合為Ai，共有Mi個行，如圖2c表示第i個基礎處理電路上將要執行的計算。The set of rows distributed to S in the i-th basic processing circuit is Ai, with a total of Mi rows. As shown in FIG. 2c, the calculation to be performed on the i-th basic processing circuit is shown.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中，可以將接收到的分發數據例如矩陣Ai保存在第i個基礎處理電路的寄存器和/或片上緩存中；優點是減少了之後的分發數據的數據傳輸量，提高了計算效率，降低了功耗。In an optional solution, in each basic processing circuit, for example, the i-th basic processing circuit, the received distribution data, such as a matrix Ai, may be stored in a register and / or an on-chip buffer of the i-th basic processing circuit. ; The advantage is that it reduces the data transmission amount of the distributed data in the future, improves the calculation efficiency, and reduces the power consumption.

步驟S402、主處理電路的數據類型運算電路將向量P轉換成定點類型的數據，主處理電路的控制電路將定點類型的向量P中各部分以廣播的方式傳輸給K個基礎處理電路；Step S402: The data type operation circuit of the main processing circuit converts the vector P into data of a fixed point type, and the control circuit of the main processing circuit transmits the parts of the vector P of the fixed point type to the K basic processing circuits in a broadcast manner;

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的向量P的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算。優點是，減少從主處理電路到基礎處理電路的向量P的重復傳輸的數據傳輸量，提高執行效率，降低傳輸功耗。In an optional solution, the control circuit of the main processing circuit may broadcast each part of the vector P to the registers or on-chip buffers of the basic processing circuits only once, and the data of the vector P obtained this time by the ith basic processing circuit Perform sufficient multiplexing to complete the inner product operation corresponding to each row in the matrix Ai. The advantage is that the data transmission amount of repeated transmission of the vector P from the main processing circuit to the basic processing circuit is reduced, the execution efficiency is improved, and the transmission power consumption is reduced.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；優點是，減少基礎處理電路內部的單次傳輸的向量P的數據傳輸量，並可以降低基礎處理電路緩存和/或寄存器的容量，提高執行效率，降低傳輸功耗，降低成本。In an optional solution, the control circuit of the main processing circuit may broadcast the parts of the vector P to the registers or on-chip buffers of each basic processing circuit multiple times. Without multiplexing, the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of a single transmission vector P inside the basic processing circuit is reduced, and the basic processing circuit cache and / Or register capacity, improve execution efficiency, reduce transmission power consumption, and reduce costs.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；優點是，減少從主處理電路到基礎處理電路的數據傳輸量，也減少基礎處理電路內部的數據傳輸量，提高執行效率，降低傳輸功耗。In an optional solution, the control circuit of the main processing circuit may broadcast the parts of the vector P to the registers or on-chip buffers of each basic processing circuit multiple times, and the i-th basic processing circuit may obtain the data of the vector P each time. Perform partial multiplexing to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that it reduces the amount of data transmission from the main processing circuit to the basic processing circuit, and also reduces the amount of data transmission inside the basic processing circuit, improving execution efficiency To reduce transmission power consumption.

步驟S403、 K個基礎處理電路的內積運算器電路計算矩陣S和向量P的數據的內積，例如第i個基礎處理電路，計算矩陣Ai的數據和向量P的數據的內積；Step S403: The inner product operator circuit of the K basic processing circuits calculates the inner product of the data of the matrix S and the vector P, for example, the i-th basic processing circuit calculates the inner product of the data of the matrix Ai and the data of the vector P;

步驟S404、 K個基礎處理電路的累加器電路將內積運算的結果進行累加得到累加結果，將累加結果以定點類型形式傳輸回主處理電路。Step S404: The accumulator circuits of the K basic processing circuits accumulate the results of the inner product operation to obtain an accumulation result, and transmit the accumulation result to the main processing circuit in a fixed-point type.

在一種可選方案中，可以將每次基礎處理電路執行內積運算得到的部分和（部分和即累加結果的一部分，例如累加結果為：F1*G1+ F2*G2+ F3*G3+ F4*G4+ F5*G5,那麼部分和可以為：F1*G1+ F2*G2+ F3*G3的值）傳輸回主處理電路進行累加；優點是，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。In an optional solution, the partial sum (partial sum is a part of the accumulated result) obtained when the inner product operation is performed each time by the basic processing circuit, for example, the accumulated result is: F1 * G1 + F2 * G2 + F3 * G3 + F4 * G4 + F5 * G5, then the partial sum can be: the value of F1 * G1 + F2 * G2 + F3 * G3) is transmitted back to the main processing circuit for accumulation; the advantage is that the internal calculation amount of the basic processing circuit is reduced and the operation efficiency of the basic processing circuit is improved.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗。In an optional solution, the part obtained by the inner product operation performed by each basic processing circuit may be stored in a register and / or an on-chip buffer of the basic processing circuit, and then transferred back to the main processing circuit after the accumulation is completed; the advantage is, The data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。In an optional solution, the part obtained by the inner product operation performed by the basic processing circuit and in some cases may be stored in a register of the basic processing circuit and / or an on-chip buffer for accumulation, and may be transmitted to the host in some cases. The processing circuit accumulates and transmits back to the main processing circuit after the accumulation; the advantage is that the data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the data transmission power consumption is reduced, and the basic processing circuit is reduced The internal calculation amount improves the calculation efficiency of the basic processing circuit.

本披露還提供一種集成電路芯片裝置，所述集成電路芯片裝置用於執行神經網絡的正向運算，所述神經網絡包含多層，所述裝置包括：處理電路以及外部介面；The present disclosure also provides an integrated circuit chip device for performing a forward operation of a neural network. The neural network includes multiple layers. The device includes: a processing circuit and an external interface.

所述處理電路，用於解析第一計算指令得到所述第一計算指令在所述正向運算的第i層包含的第一運算、第一計算指令對應的輸入數據以及權值數據；上述i的取值可以為1，如為1時，其輸入數據可以為原始輸入數據，當i大於等於2時，該輸入數據可以為上一層的輸出數據，例如i-1層的輸出數據。The processing circuit is configured to parse a first calculation instruction to obtain a first operation included in the i-th layer of the forward operation by the first calculation instruction, input data corresponding to the first calculation instruction, and weight data; i The value of can be 1. If it is 1, the input data can be the original input data. When i is greater than or equal to 2, the input data can be the output data of the previous layer, such as the output data of the i-1 layer.

本披露還揭露了一個神經網絡運算裝置，其包括一個或多個在如圖3a或如圖3b所示的芯片，用於從其他處理裝置中獲取待運算數據和控制信息，執行指定的神經網絡運算，執行結果通過I/O介面傳遞給外圍設備。外圍設備譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面，服務器。當包含一個以上神如圖3a或如圖3b所示的芯片時，如圖3a或如圖3b所示的芯片間可以通過特定的結構進行鏈接並傳輸數據，譬如，通過PCIE總線進行互聯並傳輸數據，以支持更大規模的神經網絡的運算。此時，可以共享同一控制系統，也可以有各自獨立的控制系統；可以共享內存，也可以每個加速器有各自的內存。此外，其互聯方式可以是任意互聯拓撲。This disclosure also discloses a neural network computing device, which includes one or more chips as shown in FIG. 3a or FIG. 3b, for obtaining data to be calculated and control information from other processing devices, and executing a specified neural network. The calculation and execution results are passed to the peripheral device through the I / O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one chip shown in Figure 3a or Figure 3b is included, the chips shown in Figure 3a or Figure 3b can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus. Data to support larger-scale neural network operations. At this time, you can share the same control system, or you can have separate control systems; you can share memory, or each accelerator can have its own memory. In addition, its interconnection method can be any interconnection topology.

該神經網絡運算裝置具有較高的兼容性，可通過PCIE介面與各種類型的服務器相連接。The neural network computing device has high compatibility and can be connected to various types of servers through a PCIE interface.

本披露還揭露了一個組合處理裝置，其包括上述的神經網絡運算裝置，通用互聯介面，和其他處理裝置（即通用處理裝置）。神經網絡運算裝置與其他處理裝置進行交互，共同完成用戶指定的操作。如5a為組合處理裝置的示意圖。The present disclosure also discloses a combined processing device, which includes the aforementioned neural network computing device, a universal interconnection interface, and other processing devices (ie, general processing devices). The neural network computing device interacts with other processing devices to complete a user-specified operation. For example, 5a is a schematic diagram of a combined processing device.

其他處理裝置，包括中央處理器CPU、圖形處理器GPU、神經網絡處理器等通用／專用處理器中的一種或以上的處理器類型。其他處理裝置所包括的處理器數量不做限制。其他處理裝置作為神經網絡運算裝置與外部數據和控制的介面，包括數據搬運，完成對本神經網絡運算裝置的開啓、停止等基本控制；其他處理裝置也可以和神經網絡運算裝置協作共同完成運算任務。Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the neural network computing device and external data and control, including data transfer, to complete the basic control of the neural network computing device, such as start and stop; other processing devices can also cooperate with the neural network computing device to complete computing tasks.

通用互聯介面，用於在所述神經網絡運算裝置與其他處理裝置間傳輸數據和控制指令。該神經網絡運算裝置從其他處理裝置中獲取所需的輸入數據，寫入神經網絡運算裝置片上的存儲裝置；可以從其他處理裝置中獲取控制指令，寫入神經網絡運算裝置片上的控制緩存；也可以讀取神經網絡運算裝置的存儲模塊中的數據並傳輸給其他處理裝置。A universal interconnection interface for transmitting data and control instructions between the neural network computing device and other processing devices. The neural network computing device obtains required input data from other processing devices and writes it to a storage device on the neural network computing device chip; it can obtain control instructions from other processing devices and write it to the control buffer on the neural network computing device chip; also The data in the storage module of the neural network computing device can be read and transmitted to other processing devices.

如圖5b所示，可選的，該結構還包括存儲裝置，用於保存在本運算單元／運算裝置或其他運算單元所需要的數據，尤其適用於所需要運算的數據在本神經網絡運算裝置或其他處理裝置的內部存儲中無法全部保存的數據。As shown in FIG. 5b, optionally, the structure further includes a storage device for storing data required by the operation unit / operation device or other operation units, and particularly suitable for the data required for operation in the neural network operation device. Or all data that cannot be saved in the internal storage of other processing devices.

該組合處理裝置可以作為手機、機器人、無人機、視頻監控設備等設備的SOC片上系統，有效降低控制部分的核心面積，提高處理速度，降低整體功耗。此情況時，該組合處理裝置的通用互聯介面與設備的某些部件相連接。某些部件譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面。The combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption. In this case, the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

請參照圖5c，圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖。如圖5c所示，上述神經網絡處理器板卡10包括神經網絡芯片封裝結構11、第一電氣及非電氣連接裝置12和第一基板（substrate）13。Please refer to FIG. 5c, which is a schematic structural diagram of a neural network processor board provided by an embodiment of the present disclosure. As shown in FIG. 5 c, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate 13.

本披露對於神經網絡芯片封裝結構11的具體結構不作限定，可選的，如圖5d所示，上述神經網絡芯片封裝結構11包括：神經網絡芯片111、第二電氣及非電氣連接裝置112、第二基板113。The present disclosure does not limit the specific structure of the neural network chip package structure 11. Optionally, as shown in FIG. 5d, the aforementioned neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, a first Two substrates 113.

本披露所涉及的神經網絡芯片111的具體形式不作限定，上述的神經網絡芯片111包含但不限於將神經網絡處理器集成的神經網絡晶片，上述晶片可以由硅材料、鍺材料、量子材料或分子材料等製成。根據實際情況（例如：較嚴苛的環境）和不同的應用需求可將上述神經網絡晶片進行封裝，以使神經網絡晶片的大部分被包裹住，而將神經網絡晶片上的引腳通過金線等導體連到封裝結構的外邊，用於和更外層進行電路連接。The specific form of the neural network chip 111 involved in this disclosure is not limited. The aforementioned neural network chip 111 includes, but is not limited to, a neural network chip that integrates a neural network processor. The above chip may be made of silicon material, germanium material, quantum material, or molecule. Materials. According to the actual situation (for example: harsh environment) and different application requirements, the above neural network chip can be packaged so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through gold wires. The isoconductor is connected to the outer side of the package structure for circuit connection with the outer layer.

本披露對於神經網絡芯片111的具體結構不作限定，可選的，請參照圖1a所示的裝置。The present disclosure does not limit the specific structure of the neural network chip 111. For optional, please refer to the device shown in FIG. 1a.

本披露對於第一基板13和第二基板113的類型不做限定，可以是印制電路板(printed circuit board，PCB)或(printed wiring board，PWB)，還可能為其它電路板。對PCB的製作材料也不做限定。The disclosure does not limit the types of the first substrate 13 and the second substrate 113, and may be a printed circuit board (PCB) or a printed wiring board (PWB), or may be other circuit boards. There are no restrictions on the materials used to make the PCB.

本披露所涉及的第二基板113用於承載上述神經網絡芯片111，通過第二電氣及非電氣連接裝置112將上述的神經網絡芯片111和第二基板113進行連接得到的神經網絡芯片封裝結構11，用於保護神經網絡芯片111，便於將神經網絡芯片封裝結構11與第一基板13進行進一步封裝。The second substrate 113 according to the present disclosure is used to carry the neural network chip 111, and a neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through a second electrical and non-electrical connection device 112. It is used to protect the neural network chip 111 and facilitate the further packaging of the neural network chip packaging structure 11 and the first substrate 13.

對於上述具體的第二電氣及非電氣連接裝置112的封裝方式和封裝方式對應的結構不作限定，可根據實際情況和不同的應用需求選擇合適的封裝方式並進行簡單地改進，例如：倒裝芯片球柵陣列封裝（Flip Chip Ball Grid Array Package，FCBGAP），薄型四方扁平式封裝（Low-profile Quad Flat Package，LQFP）、帶散熱器的四方扁平封裝（Quad Flat Package with Heat sink，HQFP）、無引腳四方扁平封裝（Quad Flat Non-lead Package，QFN）或小間距四方扁平式封裝（Fine-pitch Ball Grid Package，FBGA）等封裝方式。There is no limitation on the above-mentioned specific packaging method of the second electrical and non-electrical connection device 112 and the corresponding structure of the packaging method. A suitable packaging method can be selected and simply improved according to the actual situation and different application needs, such as flip chip Flip Chip Ball Grid Array Package (FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat sink (HQFP), None Packaging methods such as Quad Flat Non-lead Package (QFN) or Fine-pitch Ball Grid Package (FBGA).

倒裝芯片（Flip Chip），適用於對封裝後的面積要求高或對導線的電感、信號的傳輸時間敏感的情況下。除此之外可以用引線鍵合（Wire Bonding）的封裝方式，減少成本，提高封裝結構的靈活性。Flip chip (Flip Chip) is suitable for the case where the area after packaging is high or the inductance of the wire and the signal transmission time are sensitive. In addition, wire bonding can be used to reduce the cost and improve the flexibility of the packaging structure.

球柵陣列（Ball Grid Array），能夠提供更多引腳，且引腳的平均導線長度短，具備高速傳遞信號的作用，其中，封裝可以用引腳網格陣列封裝（Pin Grid Array，PGA）、零插拔力（Zero Insertion Force，ZIF）、單邊接觸連接（Single Edge Contact Connection，SECC）、觸點陣列（Land Grid Array，LGA）等來代替。Ball Grid Array, which can provide more pins, and the average lead length of the pins is short, which has the function of transmitting signals at high speed. Among them, the package can be packaged with a pin grid array (Pin Grid Array, PGA) , Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), and Land Grid Array (LGA).

可選的，採用倒裝芯片球柵陣列（Flip Chip Ball Grid Array）的封裝方式對神經網絡芯片111和第二基板113進行封裝，具體的神經網絡芯片封裝結構的示意圖可參照圖6。如圖6所示，上述神經網絡芯片封裝結構包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26。Optionally, the Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging method is used to package the neural network chip 111 and the second substrate 113. For a schematic diagram of a specific neural network chip packaging structure, refer to FIG. 6. As shown in FIG. 6, the aforementioned neural network chip package structure includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25 on the second substrate 24, and a pin 26.

其中，焊盤22與神經網絡芯片21相連，通過在焊盤22和第二基板24上的連接點25之間焊接形成焊球23，將神經網絡芯片21和第二基板24連接，即實現了神經網絡芯片21的封裝。Among them, the pad 22 is connected to the neural network chip 21, and a solder ball 23 is formed by welding between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected. Packaging of the neural network chip 21.

引腳26用於與封裝結構的外部電路（例如，神經網絡處理器板卡10上的第一基板13）相連，可實現外部數據和內部數據的傳輸，便於神經網絡芯片21或神經網絡芯片21對應的神經網絡處理器對數據進行處理。對於引腳的類型和數量本披露也不作限定，根據不同的封裝技術可選用不同的引腳形式，並遵從一定規則進行排列。Pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), and can realize the transmission of external data and internal data, which is convenient for the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The type and quantity of pins are not limited in this disclosure. Different pin forms can be selected according to different packaging technologies and arranged in accordance with certain rules.

可選的，上述神經網絡芯片封裝結構還包括絕緣填充物，置於焊盤22、焊球23和連接點25之間的空隙中，用於防止焊球與焊球之間產生干擾。Optionally, the aforementioned neural network chip package structure further includes an insulating filler placed in a gap between the pad 22, the solder ball 23, and the connection point 25 to prevent interference between the solder ball and the solder ball.

其中，絕緣填充物的材料可以是氮化硅、氧化硅或氧氮化硅；干擾包含電磁干擾、電感干擾等。The material of the insulating filler may be silicon nitride, silicon oxide, or silicon oxynitride; interference includes electromagnetic interference, inductive interference, and the like.

可選的，上述神經網絡芯片封裝結構還包括散熱裝置，用於散髮神經網絡芯片21運行時的熱量。其中，散熱裝置可以是一塊導熱性良好的金屬片、散熱片或散熱器，例如，風扇。Optionally, the aforementioned neural network chip package structure further includes a heat dissipation device for dissipating heat during operation of the neural network chip 21. The heat dissipation device may be a metal sheet, a heat sink, or a heat sink with good thermal conductivity, such as a fan.

舉例來說，如圖6a所示，神經網絡芯片封裝結構11包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26、絕緣填充物27、散熱膏28和金屬外殼散熱片29。其中，散熱膏28和金屬外殼散熱片29用於散髮神經網絡芯片21運行時的熱量。For example, as shown in FIG. 6a, the neural network chip package structure 11 includes: a neural network chip 21, a pad 22, a solder ball 23, a second substrate 24, a connection point 25, a pin 26, The insulating filler 27, the heat dissipation paste 28, and the metal case heat sink 29. Among them, the heat dissipation paste 28 and the metal shell heat sink 29 are used to dissipate heat during the operation of the neural network chip 21.

可選的，上述神經網絡芯片封裝結構11還包括補強結構，與焊盤22連接，且內埋於焊球23中，以增強焊球23與焊盤22之間的連接強度。Optionally, the aforementioned neural network chip package structure 11 further includes a reinforcing structure connected to the pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the pad 22.

其中，補強結構可以是金屬線結構或柱狀結構，在此不做限定。The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本披露對於第一電氣及非電氣裝置12的具體形式也不作限定，可參照第二電氣及非電氣裝置112的描述，即通過焊接的方式將神經網絡芯片封裝結構11進行封裝，也可以採用連接線連接或插拔方式連接第二基板113和第一基板13的方式，便於後續更換第一基板13或神經網絡芯片封裝結構11。The present disclosure also does not limit the specific form of the first electrical and non-electrical device 12, and may refer to the description of the second electrical and non-electrical device 112, that is, the neural network chip packaging structure 11 is packaged by soldering, and connection may also be adopted. The method of connecting the second substrate 113 and the first substrate 13 in a line connection or plugging manner is convenient for subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

可選的，第一基板13包括用於擴展存儲容量的內存單元的介面等，例如：同步動態隨機存儲器（Synchronous Dynamic Random Access Memory，SDRAM）、雙倍速率同步動態隨機存儲器（Double Date Rate SDRAM，DDR）等，通過擴展內存提高了神經網絡處理器的處理能力。Optionally, the first substrate 13 includes an interface of a memory unit for expanding the storage capacity, such as: synchronous dynamic random access memory (SDRAM), double-rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., to improve the processing capacity of the neural network processor by expanding the memory.

第一基板13上還可包括快速外部設備互連總線（Peripheral Component Interconnect-Express，PCI-E或PCIe）介面、小封裝可熱插拔（Small Form-factor Pluggable，SFP）介面、以太網介面、控制器局域網總線（Controller Area Network，CAN）介面等等，用於封裝結構和外部電路之間的數據傳輸，可提高運算速度和操作的便利性。The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, Controller Area Network (CAN) interfaces, etc., are used for data transmission between the package structure and external circuits, which can improve the operation speed and the convenience of operation.

將神經網絡處理器封裝為神經網絡芯片111，將神經網絡芯片111封裝為神經網絡芯片封裝結構11，將神經網絡芯片封裝結構11封裝為神經網絡處理器板卡10，通過板卡上的介面（插槽或插芯）與外部電路（例如：計算機主板）進行數據交互，即直接通過使用神經網絡處理器板卡10實現神經網絡處理器的功能，並保護神經網絡芯片111。且神經網絡處理器板卡10上還可添加其他模塊，提高了神經網絡處理器的應用範圍和運算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, the neural network chip package structure 11 is packaged as a neural network processor board 10, and an interface on the board ( Slots or inserts) perform data interaction with external circuits (for example, computer motherboards), that is, the function of the neural network processor is realized directly by using the neural network processor board 10 and the neural network chip 111 is protected. In addition, other modules can be added to the neural network processor board 10, which improves the application range and operation efficiency of the neural network processor.

在一個實施例里，本公開公開了一個電子裝置，其包括了上述神經網絡處理器板卡10或神經網絡芯片封裝結構11。In one embodiment, the present disclosure discloses an electronic device including the neural network processor board 10 or the neural network chip package structure 11 described above.

電子裝置包括數據處理裝置、機器人、電腦、打印機、掃描儀、平板電腦、智能終端、手機、行車記錄儀、導航儀、傳感器、攝像頭、服務器、相機、攝像機、投影儀、手錶、耳機、移動存儲、可穿戴設備、交通工具、家用電器、和/或醫療設備。Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , Wearables, vehicles, home appliances, and / or medical devices.

所述交通工具包括飛機、輪船和/或車輛；所述家用電器包括電視、空調、微波爐、冰箱、電飯煲、加濕器、洗衣機、電燈、燃氣灶、油煙機；所述醫療設備包括核磁共振儀、B超儀和/或心電圖儀。The vehicles include airplanes, ships, and / or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods, and the medical equipment includes nuclear magnetic resonance Instrument, B-mode and / or electrocardiograph.

以上所述的具體實施例，對本披露的目的、技術方案和有益效果進行了進一步詳細說明，所應理解的是，以上所述僅為本披露的具體實施例而已，並不用於限制本披露，凡在本披露的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本披露的保護範圍之內。The specific embodiments described above further describe the purpose, technical solution and beneficial effects of the present disclosure. It should be understood that the above are only specific embodiments of the present disclosure and are not used to limit the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of this disclosure shall be included in the protection scope of this disclosure.

A、B、S‧‧‧矩陣A, B, S‧‧‧ Matrix

P‧‧‧向量P‧‧‧ vector

S401b、S402b、S403b、S401、S402、S403、S404‧‧‧步驟S401b, S402b, S403b, S401, S402, S403, S404‧‧‧ steps

10‧‧‧神經網絡處理器卡板10‧‧‧ Neural Network Processor Card Board

11‧‧‧神經網絡芯片封裝結構11‧‧‧ neural network chip packaging structure

12‧‧‧第一電氣及非電氣連接裝置12‧‧‧First electrical and non-electrical connection device

13‧‧‧第一基板13‧‧‧First substrate

111‧‧‧神經網絡芯片111‧‧‧ neural network chip

112‧‧‧第二電氣及非電氣連接裝置112‧‧‧Second electrical and non-electrical connection device

113‧‧‧第二基板113‧‧‧second substrate

1111‧‧‧存儲單元1111‧‧‧Storage Unit

1112‧‧‧直接內存存取單元1112‧‧‧Direct Memory Access Unit

1113‧‧‧指令緩存單元1113‧‧‧Instruction cache unit

1114‧‧‧權緩存單元1114‧‧‧ Rights Cache Unit

1115‧‧‧輸入神經元緩存單元1115‧‧‧Input neuron buffer unit

1116‧‧‧輸出神經元緩存單元1116‧‧‧Output neuron buffer unit

1117‧‧‧控制單元1117‧‧‧Control Unit

1118‧‧‧運算單元1118‧‧‧ Computing Unit

21‧‧‧神經網絡芯片21‧‧‧Neural Network Chip

22‧‧‧焊盤22‧‧‧ pad

23‧‧‧焊球23‧‧‧Solder Ball

24‧‧‧第二基板24‧‧‧second substrate

25‧‧‧第二基板24上的連接點25‧‧‧ Connection points on the second substrate 24

26‧‧‧引腳26‧‧‧pin

27‧‧‧絕緣填充物27‧‧‧Insulation filler

28‧‧‧散熱膏28‧‧‧ Thermal Paste

29‧‧‧金屬外殼散熱片29‧‧‧ metal case heat sink

圖1是一種神經網絡的正向運算示意圖。Figure 1 is a schematic diagram of the forward operation of a neural network.

圖1a為一種定點數據類型的示意結構圖。FIG. 1a is a schematic structural diagram of a fixed-point data type.

圖2a為卷積輸入數據示意圖。Figure 2a is a schematic diagram of convolution input data.

圖2b為卷積核示意圖。Figure 2b is a schematic diagram of a convolution kernel.

圖2c為輸入數據的一個三維數據塊的運算窗口示意圖。FIG. 2c is a schematic diagram of a calculation window of a three-dimensional data block of input data.

圖2d為輸入數據的一個三維數據塊的另一運算窗口示意圖。FIG. 2d is a schematic diagram of another operation window of a three-dimensional data block of input data.

圖2e為輸入數據的一個三維數據塊的又一運算窗口示意圖.Figure 2e is a schematic diagram of another calculation window of a three-dimensional data block of the input data.

圖3a是一種神經網絡芯片的結構示意圖。Figure 3a is a schematic structural diagram of a neural network chip.

圖3b是另一種神經網絡芯片的結構示意圖。FIG. 3b is a schematic structural diagram of another neural network chip.

圖4a為矩陣乘以矩陣示意圖。Figure 4a is a schematic diagram of matrix multiplication by matrix.

圖4b為矩陣乘以矩陣的方法流程圖。FIG. 4b is a flowchart of a method of matrix by matrix.

圖4c為矩陣乘以向量示意圖。Figure 4c is a schematic diagram of a matrix multiplied by a vector.

圖4d為矩陣乘以向量的方法流程圖。FIG. 4d is a flowchart of a method of matrix multiplication by a vector.

圖5a為本披露還揭露了一個組合處理裝置結構示意圖。FIG. 5a is a schematic structural diagram of a combination processing device disclosed in the present disclosure.

圖5b為本披露還揭露了一個組合處理裝置另一種結構示意圖。FIG. 5b also discloses another structure diagram of a combined processing device in the present disclosure.

圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖；5c is a schematic structural diagram of a neural network processor board provided in an embodiment of the present disclosure;

圖5d為本披露實施例流提供的一種神經網絡芯片封裝結構的結構示意圖；5d is a schematic structural diagram of a neural network chip package structure provided by an embodiment of the present disclosure;

圖5e為本披露實施例流提供的一種神經網絡芯片的結構示意圖；5e is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;

圖6為本披露實施例流提供的一種神經網絡芯片封裝結構的示意圖；6 is a schematic diagram of a neural network chip package structure provided by the embodiment of the present disclosure;

圖6a為本披露實施例流提供的另一種神經網絡芯片封裝結構的示意圖。FIG. 6a is a schematic diagram of another neural network chip package structure provided by the embodiment of the present disclosure.

Claims

A neural network forward operation method executed on an integrated circuit chip device. The neural network includes multiple layers, wherein the method includes the following steps: receiving a first calculation instruction, parsing the first calculation instruction, and obtaining the first calculation instruction in the normal operation; The first operation included in the i-th layer of the operation, and an input data and a weight data corresponding to the first calculation instruction, the value of i is an integer greater than or equal to 1, and if the i is greater than or equal to 2, the input The data is the output data of layer i-1; the first complexity of the first operation is determined according to the input data, the weight data, and the first operation; the input data and the weight data are determined according to the first complexity; A first data type when performing a first operation, the first data type includes: a floating-point type or a fixed-point type; and the i-th layer that performs the forward operation on the input data and the weight data with the first data type The first operation.

The method according to item 1 of the scope of patent application, wherein determining the input data and the first data type of the weight data when performing the first operation according to the first complexity includes: combining the first complexity with a pre- Set a threshold comparison. If the first complexity is higher than the preset threshold, determine that the first data type is a fixed-point type. If the first complexity is lower than or equal to the preset threshold, determine that the first data type is floating. Point type.

The method according to item 2 of the patent application scope, wherein after determining the input data according to the first complexity and the first data type of the weight data when performing the first operation, the method further includes: determining the input data And the weight data belongs to the second data type. If the second data type is different from the first data type, the input data belonging to the second data type and the weight data belonging to the second data type are converted to belong to the first data type. The input data of a data type and the weight data belonging to a first data type.

The method according to item 1 of the scope of patent application, wherein if the first operation is a convolution operation, the input data is a convolution input data, the weight data is a convolution kernel, and the first complexity = α * C * kW * kW * M * N * W * C * H; Among them, α is the convolution coefficient, and the value range is greater than 1; C, kW, kW, and M are the values of the four dimensions of the convolution kernel, N, W, C, H are the values of the four dimensions of the convolution input data; if the first complexity is greater than a set threshold, determine whether the convolution input data and the convolution kernel are floating point data, such as the convolution The input data and the convolution kernel are not floating point data. The convolution input data is converted into floating point data, the convolution kernel is converted into floating point data, and then the convolution input data, the convolution The kernel performs convolution operations on floating-point data types.

The method according to item 1 of the scope of patent application, wherein if the first operation is a matrix multiplication matrix operation, the input data is the first matrix multiplication of the matrix multiplication matrix operation, and the weight value is the second matrix multiplication of the matrix multiplication matrix operation Matrix; first complexity = β * F * G * E * F; where β is a matrix coefficient, the value range is greater than or equal to 1, F and G are the row and column values of the first matrix, and E and F are Row and column values of the second matrix; if the first complexity is greater than a set threshold, determine whether the first matrix and the second matrix are floating point data, such as the first matrix and the second matrix are not floating point Data, convert the first matrix into floating point data, convert the second matrix into floating point data, and then perform matrix multiplication matrix operations on the first matrix and the second matrix with floating point data types.

The method according to item 1 of the scope of patent application, wherein if the first operation is a matrix multiplication vector operation, the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation; The first complexity = β * F * G * F; where β is a matrix coefficient and the value range is greater than or equal to 1, F and G are row and column values of the first matrix, and F is a column value of the vector; If the first complexity is greater than a set threshold, determine whether the first matrix and the vector are floating point data. If the first matrix and the vector are not floating point data, convert the first matrix to floating point Data, convert the vector into floating point data, and then perform a matrix multiplying vector operation on the first matrix and the vector in the floating point data type.

The method according to any one of claims 1 to 6, wherein the i-th layer further includes one or any combination of paranoid operation, fully connected operation, GEMM operation, GEMV operation, and active operation.

An integrated circuit chip device, wherein the integrated circuit chip device is used to perform a forward operation of a neural network, the neural network includes multiple layers, the device includes: a processing circuit and an external interface; the external interface is configured to receive a first calculation instruction A processing circuit for parsing the first calculation instruction to obtain a first operation included in the i-th layer of the forward operation by the first calculation instruction, an input data corresponding to the first calculation instruction, and a weight data, The value of i is an integer greater than or equal to 1. If the value of i is greater than or equal to 2, the input data is the output data of the i-1th layer. The processing circuit is further configured to use the input data, the weight data, and The first operation determines a first complexity of the first operation, and determines a first data type of the input data and the weight data when performing the first operation according to the first complexity. The first data type includes: a floating point type Or fixed-point type; the processing circuit is further configured to perform the first operation included in the first layer of the forward operation on the input data and the weight data with the first data type .

The integrated circuit chip device according to item 8 of the scope of patent application, wherein the processing circuit is specifically configured to compare the first complexity with a preset threshold, and if the first complexity is higher than the preset threshold, the computing device It is determined that the first data type is a fixed-point type. If the first complexity is lower than or equal to the preset threshold, the computing device determines that the first data type is a floating-point type.

The integrated circuit chip device according to item 9 of the scope of patent application, wherein the integrated circuit chip device further includes a data type conversion circuit; the processing circuit is further configured to determine the second data type to which the input data and the weight data belong. If the second data type is different from the first data type, a conversion command is sent to the data type conversion circuit, and the data type conversion circuit is configured to, according to the conversion command, the input data that belongs to the second data type and belong to The weight data of the second data type is converted into the input data belonging to the first data type and the weight data belonging to the first data type.

According to the integrated circuit chip device according to item 8 of the scope of patent application, if the first operation is a convolution operation, the input data is a convolution input data, the weight data is a convolution kernel, and the processing circuit uses To calculate the first complexity, the first complexity = α * C * kW * kW * M * N * W * C * H; where α is a convolution coefficient and the value range is greater than 1; C, kW, kW, M are the values of the four dimensions of the convolution kernel, and N, W, C, and H are the values of the four dimensions of the convolution input data; the processing circuit is further configured to: Determine whether the convolution input data and the convolution kernel are floating point data. If the convolution input data and the convolution kernel are not floating point data, convert the convolution input data into floating point data, and The convolution kernel is converted into floating-point data, and then the convolution input data is performed, and the convolution kernel performs a convolution operation in a floating-point data type.

The integrated circuit chip device according to item 8 of the scope of patent application, wherein if the first operation is a matrix multiplication matrix operation, the input data is the first matrix of the matrix multiplication matrix operation, and the weight value is the matrix multiplication matrix operation A second matrix; the processing circuit is used to calculate the first complexity; the first complexity = β * F * G * E * F; where β is a matrix coefficient, and the value range is greater than or equal to 1, F, G Are the row and column values of the first matrix, and E and F are the row and column values of the second matrix; the processing circuit is further configured to determine the first matrix and the first matrix if the first complexity is greater than a set threshold; Whether the second matrix is floating-point data. If the first matrix and the second matrix are not floating-point data, convert the first matrix into floating-point data and convert the second matrix into floating-point data. , And then perform a matrix multiplication matrix operation on the first matrix and the second matrix with a floating point data type.

The integrated circuit chip device according to item 8 of the scope of patent application, wherein if the first operation is a matrix multiplication vector operation, the input data is a first matrix of the matrix multiplication vector operation, and the weight value is the matrix multiplication vector operation The vector is used to calculate the first complexity; the first complexity = β * F * G * F; where β is a matrix coefficient, and the value range is greater than or equal to 1, F and G are the first The row and column values of a matrix, where F is the column value of the vector; the processing circuit is further configured to determine whether the first matrix and the vector are floating-point data if the first complexity is greater than a set threshold, such as The first matrix and the vector are not floating point data, the first matrix is converted into floating point data, the vector is converted into floating point data, and then the first matrix and the vector are converted into floating point data Type performs matrix multiplication vector operations.

The integrated circuit chip device according to any one of claims 8-13, wherein the i-th layer further includes one or any combination of paranoid operation, fully connected operation, GEMM operation, GEMV operation, and active operation.

A neural network computing device, wherein the neural network computing device includes one or more integrated circuit chip devices according to any one of claims 8-14 of the scope of the patent application.

A combined processing device, wherein the combined processing device includes: a neural network computing device, a universal interconnection interface, and a universal processing device, such as item 15 of the scope of patent application; the neural network computing device communicates with the universal through the universal interconnection interface The processing device is connected.

A chip, wherein the chip integrates a device such as any one of items 8-14 of the scope of patent application.

An electronic device, wherein the electronic device includes a chip as in item 17 of the scope of patent application.