KR102372869B1

KR102372869B1 - Matrix operator and matrix operation method for artificial neural network

Info

Publication number: KR102372869B1
Application number: KR1020190092932A
Authority: KR
Inventors: 정기석; 박상수
Original assignee: 한양대학교 산학협력단
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2022-03-08
Also published as: KR20210014897A; WO2021020848A2; WO2021020848A3

Abstract

본 발명은 피승수 행렬인 제1 행렬을 인가받아 저장하는 제1 버퍼, 제1 행렬에 곱해지는 승수 행렬인 제2 행렬을 인가받아 저장하는 제2 버퍼 및 제1 행렬에서 열 단위로 순차적 선택된 다수의 원소를 인가받고, 제1 행렬에서 선택된 열에 대응하여 제2 행렬에서 행 단위로 순차적으로 선택된 다수의 원소를 인가받으며, 제1 행렬에서 선택된 열의 원소 각각을 제2 행렬에서 선택된 행의 모든 원소와 곱셈 연산하고, 순차적 선택된 제1 행렬의 열 및 제2 행렬의 행 사이의 곱셈 연산 결과를 누적 가산하여 제1 행렬과 제2 행렬의 행렬 곱셈 연산 결과인 결과 행렬을 획득하는 연산부를 포함하여, 연산 효율성과 연산 속도를 높이고 전력 소모를 줄일 수 있는 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법을 제공할 수 있다.The present invention provides a first buffer for receiving and storing a first matrix that is a multiplicand matrix, a second buffer for receiving and storing a second matrix that is a multiplier matrix multiplied by the first matrix, and a plurality of sequentially selected columns from the first matrix Elements are applied, a plurality of elements sequentially selected row by row in a second matrix corresponding to a column selected in the first matrix are applied, and each element of a column selected in the first matrix is multiplied by all elements of a row selected in the second matrix Comprising an operation unit for obtaining a result matrix that is a matrix multiplication operation result of the first matrix and the second matrix by accumulating and adding the multiplication operation results between the sequentially selected columns of the first matrix and the rows of the second matrix, the operation efficiency It is possible to provide a matrix operator and a matrix operation method for an artificial neural network capable of increasing computation speed and reducing power consumption.

Description

MATRIX OPERATOR AND MATRIX OPERATION METHOD FOR ARTIFICIAL NEURAL NETWORK

본 발명은 인공 신경망 모듈 및 이의 스케쥴링 방법에 관한 것으로, 고효율 연산 처리를 위한 인공 신경망 모듈 및 이의 스케쥴링 방법에 관한 것이다.The present invention relates to an artificial neural network module and a scheduling method thereof, and to an artificial neural network module for high-efficiency computational processing and a scheduling method thereof.

최근 인간의 두뇌가 패턴을 인식하는 방법을 모사하여 두뇌와 비슷한 방식으로 여러 정보를 처리하도록 구성되는 인공 신경망(artificial neural network)이 다양한 분야에 적용되어 사용되고 있다.Recently, an artificial neural network that mimics how the human brain recognizes patterns and processes various information in a manner similar to that of the brain has been applied and used in various fields.

이러한 인공 신경망은 방대한 데이터를 바탕으로 학습을 필요로 하며, 이과정에서 대량의 덧셈 및 곱셈 연산을 수행해야 하며, 이에 인공 신경망을 위한 연산을 수행하는 칩 구조에서는 MAC 연산기(Multiply-accumulate operater)와 같은 다수의 연산 회로가 구비되어야 한다.Such an artificial neural network requires learning based on massive data, and in this process, a large amount of addition and multiplication operations must be performed. The same number of arithmetic circuits must be provided.

따라서 최근에는 인공 신경망의 딥러닝에 특화된 새로운 종류의 하드웨어 가속기 분야가 큰 주목을 받고 있다. 딥러닝 가속기는 사용 환경 및 목적에 따라 서로 다른 형태로 제시되었다. 일예로 성능을 중시하는 서버나 워크스테이션 등에는 GPU(Graphics Processing Unit)가 주로 사용되고, 저전력을 우선시 하는 스마트폰과 같은 엣지 디바이스에서는 FPGA(Field Programmable Gate Array) 또는 ASIC(application specific integrated circuit)을 이용하여 설계한 전용 하드웨어, 즉 NPU(Neural Processing Unit)가 주로 사용되고 있다.Therefore, recently, a new kind of hardware accelerator specialized in deep learning of artificial neural networks is receiving great attention. Deep learning accelerators have been presented in different forms depending on the usage environment and purpose. For example, a graphics processing unit (GPU) is mainly used in servers or workstations that value performance, and an FPGA (Field Programmable Gate Array) or ASIC (application specific integrated circuit) is used in edge devices such as smartphones that prioritize low power. Dedicated hardware designed for this purpose, that is, a Neural Processing Unit (NPU), is mainly used.

그러나 현재까지 나온 많은 가속기들은 전용 하드웨어 특성상 다양한 인공신경망에서 사용하는 다양한 형태의 레이어(layer) 또는 텐서(tensor)에 대응할 유연성이 부족하다. 이러한 단점은 현재 매우 다양하게 사용되고 있는 딥러닝 어플리케이션 및 모델들을 대응하기 힘들다는 점에서 문제가 있다.However, many accelerators that have been released so far lack the flexibility to respond to various types of layers or tensors used in various artificial neural networks due to the nature of dedicated hardware. This disadvantage has a problem in that it is difficult to cope with the deep learning applications and models that are currently widely used.

한편, 다수의 연산 장치를 가변적으로 사용하기 위해서는 제어 회로가 복잡해지며, 이에 인공 신경망의 연산 수행 과정에서 일부 연산 장치가 이용되지 않고 유휴 상태에 머물러 있는 경우가 발생하게 되어 비효율성이 유발되며, 불필요한 전력이 추가로 소모될 수 있다.On the other hand, in order to variably use a plurality of computing devices, the control circuit becomes complicated, and thus some computing devices are not used and remain in an idle state in the process of performing calculations of the artificial neural network, resulting in inefficiency and unnecessary Additional power may be consumed.

한국 공개 특허 제10-2019-0055447호 (2019.05.23 공개)Korean Patent Publication No. 10-2019-0055447 (published on May 23, 2019)

본 발명의 목적은 곱셈연산과 덧셈 연산을 파이프라인 기법에 따라 병렬로 동시에 수행하여 연산 효율성을 높이고 전력 소모를 줄일 수 있는 행렬 연산기 및 행렬 연산 방법을 제공하는데 있다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a matrix operator and a matrix operation method capable of increasing computational efficiency and reducing power consumption by simultaneously performing multiplication and addition operations in parallel according to a pipeline technique.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 인공 신경망을 위한 행렬 연산기는 피승수 행렬인 제1 행렬을 인가받아 저장하는 제1 버퍼; 상기 제1 행렬에 곱해지는 승수 행렬인 제2 행렬을 인가받아 저장하는 제2 버퍼; 및 상기 제1 행렬에서 열 단위로 순차적 선택된 다수의 원소를 인가받고, 상기 제1 행렬에서 선택된 열에 대응하여 상기 제2 행렬에서 행 단위로 순차적으로 선택된 다수의 원소를 인가받으며, 상기 제1 행렬에서 선택된 열의 원소 각각을 상기 제2 행렬에서 선택된 행의 모든 원소와 곱셈 연산하고, 순차적 선택된 제1 행렬의 열 및 제2 행렬의 행 사이의 곱셈 연산 결과를 누적 가산하여 상기 제1 행렬과 상기 제2 행렬의 행렬 곱셈 연산 결과인 결과 행렬을 획득하는 연산부를 포함한다.In order to achieve the above object, a matrix operator for an artificial neural network according to an embodiment of the present invention includes: a first buffer for receiving and storing a first matrix that is a multiplicand matrix; a second buffer for receiving and storing a second matrix that is a multiplier matrix multiplied by the first matrix; and receiving a plurality of elements sequentially selected in units of columns from the first matrix, and receiving a plurality of elements sequentially selected in units of rows from the second matrix in correspondence to columns selected in the first matrix, in the first matrix Each element of the selected column is multiplied by all elements of the selected row in the second matrix, and the result of the multiplication operation between the sequentially selected first matrix column and the second matrix row is accumulated and added to form the first matrix and the second matrix. and an operator for obtaining a result matrix that is a result of a matrix multiplication operation of a matrix.

상기 연산부는 각각 상기 제1 행렬에서 선택된 열의 원소 중 대응하는 하나의 원소와 상기 제2 행렬에서 선택된 행의 모든 원소를 인가받아 곱셈하고, 원소간 곱셈 결과를 이전 곱셈 결과의 누적 가산된 누적값에 가산하여 부분 누적 행렬의 행의 원소를 획득하는 다수의 연산 처리 레인을 포함할 수 있다.The operation unit receives and multiplies a corresponding one element among the elements of the column selected in the first matrix and all elements in the row selected in the second matrix, respectively, and adds the result of inter-element multiplication to the accumulated value of the previous multiplication result. It may include a plurality of arithmetic processing lanes for adding to obtain an element of a row of the partial accumulation matrix.

상기 다수의 연산 처리 레인 각각은 원소간 곱셈 결과를 이전 곱셈 결과의 누적 가산된 누적값에 가산하는 동안, 기지정된 순차에 따라 다음 선택되는 상기 제1 행렬에서 열의 원소와 상기 제2 행렬에서 행의 모든 원소를 인가받아 곱셈 연산을 수행할 수 있다.Each of the plurality of arithmetic processing lanes adds an element of a column in the first matrix and a row in the second matrix that are next selected according to a predetermined sequence while adding an inter-element multiplication result to a cumulative sum of a previous multiplication result. Multiplication operation can be performed by accepting all elements.

상기 다수의 연산 처리 레인 각각은 다수의 프로세스 소자를 포함하고, 상기 다수의 프로세스 소자 각각은 상기 제1 행렬의 선택된 열에서 대응하는 하나의 원소와 상기 제2 행렬에서 선택된 행의 다수의 원소 중 대응하는 하나의 원소를 인가받아 곱셈 연산하는 곱셈기; 상기 곱셈기에서 출력되는 곱셈 결과를 이전 인가된 원소의 곱셈 결과를 누적 가산한 누적값과 가산하여 누적값을 갱신하는 가산기; 및 상기 가산기에서 갱신된 누적값을 저장하는 누적 레지스터를 포함할 수 있다.each of the plurality of arithmetic processing lanes includes a plurality of process elements, each of the plurality of process elements having a corresponding one of a corresponding one element in a selected column of the first matrix and a corresponding one of a plurality of elements in a selected row of the second matrix a multiplier that multiplies and multiplies a single element; an adder for updating the accumulated value by adding the multiplication result output from the multiplier to the accumulated value obtained by adding the multiplication result of the previously applied element; and an accumulation register configured to store the accumulated value updated by the adder.

상기 제2 버퍼는 상기 제1 버퍼에서 상기 제1 행렬의 제i(여기서 i는 자연수) 열이 선택되면, 상기 제2 행렬의 제i 행을 선택할 수 있다.The second buffer may select the i-th row of the second matrix when the i-th column (where i is a natural number) of the first matrix is selected from the first buffer.

상기 행렬 연산기는 인공 신경망의 다수의 레이어 중 적어도 하나의 레이어에 지정된 연산을 수행하기 위한 인공 신경망 모듈로 구현되고, 상기 제1 행렬은 상기 적어도 하나의 레이어로 인가되는 특징맵이고, 상기 제2 행렬은 상기 적어도 하나의 레이어에 기지정된 커널일 수 있다.The matrix operator is implemented as an artificial neural network module for performing an operation specified for at least one layer among a plurality of layers of the artificial neural network, wherein the first matrix is a feature map applied to the at least one layer, and the second matrix may be a kernel predetermined in the at least one layer.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 인공 신경망을 위한 행렬 연산 방법은 피승수 행렬인 제1 행렬과 상기 제1 행렬에 곱해지는 승수 행렬인 제2 행렬을 인가받아 저장하는 단계; 상기 제1 행렬에서 열 단위로 순차적 선택된 다수의 원소와, 상기 제1 행렬에서 선택된 열에 대응하여 상기 제2 행렬에서 행 단위로 순차적으로 선택된 다수의 원소를 인가받는 단계; 상기 제1 행렬에서 선택된 열의 원소 각각을 상기 제2 행렬에서 선택된 행의 모든 원소와 곱셈 연산하는 단계; 및 순차적 선택된 제1 행렬의 열 및 제2 행렬의 행 사이의 곱셈 연산 결과를 누적 가산하여 상기 제1 행렬과 상기 제2 행렬의 행렬 곱셈 연산 결과인 결과 행렬을 획득하는 단계를 포함한다.A matrix operation method for an artificial neural network according to another embodiment of the present invention for achieving the above object includes receiving and storing a first matrix that is a multiplicand matrix and a second matrix that is a multiplier matrix multiplied by the first matrix; receiving a plurality of elements sequentially selected in units of columns in the first matrix and a plurality of elements sequentially selected in units of rows in the second matrix corresponding to columns selected in the first matrix; multiplying each element of a column selected in the first matrix by all elements of a row selected in the second matrix; and acquiring a result matrix that is a result of the matrix multiplication operation of the first matrix and the second matrix by cumulatively adding the multiplication operation results between the sequentially selected columns of the first matrix and the rows of the second matrix.

따라서, 본 발명의 실시예에 따른 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법은 인공 신경망의 다수의 레이어에서 수행되는 행렬 곱셈 연산과 덧셈 연산을 병렬로 동시에 수행할 수 있도록 하여, 연산 효율성을 높이고 전력 소모를 줄일 수 있다. 특히 파이프 라인 기법에 따라 곱셈 연산이 수행되는 동안 덧셈 연산이 누산되도록 하여 연산 효율성을 극대화 할 수 있다.Therefore, the matrix operator and the matrix operation method for an artificial neural network according to an embodiment of the present invention allow the matrix multiplication operation and the addition operation performed in a plurality of layers of the artificial neural network to be simultaneously performed in parallel, thereby increasing computational efficiency and power consumption can be reduced. In particular, according to the pipeline method, it is possible to maximize the operation efficiency by allowing the addition operation to be accumulated while the multiplication operation is being performed.

도 1은 인공 신경망의 일예에 대한 개괄적 구조를 나타낸다.
도 2는 일반적인 행렬의 곱셈 연산 알고리즘을 나타낸 도면이다.
도 3은 도 2의 행렬 곱셈 연산 알고리즘에서 요구되는 곱셈 연산 및 덧셈 연산의 횟수를 나타낸다.
도 4는 본 발명의 일 실시예에 따른 행렬 연산기의 개략적 구조를 나타낸다.
도 5는 도 4의 연산 처리 레인의 상세 구성을 나타낸다.
도 6은 도 5의 프로세스 소자의 상세 구성을 나타낸다.
도 7은 본 발명의 일 실시예에 따른 행렬의 곱셈 연산 알고리즘을 나타낸다.
도 8은 본 발명의 일 실시예에 따른 행렬 연산 방법을 나타낸다.1 shows a general structure of an example of an artificial neural network.
2 is a diagram illustrating a general matrix multiplication operation algorithm.
FIG. 3 shows the number of multiplication operations and addition operations required in the matrix multiplication operation algorithm of FIG. 2 .
4 shows a schematic structure of a matrix operator according to an embodiment of the present invention.
FIG. 5 shows a detailed configuration of an arithmetic processing lane of FIG. 4 .
FIG. 6 shows a detailed configuration of the process element of FIG. 5 .
7 illustrates a matrix multiplication operation algorithm according to an embodiment of the present invention.
8 shows a matrix operation method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in various different forms, and is not limited to the described embodiments. In addition, in order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it does not exclude other components unless otherwise stated, meaning that other components may be further included. In addition, terms such as "...unit", "...group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. and a combination of software.

도 1은 인공 신경망의 일예에 대한 개괄적 구조를 나타낸다.1 shows a general structure of an example of an artificial neural network.

도 1은 인공 신경망(artificial neural network)의 대표적인 일예로서, 컨볼루션 신경망(Convolution Neural Network: 이하 CNN)을 도시하였다. 특히 컨볼루션 신경망에서도 광학적 문자 인식(Optical character reader)에 사용되는 인공 신경망으로, 우편물의 우편 번호 인식, 및 숫자 인식을 위해 개발된 대표적인 컨볼루션 신경망인 LeNet-5의 개괄적 구조를 나타낸다.1 illustrates a convolutional neural network (CNN) as a representative example of an artificial neural network. In particular, as an artificial neural network used for optical character recognition in convolutional neural networks, it shows the general structure of LeNet-5, a representative convolutional neural network developed for postal zip code recognition and number recognition.

LeNet-5는 일예로 32 X 32 크기의 입력 이미지(Input)를 인가받아, 컨볼루션 연산 및 서브 샘플링 연산을 반복적으로 수행하며 특징맵(feature map: f.map)을 추출하고, 특징맵에서 추출된 특징값을 기반으로 기지정된 클래스 중 가장 가능성이 큰 클래스에 대응하는 값을 선택하도록 구성된다.For example, LeNet-5 receives a 32 X 32 input image, repeatedly performs convolution and subsampling operations, extracts a feature map (f.map), and extracts from the feature map. It is configured to select a value corresponding to the most probable class among the predetermined classes based on the specified feature value.

LeNet-5는 숫자를 인식하기 위해 개발된 신경망이므로, LeNet-5는 일예로 특징값을 0 ~ 9 사이의 숫자로 분류하고, 0 ~ 9 사이의 숫자 중 하나를 결과값으로 선택할 수 있다.Since LeNet-5 is a neural network developed to recognize numbers, LeNet-5 can classify feature values as numbers between 0 and 9, for example, and select one of numbers between 0 and 9 as a result value.

이하에서는 일예로 LeNet-5를 설명하지만, 다른 컨볼루션 신경망 및 인공 신경망 또한 기본적으로는 유사성을 갖고 있으며, 본 발명의 개념은 LeNet-5에 한정되지 않고, 다양한 인공 신경망에 적용될 수 있다.Hereinafter, LeNet-5 will be described as an example, but other convolutional neural networks and artificial neural networks also have basically similarities, and the concept of the present invention is not limited to LeNet-5 and can be applied to various artificial neural networks.

도 1에서 C는 컨볼루션 레이어(Convolution layer), S는 서브 샘플링 레이어(Sub-sampling layer), FC는 완전 연결 레이어(Fully-Connected layer)를 의미하며, C, S, FC 뒤의 숫자는 레이어 인덱스를 나타낸다. 즉 LeNet-5는 3 개의 컨볼루션 레이어(C1, C2, C3), 2 개의 서브 샘플링 레이어(S1, S2) 및 2 개의 완전 연결 레이어(FC1, FC2)를 포함한다.In FIG. 1, C denotes a convolution layer, S denotes a sub-sampling layer, and FC denotes a fully-connected layer, and the numbers after C, S, and FC denote layers indicates the index. That is, LeNet-5 includes three convolutional layers (C1, C2, C3), two subsampling layers (S1, S2), and two fully connected layers (FC1, FC2).

도 1에 도시된 바와 같은 CNN의 경우, 다수의 컨볼루션 레이어(C1, C2, C3) 각각이 특징맵(f.map)(또는 입력 이미지(Input))을 인가받고, 인가된 특징맵(f.map)(또는 입력 이미지(Input))을 대응하는 컨볼루션 레이어(C1, C2, C3) 각각에 기지정된 커널과 컨볼루션 연산을 수행하게 된다. 그리고 컨볼루션 연산은 다수의 곱셈 및 덧셈 연산으로 구성된다. 또한 다수의 완전 연결 레이어(FC1, FC2)에서도 곱셈 연산이 수행된다.In the case of CNN as shown in Figure 1, each of a plurality of convolutional layers (C1, C2, C3) receives a feature map (f.map) (or an input image (Input)), and the applied feature map (f .map) (or an input image (Input)) is performed with a kernel predetermined in each of the corresponding convolutional layers (C1, C2, C3) and a convolution operation. And the convolution operation consists of a number of multiplication and addition operations. In addition, multiplication operations are performed in multiple fully connected layers (FC1 and FC2).

즉 도 1에 도시된 CNN의 경우, 다수의 곱셈 연산과 덧셈 연산을 수행해야 한다. 그리고 CNN이 아닌 다른 인공 신경망의 경우에도 기본적으로 곱셈 연산과 덧셈 연산을 수행하도록 구성된다.That is, in the case of the CNN shown in FIG. 1, a plurality of multiplication operations and addition operations must be performed. And in the case of artificial neural networks other than CNN, it is basically configured to perform multiplication and addition operations.

일반적으로 인공 신경망은 특징맵(f.map)(또는 입력 이미지(Input))과 커널을 im2col(image blocks into columns)과 같은 기지정된 알고리즘에 따라 행렬로 변환하여, 행렬 곱셈 및 행렬 덧셈 연산을 수행함으로써 연산 속도를 향상시키고 있다.In general, artificial neural networks perform matrix multiplication and matrix addition operations by transforming a feature map (or input image) and a kernel into a matrix according to a predefined algorithm such as im2col (image blocks into columns). This improves the calculation speed.

행렬간 곱셈 연산은 잘 알려진 바와 같이, 각 행렬에서 지정된 원소들을 곱한 후, 곱해진 결과를 모두 가산하여 연산 결과를 획득한다. 이때 원소간 곱셈의 경우, 병렬로 수행되어 1회의 곱셈 연산이 수행되는 반해, 원소간 곱셈 결과를 가산하는 과정은 가산되어야 하는 값의 개수에 따라 다수 횟수로 반복 연산이 수행되어야 한다.As is well known, the inter-matrix multiplication operation multiplies specified elements in each matrix and then adds all the multiplied results to obtain an operation result. In this case, in the case of inter-element multiplication, one multiplication operation is performed in parallel, whereas the process of adding the inter-element multiplication result must be repeated a plurality of times according to the number of values to be added.

도 2는 일반적인 행렬의 곱셈 연산 방법을 나타낸 도면이고, 도 3은 도 2의 행렬 곱셈 연산 방법에서 요구되는 곱셈 연산 및 덧셈 연산의 횟수를 나타낸다.FIG. 2 is a diagram illustrating a general matrix multiplication operation method, and FIG. 3 illustrates the number of multiplication operations and addition operations required in the matrix multiplication operation method of FIG. 2 .

도 2에 도시된 바와 같이, 행렬의 곱셈 연산은 m × k 크기의 A 행렬의 각 행(#1, #2, #3, ..., #m)의 원소들과 k × n 크기의 B 행렬의 각 열(&1, &2, &3, ..., &n)의 원소들 중 서로 대응하는 원소들을 곱하고 곱해진 값들을 더하여, A 행렬과 B 행렬의 곱셈 결과인 C 행렬의 하나의 원소(#, &)를 획득한다.As shown in FIG. 2 , the matrix multiplication operation is performed with the elements of each row (#1, #2, #3, ..., #m) of A matrix of m × k size and B of size k × n. One element (# , &) are obtained.

여기서 A 행렬을 피승수 행렬(Multiplicand Matrix)이라 하고, B 행렬을 승수 행렬(Multiplier Matrix)이라 한다.Here, matrix A is called a multiplicand matrix, and matrix B is called a multiplier matrix.

도 3을 참조하여 C 행렬의 1행 1열의 원소(#1, &1)을 획득하기 위한 연산을 살펴보면, 일반적인 행렬의 연산 규칙에 따라 피승수 행렬인 A 행렬을 행단위로 리드하고, 승수 행렬인 B 행렬을 열 단위로 리드하여, 리드된 A 행렬의 행과 B 행렬의 열을 서로 곱하고 곱해진 결과를 모두 더한다.Referring to the operation for obtaining the elements (#1, &1) of row 1, column 1 of the C matrix with reference to FIG. 3 , the matrix A, which is the multiplicand matrix, is read row by row according to the general matrix operation rules, and the matrix B, which is the multiplier matrix, is read. is read column-by-column, the read rows of the A matrix and the columns of the B matrix are multiplied by each other, and all the multiplied results are added.

일예로 우선 A 행렬의 제 1행(#1)의 원소들과 B 행렬의 제 1열(&1)의 원소 각각을 서로 곱한다. 여기서는 일예로 k가 16인 것으로 가정하였으며, 이에 행렬 연산기가 16개의 프로세스 소자(Processing Element: 이하 PE)를 구비하고 있다면, A 행렬의 제 1행(#1)의 16개의 원소들과 B 행렬의 제 1열(&1)의 16개의 원소가 병렬로 동시에 곱셈 연산이 수행될 수 있다. 즉 A 행렬의 제 1행(#1)의 원소들과 B 행렬의 제 1열(&1)의 원소에 대한 곱셈은 1회의 병렬 연산만이 수행된다.For example, the elements of the first row (#1) of the A matrix are multiplied by each element of the first column (&1) of the B matrix. Here, as an example, it is assumed that k is 16, and if the matrix operator is provided with 16 processing elements (hereinafter referred to as PE), the 16 elements of the first row (#1) of the A matrix and the B matrix are A multiplication operation may be simultaneously performed on 16 elements of the first column &1 in parallel. That is, only one parallel operation is performed for multiplication of the elements of the first row (#1) of the A matrix and the elements of the first column (&1) of the B matrix.

그러나 이후 대응하는 원소간 곱셈으로 획득된 16개의 곱셈 값들의 합은 1회의 연산으로 계산되지 않는다. 일반적으로 프로세스 소자(PE) 와 같은 연산 소자는 2개의 입력을 인가받아 곱셈 또는 덧셈 연산을 수행하도록 구성된다. 따라서 도 3에 도시된 바와 같이, 16개의 곱셈값들을 2개씩 선택하여 우선 덧셈 연산을 수행하고, 덧셈 연산된 값에 대해 다시 2개씩 선택하여 반복적으로 덧셈 연산을 수행해야 한다. 이는 16개의 곱셈값들에 대해 4번의 덧셈 연산이 반복적으로 수행되어야 함을 의미하며, 결과적으로 원소간 곱셈 연산에 비해 곱셈 결과에 대한 덧셈 연산의 연산 시간이 매우 길다는 것을 의미한다.However, the sum of 16 multiplication values obtained by subsequent inter-element multiplication is not calculated by one operation. In general, an arithmetic element such as the process element PE is configured to receive two inputs and perform a multiplication or addition operation. Accordingly, as shown in FIG. 3 , the addition operation is first performed by selecting 16 multiplication values by two, and the addition operation must be repeatedly performed by selecting two additional values for the addition operation. This means that four addition operations have to be repeatedly performed for 16 multiplication values, and as a result, the operation time of the addition operation on the multiplication result is very long compared to the inter-element multiplication operation.

또한 원소간 곱셈 결과를 가산해야 하므로, 원소간 곱셈과 덧셈 연산이 순차적으로 수행된다. 즉 곱셈 연산과 덧셈 연산이 개별적으로 수행되어야 한다. 따라서 도 3에 도시된 바와 같이, 16개의 원소간 곱셈에 대해 총 5회의 연산을 필요로 한다.In addition, since the result of multiplication between elements must be added, multiplication and addition between elements are sequentially performed. That is, the multiplication operation and the addition operation must be performed separately. Accordingly, as shown in FIG. 3 , a total of 5 operations are required for multiplication between 16 elements.

그리고 이는 C 행렬의 하나의 원소(예를 들면, c_0,0)에 대한 값을 획득하기 위한 연산으로, 각각 16개의 프로세스 소자(PE)를 포함하는 연산 처리 레인의 개수가 m 개라고 가정하면, C 행렬 전체에 대해서는 A 행렬과 B 행렬의 크기에 비례하는 횟수만큼 수행되어야 하므로, 결과적으로 n × 5 회의 연산을 필요로 하게 된다.And this is an operation for obtaining a value for one element (eg, c _0,0 ) of the C matrix. Assuming that the number of operation processing lanes including 16 process elements PE is m, , and C for the entire matrix, since the number of times proportional to the size of the A matrix and the B matrix is to be performed, as a result, n × 5 operations are required.

따라서 행렬의 연산 속도를 가속하기 위해서는 덧셈 연산에 소요되는 시간을 저감하는 것이 매우 중요하다.Therefore, it is very important to reduce the time required for the addition operation in order to accelerate the operation speed of the matrix.

도 4는 본 발명의 일 실시예에 따른 행렬 연산기의 개략적 구조를 나타내고, 도 5는 도 4의 연산 처리 레인의 상세 구성을 나타내며, 도 6은 도 5의 프로세스 소자의 상세 구성을 나타낸다.4 shows a schematic structure of a matrix operator according to an embodiment of the present invention, FIG. 5 shows a detailed configuration of an arithmetic processing lane of FIG. 4 , and FIG. 6 shows a detailed configuration of the process element of FIG. 5 .

도 4 내지 도 6을 참조하여, 본 실시예에 따른 행렬 연산기(100)를 설명하면, 행렬 연산기(100)는 연산 제어부(110), 제1 버퍼부(120), 제2 버퍼부(130) 및 연산부(140)를 포함하며, 상기한 바와 같이, 인공 신경망의 모듈로서 이용될 수 있다.4 to 6 , the matrix operator 100 according to the present embodiment will be described. The matrix operator 100 includes an operation control unit 110 , a first buffer unit 120 , and a second buffer unit 130 . and a calculator 140 , and as described above, may be used as a module of an artificial neural network.

연산 제어부(110)는 인공 신경망의 각 레이어에서 연산이 수행되어야 할 다수의 행렬을 인가받는다. 여기서 연산이 수행되어야 하는 다수의 행렬은 레이어로 인가되는 적어도 하나의 특징맵(또는 입력 이미지)과 각각의 레이어에 지정된 적어도 하나의 커널일 수 있다.The operation control unit 110 receives a plurality of matrices on which operations are to be performed in each layer of the artificial neural network. Here, the plurality of matrices to be calculated may be at least one feature map (or input image) applied to a layer and at least one kernel assigned to each layer.

연산 제어부(110)는 인가된 다수의 행렬 중 연산이 수행되어야 하는 2개의 행렬을 선택하고, 선택된 2개의 행렬을 연산 명령과 함께 제1 버퍼부(120) 및 제2 버퍼부(130)로 전달한다. 이때 제1 버퍼부(120)로는 인공 신경망의 레이어로 인가되는 적어도 하나의 특징맵에 대한 피승수 행렬인 A 행렬을 인가하고, 제2 버퍼부(130)로는 인공 신경망의 레이어에 지정된 커널에 대한 승수 행렬인 B 행렬을 인가한다.The operation control unit 110 selects two matrices on which an operation is to be performed from among a plurality of applied matrices, and transmits the selected two matrices to the first buffer unit 120 and the second buffer unit 130 together with an operation command. do. At this time, the first buffer unit 120 applies the matrix A, which is a multiplicand matrix for at least one feature map applied to the layer of the artificial neural network, to the second buffer unit 130 , and the second buffer unit 130 applies the multiplier to the kernel specified in the layer of the artificial neural network. Apply the matrix B, which is a matrix.

그리고 연산 제어부(110)는 연산부(140)으로부터 행렬 연산 수행 결과를 인가받아 메모리(미도시) 등으로 전송하여 저장할 수 있다.In addition, the operation control unit 110 may receive a matrix operation result from the operation unit 140 and transmit it to a memory (not shown) and store it.

연산 제어부(110)가 선택된 행렬을 연산 명령과 함께 인가하는 것은 후술하는 본 실시예에 따른 행렬의 곱셈 알고리즘에 기반하여, 각 행렬의 원소를 선택하여 행렬 곱셈 연산을 수행할 수 있도록 하기 위함이다.The reason that the operation control unit 110 applies the selected matrix together with the operation command is to select an element of each matrix and perform a matrix multiplication operation based on the matrix multiplication algorithm according to the present embodiment to be described later.

제1 버퍼부(120)는 연산 명령에 따라 연산 제어부(110)에서 인가된 A 행렬에서 열 단위로 원소를 선택하여 연산부(140)로 전달한다. 그리고 제2 버퍼부(130)는 연산 명령에 따라 인가된 B 행렬에서 행 단위로 원소를 선택하여 연산부(140)로 전달한다.The first buffer unit 120 selects an element in units of columns from the matrix A applied from the operation control unit 110 according to an operation command and transmits it to the operation unit 140 . In addition, the second buffer unit 130 selects an element in units of rows from the B matrix applied according to the operation command and transmits it to the operation unit 140 .

연산부(140)는 다수의 연산 처리 레인(SIMDL)을 포함할 수 있다. 그리고 다수의 연산 처리 레인(SIMDL) 각각은 도 5에 도시된 바와 같이, 다수의 프로세스 소자(PE)와 SIMD 유닛(SIMDU)을 포함할 수 있다.The calculation unit 140 may include a plurality of calculation processing lanes SIMDL. In addition, each of the plurality of arithmetic processing lanes SIMDL may include a plurality of process elements PE and a SIMD unit SIMDU, as shown in FIG. 5 .

최근 행렬 연산기는 연산 효율성을 높이기 위해, 복잡한 연산을 단일 명령으로 일괄 처리할 수 있도록 SIMD(Single Instruction Multiple Data) 기법을 이용하는 것이 일반적이다. SIMD 기법은 다수의 프로세스 소자(PE)들이 동일(또는 유사)한 연산을 다수의 데이터에 적용하여 동시에 처리하는 방식으로, 주로 백터(vector) 프로세서에서 이용되는 기술이다.Recently, in order to increase computational efficiency, it is common for matrix operators to use a SIMD (Single Instruction Multiple Data) technique so that complex operations can be batch-processed with a single instruction. The SIMD technique is a method in which a plurality of process elements (PEs) apply the same (or similar) operation to a plurality of data and simultaneously process it, and is a technology mainly used in a vector processor.

SIMD 기법에서는 명령의 효율성을 극대화 하기 위해, 단일 명령으로 다중 데이터를 처리할 수 있는 다수의 명령어 집합을 저장하고 있다. 그리고 저장된 명령어 집합 각각은 다수의 프로세스 소자(PE)에 대해 데이터 수준 병렬성(Data Level Parallelism; DLP)을 이용하여 동시에 병렬로 연산을 수행하도록 한다. 즉 SIMD 유닛(SIMDU)은 제1 및 제2 버퍼부(120, 130)에서 행 또는 열 단위로 인가되는 원소들에 다수의 프로세스 소자(PE)가 지정된 동일한 연산을 병렬로 수행하도록 한다.In the SIMD technique, in order to maximize instruction efficiency, multiple instruction sets that can process multiple data with a single instruction are stored. In addition, each of the stored instruction sets uses data level parallelism (DLP) for a plurality of process elements (PE) to simultaneously perform operations in parallel. That is, the SIMD unit SIMDU causes the first and second buffer units 120 and 130 to perform the same operation in which a plurality of process elements PE are assigned to elements applied in units of rows or columns in parallel.

여기서 SIMD 유닛(SIMDU)은 연산부(140)내의 하드웨어로 구현될 수도 있으나 연산부(140)에서 수행되는 동작을 지정하는 소프트웨어로 구현될 수도 있다. 또한 경우에 따라서는 연산 제어부(110) 내에 구현될 수도 있다.Here, the SIMD unit (SIMDU) may be implemented as hardware in the operation unit 140 or may be implemented as software specifying an operation performed by the operation unit 140 . Also, in some cases, it may be implemented in the operation control unit 110 .

다수의 프로세스 소자(PE) 각각은 제1 및 제2 버퍼부(120, 130)로부터 A 행렬과 B 행렬의 원소(a, b)들 중 서로 연산되어야 하는 원소들을 인가받아 곱셈 또는 덧셈 연산을 수행한다. 그리고 다수의 프로세스 소자(PE) 각각은 MAC 연산기(Multiply-accumulate operater)로 구현될 수 있다.Each of the plurality of process elements PE receives from the first and second buffer units 120 and 130, elements to be calculated among the elements a and b of the A matrix and the B matrix, and performs a multiplication or addition operation. do. In addition, each of the plurality of process elements PE may be implemented as a multiply-accumulate operator (MAC).

도 6을 참조하면, 다수의 프로세스 소자(PE) 각각은 곱셈기(MUL), 가산기(ADD) 및 누적 레지스터(ACC)를 포함할 수 있다.Referring to FIG. 6 , each of the plurality of process elements PE may include a multiplier MUL, an adder ADD, and an accumulation register ACC.

곱셈기(MUL)는 제1 버퍼부(120)로부터 인가된 A 행렬의 원소(a)와 제2 버퍼부(130)로부터 인가된 B 행렬의 원소(b)를 서로 곱하여 가산기(ADD)로 출력한다. 가산기(ADD)는 곱셈기(MUL)의 출력값과 누적 레지스터(ACC)에 저장된 이전 계산된 누적 부분합을 인가받아 가산하여 누적 부분합을 갱신한다. 누적 레지스터(ACC)는 가산기(ADD)에서 출력되는 갱신된 누적 부분합을 저장한다.The multiplier MUL multiplies the element (a) of the A matrix applied from the first buffer unit 120 and the element (b) of the B matrix applied from the second buffer unit 130 and outputs it to the adder ADD. . The adder ADD receives and adds the output value of the multiplier MUL and the previously calculated accumulated subtotal stored in the accumulation register ACC to update the accumulated subtotal. The accumulation register ACC stores the updated accumulated subtotal output from the adder ADD.

도 7은 본 발명의 일 실시예에 따른 행렬의 곱셈 연산 알고리즘을 나타낸다.7 illustrates a matrix multiplication operation algorithm according to an embodiment of the present invention.

이하에서는 도 4 내지 도 6을 참조하여, 도 7의 행렬의 곱셈 연산 알고리즘을 설명한다.Hereinafter, an algorithm for multiplying the matrix of FIG. 7 will be described with reference to FIGS. 4 to 6 .

일반적인 행렬의 곱셈 연산에서는 도 2에 도시된 바와 같이, 피승수 행렬인 A 행렬에서 열 단위(&1, &2, ..., &k)로 원소를 선택하고, 승수 행렬인 B 행렬에서 열 단위(&1, &2, ..., &n)로 원소를 선택하여 선택된 원소들 중 대응하는 원소들을 서로 곱한 후 모두 가산하여 곱셈 연산을 수행하였다.In a general matrix multiplication operation, as shown in FIG. 2 , elements are selected in units of columns (&1, &2, ..., &k) from matrix A, which is a multiplicand matrix, and column units (&1, &2, ..., &n) was selected, the corresponding elements among the selected elements were multiplied with each other, and then the multiplication operation was performed by adding them all together.

그에 반해 도 7에 도시된 본 실시예에 따른 행렬의 곱셈 연산 방법에서는 제1 버퍼(120)가 피승수 행렬인 A 행렬에서 열 단위(&1, &2, ..., &k)로 원소를 선택하고, 제2 버퍼(130)가 승수 행렬인 B 행렬에서 행 단위(#1, #2, ..., #k)로 원소를 선택하여 연산부(140)로 전송한다.On the other hand, in the matrix multiplication operation method according to the present embodiment shown in FIG. 7, the first buffer 120 selects an element in a column unit (&1, &2, ..., &k) from the matrix A, which is a multiplicand matrix, The second buffer 130 selects an element in row units (#1, #2, ..., #k) from the matrix B, which is a multiplier matrix, and transmits it to the operation unit 140 .

연산부(140)의 다수의 연산 처리 레인(SIMDL) 각각은 A 행렬과 B 행렬에서 대응하는 원소들을 인가받아 서로 곱하고, 곱해진 결과를 누적하여 가산한다.Each of the plurality of operation processing lanes SIMDL of the operation unit 140 receives and multiplies corresponding elements from the A matrix and the B matrix, and accumulates and adds the multiplied results.

특히 본 실시예에서 다수의 연산 처리 레인(SIMDL) 각각은 A 행렬에서 열 단위(&1, &2, ..., &k)로 선택된 원소들 중 하나의 원소와 B 행렬의 행 단위(#1, #2, ..., #k)로 선택된 다수의 원소들을 서로 곱한다.In particular, in this embodiment, each of the plurality of arithmetic processing lanes SIMDL includes one element among the elements selected by column units (&1, &2, ..., &k) in matrix A and row units (#1, #) in matrix B. Multiple elements selected by 2, ..., #k) are multiplied by each other.

일예로 A 행렬의 제1 열(&1)과 B 행렬의 제1 행(#1)인 선택된 경우, 다수의 연산 처리 레인(SIMDL) 중 제1 연산 처리 레인은 A 행렬의 제1 열(&1)의 제1 행의 a 원소(a_0,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 인가받고, a 원소(a_0,0)를 b 원소들(b_0,0, b_0,1, ..., b_0,n) 각각에 곱한다.For example, when the first column (&1) of the A matrix and the first row (#1) of the B matrix are selected, the first operation processing lane among the plurality of operation processing lanes SIMDL is the first column (&1) of the A matrix Apply the a element (a _0,0 ) of the first row of , and the b elements (b _0,0 , b _0,1 , ..., b _0,n ) of the first row (#1) of the B matrix and multiply each of the b elements (b _0,0 , b _0,1 , ..., b _0,n ) by the element a (a _0,0 ).

이때, 제1 연산 처리 레인의 다수개의 프로세스 소자(PE) 각각에서 곱셈기(MUL)는 a 원소(a_0,0)와 b 원소들(b_0,0, b_0,1, ..., b_0,n) 중 대응하는 하나의 b 원소를 인가받아 곱셈하여 가산기(ADD)로 전달한다. 이전 계산된 곱셈 결과가 없으므로, 즉 누적 레지스터(ACC)에 이전 저장된 누적값이 없으므로, 가산기(ADD)는 곱셈기(MUL)의 출력을 그대로 누적 레지스터(ACC)로 전달하여 저장한다.At this time, in each of the plurality of process elements PE of the first arithmetic processing lane, the multiplier MUL operates the a element (a _0,0 ) and the b elements (b _0,0 , b _0,1 , ..., b _{0, n} ), receives the corresponding one b element, multiplies it, and transfers it to the adder (ADD). Since there is no previously calculated multiplication result, that is, there is no accumulated value previously stored in the accumulation register ACC, the adder ADD transfers the output of the multiplier MUL to the accumulation register ACC as it is and stores it.

즉 제1 연산 처리 레인은 A 행렬의 제1행 제1열의 원소(a_0,0)와 B 행렬의 제1 행(#1)의 원소들(b_0,0, b_0,1, ..., b_0,n) 사이의 곱셈 결과를 제1 누적 행렬(C⁰)의 제1 행(#1)의 원소값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)으로 획득한다. 그리고 획득된 제1 행(#1)의 원소값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)은 각 프로세스 소자(PE)의 누적 레지스터(ACC)에 저장된다.That is, the first arithmetic processing lane includes the elements (a _0,0 ) of the first row and first column of the A matrix and the elements (b _0,0 , b _0,1 , .. ., b _0,n ) with the element value (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ ) of the first row (#1) of the first accumulation matrix (C ⁰ ) _0,n ). And the obtained element values (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ _0,n ) of the first row #1 are stored in the accumulation register ACC of each process element PE. is saved

한편, 제2 연산 처리 레인은 A 행렬의 제1 열(&1)의 제2 행의 원소(a_1,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 인가받고, a 원소(a_1,0)를 b 원소들(b_0,0, b_0,1, ..., b_0,n) 각각에 곱하여, 제1 누적 행렬(C⁰)의 제2 행(#2)의 원소값(c⁰ _1,0, c⁰ _1,1, ..., c⁰ _1,n)으로 획득하여 저장한다.On the other hand, the second arithmetic processing lane is an element (a _1,0 ) of the second row of the first column (&1) of the A matrix and the b elements (b _0,0 ) of the first row (#1) of the B matrix. b _0,1 , ..., b _0,n ) is applied, and element a (a _1,0 ) is replaced with elements b (b _0,0 , b _0,1 , ..., b _0,n ) Multiply each to obtain and store the element values (c ⁰ _1,0 , c ⁰ _1,1 , ..., c ⁰ _1,n ) of the second row (#2) of the first cumulative matrix (C ⁰ ) do.

이와 같은 방식으로 제m 연산 처리 레인은 A 행렬의 제1 열(&1)의 제m 행의 a 원소(a_m,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 곱하여, 제1 누적 행렬(C⁰)의 제m 행(#m)의 원소값(c⁰ _m,0, c⁰ _m,1, ..., c⁰ _m,n)으로 획득하여 저장한다.In this way, the mth arithmetic processing lane is an element a (a _m,0 ) of the mth row of the first column (&1) of the A matrix and the b elements (b ₀ ) of the first row (#1) of the B matrix _,0 , b _0,1 , ..., b _0,n ) to multiply the element value (c ⁰ _m,0 , c ⁰ _m ) of the mth row (#m) of the first cumulative matrix (C ⁰ ) ₁ , ..., c ⁰ _m,n ) are acquired and stored.

즉 연산부(140)의 다수의 연산 처리 레인(SIML)에 의해 A 행렬의 제1 열(&1)의 모든 원소와 B 행렬의 제1 행(#1)의 모든 원소에 대한 곱셈이 1회의 연산으로 동시에 수행된다.That is, the multiplication of all elements of the first column (&1) of the A matrix and all the elements of the first row (#1) of the B matrix by the plurality of operation processing lanes SIML of the operation unit 140 is performed in one operation. are carried out at the same time

이후, 제1 버퍼(120)는 A 행렬의 제2 열(&2)의 a 원소들(a_0,1, a_1,1, ..., a_m,1)을 각각 다수의 연산 처리 레인(SIMDL) 중 대응하는 연산 처리 레인으로 전달하고, 제2 버퍼(130)는 B 행렬의 제2 행(#2)의 b 원소들(b_1,0, b_1,1, ..., b_1,n)을 다수의 연산 처리 레인(SIMDL) 각각으로 전달한다.Thereafter, the first buffer 120 stores the a elements (a _0,1 , a _1,1 , ..., a _m,1 ) of the second column (&2) of the A matrix into a plurality of arithmetic processing lanes ( SIMDL), the second buffer 130 transmits the b elements (b _1,0 , b _1,1 , ..., b ₁ ) of the second row (#2) of the B matrix to the corresponding operation processing lane. _{, n} ) to each of the multiple computational processing lanes (SIMDL).

이에 연산 처리 레인(SIMDL) 각각의 프로세스 소자(PE)에서는 곱셈기(MUL)가 인가된 제2 열(&2)의 하나의 a 원소(a_0,1, a_1,1, ..., a_m,1)와 제2 행(#2)의 b 원소들(b_1,0, b_1,1, ..., b_1,n) 중 대응하는 b 원소를 곱하고, 가산기(ADD)가 곱셈기(MUL)에서 출력되는 곱셈 결과에 이전 획득되어 누적 레지스터(ACC)에 저장된 누적값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)을 인가받아 가산하여, 제2 누적 행렬(C¹)의 제1 행(#1)의 원소값(c¹ _0,0, c¹ _0,1, ..., c¹ _0,n)으로 획득하고, 획득된 원소값(c¹ _0,0, c¹ _0,1, ..., c¹ _0,n)을 다시 누적 레지스터(ACC)에 저장한다.Accordingly, in the process element PE of each operation processing lane SIMDL, one a element (a _0,1 , a _1,1 , ..., a _m of the second column &2 to which the multiplier MUL is applied) _,1 ) and the corresponding b elements among the b elements (b _1,0 , b _1,1 , ..., b _1,n ) of the second row (#2) are multiplied by the adder (ADD) and the multiplier ( MUL), the accumulated values (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ _0,n ) previously obtained and stored in the accumulation register (ACC) are received and added, 2 Acquired as the element value (c ¹ _0,0 , c ¹ _0,1 , ..., c ¹ _0,n ) of the first row (#1) of the cumulative matrix (C ¹ ), and the obtained element value ( c ¹ _0,0 , c ¹ _0,1 , ..., c ¹ _0,n ) is again stored in the accumulation register (ACC).

유사하게 제2 연산 처리 레인은 A 행렬의 제2 열(&2)의 제2 행의 원소(a_1,1)와 B 행렬의 제2 행(#2)의 원소들(b_1,0, b_1,0, ..., b_1,n)을 인가받고, a 원소(a_1,1)를 원소들(b_1,0, b_1,0, ..., b_1,n) 각각에 곱하고, 곱셈 결과를 누적 레지스터(ACC)에 저장된 누적값(c⁰ _1,0, c⁰ _1,1, ..., c⁰ _1,n)과 가산하여, 제2 누적 행렬(C¹)의 제2 행(#2)의 원소값(c¹ _1,0, c¹ _1,1, ..., c¹ _1,n)으로 획득하여 저장한다.Similarly, the second arithmetic processing lane includes the elements (a _1,1 ) of the second row of the second column (&2) of the A matrix and the elements (b _1,0 , b) of the second row (#2) of the B matrix. _1,0 , ..., b _1,n ) is applied, and an element a (a _1,1 ) is added to each of the elements (b _1,0 , b _1,0 , ..., b _1,n ) By multiplying and adding the multiplication result with the accumulated values (c ⁰ _1,0 , c ⁰ _1,1 , ..., c ⁰ _1,n ) stored in the accumulation register (ACC), the second accumulation matrix (C ¹ ) It is obtained and stored as element values (c ¹ _1,0 , c ¹ _1,1 , ..., c ¹ _1,n ) of the second row (#2).

그리고 제m 연산 처리 레인은 A 행렬의 제2 열(&2)의 제m 행의 a 원소(a_m,0)와 B 행렬의 제2 행(#2)의 b 원소들(b_1,0, b_1,0, ..., b_1,n)을 곱하고, 누적값(c⁰ _m,0, c⁰ _m,1, ..., c⁰ _m,n)을 가산하여, 제2 누적 행렬(C¹)의 제m 행(#m)의 원소값(c¹ _m,0, c¹ _m,1, ..., c¹ _m,n)으로 획득하여 저장한다.And the m th operation processing lane is an element a (a _m,0 ) of the mth row of the second column (&2) of the A matrix and the b elements (b _1,0 ) of the second row (#2) of the B matrix a second cumulative matrix by multiplying b _1,0 , ..., b _1,n ) and adding the cumulative value (c ⁰ _m,0 , c ⁰ _m,1 , ..., c ⁰ _m,n ) It is obtained and stored as element values (c ¹ _m,0 , c ¹ _m,1 , ..., c ¹ _m,n ) of the mth row (#m) of (C ¹ ).

이와 같이 연산부(140)는 A 행렬의 제k 열(&k)과 B 행렬의 제k 행(#k)까지의 원소들을 순차적으로 인가받고, 인가된 원소들을 곱하고 이전 계산된 누적값과 가산하여 최종적으로 A 행렬과 B 행렬의 행렬 곱셈 결과인 제k 누적 행렬(C^k)을 획득한다.In this way, the operation unit 140 sequentially receives the elements up to the k-th column (&k) of the A matrix and the k-th row (#k) of the B matrix, multiplies the applied elements, and adds them to the previously calculated cumulative value to finally to obtain the k-th cumulative matrix (C ^k ), which is the matrix multiplication result of the A matrix and the B matrix.

상기한 행렬 곱셈 알고리즘을 다수의 연산 처리 레인(SIMDL)의 다수의 프로세스 소자(PE) 각각의 관점에서 다시 설명하면, 제p SIDM 레인의 다수의 프로세스 소자(PE) 각각은 A 행렬에서 대응하는 제p 행의 a 원소들(a_p,0, a_p,1, ..., a_p,k) 각각을 B 행렬에서 대응하는 제p 열의 b 원소들(b_0,p, b_1,p, ..., b_k,p)과 순차적으로 곱하고, 곱셈 결과를 이전 곱셈 결과의 누적값에 가산하여 제k 누적 행렬(C^k)의 제p 행의 원소(c^k _p,0, c^k _p,0, ..., c^k _p,n)값으로 획득한다.If the above-described matrix multiplication algorithm is described again in terms of each of the plurality of process elements PE of the plurality of arithmetic processing lanes SIMDL, each of the plurality of process elements PE of the pth SIDM lane corresponds to the corresponding first in the A matrix. Each of the a elements of p row (a _p,0 , a _p,1 , ..., a _p,k ) is assigned to the b elements of the corresponding pth column in the B matrix (b _0,p , b _1,p , _. ^_ ^_ _{_} ^_ _{_ ,0} , ..., c ^k _p,n ).

그리고 이러한 프로세스 소자(PE) 각각의 계산 방식은 도 2에 도시된 일반적인 행렬의 곱셈 연산에서 C 행렬의 하나의 원소(#, &)를 계산하는 과정과 동일하다. 다만 도 2의 알고리즘의 경우, 연산부가 C 행렬의 하나의 원소(#, &)를 계산하기 위해 요구되는 A 행렬과 B 행렬의 원소들을 동시에 인가받아 곱셈을 수행하고, 곱셈 수행 결과에 대해 다시 덧셈 연산을 반복적으로 수행해야 하므로, 도 3에 도시된 바와 같이, C 행렬의 각 원소(#, &)를 획득할 때마다 한번의 곱셈 연산 이후 다수 횟수로 덧셈 연산을 수행해야 하였다.In addition, the calculation method of each of these process elements PE is the same as the process of calculating one element (#, &) of the C matrix in the general matrix multiplication operation shown in FIG. 2 . However, in the case of the algorithm of FIG. 2, the operation unit receives the elements of the A matrix and the B matrix required to calculate one element (#, &) of the C matrix at the same time, performs multiplication, and adds again to the result of the multiplication Since the operation has to be repeatedly performed, as shown in FIG. 3 , whenever each element (#, &) of the C matrix is obtained, the addition operation has to be performed a plurality of times after one multiplication operation.

그러나 도 7에 도시된 본 실시예에 따른 행렬 곱셈 알고리즘에서는 순차적으로 누적되는 누적 행렬(C⁰, C¹, ..., C^k)을 획득하므로, k번의 곱셈 연산과 k번의 덧셈 연산만으로 A 행렬과 B 행렬 사이의 곱셈 결과인 C 행렬을 획득할 수 있다.However, since the matrix multiplication algorithm according to the present embodiment shown in FIG. 7 acquires sequentially accumulated cumulative matrices (C ⁰ , C ¹ , ..., C ^k ), only k multiplication operations and k addition operations A A C matrix, which is the result of multiplication between the matrix and the B matrix, can be obtained.

특히 다수의 연산 처리 레인(SIMDL)의 다수의 프로세스 소자(PE) 각각에서 곱셈기(MUL)는, 가산기(ADD)가 이전 곱셈기(MUL)에서 출력되는 곱셈 결과와 누적 레지스터(ACC)에 저장된 누적값을 가산하는 동안, 다음 곱셈 연산되어야 하는 a 원소와 b 원소를 인가받아 곱셈 연산을 수행할 수 있다. 즉 파이프 라인(Pipeline) 기법에 따라 곱셈기(MUL)와 가산기(ADD)가 동시 연산을 수행할 수 있다.In particular, in each of the plurality of process elements PE of the plurality of arithmetic processing lanes SIMDL, the multiplier MUL generates a multiplication result output by the adder ADD from the previous multiplier MUL and the accumulated value stored in the accumulation register ACC. While adding , the multiplication operation may be performed by receiving an element a and an element b to be multiplied next. That is, the multiplier (MUL) and the adder (ADD) may perform simultaneous operation according to the pipeline technique.

이는 A 행렬과 B 행렬 사이의 곱셈 결과인 C 행렬을 획득하기 위해 2k 만큼의 연산 시간이 소요되는 것이 아니라, k+1 만큼의 연산 시간이 소요되는 것을 의미한다.This means that it takes k+1 computation time, rather than 2k computation time, to obtain the C matrix, which is the result of multiplication between the A matrix and the B matrix.

즉 행렬의 곱셈 연산에서 덧셈 연산을 위한 시간을 거의 필요로 하지 않도록 하여 행렬 곱셈 연산 시간을 크게 줄일 수 있다.That is, since the time for the addition operation is hardly required in the matrix multiplication operation, the matrix multiplication operation time can be greatly reduced.

도 8은 본 발명의 일 실시예에 따른 행렬 연산 방법을 나타낸다.8 shows a matrix operation method according to an embodiment of the present invention.

도 4 내지 도 7을 참조하여, 도 8의 행렬 연산 방법을 설명하면, 우선 곱셈 연산 대상이 되는 2개의 행렬을 획득한다(S10). 2개의 행렬 중 하나는 m × k 크기의 피승수 행렬로서 A 행렬이라 하고, 나머지 하나는 k × n 크기의 승수 행렬로서 B 행렬이라 할 수 있다. 여기서 A 행렬은 인공 신경망의 각 레이어에 입력되는 특징맵(f.map)(또는 입력 이미지(Input))의 전체 또는 일부 일 수 있으며, B 행렬은 각 레이어에 기지정된 커널의 전체 또는 일부 일 수 있다.Referring to FIGS. 4 to 7 , the matrix calculation method of FIG. 8 is described. First, two matrices to be multiplied are obtained ( S10 ). One of the two matrices is an m × k multiplicand matrix, called A matrix, and the other is a k × n multiplier matrix, which may be referred to as a B matrix. Here, matrix A may be all or part of a feature map (or input image) input to each layer of the artificial neural network, and matrix B may be all or part of a kernel specified for each layer. there is.

연산 대상인 2개의 행렬이 획득되면, 피승수 행렬인 A 행렬에서 제i 열을 선택한다(S20). 그리고 승수 행렬인 B 행렬에서 제i 행을 선택한다(S30). 여기서 i의 초기값은 1로서, 우선 A 행렬의 제1 열과 B 행렬의 1행을 선택한다.When two matrices to be calculated are obtained, the i-th column is selected from the matrix A, which is the multiplicand matrix (S20). Then, the i-th row is selected from the B matrix, which is a multiplier matrix (S30). Here, the initial value of i is 1, and first, the first column of the A matrix and the first row of the B matrix are selected.

그리고 선택된 A 행렬의 제i 열의 원소(a_0,i, a_1,i, ..., a_m,i) 각각을 선택된 B 행렬의 제i 행의 모든 원소(b_i,0, b_i,1, ..., b_i,n)와 곱하여, m × n개의 곱셈 결과를 획득한다(S40).And each element (a _0,i , a _1,i , ..., a _m,i ) of the i-th column of the selected A matrix is added to all the elements (b _i,0 , b _{i, 1} , ..., b _i,n ) to obtain m × n multiplication results (S40).

m × n개의 곱셈 결과를 획득되면, 획득된 곱셈 결과를 이전 획득된 누적값에 가산한다(S50). 만일 이전 획득된 누적값이 없으면, 곱셈 결과를 초기 누적값으로 획득하고, 이전 획득된 누적값이 있으면, 획득된 곱셈 결과를 이전 획득된 누적값에 가산한 결과를 갱신된 누적값으로 저장한다(S60).When m × n multiplication results are obtained, the obtained multiplication results are added to the previously obtained accumulated values (S50). If there is no previously obtained accumulated value, the multiplication result is obtained as an initial accumulated value, and if there is a previously obtained accumulated value, the result of adding the obtained multiplication result to the previously obtained accumulated value is stored as an updated accumulated value ( S60).

그리고 i가 A 행렬의 열 개수 또는 B 행렬의 행 개수인 k보다 작은지 판별한다(S70). 만일 i가 k보다 작으면(i < k), i를 i+1로 변경한다(S80). 이에 A 행렬과 B 행렬에서 이전 선택된 다음 열과 다음 행을 선택한다(S20).Then, it is determined whether i is smaller than k, which is the number of columns of the A matrix or the number of rows of the B matrix (S70). If i is less than k (i < k), i is changed to i+1 (S80). Accordingly, the next column and the next row previously selected from the A matrix and the B matrix are selected (S20).

그러나, i가 k이상이면, 저장된 m × n 크기의 누적값으로 구성된 누적 행렬을 A 행렬과 B 행렬의 행렬 곱셈 결과인 C 행렬로서 출력한다(S90).However, if i is greater than or equal to k, the accumulated matrix composed of the stored accumulated values of m × n is output as the C matrix, which is the matrix multiplication result of the A matrix and the B matrix (S90).

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution by a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and read dedicated memory), RAM (Random Access Memory), CD (Compact Disk)-ROM, DVD (Digital Video Disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiment shown in the drawings, which is only exemplary, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Accordingly, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

110: 연산 제어부 120: 제1 버퍼부
130: 제2 버퍼부 140: 연산부
SIMDL: 연산 처리 레인 PE: 프로세스 소자
SIMDU: SIMD 유닛 MUL: 곱셈기
ADD: 가산기 ACC: 누적 레지스터110: arithmetic control unit 120: first buffer unit
130: second buffer unit 140: operation unit
SIMDL: arithmetic processing lane PE: process element
SIMDU: SIMD unit MUL: multiplier
ADD: Adder ACC: Accumulation Register

Claims

a first buffer for receiving and storing a first matrix that is a multiplicand matrix;
a second buffer for receiving and storing a second matrix that is a multiplier matrix multiplied by the first matrix; and
receiving a plurality of elements sequentially selected in units of columns from the first matrix, receiving a plurality of elements sequentially selected in units of rows from the second matrix in correspondence to columns selected in the first matrix, Each element of a column is multiplied by all elements of a row selected in the second matrix, and the result of the multiplication operation between the sequentially selected column of the first matrix and the row of the second matrix is cumulatively added to the first matrix and the second matrix Including an operator for obtaining a result matrix that is a matrix multiplication operation result of
The calculation unit
Each of the elements of the column selected in the first matrix is applied and multiplied by a corresponding one element and all elements of the row selected in the second matrix, and the result of inter-element multiplication is added to the accumulated value of the previous multiplication result. a plurality of arithmetic processing lanes for obtaining an element of a row of a cumulative matrix;
Each of the plurality of arithmetic processing lanes is
While adding the result of inter-element multiplication to the cumulative added value of the previous multiplication result, the elements of the column in the first matrix and all the elements of the row in the second matrix are applied according to a predetermined sequence to perform a multiplication operation do,
the second buffer
When the i-th column of the first matrix (where i is a natural number) is selected from the first buffer, the matrix operator selects the i-th row of the second matrix.

delete

2. The method of claim 1, wherein each of the plurality of computational processing lanes comprises:
comprising a plurality of process elements;
Each of the plurality of process elements is
a multiplier for receiving a corresponding one element from the selected column of the first matrix and a corresponding one element from among a plurality of elements from the selected row of the second matrix for multiplication;
an adder for updating the accumulated value by adding the multiplication result output from the multiplier to the accumulated value obtained by adding the multiplication result of the previously applied element; and
and an accumulation register configured to store the accumulated value updated in the adder.

delete

The method of claim 1, wherein the matrix operator
It is implemented as an artificial neural network module for performing a specified operation on at least one layer among a plurality of layers of the artificial neural network,
The first matrix is a feature map applied to the at least one layer, and the second matrix is a kernel predetermined to the at least one layer.

receiving and storing a first matrix that is a multiplicand matrix and a second matrix that is a multiplier matrix multiplied by the first matrix;
receiving a plurality of elements sequentially selected in units of columns in the first matrix and a plurality of elements sequentially selected in units of rows in the second matrix corresponding to columns selected in the first matrix;
multiplying each element of a column selected in the first matrix by all elements of a row selected in the second matrix; and
Accumulating the result of the multiplication operation between the sequentially selected columns of the first matrix and the rows of the second matrix to obtain a result matrix that is the result of the matrix multiplication operation of the first matrix and the second matrix,
The step of obtaining the result matrix is
The element of the partial accumulation matrix is obtained by adding the inter-element multiplication result obtained in the step of multiplying with all the elements to the accumulated value to which the previous corresponding inter-element multiplication result is accumulated,
Each of the plurality of arithmetic processing lanes is
While adding the result of inter-element multiplication to the cumulative added value of the previous multiplication result, the elements of the column in the first matrix and all the elements of the row in the second matrix are applied according to a predetermined sequence to perform a multiplication operation perform,
The step of receiving the selected plurality of elements is
When the i-th column of the first matrix (where i is a natural number) is selected, a matrix calculation method in which a plurality of elements of the i-th row of the second matrix are applied.

delete