KR102726930B1

KR102726930B1 - Variable bit-precision multiplier-accumulator structure for deep neural network operation

Info

Publication number: KR102726930B1
Application number: KR1020210163890A
Authority: KR
Inventors: 신동엽; 임태범; 임용석
Original assignee: 한국전자기술연구원
Priority date: 2020-11-25
Filing date: 2021-11-25
Publication date: 2024-11-06

Abstract

본 발명은 가변 비트-정밀도 곱셈-누산기에 관한 것으로, 처리 가능한 최대 데이터 비트인 n비트의 제1데이터, 제2데이터, 제3데이터를 입력받아 제1데이터와 제2데이터의 곱에 제3데이터를 가산하는 곱셈-누산기에 있어서, 상기 제1데이터, 제2데이터 및 제3데이터의 비트-정밀도 정보에 따라 제1데이터, 제2데이터 및 제3데이터를 분할하는 입력 처리부와, 상기 입력 처리부에서 분할된 제1데이터, 제2데이터, 제3데이터의 비트 수에 따라 곱셈 연산의 횟수를 결정하여 제1데이터와 제2데이터를 곱셈 연산하는 곱셈-누산부를 포함한다.The present invention relates to a variable bit-precision multiply-accumulator, which receives first data, second data, and third data of n bits, which is the maximum data bits that can be processed, and adds third data to the product of the first data and the second data, the multiply-accumulator including: an input processing unit which divides the first data, the second data, and the third data according to bit-precision information of the first data, the second data, and the third data; and a multiply-accumulator which determines the number of multiplication operations according to the number of bits of the first data, the second data, and the third data divided by the input processing unit and performs a multiplication operation on the first data and the second data.

Description

{Variable bit-precision multiplier-accumulator structure for deep neural network operation}

본 발명은 심층신경망 연산을 위한 가변 비트-정밀도 곱셈-누산기 구조에 관한 것으로, 더 상세하게는 정밀도 및 누산 결과값의 가변이 가능한 가변 비트 누산기 구조에 관한 것이다.The present invention relates to a variable bit-precision multiplier-accumulator structure for deep neural network operations, and more particularly, to a variable bit accumulator structure capable of varying the precision and the accumulated result value.

일반적으로 심층신경망은 수많은 연산으로 구성되어 있지만, 연산마다 신경망의 결과 정확도에 미치는 영향이 다르다. 어떤 연산은 낮은 비트-정밀도로 수행해도 정확도에 영향이 적고, 어떤 연산은 높은 비트-정밀도로 수행하지 않으면 정확도가 크게 감소한다. In general, deep neural networks consist of numerous operations, but each operation has a different effect on the accuracy of the neural network results. Some operations have little effect on accuracy even when performed with low bit precision, while some operations have a significant decrease in accuracy if not performed with high bit precision.

즉, 신경망 파라미터마다 요구되는 연산의 비트-정밀도에 차이가 있다. 또한 CNN(Convolutional Neural Network), NLP(Natural Language Processing) 등 신경망 모델에 따라서도 처리하는 데이터의 범위 및 요구되는 연산의 비트-정밀도에 큰 차이가 있으므로, 비트-정밀도를 가변할 수 있다면 모든 연산에 같은 비트-정밀도를 사용하는 것보다 효율적(동일 정확도 대비 연산량 감소)인 신경망 연산이 가능하다.That is, there is a difference in the bit-precision of the operation required for each neural network parameter. In addition, there is a large difference in the range of data to be processed and the bit-precision of the operation required depending on the neural network model such as CNN (Convolutional Neural Network) and NLP (Natural Language Processing), so if the bit-precision can be varied, it is possible to perform neural network operations that are more efficient (reduced amount of operation for the same accuracy) than using the same bit-precision for all operations.

심층 신경망 연산은 입력 데이터와 신경망 가중치 값 사이의 반복적인 곱셈-누산(multiply-accumulate) 연산으로 구성되므로, 이러한 다양한 비트-정밀도에 대응하기 위해서는 가변 비트-정밀도를 지원하는 곱셈-누산기가 필수적이다. 대부분의 심층신경망 연산 하드웨어는 곱셈-누산기 여러 개를 시스톨릭 어레이(systolic array)와 같은 배열로 구성하여 처리하는 특성이 있으므로, 이러한 배열 연산을 고려한 가변 곱셈-누산기 구조를 통해 비트-정밀도를 가변시키더라도 최대한 하드웨어 연산 효율을 높이는 것이 가능하다.Since deep neural network operations consist of repeated multiply-accumulate operations between input data and neural network weight values, a multiply-accumulator that supports variable bit precision is essential to cope with such various bit precisions. Since most deep neural network operation hardware has the characteristic of processing multiple multiply-accumulators by configuring them in an array such as a systolic array, it is possible to maximize hardware operation efficiency even if the bit precision is varied through a variable multiply-accumulator structure that considers such array operations.

심층신경망 연산은 수많은 연산을 통해 다양한 분야의 어플리케이션에서 높은 정확도를 보여주고 있다. 하지만 수백만개~수십억개의 파라미터 수와 연산량을 갖기 때문에 하드웨어 구현 시 많은 전력 및 에너지를 소모하고 긴 처리시간을 갖는 문제점이 있다. 가장 효과적으로 파라미터 저장공간과 연산량을 줄이는 방법이 입력 데이터의 비트-정밀도를 감소시키는 것이다. 입력 데이터의 비트-정밀도를 절반으로 줄이면 같은 연산에 소모하는 하드웨어의 전력 및 에너지를 1/4로 줄일 수 있다. 특히, 추론 연산의 경우 모바일이나 에지 기기에서 동작하는 경우가 많고, 학습 연산보다 낮은 비트-정밀도를 요구하기 때문에 비트-정밀도를 크게 낮출 수 있는 여러 방법들이 보편화되고 있다. Deep neural network operations have shown high accuracy in various fields of application through numerous operations. However, since they have millions to billions of parameters and computational amounts, there are problems in hardware implementation that consume a lot of power and energy and have long processing times. The most effective way to reduce parameter storage space and computational amounts is to reduce the bit precision of input data. If the bit precision of input data is reduced by half, the power and energy of hardware consumed for the same operation can be reduced by 1/4. In particular, in the case of inference operations, since they often operate on mobile or edge devices and require lower bit precision than learning operations, various methods that can significantly reduce bit precision are becoming popular.

정확도를 크게 떨어뜨리지 않는 선에서 비트-정밀도를 낮추다보면, 신경망 모델 및 신경망 내부의 레이어 마다 신경망 파라미터의 비트-정밀도가 달라지게 된다. 예를 들어 리던던시(Redundancy)가 큰 VGG-16 네트워크의 경우 상대적으로 조밀(Compact)하게 설계된 모바일넷(MobileNet)에 비해 비트-정밀도를 더욱 낮출 수 있고, VGG-16 모델 내에서도 레이어 별로 비트-정밀도를 더 낮출 수 있는 레이어와 그렇지 않은 레이어가 존재한다. If you lower the bit-precision without significantly reducing the accuracy, the bit-precision of the neural network parameters will vary for each neural network model and each layer within the neural network. For example, in the case of the VGG-16 network with high redundancy, the bit-precision can be lowered further than the MobileNet, which is designed to be relatively compact, and even within the VGG-16 model, there are layers that can further lower the bit-precision and layers that cannot.

심층 신경망 연산은 입력 데이터와 신경망 가중치 값 사이의 반복적인 곱셈-누산(multiply-accumulate) 연산으로 구성되므로, 이러한 다양한 비트-정밀도에 대응하기 위해서는 가변 비트-정밀도를 지원하는 곱셈-누산기가 필수적이다.Since deep neural network operations consist of repeated multiply-accumulate operations between input data and neural network weight values, a multiplier-accumulator that supports variable bit precision is essential to cope with these various bit precisions.

고정된 비트-정밀도만 지원하는 곱셈-누산기에 사용하지 않는 비트는 0을 채워넣는 방식으로도 비트-정밀도 변경이 가능하지만, 이런 경우 낮은 비트-정밀도 연산 시 하드웨어 사용률(utilization)이 매우 떨어지고 전력 효율 측면에서도 leakage power 등으로 비효율적이다. 낮은 비트-정밀도에서도 최대 비트-정밀도 연산 대비 비슷한 하드웨어 효율(ex. 면적당처리속도, 전력효율)을 갖기 위해서는 비트-정밀도의 효율적인 변경이 가능한 새로운 곱셈-누산기 구조가 필요하다. Although it is possible to change the bit precision by filling the unused bits with 0 in a multiplier-accumulator that supports only a fixed bit precision, in this case, the hardware utilization is very low in low bit precision operations and it is inefficient in terms of power efficiency due to leakage power, etc. In order to have similar hardware efficiency (e.g. processing speed per area, power efficiency) to the maximum bit precision operations even at low bit precision, a new multiplier-accumulator structure that enables efficient change of bit precision is required.

또한, 대부분의 심층신경망 연산 하드웨어는 곱셈-누산기 여러 개를 시스톨릭 어레이(systolic array)와 같은 배열로 구성하여 처리하는 특성이 있다. Additionally, most deep neural network computational hardware has the characteristic of processing multiple multipliers-accumulators by configuring them in an array such as a systolic array.

이 때, 비트-정밀도가 변함에 따라 더 많은 필터 연산을 한 번에 수행하는 것도 가능한데, 예를 들어 곱셈-누산기가 여러 개의 곱셈-누산 연산을 처리하더라도 전부 누산하여 하나의 곱셈-누산 연산 결과만 출력이 가능하다면 전체 배열 연산 측면에서의 효율이 떨어질 수 있다. 따라서, 비트-정밀도의 가변과 동시에 배열 처리를 고려한 연산 결과물 수의 조정이 가능한 곱셈-누산기 개발을 통해, 비트-정밀도가 낮아짐에 따라 처리하는 곱셈-누산의 숫자를 늘리는 병렬 처리를 통해 각각의 단일 곱셈-누산기 하드웨어 사용률을 높게 유지하는 동시에, 곱셈-누산기의 어레이로 구성되는 전체 심층신경망 하드웨어의 효율 또한 증대할 수 있는 새로운 곱셈-누산기 구조가 요구되고 있다.At this time, it is also possible to perform more filter operations at once as the bit precision changes. For example, if the multiplier-accumulator processes multiple multiplication-accumulation operations but accumulates them all and outputs only one multiplication-accumulation operation result, the efficiency in terms of the entire array operation may decrease. Therefore, a new multiplier-accumulator structure is required that can maintain a high hardware utilization rate of each single multiplication-accumulator while increasing the efficiency of the entire deep neural network hardware composed of an array of multiplication-accumulators through parallel processing that increases the number of multiplications-accumulations to be processed as the bit precision decreases by developing a multiplier-accumulator that can adjust the number of operation results considering array processing while varying the bit precision.

상기와 같은 요구를 감안한 본 발명이 해결하고자 하는 과제는, 다양한 비트-정밀도를 요구하는 다양한 심층신경망 연산에 공통적으로 적용이 가능한 가변 비트-정밀도 곱셈-누산기 구조를 제공함에 있다.The problem that the present invention seeks to solve in consideration of the above-mentioned needs is to provide a variable bit-precision multiplier-accumulator structure that can be commonly applied to various deep neural network operations requiring various bit-precisions.

특히 본 발명이 해결하고자 하는 다른 과제는 하드웨어 전력 효율을 향상시키기 위해 비트-정밀도를 최적화하는 모든 심층신경망 연산 처리 기기(모바일, 에지, 서버)에 적용할 수 있는 가변 비트-정밀도 곱셈-누산기 구조를 제공함에 있다.In particular, another problem that the present invention seeks to solve is to provide a variable bit-precision multiplier-accumulator structure that can be applied to all deep neural network computational processing devices (mobile, edge, server) that optimize bit-precision to improve hardware power efficiency.

상기와 같은 기술적 과제를 해결하기 위한 본 발명 가변 비트-정밀도 곱셈-누산기는, 처리 가능한 최대 데이터 비트인 n비트의 제1데이터, 제2데이터, 제3데이터를 입력받아 제1데이터와 제2데이터의 곱에 제3데이터를 가산하는 곱셈-누산기에 있어서, 상기 제1데이터, 제2데이터 및 제3데이터의 비트-정밀도 정보에 따라 제1데이터, 제2데이터 및 제3데이터를 분할하는 입력 처리부와, 상기 입력 처리부에서 분할된 제1데이터, 제2데이터, 제3데이터의 비트 수에 따라 곱셈 연산의 횟수를 결정하여 제1데이터와 제2데이터를 곱셈 연산하는 곱셈-누산부를 포함한다.In order to solve the technical problem described above, the variable bit-precision multiply-accumulator of the present invention receives first data, second data, and third data of n bits, which is the maximum data bits that can be processed, and adds the third data to the product of the first data and the second data, the multiply-accumulator includes: an input processing unit which divides the first data, the second data, and the third data according to bit-precision information of the first data, the second data, and the third data; and a multiply-accumulator which determines the number of multiplication operations according to the number of bits of the first data, the second data, and the third data divided by the input processing unit and performs a multiplication operation on the first data and the second data.

본 발명의 실시예에서, 상기 분할된 데이터는, n비트, n/2 또는 n/4 비트 데이터일 수 있다.In an embodiment of the present invention, the divided data may be n-bit, n/2, or n/4-bit data.

본 발명의 실시예에서, 상기 제1데이터와 제2데이터의 곱셈 결과는, 상기 분할된 데이터의 비트 수에 의해 결정되며, 제3데이터는 곱셈 결과 각각에 대하여 누산될 수 있다.In an embodiment of the present invention, the result of the multiplication of the first data and the second data is determined by the number of bits of the divided data, and the third data can be accumulated for each of the multiplication results.

본 발명의 실시예에서, 상기 제1데이터와 제2데이터의 곱셈 결과는, 상기 분할된 데이터 비트 수에 의해 결정되며, 곱셈 결과를 모두 더한 후 n비트의 제3데이터를 누산 하거나, 곱셈 결과 한 쌍씩 서로 더한 후, 각 결과에 n/2비트의 제3데이터를 각각 누산할 수 있다.In an embodiment of the present invention, the result of multiplication of the first data and the second data is determined by the number of bits of the divided data, and after adding all the results of multiplication, n-bit third data can be accumulated, or after adding each pair of results of multiplication, n/2-bit third data can be accumulated for each result.

본 발명은 비트-정밀도 및 곱셈-누산을 가변할 수 있어, 다양한 심층신경망 모델에 적용 가능하도록 함으로써, 범용성을 향상시킬 수 있는 효과가 있다.The present invention has the effect of improving versatility by allowing the bit precision and multiplication-accumulation to be varied, thereby enabling application to various deep neural network models.

또한 본 발명은 비트-정밀도의 변화로 발생하는 곱셈-누산 연산 방법 기반 배열 처리 방법의 효율성을 개선하여 하드웨어 전력 효율을 향상시킬 수 있는 효과가 있다.In addition, the present invention has the effect of improving the efficiency of an array processing method based on a multiplication-accumulation operation method that occurs due to a change in bit precision, thereby improving hardware power efficiency.

도 1은 본 발명의 바람직한 실시 예에 따른 가변 비트-정밀도 곱셉-누산기의 블록 구성도이다.
도 2는 비트 정밀도에 따른 곱셈의 연산 수 변화를 설명하기 위한 동작 설명도이다.
도 3은 누산 가변의 예를 나타낸다.
도 4는 본 발명의 가변 비트-정밀도 곱셈-누산기를 기본 연산 단위로 하여 실제 심층신경망 연산을 처리하는 하드웨어 가속기의 구성도이다.
도 5는 본 발명 곱셈-누산기의 연산 수 가변에 따라 상기 가속기에서 처리하는 필터의 수의 예시도이다.
도 6은 본 발명 가변 비트-정밀도 곱셈-누산기와 종래 곱셈-누산기가 적용된 심층신경망 모델의 처리시간(Latency)대 정확도의 비교 그래프이다.FIG. 1 is a block diagram of a variable bit-precision multiplier-accumulator according to a preferred embodiment of the present invention.
Figure 2 is an operational description diagram to explain the change in the number of multiplication operations according to bit precision.
Figure 3 shows an example of a cumulative variable.
FIG. 4 is a configuration diagram of a hardware accelerator that processes actual deep neural network operations using the variable bit-precision multiplier-accumulator of the present invention as a basic operation unit.
FIG. 5 is an example diagram of the number of filters processed by the accelerator according to the variation in the number of operations of the multiplier-accumulator of the present invention.
Figure 6 is a graph comparing the accuracy versus processing time (latency) of a deep neural network model to which a variable bit-precision multiplier-accumulator of the present invention and a conventional multiplier-accumulator are applied.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예들을 설명한다. 그러나 본 발명은 이하에서 개시되는 실시 예에 한정되는 것이 아니라, 여러가지 형태로 구현될 수 있고 다양한 변경을 가할 수 있다. 단지, 본 실시 예에 대한 설명은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위하여 제공되는 것이다. 첨부된 도면에서 구성요소는 설명의 편의를 위하여 그 크기를 실제보다 확대하여 도시한 것이며, 각 구성요소의 비율은 과장되거나 축소될 수 있다.In order to fully understand the configuration and effect of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms and can have various changes. However, the description of the embodiments is provided so that the disclosure of the present invention is complete, and so that a person having ordinary knowledge in the technical field to which the present invention belongs can fully understand the scope of the invention. In the accompanying drawings, the components are illustrated with their actual sizes enlarged for convenience of explanation, and the ratio of each component may be exaggerated or reduced.

'제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 상기 구성요소는 위 용어에 의해 한정되어서는 안 된다. 위 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 수 있다. 예를 들어, 본 발명의 권리범위를 벗어나지 않으면서 '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다. 또한, 단수의 표현은 문맥상 명백하게 다르게 표현하지 않는 한, 복수의 표현을 포함한다. 본 발명의 실시 예에서 사용되는 용어는 다르게 정의되지 않는 한, 해당 기술분야에서 통상의 지식을 가진 자에게 통상적으로 알려진 의미로 해석될 수 있다.The terms "first", "second", etc. may be used to describe various components, but the components should not be limited by the terms. The terms may only be used to distinguish one component from another. For example, without departing from the scope of the present invention, the "first component" may be referred to as the "second component," and similarly, the "second component" may also be referred to as the "first component." In addition, singular expressions include plural expressions unless the context clearly indicates otherwise. The terms used in the embodiments of the present invention may be interpreted as having a meaning commonly known to a person of ordinary skill in the art, unless otherwise defined.

이하에서는, 도면을 참조하여 본 발명의 바람직한 실시 예에 따른 가변 비트-정밀도 곱셈-누산기 구조에 대하여 상세히 설명한다.Hereinafter, a variable bit-precision multiplier-accumulator structure according to a preferred embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 바람직한 실시 예에 따른 가변 비트-정밀도 곱셈-누산기의 블록 구성도이다.FIG. 1 is a block diagram of a variable bit-precision multiplier-accumulator according to a preferred embodiment of the present invention.

도 1을 참조하면, 본 발명 가변 비트-정밀도 곱셈-누산기(1)는, 데이터(A, B, C)와 모드신호(MODE)를 입력받는 입력 레지스터(10)와, 입력 레지스터(10)를 통해 입력된 데이터(A, B, C)의 비트 정보를 확인하여, 데이터를 비트별로 분할 및 조합하여 출력하는 입력 처리부(20)와, 제1파이프라인 레지스터(30)를 통해 상기 입력 처리부(20)의 처리결과를 입력받아 곱셈-누산을 수행하되, 비트-정밀도에 따라 가변하여 수행하는 곱셈-누산부(40)와, 상기 곱셈-누산부(40)의 수행 결과를 제2파이프라인 레지스터(50)를 통해 수신하여, 출력 데이터 형식으로 변환하여 출력 레지스터(70)를 통해 출력하는 출력 처리부(60)를 포함하여 구성된다.Referring to FIG. 1, the variable bit-precision multiply-accumulator (1) of the present invention is configured to include an input register (10) which receives data (A, B, C) and a mode signal (MODE), an input processing unit (20) which checks the bit information of the data (A, B, C) input through the input register (10), divides and combines the data by bit, and outputs it, a multiplication-accumulation unit (40) which receives the processing result of the input processing unit (20) through the first pipeline register (30) and performs multiplication-accumulation, but performs it variably according to the bit precision, and an output processing unit (60) which receives the execution result of the multiplication-accumulation unit (40) through the second pipeline register (50), converts it into an output data format, and outputs it through the output register (70).

이하, 상기와 같이 구성되는 본 발명 가변 비트-정밀도 곱셈-누산기의 구성과 작용에 대하여 보다 상세히 설명한다.Hereinafter, the configuration and operation of the variable bit-precision multiplier-accumulator of the present invention configured as described above will be described in more detail.

먼저, 입력 처리부(20)는 입력 레지스터(10)를 통해 데이터(A, B, C)를 입력받는다.First, the input processing unit (20) receives data (A, B, C) through the input register (10).

도면에는 데이터(A, B, C)가 모두 16비트(bit)로 도시하였으나, 이는 필요에 따라 32, 64 비트 등으로 변경될 수 있다. 즉, 본 발명의 실시예에서는 16비트를 기준으로 설명하나 이는 필요에 따라 변경될 수 있다.In the drawing, data (A, B, C) are all depicted as 16 bits, but this can be changed to 32, 64 bits, etc., as needed. That is, the embodiment of the present invention is described based on 16 bits, but this can be changed as needed.

이때, 16비트의 데이터(A, B, C)는 최대 비트로 이해되어야 한다. 즉, 데이터(A, B, C)는 최대 16비트이며, 8비트 또는 4비트의 데이터가 입력될 수 있다.At this time, 16-bit data (A, B, C) should be understood as the maximum bit. That is, data (A, B, C) is at most 16 bits, and 8-bit or 4-bit data can be input.

이는 비트-정밀도가 가변적임을 뜻하며, 정밀도가 다른 심층신경망 모델에 적용 가능하다.This means that the bit precision is variable, and it can be applied to deep neural network models with different precision.

모드 신호(MODE)는 입력 데이터의 모드인 비트 수에 대한 정보를 포함할 수 있다.The mode signal (MODE) can contain information about the number of bits that are the mode of the input data.

입력 처리부(20)는 입력 데이터의 비트 수에 따라 데이터(A, B, C)를 곱셈-누산부(40)에서 연산에 필요한 형식으로 처리하여 출력한다.The input processing unit (20) processes data (A, B, C) into a format required for operation in the multiplication-accumulation unit (40) according to the number of bits of the input data and outputs it.

이때의 처리는 이후에 좀 더 상세히 설명한다.The processing at this time will be explained in more detail later.

상기 입력 처리부(20)에서 처리된 데이터(A, B, C)는 제1파이프라인 레지스터(30)를 통해 곱셈-누산부(40)로 제공된다.Data (A, B, C) processed in the above input processing unit (20) is provided to the multiplication-accumulation unit (40) through the first pipeline register (30).

곱셈-누산부(40)는 두 데이터(A, B)의 곱에 데이터(C)를 누산하는 연산을 수행하되, 데이터(A, B, C)의 비트 수에 따라 곱셈-누산 연산의 수가 가변된다.The multiplication-accumulation unit (40) performs an operation of accumulating data (C) by multiplying two data (A, B), but the number of multiplication-accumulation operations varies depending on the number of bits of data (A, B, C).

도 2는 비트 정밀도에 따른 곱셈의 연산 수 변화를 설명하기 위한 동작 설명도이다.Figure 2 is an operational description diagram to explain the change in the number of multiplication operations according to bit precision.

도 2의 (a)는 n(예를 들어 16비트) 비트의 데이터(A, B) 입력에 따른 곱셈 연산의 예로서, 1회의 곱셈 연산(AXB)을 수행하는 것으로 연산이 완료된다.Figure 2 (a) is an example of a multiplication operation based on input of n (e.g. 16 bit) bit data (A, B), and the operation is completed by performing one multiplication operation (AXB).

도 2의 (b)는 n/2(예를 들어 8비트) 비트의 데이터(A0, A1, B0, B1) 입력에 따른 곱셈 연산의 예로, 2회의 곱셈 연산(A0XB0, A1XB1)을 수행한다.Figure 2 (b) is an example of a multiplication operation based on input data (A0, A1, B0, B1) of n/2 (e.g. 8 bits), and performs two multiplication operations (A0XB0, A1XB1).

이처럼 n/4 비트에 대해서는 도 2의 (c)와 같이 4회의 곱셈 연산을 수행해야 한다.In this way, for n/4 bits, four multiplication operations must be performed as in (c) of Fig. 2.

이처럼 곱셈-누산부(40)는 입력 데이터의 비트 수에 따라 곱셈의 연산 횟수를 가변한다.In this way, the multiplication-accumulation unit (40) varies the number of multiplication operations depending on the number of bits of the input data.

이때의 분할된 데이터들(A0, A1, A2, A3, B0, B1, B2, B3)은 입력 처리부(20)에서 원래의 데이터(A, B)를 분할 처리하여 제공하는 것으로 한다.At this time, the divided data (A0, A1, A2, A3, B0, B1, B2, B3) are provided by dividing and processing the original data (A, B) in the input processing unit (20).

또한, 곱셈-누산부(40)는 AXB+C의 연산을 수행하는 것으로, C의 누산에도 변화를 줄 수 있다.In addition, the multiplication-accumulation unit (40) performs the operation of AXB+C, which can also change the accumulation of C.

도 3은 데이터 C의 누산 가변의 예를 나타낸다.Figure 3 shows an example of an accumulation variable of data C.

도 3의 (a)에는 예를 들어 n/4 비트의 데이터(A0, A1, A2, A3, B0, B1, B2, B3)에 대한 4회의 곱셈 연산을 수행하고, 4개의 곱셈 연산 결과에 n/4 비트의 데이터(C0, C1, C2, C3)를 각각 가산하는 것으로 할 수 있다.In (a) of Fig. 3, for example, four multiplication operations are performed on n/4 bit data (A0, A1, A2, A3, B0, B1, B2, B3), and n/4 bit data (C0, C1, C2, C3) is added to each of the four multiplication operation results.

또한, 도 3의 (b)에 도시한 바와 같이 n/2 비트의 데이터(C0, C1)를 4개의 곱셈 연산 결과 중 두 개씩의 연산 결과를 서로 가산 한 후 각각에 누산할 수 있다.In addition, as shown in (b) of Fig. 3, n/2 bit data (C0, C1) can be accumulated by adding two of the four multiplication operation results to each other.

즉, 도 3의 (b)는 두 개의 곱셈-누산 결과를 얻을 수 있다.That is, (b) of Fig. 3 can obtain two multiplication-accumulation results.

다른 예로서 도 3의 (c)에 도시한 바와 같이 n/4 비트 데이터(A0, A1, A2, A3, B0, B1, B2, B3)에 대한 4회의 곱셈 연산 결과를 모두 더하고, n비트의 누산 대상 데이터(C)를 더하는 방식을 들 수 있다.As another example, as shown in (c) of Fig. 3, a method can be exemplified in which the results of four multiplication operations on n/4-bit data (A0, A1, A2, A3, B0, B1, B2, B3) are added together and n-bit accumulation target data (C) is added.

이 경우에는 연산 결과는 하나가 된다.In this case, the result of the operation is one.

이러한 구조의 변환은 캐리 세이브(carry save) 가산기의 변형을 통해 구현할 수 있다.A transformation of this structure can be implemented through a modification of the carry save adder.

이와 같이 곱셈-누산의 연산 결과는 출력 처리부(60)를 통해 원하는 데이터 형식으로 변환되어 출력 레지스터(70)를 통해 출력되는 것으로 한다.In this way, the result of the multiplication-accumulation operation is converted into a desired data format through the output processing unit (60) and output through the output register (70).

도 4는 본 발명의 가변 비트-정밀도 곱셈-누산기를 기본 연산 단위로 하여 실제 심층신경망 연산을 처리하는 하드웨어 가속기(100)의 구성도이다.FIG. 4 is a configuration diagram of a hardware accelerator (100) that processes actual deep neural network operations using the variable bit-precision multiplier-accumulator of the present invention as a basic operation unit.

도 4를 참조하면, 액티베이션 버퍼(120)로부터 데이터를 입력받고, 웨이트 입력부(110)로부터 웨이트를 입력받아 곱셈-누산을 수행하는 MAC 어레이(130)를 구현한다.Referring to FIG. 4, a MAC array (130) is implemented that receives data from an activation buffer (120), receives weights from a weight input unit (110), and performs multiplication-accumulation.

MAC 어레이(130)는 본 발명의 곱셈-누산기(1)를 2차원 배열할 수 있다. 예를 들어 가로 256, 세로 256개의 곱셈-누산기(1)를 배열하여, MAC 어레이(130)를 구현할 수 있다.The MAC array (130) can two-dimensionally arrange the multiplier-accumulator (1) of the present invention. For example, the MAC array (130) can be implemented by arranging 256 multiplier-accumulators (1) in a horizontal direction and 256 in a vertical direction.

가속기 구조에서 데이터는 배열의 세로줄(column)로 누적되어 누적부(140)에서 최종 누적되는 연산에 의해 각각 하나의 필터를 처리한다. In the accelerator structure, data is accumulated in columns of an array, and each filter is processed by a final accumulation operation in the accumulation unit (140).

위의 예와 같이 MAC 어레이(130)가 256x256의 배열 구조일 때, 256 column을 가지므로 최대 256개의 필터 연산 처리가 가능하다.As in the example above, when the MAC array (130) has an array structure of 256x256, it has 256 columns, so a maximum of 256 filter operations can be processed.

도 5는 본 발명 곱셈-누산기(1)의 연산 수 가변에 따라 상기 가속기(100)에서 처리하는 필터의 수의 예시도이다.Figure 5 is an example diagram of the number of filters processed by the accelerator (100) according to the variation in the number of operations of the multiplier-accumulator (1) of the present invention.

상기 MAC 어레이(130)가 NxN 배열의 본 발명 곱셈-누산기(1)를 포함한다고 가정할 때, 가속기(100)에서 단일 곱셈-누산기(1)가 n/4 비트를 처리하는 경우, 1개의 곱셈-누산 연산만을 처리 가능한 경우에는 N개의 필터 연산만을 처리할 수 있는데 비해, 2개나 4개의 곱셈-누산 연산을 처리하도록 구조 변경이 가능한 경우 2N, 4N개의 필터 연산도 동시에 처리 가능한 배열 구조가 된다. Assuming that the above MAC array (130) includes an NxN array of the present invention's multiply-accumulators (1), when a single multiply-accumulator (1) in the accelerator (100) processes n/4 bits, if only one multiply-accumulate operation can be processed, only N filter operations can be processed, whereas if the structure can be changed to process two or four multiply-accumulate operations, the array structure can process 2N or 4N filter operations simultaneously.

처리하는 필터 수가 무조건 많다고 이득인 것은 아니나, 심층신경망의 레이어 특성에 따라 처리하는 필터 수가 많을수록 좋은 경우도 존재하기 때문에 이러한 경우 동시에 많은 필터를 처리하게끔 변경하고 적은 필터 수의 연산이라도 빠르게 처리하는 것이 이득인 경우에는 적은 필터를(대신 같은 필터의 연산을 여러 개 처리) 처리하도록 변경가능한 구조가 고정된 필터 연산만을 수행하는 구조보다 훨씬 효율적인 구조이다.It is not always advantageous to process a large number of filters, but there are cases where the number of filters processed is better depending on the characteristics of the layers of the deep neural network. In such cases, if it is advantageous to process many filters simultaneously and process a small number of filter operations quickly, a structure that can be changed to process a small number of filters (instead, process multiple operations of the same filter) is much more efficient than a structure that only performs fixed filter operations.

도 5의 (a)는 n/4 비트 데이터에 대한 곱셈-누산 연산을 통해 4N개의 필터를 처리하는 예시도이고, 도 5의 (b)와 도 5의 (c)는 각각 n/2, n 비트의 데이터를 곱셈-누산 연산하여 각각 2N개 및 N개의 필터를 처리하는 구성의 예시도이다.Figure 5 (a) is an example diagram for processing 4N filters through multiplication-accumulation operations on n/4 bit data, and Figures 5 (b) and 5 (c) are example diagrams for configurations for processing 2N and N filters, respectively, through multiplication-accumulation operations on n/2 and n bit data, respectively.

도 6은 본 발명 가변 비트-정밀도 곱셈-누산기와 종래 곱셈-누산기가 적용된 심층신경망 모델의 처리시간(Latency)대 정확도의 비교 그래프이다.Figure 6 is a graph comparing the accuracy versus processing time (latency) of a deep neural network model to which a variable bit-precision multiplier-accumulator of the present invention and a conventional multiplier-accumulator are applied.

심층신경망 모델로는 모바일넷(MobileNet)을 사용하였으며, 본 발명 가변 비트-정밀도 곱셈-누산기를 이용하여 양자화했을 때 동일한 정확도를 나타내며 처리시간을 단축할 수 있는 것을 확인할 수 있다.MobileNet was used as a deep neural network model, and it was confirmed that the same accuracy was achieved and processing time could be shortened when quantized using the variable bit-precision multiplier-accumulator of the present invention.

즉, 하나의 신경망 모델 내에서도 파라미터별로 다른 비트-정밀도를 적용하는 것이 고정된 비트-정밀도를 전체 파라미터에 적용하는 것보다 효율적임을 나타낸다.That is, it shows that applying different bit-precisions to each parameter within a single neural network model is more efficient than applying a fixed bit-precision to all parameters.

이처럼 본 발명은 곱셈-누산기를 데이터의 비트-정확도에 따라 가변하여 처리할 수 있으며, 따라서 다양한 심층신경망 모델에 적용 가능하며, 처리 속도 또한 향상시킬 수 있다.In this way, the present invention can process a multiplier-accumulator in a variable manner according to the bit accuracy of data, and therefore can be applied to various deep neural network models, and can also improve the processing speed.

이상에서 본 발명에 따른 실시 예들이 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 범위의 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 다음의 청구범위에 의해서 정해져야 할 것이다.Although the embodiments according to the present invention have been described above, they are merely exemplary, and those with ordinary skill in the art will understand that various modifications and equivalent embodiments are possible. Accordingly, the true technical protection scope of the present invention should be determined by the following claims.

10:입력 레지스터 20:입력 처리부
30:제1파이프라인 레지스터 40:곱셈-누산부
50:제2파이프라인 레지스터 60:출력 처리부
70:출력 레지스터10: Input register 20: Input processing unit
30:1st pipeline register 40:Multiply-accumulate unit
50:2nd pipeline register 60:Output processing unit
70: Output register

Claims

In a multiplier-accumulator that receives first data, second data, and third data of n bits, which is the maximum data bit that can be processed, and adds the third data to the product of the first data and the second data,
An input processing unit that divides the first data, the second data, and the third data according to bit-precision information of the first data, the second data, and the third data;
It includes a multiplication-accumulation unit that determines the number of multiplication operations according to the number of bits of the first data, second data, and third data divided in the above input processing unit and performs a multiplication operation on the first data and the second data.
The result of multiplying the first and second data above is,
It is determined by the number of bits of the above divided data,
After adding all the multiplication results, the n-bit third data is accumulated for each multiplication result, or
A variable bit-precision multiplier-accumulator characterized by adding the multiplication results in pairs and then accumulating n/2 bits of third data for each result.

In the first paragraph,
The above divided data is,
A variable bit-precision multiplier-accumulator characterized by n-bit, n/2, or n/4-bit data.

delete

An accelerator comprising a MAC array in which variable bit-precision multiplier-accumulators of the first clause are arranged in a horizontal and vertical matrix.

In paragraph 5,
The above MAC array is an accelerator that quantizes and processes different numbers of filters depending on the number of bits of input data.

In Article 6,
An accelerator further including an accumulation unit that processes a filter by finally accumulating the operation results accumulated in the vertical direction of the above MAC array.