KR20180122021A

KR20180122021A - How to operate a digital computer to reduce computational complexity associated with dot products between large vectors

Info

Publication number: KR20180122021A
Application number: KR1020187030590A
Authority: KR
Inventors: 빈센조 리구오리
Original assignee: 오션 로직 피티와이 리미티드
Priority date: 2016-03-29
Filing date: 2017-03-23
Publication date: 2018-11-09
Also published as: CN109074350A; WO2017165904A2; WO2017165904A3

Abstract

본 발명은 각각의 벡터가 N 개의 성분에 의해 특징 지워지는 제1 벡터와 제2 벡터 사이의 스칼라 곱에 대한 근삿값을 연산하기 위해 데이터 처리 시스템을 동작시키는 방법을 포함한다. 이 방법은 제1 벡터를 N 개의 성분 및 N 개의 성분의 절댓값들의 합계인 정수 K에 의해 특징 지워지는 피라미드 정수 벡터인 제3 벡터로 대체하는 단계 및 제1 벡터와 제2 벡터 사이의 스칼라 곱에 대한 근삿값을 제공하기 위해 제2 벡터와 제3 벡터의 스칼라 곱을 연산하는 단계를 포함한다. 제2 벡터와 제3 벡터의 스칼라 곱을 연산하는 단계는 K 번의 덧셈과 1번의 부동 소수점 곱셈에 의해 수행될 수 있다.The present invention includes a method of operating a data processing system to calculate an approximation to a scalar product between a first vector and a second vector, wherein each vector is characterized by N components. The method includes the steps of replacing the first vector with a third vector, which is a pyramidal integer vector characterized by an integer K, which is the sum of the N components and the N component excerpt values, and a scalar multiplication between the first vector and the second vector And computing a scalar product of the second vector and the third vector to provide an approximation for the second vector. The step of calculating the scalar product of the second vector and the third vector may be performed by K addition and 1 floating point multiplication.

Description

How to operate a digital computer to reduce computational complexity associated with dot products between large vectors

본 출원은 2016년 3월 29일에 출원되고 참조로서 여기에 통합되어 있는 호주 가출원 제2016901146호에 대한 우선권을 주장한다.This application claims priority to U.S. Provisional Application No. 2016901146, filed on March 29, 2016, incorporated herein by reference.

디지털 신호 처리는 보편화되어 가고 있다. 예를 들어, 저역 통과 필터들, 대역 통과 필터들, 및 고역 통과 필터들과 같은 신호 필터들은 아날로그 신호를 디지털화하여 디지털 값들의 시퀀스를 생성함으로써 구현되고 있다. 여기서, 디지털 값들은 유한 임펄스 필터를 통해 처리되는데, 이 경우 신호 값들의 벡터와 필터 계수들의 벡터의 스칼라 곱 또는 "점(dot)" 곱이 매시간 필터링된 신호를 나타내는 디지털 값을 제공한다.Digital signal processing is becoming common. For example, signal filters such as low pass filters, band pass filters, and high pass filters are implemented by digitizing an analog signal to generate a sequence of digital values. Here, the digital values are processed through a finite impulse filter, where the scalar product or " dot " product of the vector of signal values and the vector of filter coefficients provides a digital value representing the hourly filtered signal.

2개의 N 성분 벡터인 A와 B의 스칼라 곱은 다음과 같이 정의된다.The scalar product of two N component vectors, A and B, is defined as:

.

여기서, A_i 및 B_i는 각각 A 및 B의 i번째 성분이며, N은 각 벡터에서의 성분의 개수이다. 이러한 연산을 수행하는데 내재된 연산 작업 부하는 N 번의 곱셈과 N-1 번의 덧셈이다. 곱하기에 대한 연산 부하는 더하기에 대한 연산 부하보다 상당히 크다. 정수 곱셈 및 덧셈과 같은 간단한 경우에, 곱셈은 N 번의 덧셈과 N-1 번의 시프트(shift)에 의해 수행된다. 부동 소수점 숫자와 같이 더 복잡한 경우에, 연산 부하는 훨씬 더 크다. 그러므로 애플리케이션이 2개의 큰 벡터들의 스칼라 곱을 요구하는 경우, 연산 작업 부하는 기본 필터 또는 그 밖의 기타 모델들에 대해 실질적인 제한을 둘 수 있다.Where A _i and B _i are the i-th components of A and B, respectively, and N is the number of components in each vector. The computational workload inherent in performing these operations is N multiplications and N-1 additions. The operation load for multiplication is considerably larger than the operation load for addition. In simple cases such as integer multiplication and addition, the multiplication is performed by N additions and N-1 shifts. In more complex cases such as floating point numbers, the computational load is much larger. Therefore, when an application requires a scalar multiplication of two large vectors, the computational workload can place practical constraints on the underlying filter or other models.

예를 들어, 디지털 필터의 품질은 통상적으로 관련 벡터의 크기에 따라 증가한다. 50개의 필터 계수를 갖는 대역 통과 필터는 500개의 필터 계수를 갖는 필터보다 상당히 더 적은 대역외 억제(out-of-band rejection)를 갖는다. 공교롭게도, 더 큰 필터에 내재된 연산 작업 부하는 이러한 우수한 필터를 덜 매력적이게 만들 수 있다. 이와 유사하게, 신경망들에 기반한 알고리즘들을 활용하는 패턴 인식 시스템들은 다수의 스칼라 곱들을 수행하는 문제에 직면해 있다.For example, the quality of a digital filter typically increases with the size of the associated vector. A bandpass filter with 50 filter coefficients has significantly less out-of-band rejection than a filter with 500 filter coefficients. Coincidentally, the computational workload inherent in larger filters can make these superior filters less attractive. Similarly, pattern recognition systems that utilize algorithms based on neural networks are faced with the problem of performing multiple scalar multiplications.

본 발명은 각각의 벡터가 N 개의 성분에 의해 특징 지워지는 제1 벡터와 제2 벡터 사이의 스칼라 곱에 대한 근삿값을 연산하기 위해 데이터 처리 시스템을 동작시키는 방법을 포함한다. 이 방법은 제1 벡터를 N 개의 성분 및 N 개의 성분의 절댓값들의 합계인 정수 K에 의해 특징 지워지는 피라미드 정수 벡터인 제3 벡터로 대체하는 단계 및 제1 벡터와 제2 벡터 사이의 스칼라 곱에 대한 근삿값을 제공하기 위해 제2 벡터와 제3 벡터의 스칼라 곱을 연산하는 단계를 포함한다.The present invention includes a method of operating a data processing system to calculate an approximation to a scalar product between a first vector and a second vector, wherein each vector is characterized by N components. The method includes the steps of replacing the first vector with a third vector, which is a pyramidal integer vector characterized by an integer K, which is the sum of the N components and the N component excerpt values, and a scalar multiplication between the first vector and the second vector And computing a scalar product of the second vector and the third vector to provide an approximation for the second vector.

본 발명의 일 양태에서, 제2 벡터와 제3 벡터의 스칼라 곱을 연산하는 단계는 제2 벡터와 제3 벡터의 스칼라 곱을 제공하기 위해 제3 벡터에서의 성분들 중 대응하는 하나의 성분에 의해 지정된 다수의 횟수만큼 제2 벡터의 각 성분을 레지스터에 더하는 단계를 포함한다.In one aspect of the present invention, computing the scalar product of the second vector and the third vector comprises computing a scalar multiplication of a second vector by a third vector, which is specified by a corresponding one of the components in the third vector to provide a scalar multiplication of the second vector and the third vector. And adding each component of the second vector to the register a number of times.

본 발명의 다른 양태에서, 제2 벡터는 제2 벡터 길이에 의해 특징 지워지고, 제3 벡터는 제3 벡터 길이에 의해 특징 지워지며, 제2 벡터와 제3 벡터의 스칼라 곱은 제2 벡터 길이와 제3 벡터 길이에서의 차이가 제1 벡터와 제2 벡터의 스칼라 곱에 대한 근삿값을 제공하기 위해 정정된다.In another aspect of the invention, the second vector is characterized by a second vector length, wherein the third vector is characterized by a third vector length, and the scalar product of the second vector and the third vector is a second vector length and The difference in three vector lengths is corrected to provide an approximation to the scalar multiplication of the first vector and the second vector.

본 발명의 다른 양태에서, 제1 벡터와 제2 벡터의 스칼라 곱에 대한 근삿값은 제1 벡터와 제2 벡터의 스칼라 곱에 있는 허용 오차에 의해 특징 지워지며, 제2 벡터와 제3 벡터의 스칼라 곱이 제1 벡터와 제2 벡터의 스칼라 곱과 허용 오차 미만으로 상이한 최솟값을 갖도록 K가 선택된다. 일 양태에서, K<N이다.In another aspect of the invention, the approximate value for the scalar multiplication of the first vector and the second vector is characterized by the tolerance in the scalar multiplication of the first vector and the second vector, and the scalar multiplication of the second vector and the third vector, K is selected such that the product has a minimum value that is less than the tolerance and the scalar product of the first vector and the second vector. In one aspect, K < N.

본 발명의 다른 양태에서, 제1 벡터의 성분들은 소정 개수의 비트들(n)을 요구하는 수치 표현에 의해 특징 지워지고, K<nN이다.In another aspect of the invention, the components of the first vector are characterized by a numerical expression requiring a predetermined number of bits (n), where K < nN.

본 발명의 실시 예에 따르면, 필터 계수 벡터는 유형의 피라미드 정수 성분 벡터에 의해 근사화될 수 있어, 연산 복잡도가 축소될 수 있다.According to an embodiment of the present invention, the filter coefficient vector can be approximated by a pyramidal integer component vector of the type, and the computational complexity can be reduced.

도 1은 A와 S 사이의 스칼라 곱을 연산하기 위한 종래의 연산 하드웨어를 예시한다.
도 2는 정수 성분 벡터와 S의 스칼라 곱을 연산하기 위한 하드웨어를 예시한다.
도 3은 정수 값들의 분포를 예시한다.
도 4는 근사화 벡터(approximation vector)의 성분들의 주파수 분포를 예시한다.
도 5는 필터들 각각의 감쇠를 예시한다.FIG. 1 illustrates conventional computing hardware for computing a scalar product between A and S. FIG.
Figure 2 illustrates hardware for computing the scalar product of an integer component vector and S;
Figure 3 illustrates the distribution of integer values.
Figure 4 illustrates the frequency distribution of the components of an approximation vector.
Figure 5 illustrates the attenuation of each of the filters.

본 발명이 그 이점을 제공하는 방식은 시간의 함수로서 신호 세기를 나타내는 디지털 값들의 시퀀스를 제공하기 위해 디지털화된 신호로부터 도출된 동일 길이의 벡터와 필터 계수 벡터의 스칼라 곱을 생성하는 연산 엔진에서 유한 응답 필터로서 구현된 간단한 대역 통과 필터를 참조하여 더 쉽게 이해될 수 있다. 필터 계수 벡터를 A = [A₁, A₂, ..., A_N]로 나타낸다. 일반적으로, 신호는 디지털 값들 D_i(여기서, i는 1 내지 N_D로서 N_D>>N임)의 시퀀스로 구성된다. 신호는 N 개의 셀 길이인 레지스터로 시프트 되는 것으로 보일 수 있다. 이 레지스터의 값들은 신호 벡터 S=[S₁, S₂, ?, S_N]에 대응한다. A와 S 사이의 각각의 스칼라 곱이 연산된 이후, 레지스터의 내용이 소정의 개수의 셀에 의해 시프트 되고, D_i의 새로운 값들이 레지스터로 시프트 된다. 이러한 방식으로 생성된 스칼라 곱들의 시퀀스는 필터로부터의 출력 디지털 신호를 형성한다.The manner in which the invention provides its advantages is based on the fact that in an arithmetic engine that produces a scalar product of a vector of the same length and a filter coefficient vector derived from a digitized signal to provide a sequence of digital values representing the signal strength as a function of time, Can be more easily understood with reference to a simple band pass filter implemented as a filter. The filter coefficient vector is denoted by A = [A ₁ , A ₂ , ..., A _N ]. Generally, the signal is composed of a sequence of digital values D _i , where i is 1 to N _D and N _D >> N. The signal may appear to be shifted into a register that is N cell lengths. The values of this register correspond to the signal vector S = [S ₁ , S ₂ ,?, S _N ]. After each scalar multiplication between A and S is computed, the contents of the register are shifted by a predetermined number of cells and the new values of D _i are shifted into the registers. The sequence of scalar products generated in this manner forms the output digital signal from the filter.

실시간 애플리케이션들에서, 스칼라 곱은 레지스터로 시프트 되어 그 레지스터를 시프트 하는 신호 값들의 다음 집합을 생성하는데 필요한 시간보다 더 적은 시간 내에 연산되어야 한다. 병렬 처리에 의해 시간 제약들이 완화될 수 있지만, 추가 하드웨어의 비용은 속도 개선을 위한 이러한 옵션이 사용될 수 있는 한도를 제한할 수 있다.In real-time applications, the scalar product must be computed in less time than is needed to shift to a register to produce the next set of signal values that shift the register. Although time constraints can be mitigated by parallel processing, the cost of additional hardware can limit the extent to which such an option for speed improvement can be used.

본 발명은 관심 있는 몇몇 경우에 벡터 A가 2개의 관심 특성이 있는 동일한 길이의 근사화 벡터 A'로 대체될 수 있다는 관찰에 기초한다. 첫째, A'와 S의 스칼라 곱은 A와 S의 스칼라 곱에 대한 적절한 근삿값이다. 둘째, A'와 S의 스칼라 곱은 A와 S의 스칼라 곱보다 상당히 더 적은 연산 리소스를 요구한다.The present invention is based on the observation that in some cases of interest the vector A can be replaced by an approximation vector A 'of the same length with two interesting properties. First, the scalar product of A 'and S is the approximate approximation to the scalar multiplication of A and S. Second, the scalar multiplication of A 'and S requires significantly less computational resources than the scalar multiplication of A and S.

A를 근사화할 때 용인될 수 있는 오차 정도는 스칼라 곱이 사용되고 있는 특정 애플리케이션에 의존한다. 일부 센서의 출력을 디지털화함으로써 생성된 디지털 시퀀스에 대해 동작하는 유한 임펄스 응답으로서 구현되는 종래의 디지털 필터를 고려한다. 신호가 필터링 되는 것은 디지털 시퀀스(D_i)에 대한 시간의 함수로서 센서 아날로그 출력 전압을 변환하는데 사용되는 아날로그-디지털 변환기(ADC)에 의해 유입되는 디지털화 잡음 및 센서 내의 잡음으로부터 초래되는 일정 레벨의 잡음을 가질 것이다. 필터로부터의 출력 신호는 다른 디지털 시퀀스(F_i)로 구성된다. 컴포넌트들이 입력 디지털 시퀀스의 서브 시퀀스인 벡터와 A의 스칼라 곱을 형성함으로써 각각의 출력 시퀀스 값이 획득된다. 예를 들어, 벡터 [S_k, S_k ₊ ₁, ..., S_k _+N- ₁]와 A의 스칼라 곱에 의해 출력 신호 값(F_k)이 획득된다. 출력 신호는 또한 입력 신호의 잡음으로부터 초래되는 일정 레벨의 잡음을 가질 것이다. A가 A'로 대체되면, 추가 잡음이 출력 신호로 유입될 것이다. 추가 잡음이 입력 신호 내의 잡음으로부터 초래되는 잡음보다 훨씬 작다면, 추가 잡음은 출력 신호의 정확성을 크게 바꾸지 않을 것이다. 예를 들어, A'가 선택되어 A에 대한 근사화에 의해 유입되는 추가 잡음이 신호에 의해 유입되는 잡음의 25퍼센트 미만이 되면, 이러한 근사화는 출력 잡음 진폭에 거의 영향을 미지치 않을 것이다.The degree of tolerance that can be tolerated when approximating A depends on the particular application in which the scalar multiplication is being used. Consider a conventional digital filter implemented as a finite impulse response that operates on a digital sequence generated by digitizing the output of some sensors. Constant level of the noise resulting from the noise in the digitized noise and sensor coming to digital converter (ADC) - The signals are filtered analog that is used to convert the sensor analog output voltage as a function of time for the digital sequences (D _i) . The output signal from the filter is composed of another digital sequence (F _i). Each output sequence value is obtained by forming a scalar multiplication of A with the vector whose components are the subsequences of the input digital sequence. For example, the output signal value F _k is obtained by scalar multiplication of A [S _k , S _k ₊ ₁ , ..., S _k _{+ N-} ₁ ] with A. The output signal will also have a certain level of noise resulting from the noise of the input signal. If A is replaced by A ', additional noise will be introduced into the output signal. If the additional noise is much less than the noise resulting from the noise in the input signal, the additional noise will not significantly change the accuracy of the output signal. For example, if A 'is selected and the additional noise introduced by the approximation to A is less than 25 percent of the noise introduced by the signal, then this approximation will have little effect on the output noise amplitude.

스칼라 곱에 부과되는 연산 부하를 축소하기 위한 하나의 방법은 벡터 성분들 모두가 정수인 근사화 벡터를 이용하는 것이다. 이러한 벡터는 다음의 논의에서 정수 성분 벡터라고 지칭될 것이다. 입력 데이터 스트림이 정수의 시퀀스라면, 스칼라 곱은 정수 곱셈들과 덧셈들만을 이용하여 연산될 수 있는데, 이들 곱셈 및 덧셈은 실수 곱셈 및 덧셈보다 상당히 더 적은 연산 리소스를 요구한다. ADC가 매우 큰 전압 차를 갖지 않는 센서 출력에 대해 동작함으로써 입력 신호가 생성되면, 입력 데이터 스트림은 이러한 정수 데이터 스트림이 되도록 조절될 수 있다.One way to reduce the computational load imposed on the scalar product is to use an approximation vector, where all of the vector components are integers. Such a vector will be referred to as an integer component vector in the following discussion. If the input data stream is a sequence of integers, the scalar product can be computed using integer multiplications and additions, which require significantly less computational resources than real multiplication and addition. If an input signal is generated by the ADC operating on a sensor output that does not have a very large voltage difference, the input data stream can be adjusted to be such an integer data stream.

A를 위한 정수 성분 벡터를 생성하기 위한 하나의 방법은 A를 스케일 조정하여 A에서 가장 큰 성분의 절댓값이 원하는 최대 사이즈의 정수에 의해 적절히 근사화되게 하는 것이다. 스케일 조정된 벡터의 성분들은 정수들로 반올림 된다. 본 개시의 목적들을 위해, 이러한 유형의 양자화 방식은 스칼라 양자화라고 지칭될 것인데, 이 경우 성분들은 벡터의 다른 성분에 대한 고려 없이 한 번에 하나씩 근사화된다. 이러한 간단한 알고리즘이 정수 성분 벡터를 생성하는 반면에, 정수 성분 벡터는 근사화에 의해 유입되는 잡음을 축소하거나 연산 복잡도를 축소하는 관점에서 반드시 최상의 정수 성분 벡터일 필요는 없다. 더구나, 정수 성분 벡터가 A'인 경우조차, 스칼라 곱은 여전히 N 번의 곱셈과 N-1 번의 덧셈을 요구한다. 신호 벡터가 실수들의 벡터라면, 하나의 정수와 하나의 실수를 포함하는 곱셈들을 갖는 연산 절감이 훨씬 더 적다.One way to generate an integer component vector for A is to scale A so that the absolute value of the largest component in A is properly approximated by the desired maximum size integer. The components of the scaled vector are rounded up to integers. For purposes of this disclosure, this type of quantization scheme will be referred to as scalar quantization, in which the components are approximated one at a time without consideration of the other components of the vector. While this simple algorithm produces an integer component vector, the integer component vector does not necessarily have to be the best integer component vector in terms of reducing the noise introduced by the approximation or reducing the computational complexity. Moreover, even if the integer component vector is A ', the scalar product still requires N multiplications and N-1 additions. If the signal vector is a vector of real numbers, there is much less computation savings with multiplications involving one integer and one real number.

본 발명은 스칼라 곱들을 요구하는 많은 실상의 문제들에서 정수 성분 벡터에 의해 근사화될 벡터의 계수들이 특정한 통계 특성을 갖는다는 관찰을 이용한다. 통계적 분포를 갖는 숫자들의 집합으로서 벡터의 성분들을 고려한다. 통계적 분포가 라플라시안(Laplacian)이면, 더 적합한 정수 성분 벡터가 발견될 수 있다. 나아가, 상이한 근사화 오차들을 갖는 복수의 정수 성분 벡터들이 생성될 수 있으며, 주어진 오차에 대해 최상의 연산 축소를 제공하는 하나의 정수 성분 벡터가 결정될 수 있다. 관심 대상이 되는 많은 문제는 성분들이 대략 라플라시안인 통계적 분포를 갖는 벡터 A를 포함한다.The present invention takes advantage of the observation that the coefficients of a vector to be approximated by an integer component vector in many real problems that require scalar multiplications have certain statistical properties. Consider the components of the vector as a set of numbers with a statistical distribution. If the statistical distribution is Laplacian, a more suitable integer component vector can be found. Further, a plurality of integer component vectors having different approximation errors may be generated, and one integer component vector that provides the best computational reduction for a given error may be determined. A number of problems of interest include vector A with statistical distributions in which the components are approximately laplacian.

본 개시의 목적을 위해, A의 성분들은 실질적으로 라플라시안 분포를 갖는다고 가정한다. A를 스케일 조정하고 전술된 성분들을 반올림함으로써 생성되는 간단한 정수 성분 벡터(A')를 고려한다. A'의 정수 성분들의 절댓값들의 합을 K라고 나타낸다. N과 K는 N 차원 공간에서 정수 성분 벡터들의 집합을 정의한다. 본 개시의 목적을 위해, 이들 벡터는 피라미드 벡터라고 지칭될 것이다. 이러한 벡터들의 집합 및 이들을 생성하는 방법들은 벡터 양자화 분야에서 공지되어 있으며, 따라서 여기에서는 상세히 설명되지 않을 것이다. 본 개시의 목적을 위해, 성분들의 통계적 분포가 라플라시안이고 양자화된 성분들이 K로 합산되는 절댓값들을 갖는다면, 이러한 벡터들의 집합은 정수 성분 벡터들에 의해 N 차원 공간에서 벡터들을 나타내는데 적합하다는 점에 유의해야 한다. 따라서, 주어진 K에 대해, A의 방향에 가장 가까운 N 차원 공간에서의 방향을 갖는 이 집합의 정수 성분 벡터는 이 특정 K를 갖는 A'에 대해 최선의 선택을 제공한다. 또한, A'와 몇몇 다른 벡터들의 스칼라 곱이 K 번의 덧셈을 이용하여 성취될 수 있음을 알 수 있다. K가 증가함에 따라, 이 집합 내의 벡터들의 개수 또한 증가한다. 그러므로 K를 증가시킴으로써 A에 대한 더 나은 근삿값이 발견될 수 있다. 그러므로 A를 A'에 의해 근사화함으로써 생성되는 오차와 연산 작업 부하 사이에 트레이드오프가 존재한다.For purposes of this disclosure, it is assumed that the components of A have a substantially Laplacian distribution. Consider a simple integer component vector (A ') that is generated by scaling A and rounding the above-mentioned components. The sum of the absolute values of the integer components of A 'is denoted by K. N and K define a set of integer component vectors in N-dimensional space. For purposes of this disclosure, these vectors will be referred to as pyramid vectors. A set of such vectors and methods for generating them are well known in the field of vector quantization and will not be described in detail here. For purposes of this disclosure, it is noted that if the statistical distribution of components is laplacian and the quantized components have extreme values summed to K, then such a set of vectors is suitable for representing vectors in N-dimensional space by integer component vectors Should be. Thus, for a given K, this set of integer component vectors with direction in the N-dimensional space closest to the direction of A provides the best choice for A 'with this particular K. [ It can also be seen that the scalar multiplication of A 'and some other vectors can be achieved using K additions. As K increases, the number of vectors in this set also increases. Therefore, by increasing K, a better approximation of A can be found. There is therefore a tradeoff between the error produced by approximating A by A 'and the computational workload.

A'에 의한 A의 근사화에 의해 유입되는 오차는 많은 경우에 측정될 수 있다. 예를 들어, 필터의 경우에, 필터의 대역 통과와 대역외 억제는 각각의 K 값을 위한 근사화 필터에 대해 결정될 수 있다. 허용 가능한 통과 대역 및 대역외 억제를 제공하는 최소 K 값이 이용될 수 있다. 오차 범위(error bound)가 알려지면, 본 발명의 근사화는 예상 벡터들(S)의 몇몇 샘플을 위해 시도될 수 있다. 오차 범위가 충족되면, 근사화가 이용될 수 있다.The error introduced by the approximation of A by A 'can be measured in many cases. For example, in the case of a filter, the bandpass and out-of-band suppression of the filter may be determined for an approximation filter for each K value. A minimum K value that provides an acceptable passband and out-of-band suppression may be used. Once an error bound is known, an approximation of the present invention may be attempted for some samples of the predicted vectors S. If the error range is met, an approximation can be used.

이하, A와 S 사이의 스칼라 곱을 연산하기 위한 종래의 연산 하드웨어를 예시하는 도 1을 참조한다. 도면을 간략히 하기 위해, 도면으로부터 제어 회로망이 생략되었다. 일반적으로, 초기에 0으로 설정된 레지스터(14)가 존재한다. 1 내지 N의 i마다, 레지스터(14)의 현재 내용에 승산기 출력을 더하는 가산기(13)에 출력이 입력되는 승산기(12)에 S_i 및 A_i가 입력된다. 이 공정이 끝나면, 레지스터(14)는 스칼라 곱을 저장한다.Reference is now made to Fig. 1, which illustrates a conventional computing hardware for computing a scalar product between A and S. In order to simplify the drawing, the control network is omitted from the drawing. Generally, there is a register 14 initially set to zero. S _i and A _i are input to the multiplier 12 to which the output is inputted to the adder 13 which adds the multiplier output to the current contents of the register 14 every i of 1 to N. [ At the end of this process, the register 14 stores the scalar product.

위에서 언급했듯이, K에 의해 특성화된 정수 성분 벡터를 이용하여 스칼라 곱을 연산하기 위한 연산 작업 부하는 K 번의 덧셈이다. 이하, 정수 성분 벡터와 S의 스칼라 곱을 연산하기 위한 하드웨어를 예시하는 도 2를 참조한다. 다시 도면을 간략히 하기 위해, 도면으로부터 제어 회로망이 생략되었다. 연산이 시작되면, 레지스터(14)는 다시 0으로 설정된다. A'의 i번째 성분에 의한 스칼라 곱(A'_i*S_i)에 대한 기여도를 고려한다. 이 기여도는 곱셈에 이어서 레지스터(14)의 내용에 기여도를 더함으로써 생성될 수 있다. 다른 방법으로서, 기여도는 가산기(16)를 이용하여 레지스터(14)의 내용에 S_i를 A'_i 번 더함으로써 생성될 수 있다. 본 개시의 목적을 위해, x를 누산기에 j 번 더함으로써 정수(j) 및 제2 숫자(x)를 포함하는 곱셈을 수행하는 하드웨어 컴포넌트가 다음의 설명에서 반복 가산 승산기(repeated add multiplier)라고 지칭될 것이다. J가 음수이면, 반복 가산 승산기들은 누산기에서 x를 뺀다. A'_i의 절댓값들의 합이 K이기 때문에, 총 K 번의 덧셈이 필요하다. x는 부동 소수점 숫자 또는 정수일 수 있다.As mentioned above, the computational workload for computing a scalar multiplication using an integer component vector characterized by K is K additions. Reference is now made to Fig. 2, which illustrates hardware for computing the scalar product of an integer component vector and S. To simplify the drawing again, the control network is omitted from the drawing. When the operation is started, the register 14 is again set to zero. The scalar product (A ') by the i-th component of A'_i* S_i) Of the contribution to the project. This contribution can be generated by adding a contribution to the contents of the register 14 following the multiplication. Alternatively, the contribution may be added to the contents of register 14 using adder 16,_iA '_i And the like. For purposes of this disclosure, a hardware component that performs a multiplication involving an integer (j) and a second number (x) by adding x to the accumulator j times is referred to in the following description as a repeated add multiplier Will be. If J is negative, the iterative additive multipliers subtract x from the accumulator. A '_iSince the sum of the absolute values of K is K, a total of K additions is required. x can be a floating-point number or an integer.

스칼라 곱은 N 번의 곱셈을 요구한다. 각 성분의 기여도가 반복 가산 승산기에 의해 제공되면, 덧셈의 총 횟수는 K이며, 레지스터는 0으로 재설정된다고 가정한다. A의 성분들의 거짓수(mantissa)를 나타내는데 필요한 비트의 개수가 n이면, 각각의 곱셈은 적어도 n 번의 덧셈 및 n-1 번의 시프트 동작을 요구할 것이다. 또한, 추가적인 N-1 번의 덧셈 동작이 존재할 것이다. 그러므로 총 작업 부하는 시프트 동작들로 인해 Nn+N-1 번의 덧셈보다 클 것이다. 따라서, K가 Nn+N-1 미만이면, 피라미드 벡터 근사화 및 반복적인 가산 곱셈을 이용함으로써 순수 연산 절감이 달성될 것이다.The scalar multiplication requires N multiplications. Assuming that the contribution of each component is provided by the iterative adder, the total number of additions is K, and the register is reset to zero. If the number of bits required to represent the mantissa of the components of A is n, each multiplication will require at least n additions and n-1 shift operations. In addition, there will be an additional N-1 addition operations. Therefore, the total workload will be larger than Nn + N-1 additions due to shift operations. Thus, if K is less than Nn + N-1, pure computational savings will be achieved by using pyramid vector approximation and iterative additive multiplication.

A의 스칼라 양자화가 이용됨에도 불구하고, 통상적으로, n은 적어도 16일 것이며, 그러므로 곱셈들을 행함으로써 발생되는 연산 부하는 16번의 덧셈 및 15번의 시프트를 요구할 것이다. 어느 방법이 적어도 연산 작업 부하를 요구하는지는 A'에서 성분들의 사이즈에 의존한다. 반대로, A'가 많은 작은 성분, 즉 -15와 15 사이의 정수 값을 갖는 S와 A'의 스칼라 곱은 곱셈 동작을 위해 반복 가산 승산기를 이용함으로써 훨씬 적은 시간에 구현될 수 있다. 따라서, 다수의 작은 성분들을 갖도록 A'를 선택하는 것이 유리하다. K<16*N이면, 순수 절감이 달성될 것이다.Even though scalar quantization of A is used, n will typically be at least 16, so the computational load generated by doing multiplications will require 16 additions and 15 shifts. Which method requires at least a computational workload depends on the size of the components in A '. Conversely, a scalar multiplication of S and A 'with a small number of small components, A' having an integer value between -15 and 15, can be implemented in a much less time by using a repeated adder multiplier for the multiplication operation. Therefore, it is advantageous to select A 'to have a large number of small components. If K <16 * N, net savings will be achieved.

특정한 N 및 K에 대해 정의된 벡터들의 집합 내의 정수 성분 벡터들은 일반적으로 A의 길이와 상이한 몇몇 소정의 길이를 갖는다는 점에 유의해야 한다. N과 K에 의해 정의된 집합으로부터 획득된 벡터를 Y로 나타낸다. 일반적으로, Y는 A와 상이한 길이를 가질 것이다. 그러므로,It should be noted that the integer component vectors in the set of vectors defined for a particular N and K generally have some predetermined length different from the length of A. Y is the vector obtained from the set defined by N and K. Generally, Y will have a length different from A. therefore,

A'=cY.A '= cY.

여기서, c는 Y의 길이에 대한 A의 길이의 비이다. 그러므로,Where c is the ratio of the length of A to the length of Y. therefore,

.

스칼라 곱은 위에서 설명된 바와 같이 K-1 번의 덧셈으로서 수행될 수 있는 N 개의 곱뿐 아니라 1 번의 곱셈을 요구한다.The scalar multiplication requires one multiplication as well as N multiplications that can be performed as K-1 addition, as described above.

앞서 설명된 정수 성분 벡터들을 이용하여 스칼라 곱을 수행하는 연산 작업 부하가 1번의 실수 곱셈 및 K-1 번의 덧셈을 위한 시간에 의해 한정되는 동안, 실제로 작업 부하는 이 한도 미만일 수 있다. 전술한 바와 같이, A'_i의 절댓값이 최대 A'_i 값을 저장하는데 요구되는 비트들의 개수와 관련된 값 미만이면, 그 A'_i와 대응하는 성분 S_i의 곱은 기존의 곱셈 대신에 A'_i 번의 덧셈 또는 뺄셈을 이용하여 반복 가산 승산기에서 수행될 수 있다. K-1 번의 덧셈 한도는 제1 영이 아닌 S 성분의 값을 이용하여 초기에 설정되는 누산기를 이용하여 곱들을 연산하기 위한 반복 가산 승산기를 이용함으로써 스칼라 곱에 대한 A'_i*S_i 번의 기여 모두를 수행하는 것에 대응한다. 반복 가산 승산기의 누산기가 스칼라 곱 연산을 시작할 때 0으로 설정되어 있으면, K 번의 덧셈이 요구된다. 그러나 몇몇 경우에는, i의 일부 값들을 위한 A'_i는 이러한 컷오프보다 큰 절댓값들을 갖는다. 이들 경우에, 종래의 곱셈을 수행하는 것이 연산 측면에서 더 효율적이다. 본 발명의 일 양태에서, A'의 성분들은 그룹화되어 작은 A'_i 값들을 갖는 성분들 및 S_i가 반복 가산 승산기로 보내지고 더 큰 절댓값을 갖는 성분들이 종래의 승산기로 보내진다.While the computational workload performing the scalar multiplication using the integer component vectors described above is limited by the time for one real multiplication and K-1 addition, the actual workload may actually be less than this limit. As described above, if it is less than the value related to the number of bits required to 'the absolute value of _i max A' A stores the value of _i, and A 'in place of the conventional multiplication product of the component S _i corresponding to the _i A' _i Lt; / RTI > can be performed in the iterative adder multiplier using < RTI ID = 0.0 > a < / RTI > The addition limit of K-1 is A ' _i * S _i contribution to the scalar multiplication by using an iterative adder to calculate the products using the accumulator initially set using the value of the S component other than the first zero Lt; / RTI > If the accumulator of the iterative adder multiplier is set to 0 when starting the scalar multiplication, K additions are required. In some cases, however, A ' _i for some values of _i have larger cut-off values than these cutoffs. In these cases, performing the conventional multiplication is more efficient in terms of operation. In one aspect of the invention, the components of A 'are grouped so that components with small A' _i values and S _i are sent to the iterative adder and the components with larger offsets are sent to conventional multipliers.

알려지지 않은 벡터를 이용하여 스칼라 곱을 실행하고 스칼라 곱에서 충분한 정확성을 여전히 제공하는 경우 가장 낮은 연산 작업 부하를 제공하는 근사화 벡터를 찾기 위한 연산 부하 로드는 하나의 스칼라 곱을 수행할 때 내재하는 연산 작업 부하보다 훨씬 더 크다. 그러므로 본 발명의 방법은 동일한 벡터 A를 이용하는 다수의 스칼라 곱들이 수행될 상황들에 가장 적합하다. 이런 타입의 상황의 많은 예시가 업계에 공지되어 있다. 유한 임펄스 응답 필터들 및 신경망 기반 패턴 인식은 본 발명의 이점들이 상당한 절감들 및/또는 결과에서의 상당한 개선을 제공할 수 있는 상황들의 예시이다.If you perform scalar multiplication using unknown vectors and still provide sufficient accuracy in scalar multiplication, the computational load load to find the approximate vector that provides the lowest computational workload is less than the computational workload inherent in performing one scalar multiplication Much bigger. The method of the present invention is therefore most suitable for situations where multiple scalar multiplications using the same vector A are to be performed. Many examples of this type of situation are known in the art. Finite impulse response filters and neural network based pattern recognition are examples of situations in which the benefits of the present invention can provide significant savings in significant savings and / or results.

이하, 유한 임펄스 응답 필터의 예는 더 나은 필터 특성을 가질 뿐 아니라 종래의 유한 임펄스 응답 필터와 동일한 연산 작업 부하를 갖는 유한 임펄스 응답 필터를 구현하는데 본 발명의 축소된 연산 작업 부하가 이용될 수 있는 방식을 예시하기 위해 특정 대역 통과 필터를 참조하여 더 상세히 설명될 것이다. 57개의 "탭들" 즉, N=57인 220Hz와 400Hz 사이의 통과 대역을 갖는 종래의 유한 임펄스 응답 대역 통과 필터를 고려한다. 종래의 구현 예에서, 유한 임펄스 응답 필터는 57번의 부동 소수점 곱셈 및 56번의 부동 소수점 덧셈들을 요구하여 출력 신호 스트림의 하나의 샘플을 생성한다. 부동 소수점 계수들을 스케일 조정하고 계수들을 가장 가까운 정수로 반올림한 후에 실수 필터 계수들을 정수들로 대체함으로써 부동 소수점 곱셈들은 제거될 수 있다. 도 3에는 정수 값들의 분포가 도시되어 있다. 계수들은 각 계수를 나타내기 위해 16-비트 정수를 요구하며, 본질적으로 모든 필터 계수는 스칼라 곱들 내의 모든 곱셈들에 대해 너무 커서 반복 가산 승산기들로 대체될 수 없다는 점에 유의한다. 따라서, 스칼라 곱에서의 각 곱셈은 16번의 덧셈 및 15번의 시프트를 요구한다. 추가적으로, 곱셈 결과들을 합산하기 위해 요구되는 56번의 덧셈들이 존재한다. 그러므로 하나의 필터링된 성분을 생성하기 위해 스칼라 곱을 수행할 연산 작업 부하는 968(= 16*57+56) 번의 덧셈 및 855(= 15*57) 번의 시프트 동작이다.Hereinafter, an example of a finite impulse response filter not only has better filter characteristics but also implements a finite impulse response filter having the same computational workload as a conventional finite impulse response filter, Will be described in more detail with reference to a specific bandpass filter to illustrate the scheme. Consider a conventional finite impulse response bandpass filter with 57 " taps ", i.e. N = 57 passbands between 220 Hz and 400 Hz. In a conventional implementation, the finite impulse response filter requires 57 floating-point multiplications and 56 floating-point additions to generate one sample of the output signal stream. Floating point multiplications can be eliminated by scaling the floating-point coefficients and rounding the coefficients to the nearest integer and then replacing the real filter coefficients with integers. The distribution of integer values is shown in FIG. Note that the coefficients require a 16-bit integer to represent each coefficient, and essentially all filter coefficients are too large for all multiplications in the scalar products so that they can not be replaced by iterative addition multipliers. Thus, each multiplication in the scalar multiplication requires 16 additions and 15 shifts. Additionally, there are 56 additions required to sum the multiplication results. Therefore, the computational workload to perform the scalar multiplication to produce one filtered component is 968 (= 16 * 57 + 56) times of addition and 855 (= 15 * 57) shift operations.

스칼라 곱들을 연산하는데 사용될 근사화 벡터에 도달하기 위해 원래의 필터 계수들이 N=197 및 K=999로 특징 지워지는 피라미드 벡터들의 집합으로부터 가장 가까운 피라미드 벡터로 대체된 동일한 통과 대역을 위한 197-탭 필터를 고려한다. 이러한 필터는 다음의 설명에서 근사화 필터라고 지칭될 것이다. 도 4에는 근사화 벡터의 성분들의 주파수 분포가 도시되어 있다. 57-탭 필터를 위한 가중치의 분포와 대조적으로, 계수들 중 13개를 제외한 전부는 20 미만의 절댓값을 갖는다. 그러므로 스칼라 곱을 연산하는데 필요한 곱셈들은 반복 가산 승산기를 이용하여 성취될 수 있다. 반복 가산 승산기에 의해 곱셈들 모두가 수행되면, 총 작업 부하는 999번의 덧셈이다. 그러므로 근사화 필터는 피라미드 벡터 근사화 없이 57-탭 필터보다 다소 적은 연산 작업 부하를 갖는다.A 197-tap filter for the same passband replaced with the closest pyramid vector from the set of pyramid vectors whose original filter coefficients are characterized by N = 197 and K = 999 to reach the approximation vector to be used to compute the scalar products . Such a filter will be referred to as an approximation filter in the following description. Figure 4 shows the frequency distribution of the components of the approximation vector. In contrast to the distribution of the weights for the 57-tap filter, all but 13 of the coefficients have an exemption of less than 20. Hence, the multiplications needed to compute the scalar multiplication can be achieved using a repetitive addition multiplier. If all of the multiplications are performed by the iterative adder, the total work load is 999 additions. Therefore, the approximation filter has slightly less computational workload than the 57-tap filter without pyramid vector approximation.

이하, 필터들 각각의 감쇠를 예시하는 도 5를 참조한다. 197-탭 근사화 필터는 54로 도시되어 있고, 57-탭 필터는 52로 도시되어 있다. 근사화 필터는 통과 대역 밖에서 57-탭 필터보다 더 가파른 감소를 갖는다. 추가로, 근사화 필터의 경우 대역외 신호 억제는 대략 20dB 우수하다.Reference is now made to Fig. 5 which illustrates the attenuation of each of the filters. The 197-tap approximation filter is shown at 54 and the 57-tap filter is shown at 52. The approximation filter has a more steep reduction than the 57-tap filter outside the passband. In addition, the out-of-band signal suppression for the approximation filter is approximately 20 dB better.

전술된 예시는 근사화 필터에서의 모든 곱셈들이 반복 가산 승산기에 의해 수행된다고 가정한다. 그러나 전술된 바와 같이, 더 큰 필터 계수들에 대한 곱셈들이 종래의 가산 및 시프트 승산기에서 수행되는 반면, 나머지 곱셈들이 반복 가산 승산기에서 수행되는 실시예 또한 가능하다. 근사화 필터의 성분들은 부호 비트와 7 비트에 의해 나타낼 수 있다. 그러므로 종래의 곱셈은 7번의 덧셈과 6번의 시프트를 요구한다. 절댓값들이 16보다 큰 성분들 모두가 종래의 곱셈을 활용하면, 작업 부하는 대략 500번의 덧셈으로 축소될 수 있다.The above-described example assumes that all the multiplications in the approximation filter are performed by the iterative addition multiplier. However, as described above, it is also possible that the multiplications for the larger filter coefficients are performed in the conventional adder and shift multipliers, while the remaining multiplications are performed in the iterative adder. The components of the approximation filter can be represented by sign bits and 7 bits. Therefore, conventional multiplication requires 7 additions and 6 shifts. If all of the components whose offset values are greater than 16 utilize conventional multiplication, the workload can be reduced to approximately 500 additions.

앞선 예시에서, K>N이다. 그러나 K<N인 실시예들 또한 가능하다. 앞서 설명된 바와 같이, K값이 클수록 원시 벡터 A의 방향과 A에 대한 피라미드 벡터 근삿값 A' 사이의 차이는 줄어든다. 그러나 특정 애플리케이션이 A와 A' 사이의 차이를 허용하면, 연산 절감은 매우 클 수 있다. K<N이면, 적어도 A'의 N-K 성분들이 0일 것이다. K=N-1인 경우, 덧셈들의 개수가 종래의 스칼라 곱에서의 덧셈 개수와 갖지만, 곱셈들이 아무런 시간을 요구하지 않는다. 그러므로 시간 연산 부하 절감은 스칼라 곱을 연산하기 위한 일반적인 방법에 비해 16배 이상일 수 있다.In the preceding example, K > N. However, embodiments where K < N are also possible. As described above, the larger the K value, the smaller the difference between the direction of the source vector A and the pyramid vector A 'for A. However, if a particular application allows a difference between A and A ', the computational savings can be very large. If K < N, at least the N-K components of A 'will be zero. If K = N-1, the number of additions is with the number of additions in the conventional scalar multiplication, but the multiplications do not require any time. Therefore, the time computation load savings can be up to 16 times that of the usual method for computing scalar multiplications.

본 발명은 특히 신경망에 유용하다. 신경망에서, 각 층은 복수의 신경 노드 또는 프로세서를 포함한다. 각각의 신경 노드는 기본적으로 신경망의 다음 층으로 전달되는 출력을 제공하도록 비선형 함수에 의해 처리되는 출력을 가진 필터이다. 제1 층은 필터들에 의해 처리되고 있는 신호와 유사한 복수의 센서 입력을 수신한다. 각 층은 센서 입력들의 서브 집합을 포함하는 벡터에 대해 동작한다. 필터 계수들은 신경망이 서로 구별되는 다양한 아이템으로 "도시된" 훈련 기간에 결정되며, 필터 계수들은 최종 계층이 다양한 클래스를 정확하게 식별할 때까지 조정된다. 본 발명은 많은 신경망 구현 패턴 인식 문제에 대한 최종 필터 계수들의 통계적 분포가 대략적으로 라플라시안이라는 관찰에 기초한다. 그러므로 이들 필터 계수 벡터는 앞서 설명된 유형의 피라미드 정수 성분 벡터에 의해 근사화될 수 있으며 전술된 바와 같이 연산 복잡도가 축소될 수 있다. 최적 K 값은 학습 집합의 분류 시에 오차들이 관찰되는 지점까지 초기 K 값을 축소함으로써 결정될 수 있다.The present invention is particularly useful for neural networks. In a neural network, each layer comprises a plurality of neural nodes or processors. Each neural node is essentially a filter with an output that is processed by a nonlinear function to provide an output that is delivered to the next layer of the neural network. The first layer receives a plurality of sensor inputs similar to the signal being processed by the filters. Each layer operates on a vector comprising a subset of the sensor inputs. The filter coefficients are determined during the training period " shown " as a variety of items in which the neural network is distinct from each other, and the filter coefficients are adjusted until the final layer correctly identifies the various classes. The present invention is based on the observation that the statistical distribution of the final filter coefficients for many neural network implementation pattern recognition problems is approximately laplacian. These filter coefficient vectors can therefore be approximated by a pyramidal integer component vector of the type described above and the computational complexity can be reduced as described above. The optimal K value can be determined by reducing the initial K value to the point at which errors are observed in classifying the learning set.

본 발명의 스칼라 곱 근사화는 누산기 및 반복 가산 승산기를 제공할 수 있는 임의의 데이터 처리 시스템에 구현될 수 있다. 본 발명은 특히 그래픽 프로세서들과 같이 병렬로 실행될 수 있는 다수의 프로세서를 갖는 데이터 프로세서에 적합한데, 각각의 프로세서는 다른 프로세서들과 병렬로 스칼라 곱의 하나의 항을 생성할 수 있다. 또한, 본 발명에 따라 스칼라 곱을 구현하는데 필요한 간단한 하드웨어로 인해, 본 발명은 필드 프로그래머블 게이트 어레이들 및 특수 처리 칩들에 구현될 수 있다.The scalar product approximation of the present invention may be implemented in any data processing system capable of providing an accumulator and a repeated adder multiplier. The present invention is particularly well suited for a data processor having multiple processors that may be executed in parallel, such as graphics processors, wherein each processor can generate a term of a scalar multiplication in parallel with other processors. Further, due to the simple hardware required to implement scalar multiplication in accordance with the present invention, the present invention can be implemented in field programmable gate arrays and special processing chips.

본 발명의 앞서 설명된 실시예들은 본 발명의 다양한 양태들을 예시하기 위해 제공되었다. 다만, 상이한 특정 실시예들에 도시된 본 발명의 상이한 양태들은 본 발명의 다른 실시예들을 제공하기 위해 결합될 수 있다는 점이 이해되어야 한다. 추가적으로, 본 발명에 대한 다양한 변형예들은 앞선 설명 및 첨부 도면들로부터 명확해질 것이다. 따라서, 본 발명은 다음의 청구항의 범위에 의해서만 제한되어야 한다.The foregoing embodiments of the invention have been presented to illustrate various aspects of the invention. It should be understood, however, that the different aspects of the invention illustrated in the different specific embodiments may be combined to provide other embodiments of the invention. In addition, various modifications to the present invention will become apparent from the foregoing description and the accompanying drawings. Accordingly, the invention should be limited only by the scope of the following claims.

Claims

CLAIMS What is claimed is: 1. A method of operating a data processing system for computing an approximation to a scalar product between a first vector and a second vector characterized by N components, each characterized by an absolute value,
Replacing the first vector with a third vector, which is a pyramidal integer vector characterized by N components, wherein each component of the pyramidal integer vector is multiplied by an integer K, which is a sum of absolute values and an absolute value of the N components, Characterized step; And
And computing a scalar product of the second vector and the third vector to provide the approximation for the scalar product between the first vector and the second vector.

2. The method of claim 1, wherein computing the scalar product of the second vector and the third vector comprises computing a scalar product of a corresponding one of the components in the third vector to provide a scalar multiplication of the second vector and the third vector. &Lt; And adding each component of the second vector to a register a number of times specified by a component of the second vector.

3. The method of claim 2, wherein the second vector is characterized by a second vector length, the third vector characterized by a third vector length, and the scalar product of the second vector and the third vector is characterized by the second Wherein a difference between the vector length and the third vector length is corrected to provide the approximate value for the scalar multiplication of the first vector and the second vector.

2. The method of claim 1 wherein the approximation of the scalar product of the first vector and the second vector is characterized by a tolerance in the scalar product of the first vector and the second vector, And K is chosen so that the scalar multiplication of the third vector has a different minimum value than the scalar product of the first vector and the second vector less than the tolerance.

2. The method of claim 1 wherein K < N.

2. The method of claim 1, wherein the components of the first vector are characterized by a numerical representation requiring a predetermined number of bits (n), wherein K < nN.