TWI841330B

TWI841330B - Gaussian elimination computing system and gaussian elimination computing method

Info

Publication number: TWI841330B
Application number: TW112112623A
Authority: TW
Inventors: 陳博仁; 楊宗賢; 王珽; 劉宗德
Original assignee: 國立臺灣大學
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2024-05-01

Abstract

A Gaussian elimination computing system and a Gaussian elimination computing method are provided. The Gaussian elimination computing system includes a control circuit, a systolic array, and a memory. The control circuit receives an operation matrix. The systolic array includes a square array formed by a plurality computing cells. The systolic array is configured to perform a matrix decomposition operation to the operation matrix, to decompose the operation matrix into an upper triangular matrix and a lower triangular matrix. The memory is configured with an operation data block with the same size of the operation matrix, for storing the upper triangular matrix and the lower triangular matrix after decomposition.

Description

Gaussian elimination calculation system and Gaussian elimination calculation method

本發明是有關於一種系統及方法，且特別是有關於一種高斯消去計算系統及高斯消去計算方法。 The present invention relates to a system and method, and in particular to a Gaussian elimination calculation system and a Gaussian elimination calculation method.

在後量子密碼學(Post-quantum cryptography)中，不同於量子密碼學，後量子密碼學使用現有二進制的電腦，不依靠量子力學，它依靠的是密碼學家認為無法被量子電腦有效解決的計算難題來進行有效的加密。 In post-quantum cryptography, unlike quantum cryptography, post-quantum cryptography uses existing binary computers and does not rely on quantum mechanics. It relies on computational problems that cryptographers believe cannot be effectively solved by quantum computers to perform effective encryption.

本發明提供一種高斯消去計算系統及高斯消去計算方法，其可以節省進行矩陣分解運算時的記憶體需求。 The present invention provides a Gaussian elimination calculation system and a Gaussian elimination calculation method, which can save memory requirements when performing matrix decomposition operations.

本發明的高斯消去計算系統包括控制電路、脈動陣列及記憶體。控制電路接收運算矩陣。脈動陣列(systolic array)包括多個運算胞形成的方陣，脈動陣列經組態用以對運算矩陣進行矩陣分解運算，以將運算矩陣分解為下三角矩陣及上三角矩陣。記憶體配置有與運算矩陣相同的大小的運算資料區塊，用以儲存分解後的下三角矩陣及上三角矩陣。 The Gaussian elimination calculation system of the present invention includes a control circuit, a systolic array and a memory. The control circuit receives a calculation matrix. The systolic array includes a square matrix formed by a plurality of calculation cells, and the systolic array is configured to perform a matrix decomposition operation on the calculation matrix to decompose the calculation matrix into a lower triangular matrix and an upper triangular matrix. The memory is configured with a calculation data block of the same size as the calculation matrix to store the decomposed lower triangular matrix and upper triangular matrix.

本發明的高斯消去計算方法包括接收運算矩陣；將運算矩陣輸入至脈動陣列以進行矩陣分解運算，並將運算矩陣分解為下三角矩陣及上三角矩陣；以及於記憶體中配置運算矩陣相同的大小的運算資料區塊，並將分解後的下三角矩陣及上三角矩陣儲存於運算資料區塊。 The Gaussian elimination calculation method of the present invention includes receiving a calculation matrix; inputting the calculation matrix into a pulse array to perform matrix decomposition operation, and decomposing the calculation matrix into a lower triangular matrix and an upper triangular matrix; and configuring a calculation data block of the same size as the calculation matrix in a memory, and storing the decomposed lower triangular matrix and upper triangular matrix in the calculation data block.

基於上述，本發明的高斯消去計算系統及高斯消去計算方法可以有效率地利用記憶體，改善計算時所需的效率以及硬體需求。 Based on the above, the Gaussian elimination calculation system and Gaussian elimination calculation method of the present invention can efficiently utilize memory and improve the efficiency and hardware requirements required for calculation.

1、3:高斯消去計算系統 1, 3: Gaussian elimination calculation system

10、30:控制電路 10, 30: Control circuit

11、31:脈動陣列 11, 31: Pulsating array

12、32、303:記憶體 12, 32, 303: Memory

33、34、1102、MUX1~MUX3:多工器 33, 34, 1102, MUX1~MUX3: Multiplexer

110:運算胞 110: Operation Cells

110A:第一運算胞 110A: First operation cell

110B:第二運算胞 110B: Second operation cell

300:第一控制電路 300: First control circuit

301:第二控制電路 301: Second control circuit

302:排序電路 302: Sorting circuit

304:位址記憶體 304: Address memory

500~503:資料區塊 500~503: Data block

1100:偵錯電路 1100: Debug circuit

1101:或閘 1101: or gate

1103、FF:正反器 1103, FF: Flip-flop

A、B、A[0]~A[4]、B[0]~B[4]:輸入 A, B, A[0]~A[4], B[0]~B[4]: Input

C、C[0]~C[4]:輸出 C, C[0]~C[4]: output

CB1~CB3:行群組矩陣 CB1~CB3: row group matrix

addr1~addr5:位址記錄器 addr1~addr5: address recorder

data_in[0]~data_in[4]:輸入值 data_in[0]~data_in[4]: input value

data_out:輸入列 data_out: input row

data_out[0]~data_out[4]:輸出值 data_out[0]~data_out[4]: output value

data_in:輸出列 data_in: output column

ext_contr:控制器 ext_contr: controller

ext_en、ext_en[0]~ext_en[4]:外部致能訊號 ext_en, ext_en[0]~ext_en[4]: external enable signal

ext_op、ext_op[0]~ext_op[4]:外部控制訊號 ext_op, ext_op[0]~ext_op[4]: external control signal

fail_in、fail_out:偵錯判斷結果 fail_in, fail_out: Debugging results

FF11~FF19、FF21~FF29:移位暫存器 FF11~FF19, FF21~FF29: shift register

int_op:內部控制訊號 int_op: internal control signal

M:運算矩陣 M: Operation matrix

check_en、op_in、op_in[0]~op_in[4]、op_out、OPA、 perm_op、perm_op[0]~perm_op[4]:控制訊號 check_en, op_in, op_in[0]~op_in[4], op_out, OPA, perm_op, perm_op[0]~perm_op[4]: control signal

P:排序矩陣 P: sorting matrix

pivot_en:致能訊號 pivot_en: Enable signal

R:輸入矩陣 R: Input matrix

S、S[0]~S[4]:選擇訊號 S, S[0]~S[4]: Select signal

S90~S92:步驟 S90~S92: Steps

T:目標矩陣 T: Target Matrix

trig:觸發訊號 trig: trigger signal

U12~U15、U23~U25、U34~U35、U45:單元 U12~U15, U23~U25, U34~U35, U45: unit

圖1為本發明實施例一高斯消去計算系統的方塊示意圖。 Figure 1 is a block diagram of the Gaussian elimination calculation system of the first embodiment of the present invention.

圖2A為本發明實施例一脈動陣列的電路方塊圖。 Figure 2A is a circuit block diagram of a pulse array in accordance with an embodiment of the present invention.

圖2B、2C分別為圖2A中的第二運算胞及第一運算胞的真值表。 Figures 2B and 2C are the truth tables of the second and first operation cells in Figure 2A, respectively.

圖2D中繪示了圖2A中第一運算胞中的偵錯電路的電路示意圖。 FIG2D shows a schematic diagram of the debugging circuit in the first operation cell in FIG2A.

圖3A為本發明實施例一高斯消去計算系統的方塊示意圖。 FIG3A is a block diagram of a Gaussian elimination calculation system according to the first embodiment of the present invention.

圖3B中繪示了圖3A中的第一控制電路的電路方塊圖。 FIG3B shows a circuit block diagram of the first control circuit in FIG3A.

圖3C中繪示了圖3A中排序電路的電路方塊圖。 FIG3C shows a circuit block diagram of the sequencing circuit in FIG3A.

圖3D中繪示了圖3A中部分多工器的電路方塊圖。 FIG3D shows a circuit block diagram of a portion of the multiplexer in FIG3A.

圖4A~圖4Z用來說明一脈動陣列對第一行群組矩陣進行矩陣分解運算的操作流程。 Figures 4A to 4Z are used to illustrate the operation flow of a pulse array performing matrix decomposition operations on the first row group matrix.

圖5A~圖5C繪示了圖3A中的脈動陣列對記憶體進行寫入過程的示意圖。 Figures 5A to 5C show schematic diagrams of the process of the pulse array in Figure 3A writing into the memory.

圖6A~圖6D繪示了對運算矩陣進進行矩陣分解運算過程的示意圖。 Figure 6A to Figure 6D show schematic diagrams of the matrix decomposition operation process for the operation matrix.

圖7A~圖7C繪示了如何將下三角矩陣的反矩陣及排序矩陣乘上目標矩陣的計算過程的示意圖。 Figures 7A to 7C show schematic diagrams of the calculation process of how to multiply the inverse matrix and the sorting matrix of the lower triangular matrix by the target matrix.

圖8A~圖8C繪示了如何依據上三角矩陣的反矩陣來進行計算並獲得公鑰矩陣的示意圖。 Figures 8A to 8C show schematic diagrams of how to calculate and obtain the public key matrix based on the inverse matrix of the upper triangular matrix.

圖9繪示了本發明實施例一高斯消去計算方法的流程圖。 FIG9 shows a flow chart of the Gaussian elimination calculation method of the first embodiment of the present invention.

圖1為本發明實施例一高斯消去計算系統1的方塊示意圖。在一些實施例中，高斯消去計算系統1可應用於後量子密碼學(Post-Quantum Cryptography)中，用以接收輸入矩陣R並依據相對應的加密演算法來產生出加密用的公鑰。舉例來說，加密演算法可例如是McEliece、BIKE、FrodoKEM、HQC、NTRU Prime或是其他適合的加密演算法。高斯消去計算系統1所接收的輸入矩陣R可以是透過隨機生成的亂數矩陣，輸入矩陣R可被輸入矩陣 R分割成為運算矩陣M及目標矩陣T。在運算矩陣M及目標矩陣T中，運算矩陣M可被高斯消去計算系統1用來計算反矩陣M^-1，並進一步基於反矩陣M^-1來對目標矩陣T進行運算以獲得公鑰。在一些實施例中，高斯消去計算系統1可在計算在運算矩陣M的反矩陣M^-1的過程中，實時地判斷運算矩陣M是否為滿秩(full rank)或是判斷運算矩陣M是否為可逆(invertible)的。換句話說，高斯消去計算系統1不需要等到反矩陣M^-1的完整計算過程結束之後才能判斷其是否為滿秩或反矩陣M^-1是否存在。相反的，當判斷出運算矩陣M並不是滿秩或其為不可逆的情況下，高斯消去計算系統1可以在計算到一半的過程中放棄針對運算矩陣M的運算，達到早期放棄(early-abortion)的效果，舉例來說，高斯消去計算系統1可以利用Naive Abortion、Square Inversion、PLU分解(PLU decomposition)或其他適合的演算方式來在運算過程中實現早期放棄。另一方面來說，高斯消去計算系統1在運算的過程中，可以節省記憶體的使用空間，使得高斯消去計算系統1可以利用與運算矩陣M的大小相同的記憶體區塊來進行運算，避免記憶體容量以及電路面積上的額外負擔。 FIG1 is a block diagram of a Gaussian elimination computing system 1 of the first embodiment of the present invention. In some embodiments, the Gaussian elimination computing system 1 can be applied to post-quantum cryptography to receive an input matrix R and generate a public key for encryption according to a corresponding encryption algorithm. For example, the encryption algorithm may be McEliece, BIKE, FrodoKEM, HQC, NTRU Prime or other suitable encryption algorithms. The input matrix R received by the Gaussian elimination computing system 1 may be a random matrix generated randomly, and the input matrix R may be divided by the input matrix R into an operation matrix M and a target matrix T. In the operation matrix M and the target matrix T, the operation matrix M can be used by the Gaussian elimination calculation system 1 to calculate the inverse matrix M ^-1 , and further operate the target matrix T based on the inverse matrix M ^-1 to obtain the public key. In some embodiments, the Gaussian elimination calculation system 1 can determine in real time whether the operation matrix M is full rank or whether the operation matrix M is invertible during the process of calculating the inverse matrix M ^- ^{1 of the operation matrix M. In other words, the Gaussian elimination calculation system 1 does not need to wait until the complete calculation process of the inverse matrix M-1} is completed before determining whether it is full rank or whether the inverse matrix M ^-1 exists. On the contrary, when it is determined that the operation matrix M is not full rank or is irreversible, the Gaussian elimination calculation system 1 can abandon the operation on the operation matrix M halfway through the calculation process to achieve the effect of early-abortion. For example, the Gaussian elimination calculation system 1 can use Naive Abortion, Square Inversion, PLU decomposition or other suitable calculation methods to achieve early abandonment in the calculation process. On the other hand, the Gaussian elimination calculation system 1 can save memory space during the calculation process, so that the Gaussian elimination calculation system 1 can use the memory block of the same size as the operation matrix M to perform calculations, avoiding the additional burden on memory capacity and circuit area.

大致而言，高斯消去計算系統1將會執行並計算下方的公式(1)~公式(3)來依據輸入矩陣R計算出公鑰矩陣PK。 Generally speaking, the Gaussian elimination calculation system 1 will execute and calculate the following formulas (1) to (3) to calculate the public key matrix PK based on the input matrix R.

PM=LU (1) PM = LU (1)

U ^-1．L ^-1．P．M=I (2) U ^-1 ． L ^-1 . P. M = I (2)

PK=U ^-1．L ^-1．P．T (3) 其中P為排序矩陣，M為運算矩陣，L為下三角矩陣，U為上三角矩陣，U^-1為上三角矩陣的反矩陣，L^-1為下三角矩陣的反矩陣，I為單位矩陣，PK為公鑰矩陣，T為目標矩陣。詳細來說，藉由PLU分解，運算矩陣M可以被分解成如公式(1)及公式(2)中所表示的關係。進一步，藉由將公式(2)中將運算矩陣M還原成單位矩陣的過程重現在目標矩陣T上，即可如公式(3)所示，透過運算矩陣M的矩陣分解過程來對目標矩陣T進行運算，並獲得出公鑰矩陣PK。 PK = U ^-1 ． L ^-1 ． P ． T (3) Where P is the sorting matrix, M is the operation matrix, L is the lower triangular matrix, U is the upper triangular matrix, U ^-1 is the inverse matrix of the upper triangular matrix, L ^-1 is the inverse matrix of the lower triangular matrix, I is the unit matrix, PK is the public key matrix, and T is the target matrix. In detail, by PLU decomposition, the operation matrix M can be decomposed into the relationship expressed in formula (1) and formula (2). Furthermore, by reproducing the process of restoring the operation matrix M to the unit matrix in formula (2) on the target matrix T, the target matrix T can be operated through the matrix decomposition process of the operation matrix M as shown in formula (3), and the public key matrix PK can be obtained.

詳細而言，高斯消去計算系統1包括控制電路10、脈動陣列11及記憶體12。控制電路10可經組態以接收n×m(也就是n列m行)的輸入矩陣R，並將該輸入矩陣R區分為n×n的運算矩陣M及n×(m-n)的目標矩陣T，其中n<m。脈動陣列11耦接控制電路10並接收運算矩陣M。在本實施例中，脈動陣列11可對運算矩陣M進行例如是PLU分解的矩陣分解運算，以將運算矩陣M分解為排序矩陣P、下三角矩陣L及上三角矩陣U。另外，記憶體12可配置有與運算矩陣M相同大小的運算資料區塊，因此，脈動陣列11可以將分解出的下三角矩陣L及上三角矩陣U儲存在記憶體12中與運算矩陣M相同的大小的運算資料區塊中。 In detail, the Gaussian elimination calculation system 1 includes a control circuit 10, a pulse array 11, and a memory 12. The control circuit 10 can be configured to receive an n×m (i.e., n columns and m rows) input matrix R, and divide the input matrix R into an n×n operation matrix M and an n×(m-n) target matrix T, where n<m. The pulse array 11 is coupled to the control circuit 10 and receives the operation matrix M. In this embodiment, the pulse array 11 can perform a matrix decomposition operation such as PLU decomposition on the operation matrix M to decompose the operation matrix M into a sorting matrix P, a lower triangular matrix L, and an upper triangular matrix U. In addition, the memory 12 can be configured with a computing data block of the same size as the computing matrix M. Therefore, the pulse array 11 can store the decomposed lower triangular matrix L and upper triangular matrix U in the computing data block of the same size as the computing matrix M in the memory 12.

在一些實施例中，脈動陣列11可以是利用脈動陣列(systolic array)、脈動網路(systolic network)、脈動線(systolic line)或其他適合的運算架構所實現的。在本實施例接下來的說明中，脈動陣列11將會以基於脈動線架構下進行PLU來分解作為主要說明的範例，但應理解的是上述實施方式不應做為限縮高斯消去計算系統1及脈動陣列11的依據。 In some embodiments, the pulsating array 11 can be implemented using a systolic array, a systolic network, a systolic line, or other suitable computing architectures. In the following description of this embodiment, the pulsating array 11 will be mainly described by decomposing the pulsating array 11 using the PLU based on the systolic line architecture, but it should be understood that the above implementation should not be used as a basis for limiting the Gaussian elimination computing system 1 and the pulsating array 11.

在一些實施例中，當控制電路10接收到n×m的輸入矩陣R時，控制電路10可先將輸入矩陣R拆分為n×n的運算矩陣M及n×(m-n)的目標矩陣T。接著，控制電路10可再將運算矩陣M等分為多個n×p的行群組矩陣，並藉由將行群組矩陣分別輸入至脈動陣列11，因而計算出整體運算矩陣M的反矩陣M^-1。 In some embodiments, when the control circuit 10 receives an n×m input matrix R, the control circuit 10 may first split the input matrix R into an n×n operation matrix M and an n×(mn) target matrix T. Then, the control circuit 10 may further split the operation matrix M into a plurality of n×p row group matrices, and input the row group matrices into the pulse array 11 respectively, thereby calculating the inverse matrix M ^-1 of the overall operation matrix M.

圖2A為本發明實施例一脈動陣列11的電路方塊圖。在一些實施例中，因應於進行具有n×p大小的行群組矩陣的計算，脈動陣列11可以例如具有圖2A所繪示的電路方塊結構。脈動陣列11可例如是接收具有輸入值data_in[0]~data_in[4]的輸入列data_in，並依據外部控制訊號ext_en及控制訊號op_in來產生具有輸出值data_out[0]~data_out[4]的輸出列。如圖2A所示，脈動陣列11具有方陣結構，並具有多個運算胞110A、110B以陣列方式設置，在以下的說明中將以5×5大小的脈動陣列11為例來說明高斯消去計算系統1的操作過程，本領域具通常知識者當可依據不同設計需求對脈動陣列11的尺寸進行適應性地調整。更具體來說，脈動陣列11中具有第一運算胞110A及第二運算胞110B。在脈動陣列11的主對角線(也就是左上至右下的對角線)上設置的是第一運算胞110A，而在脈動陣列11的主對角線之外的位置上設置的則是第二運算胞110B。每個運算胞110可以例如是依據米利型有限狀態機(mealy state machine)所實現的電路結構，其包括有適當的邏輯閘及正反器互相耦接來實現其功能，因而在運算胞110每次被時脈訊號所觸發時，依據接收到的輸入資料值以及其本身所儲存的資料值來產生輸出資料值。 FIG2A is a circuit block diagram of a pulse array 11 according to an embodiment of the present invention. In some embodiments, in response to performing calculations on a row group matrix having a size of n×p, the pulse array 11 may have, for example, a circuit block structure as shown in FIG2A. The pulse array 11 may, for example, receive an input row data_in having input values data_in[0] to data_in[4], and generate an output row having output values data_out[0] to data_out[4] according to an external control signal ext_en and a control signal op_in. As shown in FIG. 2A , the pulse array 11 has a square matrix structure and has a plurality of operation cells 110A and 110B arranged in an array. In the following description, a 5×5 size pulse array 11 is used as an example to illustrate the operation process of the Gaussian elimination calculation system 1. A person skilled in the art can adaptively adjust the size of the pulse array 11 according to different design requirements. More specifically, the pulse array 11 has a first operation cell 110A and a second operation cell 110B. The first operation cell 110A is arranged on the main diagonal line of the pulse array 11 (i.e., the diagonal line from the upper left to the lower right), and the second operation cell 110B is arranged outside the main diagonal line of the pulse array 11. Each operation cell 110 can be, for example, a circuit structure implemented according to a mealy state machine, which includes appropriate logic gates and flip-flops coupled to each other to implement its function, so that each time the operation cell 110 is triggered by a clock signal, it generates an output data value according to the received input data value and the data value stored in itself.

圖2B、2C分別為圖2A中的第二運算胞110B、第一運算胞110A的真值表示意圖。具體來說，第二運算胞110B相較於第一運算胞110A具有相對較簡單的邏輯功能，每個第二運算胞110B可被2位元的控制訊號OP操作具有四種功能之下，其分別為傳遞(PASS)、加法(ADD)、交換(SWAP)及啟動(INIT)的功能。具體來說，當操作在傳遞(PASS)功能下時，第二運算胞110B僅將輸入資料傳遞至輸出端，輸入資料並不影響第二運算胞110B內部所儲存的資料值。當操作在加法(ADD)功能下時，第二運算胞110B會將輸入資料值與儲存資料值加總後輸出，且輸入資料並不影響第二運算胞110B內部所儲存的資料值。當操作在交換(SWAP)功能下時，第二運算胞110B會將輸入資料值寫入內部，並將儲存資料值輸出。當操作在啟動(INIT)功能下時，第二運算胞110B會將輸入資料值寫入內部，並輸出資料值0。相較於在一些實施方式中，傳遞(PASS)、加法(ADD)及交換(SWAP)功能是以兩位元的控制訊號所指示的，而啟動(INIT)功能則是以額外的另一個位元所指示，第二運算胞110B的控制訊號以較為簡潔的方式進行編碼，因此不論是在訊號走線或是在硬體結構上的需求，皆可為高斯消去計算系統1帶來有效的面積改善效果。 FIG2B and FIG2C are respectively diagrams of the truth value representation of the second operation cell 110B and the first operation cell 110A in FIG2A. Specifically, the second operation cell 110B has a relatively simpler logic function than the first operation cell 110A. Each second operation cell 110B can be operated by a 2-bit control signal OP to have four functions, namely, pass (PASS), addition (ADD), swap (SWAP) and start (INIT). Specifically, when operating in the pass (PASS) function, the second operation cell 110B only passes the input data to the output end, and the input data does not affect the data value stored inside the second operation cell 110B. When operating in the addition (ADD) function, the second operation cell 110B will add up the input data value and the stored data value and output it, and the input data will not affect the data value stored in the second operation cell 110B. When operating in the swap (SWAP) function, the second operation cell 110B will write the input data value into the inside and output the stored data value. When operating in the startup (INIT) function, the second operation cell 110B will write the input data value into the inside and output the data value 0. Compared to some implementations, where the pass (PASS), addition (ADD) and swap (SWAP) functions are indicated by a two-bit control signal, and the start (INIT) function is indicated by an additional bit, the control signal of the second operation cell 110B is encoded in a more concise manner, so that no matter in terms of signal routing or hardware structure requirements, it can bring effective area improvement effects to the Gaussian elimination calculation system 1.

對於第一運算胞110A而言，第一運算胞110A可操作在分解運算模式之下或重現運算模式之下。如圖2C所示，當操作在分解運算模式之下時，第一運算胞110A會依據接收到的控制訊號op_in及外部致能訊號ext_en的控制，來產生輸出控制訊號OPA。當外部致能訊號ext_en例如是在邏輯值1的情況下，第一運算胞110A可被控制訊號OP控制在傳遞(PASS)、交換(SWAP)及啟動(INIT)的功能之下，並分別輸出對應於傳遞(PASS)、交換(SWAP)及啟動(INIT)的控制訊號op_out。另外，當外部致能訊號ext_en例如是在邏輯值0的情況下，第一運算胞110A可依據輸入資料值及本身所儲存的資料值操作在傳遞(PASS)、交換(SWAP)及加法(ADD)的功能之下。另外，當第一運算胞110A操作在重現運算模式之下時，則第一運算胞110A則可與第二運算胞110B具有相同的操作特性，因而可依據圖2B所示的真值表，依據接收外部所給入的控制訊號op_in來操作在傳遞(PASS)、加法(ADD)、交換(SWAP)及啟動(INIT)的功能之下。 For the first operation cell 110A, the first operation cell 110A can be operated in the decomposition operation mode or the reproduction operation mode. As shown in FIG2C, when operating in the decomposition operation mode, the first operation cell 110A generates an output control signal OPA according to the control of the received control signal op_in and the external enable signal ext_en. When the external enable signal ext_en is, for example, at a logical value of 1, the first operation cell 110A can be controlled by the control signal OP under the functions of passing (PASS), switching (SWAP) and starting (INIT), and outputs the control signal op_out corresponding to passing (PASS), switching (SWAP) and starting (INIT), respectively. In addition, when the external enable signal ext_en is, for example, at a logical value of 0, the first operation cell 110A can operate under the functions of pass (PASS), swap (SWAP) and addition (ADD) according to the input data value and the data value stored in itself. In addition, when the first operation cell 110A operates under the reproduction operation mode, the first operation cell 110A can have the same operating characteristics as the second operation cell 110B, and thus can operate under the functions of pass (PASS), addition (ADD), swap (SWAP) and start (INIT) according to the truth table shown in FIG. 2B and the control signal op_in given by the external.

回到圖2A，以脈動陣列11的整體操作來說，行群組矩陣中各列的值可由脈動陣列11的上方被分別輸入至脈動陣列11第一列的各個運算胞110。脈動陣列11的第一列則依據接收到的外部致能訊號ext_en[0]及控制訊號op_in[0]來操作在對應的狀態下，以進行運算並輸出資料值至第二列的脈動陣列11。同理，第二列的脈動陣列亦會依據外部致能訊號ext_en[1]及控制訊號op_in[1]來進行運算並輸出資料值至第三列的脈動陣列11，依此類推。 Returning to FIG. 2A , for the overall operation of the pulse array 11, the values of each row in the row group matrix can be input from the top of the pulse array 11 to each operation cell 110 in the first row of the pulse array 11. The first row of the pulse array 11 operates in a corresponding state according to the received external enable signal ext_en[0] and control signal op_in[0] to perform calculations and output data values to the second row of the pulse array 11. Similarly, the second row of the pulse array will also perform calculations according to the external enable signal ext_en[1] and control signal op_in[1] and output data values to the third row of the pulse array 11, and so on.

在圖2A中雖然繪示的是5×5大小的脈動陣列11，但脈動陣列11的尺寸不以此限，只要脈動陣列11的行數大於或等於行群組矩陣的行數即可。 Although FIG. 2A shows a 5×5 size pulse array 11, the size of the pulse array 11 is not limited to this, as long as the number of rows of the pulse array 11 is greater than or equal to the number of rows of the row group matrix.

圖2B、圖2C中繪示並說明了第一運算胞110A及第二運算胞110B的真值表，藉以說明第一運算胞110A及第二運算胞110B的個別操作。但如圖2A所示，所有的運算胞110除了行列之間的耦接關係之外，對角線上的第一運算胞110A還會互相串聯連接，這樣的串聯連接關係是用來配置第一運算胞110A的早期放棄功能。 FIG2B and FIG2C illustrate and explain the truth tables of the first operation cell 110A and the second operation cell 110B, so as to explain the individual operations of the first operation cell 110A and the second operation cell 110B. However, as shown in FIG2A, in addition to the coupling relationship between rows and columns, the first operation cells 110A on the diagonal are also connected in series. Such a series connection relationship is used to configure the early abandonment function of the first operation cell 110A.

圖2D中繪示了圖2A中第一運算胞110A中的偵錯電路1100的電路示意圖。在脈動陣列31中，每個第一運算胞110A的偵錯電路1100會連接到前一級的第一運算胞110A的偵錯電路1100，以接收前一級第一運算胞110A的偵錯判斷結果fail_in。接收的偵錯判斷結果fail_in會與反相後的內部儲存資料被或閘1101進行運算，並經過被控制訊號check_en所控制多工器1102的選擇後輸出。另外，在多工器1102的輸出端則會耦接有正反器1103，用以儲存偵錯判斷結果fail_out，並依據時脈訊號的驅動來輸出偵錯判斷結果fail_out至下一級的第一運算胞110A。因此，在實際操作上，所有的第一運算胞110A儲存資料會透過串聯連接的偵錯電路1100來被檢查，只要任一個第一運算胞110A內部所儲存的資料為0就會導致該級第一運算胞110A輸出的偵錯判斷結果fail_out被切換至邏輯值1，而邏輯值為1的偵錯判斷結果fail_out 會經過偵錯電路1100中的或閘1101不斷被傳遞下去，直到第一運算胞110A的串列尾端。 FIG2D shows a circuit diagram of the debug circuit 1100 in the first operation cell 110A in FIG2A. In the pulse array 31, the debug circuit 1100 of each first operation cell 110A is connected to the debug circuit 1100 of the first operation cell 110A of the previous stage to receive the debug judgment result fail_in of the first operation cell 110A of the previous stage. The received debug judgment result fail_in is operated by the OR gate 1101 with the inverted internal storage data, and is output after being selected by the multiplexer 1102 controlled by the control signal check_en. In addition, a flip-flop 1103 is coupled to the output end of the multiplexer 1102 to store the error detection result fail_out and output the error detection result fail_out to the first operation cell 110A of the next stage according to the driving of the clock signal. Therefore, in actual operation, all the data stored in the first operation cells 110A will be checked through the serially connected debugging circuit 1100. As long as the data stored in any first operation cell 110A is 0, the debugging result fail_out output by the first operation cell 110A at that level will be switched to the logical value 1, and the debugging result fail_out with a logical value of 1 will be continuously transmitted through the OR gate 1101 in the debugging circuit 1100 until the end of the series of the first operation cells 110A.

圖3A為本發明實施例一高斯消去計算系統3的方塊示意圖。高斯消去計算系統3包括控制電路30、脈動陣列31及記憶體32。高斯消去計算系統3相似於高斯消去計算系統1，因此關於脈動陣列31的結構及個別操作細節請參考圖2A~圖2D中關於脈動陣列11的說明內容，於此不另贅述。 FIG3A is a block diagram of a Gaussian elimination calculation system 3 of the first embodiment of the present invention. The Gaussian elimination calculation system 3 includes a control circuit 30, a pulse array 31, and a memory 32. The Gaussian elimination calculation system 3 is similar to the Gaussian elimination calculation system 1, so for the structure and individual operation details of the pulse array 31, please refer to the description of the pulse array 11 in FIG2A to FIG2D, which will not be elaborated here.

詳細而言，控制電路30中包含有第一控制電路300及第二控制電路301。第一控制電路300是用來產生外部致能訊號ext_en及外部控制訊號ext_op。第二控制電路301可依據脈動陣列31記錄脈動陣列31所輸出的控制訊號op_out並產生內部控制訊號int_op。控制電路30則可透過多工器MUX2的選擇，依據外部致能訊號ext_en來由外部控制訊號ext_op或內部控制訊號int_op進行選擇以產生控制訊號op_in。透過多工器MUX2所選擇出的控制訊號op_in會被輸入至脈動陣列31，藉以控制脈動陣列31中各列的操作。 In detail, the control circuit 30 includes a first control circuit 300 and a second control circuit 301. The first control circuit 300 is used to generate an external enable signal ext_en and an external control signal ext_op. The second control circuit 301 can record the control signal op_out output by the pulse array 31 and generate an internal control signal int_op according to the pulse array 31. The control circuit 30 can select the external control signal ext_op or the internal control signal int_op according to the external enable signal ext_en through the selection of the multiplexer MUX2 to generate the control signal op_in. The control signal op_in selected by the multiplexer MUX2 will be input to the pulse array 31 to control the operation of each row in the pulse array 31.

第二控制電路301包括排序電路302、記憶體303及多工器MUX1。第二控制電路301可利用排序電路302來儲存排序矩陣P，其記錄了脈動陣列31在對行群組矩陣進行矩陣分解運算時各列的順序關係。另外，第二控制電路301則可利用記憶體303及多工器MUX1來記錄脈動陣列31輸出的控制訊號op_out，並產生內部控制訊號int_op至多工器MUX2。在一些實施例中，第二控制電路301更新記憶體303的操作可以是依照時序而對記憶體303進行的部分更新。關於部分更新內容，將於後面篇幅搭配電路結構以及操作內容詳述。 The second control circuit 301 includes a sorting circuit 302, a memory 303 and a multiplexer MUX1. The second control circuit 301 can use the sorting circuit 302 to store the sorting matrix P, which records the order relationship of each column when the pulse array 31 performs matrix decomposition operation on the row group matrix. In addition, the second control circuit 301 can use the memory 303 and the multiplexer MUX1 to record the control signal op_out output by the pulse array 31, and generate an internal control signal int_op to the multiplexer MUX2. In some embodiments, the operation of the second control circuit 301 to update the memory 303 can be a partial update of the memory 303 according to the timing. The partial update content will be described in detail in the following section with the circuit structure and operation content.

在一實施例中，當脈動陣列31在進行例如是行群組矩陣的分解運算時，脈動陣列31可被操作在三角運作模式之下，也就是脈動陣列31的對角線下方的第二運算胞110B可被禁能或關閉，僅保留脈動陣列31對角線上的第一運算胞110A及對角線上方的第二運算胞110B開啟而進行運算。在此三角運作模式下，第一運算胞110A會被操作在分解運算模式之下，也就是第一運算胞110A會依據如圖2C所繪示的真值表進行操作，並且產生的控制訊號將會被用來控制同列中被開啟的第二運算胞110B。最後，脈動陣列31將由對角線上的第一運算胞110A輸出控制訊號op_out。 In one embodiment, when the pulse array 31 is performing a decomposition operation such as a row group matrix, the pulse array 31 may be operated in a triangular operation mode, that is, the second operation cell 110B below the diagonal of the pulse array 31 may be disabled or turned off, and only the first operation cell 110A on the diagonal of the pulse array 31 and the second operation cell 110B above the diagonal are left turned on for operation. In this triangular operation mode, the first operation cell 110A is operated in a decomposition operation mode, that is, the first operation cell 110A is operated according to the truth table shown in FIG. 2C, and the generated control signal will be used to control the turned-on second operation cell 110B in the same column. Finally, the pulse array 31 will output the control signal op_out from the first operation cell 110A on the diagonal line.

在一實施例中，當脈動陣列31在進行其他操作的時候，例如是進行重現操作時，脈動陣列31可被操作在方陣運作模式之下，也就是脈動陣列31中的所有運算胞110皆會被開啟而進行運算。在此方陣運作模式下，第一運算胞110A會被操作在重現運算模式之下，也就是第一運算胞110A會依據如圖2B所繪示的真值表進行操作，脈動陣列31中整列的運算胞110都會依據接收的控制訊號op_in來進行操作。最後脈動陣列31可由脈動陣列31上的最後一列運算胞110產生輸出資料訊號data_out。 In one embodiment, when the pulse array 31 is performing other operations, such as reproducing operations, the pulse array 31 can be operated in a matrix operation mode, that is, all the operation cells 110 in the pulse array 31 are turned on to perform operations. In this matrix operation mode, the first operation cell 110A is operated in a reproducing operation mode, that is, the first operation cell 110A is operated according to the truth table shown in FIG. 2B, and the operation cells 110 in the entire row of the pulse array 31 are operated according to the received control signal op_in. Finally, the pulse array 31 can generate an output data signal data_out by the last row of operation cells 110 on the pulse array 31.

多工器33耦接於脈動陣列31，用以接收輸出控制訊號op_out及輸出資料訊號data_out。多工器33依據脈動陣列31是操作在三角運作模式或方陣運作模式來選擇相對應的輸出，並將其傳遞至記憶體32。多工器34耦接於第二控制電路301。多工器34用以接收控制訊號op_out及記憶體32所讀出的輸入列data_in，並依據計算需求將脈動陣列31的運算結果或記憶體32所儲存的資料寫入至第二控制電路303的記憶體303中。 The multiplexer 33 is coupled to the pulse array 31 to receive the output control signal op_out and the output data signal data_out. The multiplexer 33 selects the corresponding output according to whether the pulse array 31 is operating in a triangle operation mode or a square operation mode, and transmits it to the memory 32. The multiplexer 34 is coupled to the second control circuit 301. The multiplexer 34 is used to receive the control signal op_out and the input column data_in read from the memory 32, and write the calculation result of the pulse array 31 or the data stored in the memory 32 into the memory 303 of the second control circuit 303 according to the calculation requirements.

脈動陣列31的輸出經過多工器MUX3選擇之後被寫入至記憶體32中。而多工器MUX3的結構與功能類似於多工器MUX1，其可以進行按位元的部分資料寫入。 The output of the pulse array 31 is written into the memory 32 after being selected by the multiplexer MUX3. The structure and function of the multiplexer MUX3 are similar to those of the multiplexer MUX1, and it can write partial data bit by bit.

圖3B中繪示了圖3A中的第一控制電路300的電路方塊圖。圖3C中繪示了圖3A中排序電路302的電路方塊圖。圖3D中繪示了圖3A中多工器MUX1、MUX3的電路方塊圖。關於圖3B、圖3C、圖3D的結構與操作將於後續篇幅中配合整體高斯消去計算系統3的操作進行說明。 FIG3B shows a circuit block diagram of the first control circuit 300 in FIG3A. FIG3C shows a circuit block diagram of the sorting circuit 302 in FIG3A. FIG3D shows a circuit block diagram of the multiplexers MUX1 and MUX3 in FIG3A. The structures and operations of FIG3B, FIG3C, and FIG3D will be described in the following sections in conjunction with the operation of the overall Gaussian elimination calculation system 3.

圖4A~圖4Z用來說明一脈動陣列31對第一行群組矩陣CB1進行矩陣分解運算的操作流程。在圖4A中，尺寸為15×25的輸入矩陣R首先被輸入至控制電路30並儲存至記憶體32中。為了進行矩陣分解運算並計算公鑰矩陣PK，控制電路30首先將輸入矩陣R分解為15×15的運算矩陣M及15×10的目標矩陣T。在圖4A中，為了圖示簡潔，輸入矩陣R中所有的矩陣單元都被以圓圈符號表示○，但在實際應用中，輸入矩陣R中的矩陣單元則可為相同或不同的數值。 Figures 4A to 4Z are used to illustrate the operation flow of a pulse array 31 performing matrix decomposition operation on the first row group matrix CB1. In Figure 4A, the input matrix R of size 15×25 is first input to the control circuit 30 and stored in the memory 32. In order to perform matrix decomposition operation and calculate the public key matrix PK, the control circuit 30 first decomposes the input matrix R into a 15×15 operation matrix M and a 15×10 target matrix T. In Figure 4A, for the sake of simplicity, all matrix cells in the input matrix R are represented by a circle symbol ○, but in actual applications, the matrix cells in the input matrix R can be the same or different values.

請參考圖3B，圖3B中繪示了圖3A中的第一控制電路 300的電路方塊圖，第一控制電路300可依據操作須求來產生外部致能訊號ext_en及外部控制訊號ext_op。在此實施例中，第一控制電路300可用來控制例如是5×5大小的脈動陣列31。在圖3B的左側，串聯的移位暫存器FF11~FF19將控制器ext_contr所提供的致能訊號延遲，使移位暫存器FF11~FF19在每級的輸出兼具有兩個時脈周期的延遲。另外，控制器ext_contr亦可透過或閘在適當的時脈週期中來將提供到脈動陣列31任一列的外部致能訊號切換為致能。 Please refer to FIG. 3B , which shows a circuit block diagram of the first control circuit 300 in FIG. 3A . The first control circuit 300 can generate an external enable signal ext_en and an external control signal ext_op according to the operation requirements. In this embodiment, the first control circuit 300 can be used to control a pulse array 31 of, for example, a size of 5×5. On the left side of FIG. 3B , the serially connected shift registers FF11~FF19 delay the enable signal provided by the controller ext_contr, so that the output of the shift registers FF11~FF19 at each stage has a delay of two clock cycles. In addition, the controller ext_contr can also switch the external enable signal provided to any column of the pulse array 31 to enable through an OR gate in an appropriate clock cycle.

另一方面，在圖3B的右側，也繪示了串聯連接的移位暫存器FF21~FF29，用來提供控制訊號ext_op至脈動陣列31。相似地，移位暫存器FF21~FF29將外部控制訊號ext_op所提供的控制訊號延遲，使移位暫存器FF21~FF29的每級輸出具有兩個時脈週期的延遲。 On the other hand, on the right side of FIG. 3B , serially connected shift registers FF21 to FF29 are also shown for providing a control signal ext_op to the pulse array 31. Similarly, the shift registers FF21 to FF29 delay the control signal provided by the external control signal ext_op, so that each output of the shift registers FF21 to FF29 has a delay of two clock cycles.

在圖4B中，控制電路30會將運算矩陣M等分成多個行群組矩陣。在本實施例中，15×15的運算矩陣M被等分為5×15的三個行群組矩陣，而第一行群組矩陣會首先被提供至脈動陣列31進行運算。更具體來說，第一行群組矩陣中各列的值會依時序被提供至脈動陣列31的第一列，使脈動陣列31的各列依序進行運算並將運算結果傳遞至脈動陣列31的下一列。另外，在進行矩陣分解運算的脈動陣列31則會被控制在三角運作模式下進行操作，也就是僅有對角線上的第一運算胞110A及對角線上方的第二運算胞110B被開啟，對角線下方的第二運算胞110B則被關閉。 In FIG4B , the control circuit 30 divides the operation matrix M into a plurality of row group matrices. In this embodiment, the 15×15 operation matrix M is divided into three 5×15 row group matrices, and the first row group matrix is first provided to the pulse array 31 for operation. More specifically, the values of each column in the first row group matrix are provided to the first column of the pulse array 31 in time sequence, so that each column of the pulse array 31 is operated in sequence and the operation result is transmitted to the next column of the pulse array 31. In addition, the pulse array 31 performing the matrix decomposition operation will be controlled to operate in the triangular operation mode, that is, only the first operation cell 110A on the diagonal and the second operation cell 110B above the diagonal are turned on, and the second operation cell 110B below the diagonal is turned off.

詳細而言，圖4B的左側繪示的是運算矩陣M中的第一行群組矩陣，也就是運算矩陣M的第一行至第五行。圖4B的中間繪示的是脈動陣列31各列的運算胞110所接收、儲存及輸出的值。在圖4B中，第一行群組矩陣的第一列的值[0 1 0 1 1]會首先被分別輸入至脈動陣列31的第一列的各個運算胞110中。此時，第一列運算胞110所對應的外部致能訊號ext_en會致能，使第一列運算胞110接收到外部所提供的控制訊號op_in。因此，第一列運算胞110會被外部控制訊號op_in控制在啟動功能之下。 In detail, the left side of FIG. 4B shows the first row group matrix in the operation matrix M, that is, the first to fifth rows of the operation matrix M. The middle of FIG. 4B shows the values received, stored and output by the operation cells 110 of each row of the pulse array 31. In FIG. 4B, the value [0 1 0 1 1] of the first row of the first row group matrix will first be input into each operation cell 110 of the first row of the pulse array 31. At this time, the external enable signal ext_en corresponding to the first row operation cell 110 will be enabled, so that the first row operation cell 110 receives the control signal op_in provided by the outside. Therefore, the first row operation cell 110 will be controlled by the external control signal op_in under the activation function.

由於第一列運算胞110被控制在啟動功能之下，參考圖2B、圖2C的真值表可知，當運算胞110A、110B被操作在啟動功能之下時，運算胞110A、110B會將輸入值儲存至其內部，並輸出邏輯值0。 Since the first row of operation cells 110 is controlled under the activation function, referring to the truth tables of FIG. 2B and FIG. 2C, when operation cells 110A and 110B are operated under the activation function, operation cells 110A and 110B will store the input value inside and output the logical value 0.

另外，圖4B的右側則是排序矩陣的記錄過程。在此實施例中，排序矩陣為1×15的矩陣，分別對應於運算矩陣M中第一列至第十五列的對角線單元的排序或交換結果。詳細而言，在進行矩陣分解運算時，例如是PLU分解運算，運算矩陣M的各列會被適當地交換，使運算矩陣M中對角線上為滿秩，也就是運算矩陣M中對角線上的單元皆為數值1。由於運算矩陣M的第一行群組矩陣中包含了運算矩陣M的第一列至第五列的對角線單元，因此，在進行第一行群組矩陣的運算時，僅會記錄運算矩陣M的第一列至第五列的排序結果，並將其記錄在排序矩陣中。在本實施例中，回應於邏輯值被切換為1的外部致能訊號ext_en，其代表著當第一列的運算胞110在第一個時脈週期中接收到輸入值進行運算時，排序矩陣在對應於第一列的單元內會預先被寫入地址為1，代表脈動陣列31中第一列的運算胞110正記錄著第一行群組矩陣的第一列的值。 In addition, the right side of FIG. 4B is the recording process of the sorting matrix. In this embodiment, the sorting matrix is a 1×15 matrix, which corresponds to the sorting or swapping results of the diagonal cells of the first to fifteenth columns in the operation matrix M. In detail, when performing a matrix decomposition operation, such as a PLU decomposition operation, each column of the operation matrix M will be properly swapped so that the diagonal of the operation matrix M is full rank, that is, the cells on the diagonal of the operation matrix M are all 1. Since the first row group matrix of the operation matrix M includes the diagonal cells of the first to fifth columns of the operation matrix M, when performing the operation of the first row group matrix, only the sorting results of the first to fifth columns of the operation matrix M are recorded and recorded in the sorting matrix. In this embodiment, in response to the external enable signal ext_en whose logic value is switched to 1, it means that when the operation cell 110 of the first column receives the input value for operation in the first clock cycle, the address 1 is pre-written in the cell corresponding to the first column of the sorting matrix, indicating that the operation cell 110 of the first column in the pulse array 31 is recording the value of the first column of the first row group matrix.

在圖4C中，提供至脈動陣列31第一列的外部致能訊號ext_en會被設定為邏輯值0，使脈動陣列31第一列的運算胞110依據輸入值及其儲存的值進行運算。更具體來說，在脈動陣列31中第一列的第一運算胞110A會依據輸入值及其儲存值來產生控制訊號op_out，而其產生的控制訊號則會被提供至第一列中其餘被開啟的第二運算胞110B。 In FIG. 4C , the external enable signal ext_en provided to the first row of the pulse array 31 is set to a logical value of 0, so that the operation cell 110 in the first row of the pulse array 31 performs operations according to the input value and its stored value. More specifically, the first operation cell 110A in the first row of the pulse array 31 generates a control signal op_out according to the input value and its stored value, and the generated control signal is provided to the remaining second operation cells 110B in the first row that are turned on.

在本實施例中，第一行群組中第二列的值[0 0 1 1 0]會被提供至脈動陣列31中第一列的運算胞110。第一列中的第一運算胞110A此時依據其接收到的邏輯值0及其儲存的邏輯值0判斷該列的運算胞110A需進行傳遞(PASS)功能，因而將接收到的邏輯值0作為對應於傳遞(PASS)功能的控制訊號op_out而輸出，並將對應於傳遞(PASS)功能的控制訊號OPA提供給同一列中的其餘第二運算胞110B。使第一列的運算胞110在下一個時脈週期中將接收到的值[0 0 1 1 0]直接輸入給第二列的運算胞110。 In this embodiment, the value [0 0 1 1 0] of the second column in the first row group is provided to the operation cell 110 of the first column in the pulse array 31. The first operation cell 110A in the first column determines that the operation cell 110A in the column needs to perform the pass (PASS) function based on the received logic value 0 and the stored logic value 0, and thus outputs the received logic value 0 as the control signal op_out corresponding to the pass (PASS) function, and provides the control signal OPA corresponding to the pass (PASS) function to the remaining second operation cells 110B in the same column. The operation cell 110 in the first column directly inputs the received value [0 0 1 1 0] to the operation cell 110 in the second column in the next clock cycle.

在圖4D中，第一行群組中第三列的值[1 1 1 1 1]會被提供至脈動陣列31中第一列的運算胞110，同時第二列的運算胞110則會接收到第一列運算胞110的輸出值(也就是第一行群組中第二列的值[0 0 1 1 0])。對脈動陣列31中第一列的第一運算胞110A而言，其儲存的值為0並接收到為1的值，因而產生對應於交換(SWAP)的控制訊號OPA，藉以控制同一列中的其他第二運算胞110B在下一個時脈週期中進行列交換。也就是說，在下一個時脈週期中，第一列的運算胞110將以接收到的第一行群組中第三列來交換其所儲存的第一行群組中第一列。因此，第一列的運算胞110在下一個時脈週期中將儲存第一行群組中第三列的值，並將原本儲存的第一行群組矩陣的第一列值輸出至第二列的運算胞110。 In FIG. 4D , the value [1 1 1 1 1] of the third column in the first row group is provided to the operation cell 110 in the first row of the pulse array 31, and the operation cell 110 in the second row receives the output value of the operation cell 110 in the first row (i.e., the value [0 0 1 1 0] of the second column in the first row group). For the first operation cell 110A in the first row of the pulse array 31, the value stored in it is 0 and the value received is 1, thus generating a control signal OPA corresponding to the swap (SWAP), thereby controlling the other second operation cells 110B in the same row to perform row-column swapping in the next clock cycle. That is, in the next clock cycle, the operation cell 110 in the first row will swap the first column in the first row group it stores with the third column in the first row group it receives. Therefore, the operation cell 110 in the first row will store the value of the third column in the first row group in the next clock cycle, and output the first column value of the first row group matrix originally stored to the operation cell 110 in the second row.

回應於第一列的運算胞110在這個週期所產生的交換決定，數值3會被儲存至排序矩陣的第一個單元中，代表第一行群組中第三列的值被儲存在第一列的運算胞110中。同時，由於第二列的運算胞110亦在此接收到邏輯值為1的外部致能訊號ext_en，此週期數3也會被記錄在排序矩陣對應的第二個單元中。 In response to the swap decision made by the operation cell 110 in the first row in this cycle, the value 3 is stored in the first cell of the sorting matrix, indicating that the value of the third column in the first row group is stored in the operation cell 110 in the first row. At the same time, since the operation cell 110 in the second row also receives the external enable signal ext_en with a logical value of 1, the cycle number 3 is also recorded in the corresponding second cell of the sorting matrix.

詳細而言，關於排序矩陣P的記錄過程請參考圖3C所繪示的排序電路302。排序電路302中包含了位址記錄器addr1~addr5以及位址記憶體304所串接成的環形結構。具體來說，位址記錄器addr1~addr5分別對應於脈動陣列31的五列。在脈動陣列31每次開始進行矩陣分解運算時，位址記憶體304可先將其中儲存的位址分別寫入位址記錄器addr1~addr5。接著在後續進行矩陣分解運算的過程中，當脈動陣列31任一列被啟動(INIT)或是進行交換(SWAP)時，都會導致該列的第一運算胞110A傳送致能的觸發訊號trig至對應的位址記錄器，並在致能訊號pivot_en 同樣為致能的情況下將目前地址寫入正反器FF中，完成各列位址的記錄。在一些實施例中，目前地址就是目前時脈週期的記數。多個比較器分別耦接於位址記錄器addr1~addr5，用以產生控制訊號perm_op[0]~perm_op[4]。比較器將比較目前地址與儲存在位址記錄器中的位址。當比較器判斷目前地址到達位址記錄器所記錄的位址時，比較器會輸出致能的控制訊號，以控制運算陣列31中對應列來進行交換(SWAP)。 For details, please refer to the sorting circuit 302 shown in FIG. 3C for the recording process of the sorting matrix P. The sorting circuit 302 includes a ring structure formed by serially connecting address recorders addr1 to addr5 and an address memory 304. Specifically, the address recorders addr1 to addr5 correspond to the five columns of the pulse array 31 respectively. When the pulse array 31 starts to perform matrix decomposition operation each time, the address memory 304 can first write the addresses stored therein into the address recorders addr1 to addr5 respectively. Then, in the subsequent process of matrix decomposition operation, when any row of the pulse array 31 is activated (INIT) or swapped (SWAP), the first operation cell 110A of the row will transmit the enabled trigger signal trig to the corresponding address recorder, and when the enable signal pivot_en is also enabled, the current address will be written into the flip-flop FF to complete the recording of each row address. In some embodiments, the current address is the count of the current clock cycle. A plurality of comparators are respectively coupled to the address recorders addr1~addr5 to generate control signals perm_op[0]~perm_op[4]. The comparator will compare the current address with the address stored in the address recorder. When the comparator determines that the current address reaches the address recorded by the address recorder, the comparator will output an enable control signal to control the corresponding row in the operation array 31 to perform a swap (SWAP).

最後，第一運算胞110A在各個週期中所運算輸出的值則會被記錄在控制電路30的內部記憶體中，並同時更新記憶體32。具體來說，在這個時脈週期中，控制電路30的內部記憶體及記憶體32的第一列將會被更新。由於在此時脈週期中僅有第一列的第一運算胞110A輸出了邏輯值0，因此控制電路30的內部記憶體及記憶體32中，兩者皆只有第一列的第一個單元被寫入邏輯值0。 Finally, the values calculated and output by the first operation cell 110A in each cycle will be recorded in the internal memory of the control circuit 30, and the memory 32 will be updated at the same time. Specifically, in this clock cycle, the internal memory of the control circuit 30 and the first row of the memory 32 will be updated. Since only the first operation cell 110A in the first row outputs the logic value 0 in this clock cycle, only the first cell in the first row of the internal memory of the control circuit 30 and the memory 32 is written with the logic value 0.

在圖4E中，在此時脈週期中，第一列中的第一運算胞110A會產生出對應於傳遞(PASS)的控制訊號op_out。而第二列的運算胞110在被啟動之後儲存了行群組矩陣中第二列的後面四個單元[0 1 1 0]，並同時接收到了行群組矩陣第一列的值[0 1 0 1 1]。因此，第二列中的第一運算胞110A會產生出對應於交換(SWAP)的控制訊號op_out，並據此更新排序矩陣。最後，第一列的第一運算胞110A所輸出的邏輯值0，也就是對應於傳遞(PASS)的控制訊號op_out會被記錄在控制電路30的內部記憶體及記憶體32中的第二列的第一個單元。 In FIG. 4E , in this clock cycle, the first operation cell 110A in the first row generates a control signal op_out corresponding to pass (PASS). After being activated, the operation cell 110 in the second row stores the last four cells [0 1 1 0] of the second row in the row group matrix, and simultaneously receives the value [0 1 0 1 1] of the first row of the row group matrix. Therefore, the first operation cell 110A in the second row generates a control signal op_out corresponding to swap (SWAP), and updates the sorting matrix accordingly. Finally, the logic value 0 output by the first operation cell 110A in the first row, that is, the control signal op_out corresponding to pass (PASS), will be recorded in the internal memory of the control circuit 30 and the first cell of the second row in the memory 32.

在圖4F中，脈動陣列31中第一列至第三列的第一運算胞110A分別判斷出要進行傳遞(PASS)、加法(ADD)及啟動(INIT)的功能。在這裡要說明的是脈動陣列31中第二列的第一運算胞110A操作，其產生出該列要進行加法(ADD)功能的判斷結果，因此在下一個時脈週期中，該列的運算胞110將把接收值與儲存值加總後輸出。但實際上，該列中的第一運算胞110A會將接收到的邏輯值1作為對應於加法(ADD)功能的控制訊號op_out輸出。 In FIG. 4F , the first operation cells 110A in the first to third rows of the pulse array 31 respectively determine the functions of passing (PASS), adding (ADD) and starting (INIT). Here, the first operation cell 110A in the second row of the pulse array 31 is explained, which generates the judgment result that the row is to perform the addition (ADD) function. Therefore, in the next clock cycle, the operation cell 110 in the row will add the received value and the stored value and output it. But in fact, the first operation cell 110A in the row will output the received logic value 1 as the control signal op_out corresponding to the addition (ADD) function.

整體來說，在接下來圖4F~4P中，行群組矩陣中第五列至最後一列的資料值將依時序輸入至脈動陣列31的第一列，而脈動陣列31的各列則在啟動之後依據接收資料來進行計算，脈動陣列31所輸出的值將依序寫入控制電路30的內部記憶體及記憶體32中的第三列至倒數第三列。 In general, in the following Figures 4F to 4P, the data values from the fifth to the last row in the row group matrix will be input into the first row of the pulse array 31 in time sequence, and each row of the pulse array 31 will be calculated according to the received data after activation, and the values output by the pulse array 31 will be written into the internal memory of the control circuit 30 and the third to the third to last row in the memory 32 in sequence.

另外，脈動陣列31的第三列至第五列的第一運算胞110A則是分別在圖4F、圖4H、圖4K的時脈週期中存入資料值1，因此排序矩陣在第三至第五個單元中分別記錄了5、7、10的值。 In addition, the first operation cells 110A in the third to fifth rows of the pulse array 31 store data value 1 in the clock cycles of Figure 4F, Figure 4H, and Figure 4K, respectively, so the sorting matrix records the values 5, 7, and 10 in the third to fifth cells, respectively.

在圖4Q中，在行群組矩陣的所有列資料值都被輸入至脈動陣列31之後，致能的外部致能訊號ext_en會被提供給脈動陣列31的第一列，且對應於交換(SWAP)功能的控制訊號op_in也會從外部給入，使脈動陣列31的第一列將其儲存的列資料值釋放並傳遞給脈動陣列31的下一列。並且，在接下來圖4R~圖4Z的週期中，透過提供適當的外部致能訊號ext_en及對應功能的控制訊號op_in，可以使各列的運算胞110逐列釋放其儲存的值。 In FIG. 4Q , after all row data values of the row group matrix are input into the pulse array 31, the external enable signal ext_en is provided to the first row of the pulse array 31, and the control signal op_in corresponding to the SWAP function is also input from the outside, so that the first row of the pulse array 31 releases the row data value stored therein and transmits it to the next row of the pulse array 31. Moreover, in the following cycles of FIG. 4R to FIG. 4Z , by providing appropriate external enable signals ext_en and corresponding control signals op_in, the operation cells 110 of each row can release their stored values row by row.

另外，在圖4Q~圖4U中，控制訊號check_en將會依序提供給各列中的第一運算胞110A，參考圖2D，依序致能的控制訊號check_en將會使各列的第一運算胞110A逐級檢查其儲存的值是否為1。在本實施例中，當一個第一運算胞110A所儲存的值為0的話，該第一運算胞110A將會產生一個終止訊號，終止該高斯消去計算系統3對於目前所接收的輸入矩陣R的計算，在運算實現早期放棄，並請求更新輸入矩陣R，以基於下一個輸入矩陣R來計算公鑰。因此，只有在所有的第一運算胞110A儲存的值皆為非0的情況下，高斯消去計算系統3的控制電路30才會進行下一行群組矩陣的計算，因而避免浪費無效的計算時間。 In addition, in FIG. 4Q to FIG. 4U , the control signal check_en will be sequentially provided to the first operation cell 110A in each row. Referring to FIG. 2D , the sequentially enabled control signal check_en will cause the first operation cell 110A in each row to check whether the stored value is 1 step by step. In this embodiment, when the value stored in a first operation cell 110A is 0, the first operation cell 110A will generate a termination signal to terminate the calculation of the input matrix R currently received by the Gaussian elimination calculation system 3, give up in the early stage of the calculation, and request to update the input matrix R to calculate the public key based on the next input matrix R. Therefore, only when the values stored in all the first operation cells 110A are non-zero, the control circuit 30 of the Gaussian elimination calculation system 3 will calculate the next row of the group matrix, thereby avoiding wasting invalid calculation time.

另一方面，請參考圖4R、圖4S，當脈動陣列31完成控制電路30的內部記憶體及記憶體32最後一列的寫入時，相較於配置更大的記憶體區塊來去繼續儲存脈動陣列31所產生的下一筆列資料，高斯消去計算系統3使用了控制電路30的內部記憶體及記憶體32上半部中未使用到的記憶體區塊，也就是控制電路30的內部記憶體及記憶體32儲存空間中的右上三角區塊來儲存接下來脈動陣列31的輸出值。換句話說，當脈動陣列31完成控制電路30的內部記憶體及記憶體32最後一列的寫入時，脈動陣列31會循環回到控制電路30的內部記憶體及記憶體32第一列，並對齊第一列的末端進行寫入，以將產生的運算結果寫入到控制電路30的內部記憶體及記憶體32儲存空間中未使用到的右上三角區塊。 On the other hand, please refer to FIG. 4R and FIG. 4S . When the pulse array 31 completes writing the last row of the internal memory of the control circuit 30 and the memory 32, compared with configuring a larger memory block to continue to store the next row of data generated by the pulse array 31, the Gaussian elimination calculation system 3 uses the unused memory blocks in the upper half of the internal memory of the control circuit 30 and the memory 32, that is, the upper right triangle block in the storage space of the internal memory of the control circuit 30 and the memory 32 to store the next output value of the pulse array 31. In other words, when the pulse array 31 completes writing the last row of the internal memory of the control circuit 30 and the memory 32, the pulse array 31 will loop back to the internal memory of the control circuit 30 and the first row of the memory 32, and align the end of the first row for writing, so as to write the generated operation result into the unused upper right triangle block in the storage space of the internal memory of the control circuit 30 and the memory 32.

請參考圖3A、圖3D。為了要將運算結果寫入到控制電路30的內部記憶體及記憶體32儲存空間中未使用到的右上三角區塊，脈動陣列31所產生的運算結果會透過多工器MUX1、MUX3分別寫入記憶體303及記憶體32。具體來說，多工器MUX1、MUX3可接收兩組輸入A、B，多工器MUX1、MUX3可透過選擇訊號S來進行逐位元的選擇。舉例而言，在圖4S中所繪示的時脈週期中，選擇訊號S即可選擇輸入A的第一位元通過且輸入B的第二至第五位元通過，因而組合成輸出C寫入至記憶體303及記憶體32。 Please refer to Figure 3A and Figure 3D. In order to write the calculation results into the unused upper right triangle block in the internal memory of the control circuit 30 and the memory 32 storage space, the calculation results generated by the pulse array 31 will be written into the memory 303 and the memory 32 respectively through the multiplexers MUX1 and MUX3. Specifically, the multiplexers MUX1 and MUX3 can receive two sets of inputs A and B, and the multiplexers MUX1 and MUX3 can select bit by bit through the selection signal S. For example, in the clock cycle shown in Figure 4S, the selection signal S can select the first bit of input A to pass and the second to fifth bits of input B to pass, so that the combined output C is written into the memory 303 and the memory 32.

最後，第一行群組矩陣的矩陣分解結果就會如圖4Z所示。另外，請參考圖5A~圖5C來理解脈動陣列31是如何將運算結果配置在記憶體32。圖5A~圖5C繪示了圖3A中的脈動陣列31是如何對記憶體32進行寫入的示意圖。詳細而言，如圖5A所示，由於脈動陣列31中的第一列至第五列的運算胞110A在進行矩陣分解運算中是隨著列順序被開啟以及關閉的，也就是說第一列的第一運算胞110A會最先被開啟且最早完成計算的，反之第五列的第一運算胞110A會最晚被開啟且最晚完成計算的。因此脈動陣列31中的第一運算胞110A輸出的各列的運算結果依照輸出時序進行排列會形成如圖5A所示的平行四邊形。詳細來說，圖5A~圖5C中的橫軸方向代表脈動陣列31第一列至第五列的各個第一運算胞110A的輸出值，縱軸方向則代表脈動陣列31的各個第一運算胞110A產生輸出值的時序。 Finally, the matrix decomposition result of the first row group matrix will be as shown in FIG4Z. In addition, please refer to FIG5A to FIG5C to understand how the pulse array 31 configures the operation result in the memory 32. FIG5A to FIG5C are schematic diagrams showing how the pulse array 31 in FIG3A writes to the memory 32. In detail, as shown in FIG5A, since the operation cells 110A in the first to fifth columns of the pulse array 31 are opened and closed in the order of columns during the matrix decomposition operation, that is, the first operation cell 110A in the first column will be opened first and complete the calculation earliest, whereas the first operation cell 110A in the fifth column will be opened last and complete the calculation last. Therefore, the calculation results of each column output by the first operation cell 110A in the pulse array 31 are arranged according to the output timing to form a parallelogram as shown in Figure 5A. In detail, the horizontal axis direction in Figures 5A to 5C represents the output value of each first operation cell 110A in the first to fifth columns of the pulse array 31, and the vertical axis direction represents the timing of each first operation cell 110A in the pulse array 31 generating the output value.

如圖5A所示，脈動陣列31所產生的運算結果包含資料區塊500、501。資料區塊500即為進行PLU矩陣分解中所產生的下三角矩陣，而資料區塊501即為進行PLU矩陣分解中所產生的上三角矩陣。在資料區塊500中，由於脈動陣列31是依照列順序來接收行群組矩陣的各列，因此在資料區塊500中所記錄的運算結果也是順著行群組矩陣的列順序來進行排列的。另外，在資料區塊501的寫入時序中(請參考圖4R~4Z)，以脈動陣列31第一列所儲存的列資料值為例，脈動陣列31所儲存的第一列資料值是第一個被輸出的，如圖4R、圖4S所繪示，當脈動陣列31的第二列運算胞110接收到脈動陣列31的第一列運算胞110所吐出的上三角矩陣U中第一列的單元值時，脈動陣列31的第二列運算胞110會先進行傳遞(PASS)後再藉由交換(SWAP)將上三角矩陣U中第二列的單元值吐出。因此，在資料區塊501中，也就是上三角矩陣U中所記錄的運算結果亦是順著行群組矩陣CB1的列順序來進行排列的。 As shown in FIG5A , the calculation results generated by the pulse array 31 include data blocks 500 and 501. Data block 500 is the lower triangular matrix generated by the PLU matrix decomposition, and data block 501 is the upper triangular matrix generated by the PLU matrix decomposition. In data block 500, since the pulse array 31 receives each column of the row group matrix in row order, the calculation results recorded in data block 500 are also arranged in the row order of the row group matrix. In addition, in the write timing of the data block 501 (please refer to FIGS. 4R to 4Z), taking the row data value stored in the first row of the pulse array 31 as an example, the first row data value stored in the pulse array 31 is the first to be output, as shown in FIGS. 4R and 4S. When the second row operation cell 110 of the pulse array 31 receives the unit value of the first row in the upper triangular matrix U outputted by the first row operation cell 110 of the pulse array 31, the second row operation cell 110 of the pulse array 31 will first pass (PASS) and then output the unit value of the second row in the upper triangular matrix U by swapping (SWAP). Therefore, in data block 501, that is, the calculation results recorded in the upper triangular matrix U are also arranged in the order of the columns of the row group matrix CB1.

若將脈動陣列31在每個時脈週期中所產生的運算結果分別儲存在單列的資料區塊中，則如圖5A所示，記憶體32需要運用比輸入的行群組矩陣更大的記憶體空間來去儲存，造成硬體上的負擔。如圖5B所示，在此實施例中，為了節省記憶體空間，記憶體32將會配置有與行群組矩陣相同大小的記憶體空間來去儲存脈動陣列31的運算結果。當脈動陣列31完成對記憶體32最末一列進行寫入之後，脈動陣列31將會重新循環回到記憶體32的第一列進行寫入。更確切來說，當循環回記憶體32的第一列進行寫入時，脈動陣列31會對齊記憶體32每一列的末端進行寫入，以將運算結果寫入至記憶體32的列末端的未使用空間中，因而有效節省高斯消去計算系統3所需的記憶體空間。從資料空間的配置上來看，資料區塊500將被區分成兩個資料區塊502、503，其中資料區塊502與行群組矩陣具有相同的列數。而資料區塊500的下半三角部分的資料區塊503與資料區塊501將填入到資料區塊502的上方，因而產生如圖5C所示的矩形形狀的運算結果配置結果。 If the calculation results generated by the pulse array 31 in each clock cycle are stored in a single row of data blocks, as shown in FIG5A , the memory 32 needs to use a larger memory space than the input row group matrix to store them, resulting in a burden on the hardware. As shown in FIG5B , in this embodiment, in order to save memory space, the memory 32 will be configured with a memory space of the same size as the row group matrix to store the calculation results of the pulse array 31. When the pulse array 31 completes writing to the last row of the memory 32, the pulse array 31 will loop back to the first row of the memory 32 for writing. More specifically, when looping back to the first row of memory 32 for writing, the pulse array 31 will align the end of each row of memory 32 for writing, so as to write the calculation result into the unused space at the end of the row of memory 32, thereby effectively saving the memory space required by the Gaussian elimination calculation system 3. From the perspective of data space configuration, data block 500 will be divided into two data blocks 502 and 503, wherein data block 502 has the same number of rows as the row group matrix. Data block 503 and data block 501 in the lower half triangle of data block 500 will be filled above data block 502, thereby generating a rectangular calculation result configuration result as shown in FIG5C.

圖6A~圖6D繪示了對運算矩陣進M進行矩陣分解運算的過程示意圖。在圖6A~圖6D中，繪示了記憶體32、控制電路30的排序電路302及記憶體303所儲存的資料。更具體來說，記憶體32儲存了分解中的運算矩陣M，排序電路302儲存了排序矩陣P。 Figures 6A to 6D show schematic diagrams of the process of performing matrix decomposition operations on the operation matrix M. In Figures 6A to 6D, the data stored in the memory 32, the sorting circuit 302 of the control circuit 30, and the memory 303 are shown. More specifically, the memory 32 stores the operation matrix M in decomposition, and the sorting circuit 302 stores the sorting matrix P.

在圖6A中，接續著圖4Z，排序矩陣M的第一行群組矩陣CB1已完成了矩陣分解運算，並被寫入至記憶體32及控制電路30的記憶體303。由於矩陣分解運算是以列為單位進行的操作行為，故高斯消去計算系統3接下來將把第一行群組矩陣的操作，重現在第二行群組及第三行群組上，再進行第二行群組的矩陣分解運算。 In FIG. 6A , following FIG. 4Z , the first row group matrix CB1 of the sorting matrix M has completed the matrix decomposition operation and is written to the memory 32 and the memory 303 of the control circuit 30 . Since the matrix decomposition operation is an operation performed in columns, the Gaussian elimination calculation system 3 will then reproduce the operation of the first row group matrix on the second row group and the third row group, and then perform the matrix decomposition operation on the second row group.

詳細而言，在進行重現(replay)運算時，脈動陣列31的所有運算胞110將被開啟，而第一運算胞110A亦被切換至重現運作模式之下。如此一來，所有的運算胞110都會依據外部給入的控制訊號op_in來進行操作。 Specifically, when performing a replay operation, all operation cells 110 of the pulse array 31 will be turned on, and the first operation cell 110A will also be switched to the replay operation mode. In this way, all operation cells 110 will operate according to the external control signal op_in.

第二行群組矩陣可輸入至脈動陣列31，而控制電路31的記憶體303儲存著第一行群組矩陣在進行矩陣分解運算時所產生的控制訊號op_out，其代表著第一行群組矩陣在進行矩陣分解運算時每一列的操作行為。因此，藉由將第二行群組矩陣各列的列資料依序輸入至脈動陣列31，且將記憶體303所儲存的每列控制訊號op_out分別提供至脈動陣列31的對應列，第一行群組矩陣在進行矩陣分解運算時的操作，即可重現在第二行群組上。更具體來說，在第一個時脈週期中，第二行群組矩陣的第一列可被輸入至脈動陣列31作為輸入列data_in，而記憶體303所儲存的控制訊號op_out的第一列第一個單元值也將被做為控制訊號op_in提供至脈動陣列31的第一列。在第二個時脈週期中，第二行群組矩陣的第二列可被輸入至脈動陣列31作為輸入列data_in，而記憶體303所儲存的控制訊號op_out的第二列第一個單元值也將被做為控制訊號op_in提供至脈動陣列31的第一列。而在第三個時脈週期時，第二行群組矩陣的第三列可被輸入至脈動陣列31作為輸入列data_in，而記憶體303所儲存的控制訊號op_out的第二列第一及第二個個單元值也將被做為控制訊號op_in提供至脈動陣列31的第一及第二列。如此一來，脈動陣列31將會接收到與第一行群組矩陣在進行矩陣分解運算時相同的控制訊號op_in，因而將第一行群組矩陣的操作重現在第二行群組矩陣上。依此類推，在接下來的時脈週期中，第二行群組矩陣的各列將依序被提供至脈動陣列31，而第二控制電路301的記憶體303將會以如圖5A所示的平行四邊形的形式，來提供控制訊號op_in至脈動陣列31的各列。 The second row group matrix can be input to the pulse array 31, and the memory 303 of the control circuit 31 stores the control signal op_out generated when the first row group matrix performs the matrix decomposition operation, which represents the operation behavior of each row of the first row group matrix when the matrix decomposition operation is performed. Therefore, by sequentially inputting the row data of each row of the second row group matrix to the pulse array 31, and providing each row control signal op_out stored in the memory 303 to the corresponding row of the pulse array 31, the operation of the first row group matrix when performing the matrix decomposition operation can be reproduced on the second row group. More specifically, in the first clock cycle, the first row of the second row group matrix may be input to the pulse array 31 as the input row data_in, and the first unit value of the first row of the control signal op_out stored in the memory 303 will also be provided to the first row of the pulse array 31 as the control signal op_in. In the second clock cycle, the second row of the second row group matrix may be input to the pulse array 31 as the input row data_in, and the first unit value of the second row of the control signal op_out stored in the memory 303 will also be provided to the first row of the pulse array 31 as the control signal op_in. In the third clock cycle, the third row of the second row group matrix can be input to the pulse array 31 as the input row data_in, and the first and second unit values of the second row of the control signal op_out stored in the memory 303 will also be provided as the control signal op_in to the first and second rows of the pulse array 31. In this way, the pulse array 31 will receive the same control signal op_in as the first row group matrix when performing the matrix decomposition operation, so that the operation of the first row group matrix is reproduced on the second row group matrix. Similarly, in the next clock cycle, each column of the second row group matrix will be provided to the pulse array 31 in sequence, and the memory 303 of the second control circuit 301 will provide the control signal op_in to each column of the pulse array 31 in the form of a parallelogram as shown in FIG. 5A .

一方面來說，控制訊號op_in的各個單元中記錄的是一位元的邏輯值，其邏輯值0對應的例如是傳遞(PASS)功能，而邏輯值1對應的例如是加法(ADD)功能。另一方面，排序電路302還可參考其儲存的排序矩陣P的前五個單元來指示進行交換(SWAP)操作。更具體來說，排序矩陣P中前五個單元所記錄的數值分別對應於脈動陣列31的第一至五列發生交換(SWAP)時的時脈序，排序電路302即可據此控制脈動陣列31中的各列在對應的時脈週期中進行交換操作。舉例來說，當排序矩陣P中前五個單元記錄了[3 4 5 7 10]的值，則排序電路302即可據此控制脈動陣列31中的第一至第五列運算胞110在第三、第四、第五、第七及第十個時脈週期中進行列交換。 On the one hand, each cell of the control signal op_in records a logic value of a bit, and its logic value 0 corresponds to, for example, a pass (PASS) function, and its logic value 1 corresponds to, for example, an addition (ADD) function. On the other hand, the sorting circuit 302 can also refer to the first five cells of the sorting matrix P stored therein to instruct a swap (SWAP) operation. More specifically, the values recorded in the first five cells of the sorting matrix P correspond to the timing sequence when the first to fifth columns of the pulse array 31 are swapped (SWAP), and the sorting circuit 302 can control each column in the pulse array 31 to perform a swap operation in the corresponding clock cycle accordingly. For example, when the first five cells in the sorting matrix P record the values [3 4 5 7 10], the sorting circuit 302 can control the first to fifth columns of operation cells 110 in the pulse array 31 to perform row-column swaps in the third, fourth, fifth, seventh and tenth clock cycles.

藉由觀察圖5A可知，由於提供到脈動陣列31的控制訊號op_in依據時間軸來排序會具有平行四邊形的特性，其代表了脈動陣列31在進行第一行群組CB1的運算的初期，一開始會由第一列的運算胞110啟動進行運算，而第二至第五列的運算胞110則會在後續的操作中依序被啟動。而在運算的末期，則是由第一列的運算胞110先結束運算，而第二至第五列的運算胞110則會在後續的操作中依序完成。在一些實施例中，第二行群組及第三行群組的重現操作可以是以行群組部分重疊的方式來輸入至脈動陣列31。舉例來說，當第二行群組CB2的重現操作進行到末段時，也就是脈動陣列31中的第一列運算胞110已完成運算但脈動陣列31中的第二至第五列運算胞110仍在進行運算時，第三行群組CB3第一列的值可接續著輸入至脈動陣列31的第一列運算胞110中，使脈動陣列31的第一列運算胞110可進行第三行群組CB3的重現操作。如此一來，脈動陣列31即可利用控制訊號op_in在進行重現操作的初期以及末期的互補特性，使脈動陣列31在部分時脈週期中，同時進行第二行群組末期的重現操作以及第三行群組初期的重現操作，因此脈動陣列31在運算上的排程可更為緊湊，不致產生運算資源及時間上的浪費。 By observing FIG. 5A , it can be seen that the control signal op_in provided to the pulse array 31 has a parallelogram characteristic when arranged according to the time axis, which represents that at the beginning of the operation of the first row group CB1 of the pulse array 31, the operation cells 110 in the first row will be activated to perform the operation, and the operation cells 110 in the second to fifth rows will be activated in sequence in the subsequent operation. At the end of the operation, the operation cells 110 in the first row will end the operation first, and the operation cells 110 in the second to fifth rows will be completed in sequence in the subsequent operation. In some embodiments, the repetition operation of the second row group and the third row group can be input to the pulse array 31 in a manner of partially overlapping the row groups. For example, when the reproduction operation of the second row group CB2 reaches the end, that is, the first row of operation cells 110 in the pulse array 31 has completed the operation but the second to fifth rows of operation cells 110 in the pulse array 31 are still operating, the value of the first row of the third row group CB3 can be continuously input into the first row of operation cells 110 in the pulse array 31, so that the first row of operation cells 110 in the pulse array 31 can perform the reproduction operation of the third row group CB3. In this way, the pulse array 31 can utilize the complementary characteristics of the control signal op_in at the beginning and end of the reproduction operation, so that the pulse array 31 can simultaneously perform the reproduction operation at the end of the second row group and the reproduction operation at the beginning of the third row group in part of the clock cycle. Therefore, the computational scheduling of the pulse array 31 can be more compact, so as not to waste computational resources and time.

因此，如圖6A所示，完成第二行群組及第三行群組的重現操作後，在第一行群組中被選出的軸列(pivot row)將被移動到第二行群組及第三行群組的最下方五列，且其餘列的運算結果將排在上方。接著，高斯消去計算系統3將針對第二行群組進行矩陣分解運算。 Therefore, as shown in FIG6A , after completing the reproduction operation of the second row group and the third row group, the pivot row selected in the first row group will be moved to the bottom five rows of the second row group and the third row group, and the calculation results of the remaining rows will be arranged at the top. Then, the Gaussian elimination calculation system 3 will perform matrix decomposition operation on the second row group.

圖6B繪示了第二行群組的矩陣分解結果。在第二行群組的矩陣分解過程中，第二行群組中軸列上方的列將依序被輸入至操作在三角運作模式之下的脈動陣列31，以產生第二行群組的分解結果，並且在排序矩陣的第六至第十個單元中記錄著第六至第十個軸列的排序。 FIG6B shows the matrix decomposition result of the second row group. In the matrix decomposition process of the second row group, the columns above the axis columns in the second row group will be sequentially input to the pulsating array 31 operating under the triangular operation mode to generate the decomposition result of the second row group, and the order of the sixth to tenth axis columns is recorded in the sixth to tenth cells of the ordering matrix.

在圖6C中，第三行群組矩陣將依據第二行群組矩陣的舉分解結果進行重現操作。更具體來說，第三行群組矩的前十列將被輸入至操作在方陣運作模式之下的脈動陣列31，以依據分解的第二行群組矩陣以及排序矩陣中第六至第十個單元來進行重現操作，因此，藉由分解第二行群組所選出的軸列將被移動到第六至第十列。 In FIG. 6C , the third row group matrix will be reproduced according to the decomposition result of the second row group matrix. More specifically, the first ten columns of the third row group matrix will be input to the pulse array 31 operating in the square matrix operation mode to perform the reproduction operation according to the decomposed second row group matrix and the sixth to tenth cells in the sorting matrix, so that the axis column selected by decomposing the second row group will be moved to the sixth to tenth columns.

最後，在圖6D中，第三行群組矩陣將進行矩陣分解運算，在第三行群組的矩陣分解過程中，第三行群組中軸列上方的第一至第五列將依序被輸入至操作在三角運作模式之下的脈動陣列31，以產生第三行群組的分解結果，並且在排序矩陣的第十一至第十五個單元中記錄著第十一至第十五個軸列的排序。 Finally, in FIG. 6D , the third row group matrix will be subjected to matrix decomposition operation. In the matrix decomposition process of the third row group, the first to fifth columns above the axis columns of the third row group will be sequentially input to the pulsating array 31 operating under the triangular operation mode to generate the decomposition result of the third row group, and the order of the eleventh to fifteenth axis columns is recorded in the eleventh to fifteenth cells of the ordering matrix.

在計算完排序矩陣P、下三角矩陣L及上三角矩陣U之後，接著要進行的是如前方公式(3)所示的，進行將下三角矩陣L的反矩陣L^-1及排序矩陣P乘上目標矩陣T的計算過程。 After calculating the sorting matrix P, the lower triangular matrix L, and the upper triangular matrix U, the next step is to multiply the inverse matrix L ^-1 of the lower triangular matrix L and the sorting matrix P by the target matrix T as shown in the above formula (3).

圖7A~圖7C繪示了如何將下三角矩陣L的反矩陣L^-1及排序矩陣P乘上目標矩陣T的計算過程的示意圖。 7A to 7C are schematic diagrams showing the calculation process of multiplying the inverse matrix L ^-1 of the lower triangular matrix L and the sorting matrix P by the target matrix T.

如圖7A所示，記憶體32所儲存的第一行群組矩陣CB1中的下三角矩陣部分將會先通過多工器34的選擇而被複製至記憶體303，並配合排序矩陣P的前五個單元，來對目標矩陣T進行重現操作。相似地，在圖7B中，第二行群組矩陣CB2及排序矩陣的第六至第十個單元將會被用來對目標矩陣T進行重現操作。直到圖7C，最後一個第三行群組矩陣CB3及排序矩陣的第十一至第十五個單元將會被用來對目標矩陣T進行重現操作。在完成圖7A~圖7C的操作之後，即完成了上述公式(3)中L^-1．P．T的計算過程。 As shown in FIG. 7A , the lower triangular matrix portion of the first row group matrix CB1 stored in the memory 32 will first be copied to the memory 303 through the selection of the multiplexer 34, and will be used to reproduce the target matrix T in conjunction with the first five cells of the sorting matrix P. Similarly, in FIG. 7B , the second row group matrix CB2 and the sixth to tenth cells of the sorting matrix will be used to reproduce the target matrix T. Until FIG. 7C , the last third row group matrix CB3 and the eleventh to fifteenth cells of the sorting matrix will be used to reproduce the target matrix T. After completing the operations of FIG. 7A to FIG. 7C , the calculation process of L ^-1 . P. T in the above formula (3) is completed.

接著，高斯消去計算系統3則要依據上三角矩陣U的反矩陣U^-1對目標矩陣T進行重現操作，藉以計算U^-1．(L^-1．P．T)的計算過程。詳細來說，關於上三角矩陣U與反矩陣U^-1的關係請參考下方公式(4)。 Next, the Gaussian elimination calculation system 3 performs a recurrence operation on the target matrix T according to the inverse matrix U ^-1 of the upper triangular matrix U, so as to calculate the calculation process of U ^-1 . (L ^-1 .P.T). For details, please refer to the following formula (4) for the relationship between the upper triangular matrix U and the inverse matrix U ^-1 .

依據上方公式(4)所示，在進行反矩陣U^-1的乘法時，其可以被等效為利用上三角矩陣U的單元值來進行，其中上三角矩陣U最後一行的單元值計算需先於倒數第二行，且倒數第二行的單元值計算需先於倒數第三行，依此類推。因此為了實現如上述公式(4)中所推導出的計算順序，控制電路30可透過圖8A~圖8C所繪示的方式及順序來將上三角矩陣U中的值寫入至第二控制電路301中的記憶體303，進而實現反矩陣U^-1的乘法。 According to the above formula (4), when performing the multiplication of the inverse matrix U ^-1 , it can be equivalent to using the unit values of the upper triangular matrix U, wherein the unit value calculation of the last row of the upper triangular matrix U must precede the second-to-last row, and the unit value calculation of the second-to-last row must precede the third-to-last row, and so on. Therefore, in order to realize the calculation sequence derived from the above formula (4), the control circuit 30 can write the values in the upper triangular matrix U into the memory 303 in the second control circuit 301 through the method and sequence shown in Figures 8A to 8C, thereby realizing the multiplication of the inverse matrix U ^-1 .

圖8A~圖8C繪示了如何依據上三角矩陣U的反矩陣U^-1來進行計算並獲得公鑰矩陣PK的示意圖。 FIG. 8A to FIG. 8C are schematic diagrams showing how to calculate and obtain the public key matrix PK based on the inverse matrix U ^-1 of the upper triangular matrix U.

如圖8A所示，上三角矩陣U的反轉過程將會如圖中所繪示，先對右上第三行群組矩陣CB3進行。在第三行群組矩陣CB3的第一至第五列中，標示數值1的單元即為分解後所產生的對角線單元，故在上三角矩陣U的反轉過程中，僅對單元U12~U15、 U23~U25、U34~U35、U45進行反轉，並依照圖8A所繪示的位置對應關係所示儲存至記憶體303。並且，在第三行群組矩陣下方第六至第十五列的部分，會在不調整排列順序的情況下接續在上方調整過順序的單元U12~U15、U23~U25、U34~U35、U45之後，因而產生如圖8A右方所示的結果。具體來說，第三行群組矩陣下方第六至第十五列的部分可被輸入至操作在三角運作模式下的脈動陣列31，且透過第一控制電路300提供適當的外部致能訊號ext_en及控制訊號ext_op至脈動陣列31，使脈動陣列31操作在移位緩衝器模式下，並產生如圖8A所繪示的排列關係。相較於另外使用額外的移位暫存器(shift register)來進行第三行群組矩陣下方第六至第十五列的排列，透過將操作在三角運作模式下的脈動陣列31操作在移位緩衝器模式可帶來相同的效果，因而有效降低高斯消去計算系統1的硬體需求及成本。最後，記憶體303中所儲存的上三角矩陣U的反矩陣U^-1會被用來進行重現操作。進一步，在圖8B及圖8C中、第二行群組矩陣及第一行群組矩陣都將依序進行反轉後被用來進行重現操作。在完成圖8A~圖8C的操作之後，即完成了上述公式(3)中整體U^-1．(L^-1．P．T)的計算過程，因而計算出公鑰矩陣PK。 As shown in FIG8A, the inversion process of the upper triangular matrix U will be performed on the third row group matrix CB3 on the upper right first. In the first to fifth columns of the third row group matrix CB3, the units marked with a value of 1 are the diagonal units generated after decomposition, so in the inversion process of the upper triangular matrix U, only the units U12~U15, U23~U25, U34~U35, and U45 are inverted and stored in the memory 303 according to the position correspondence relationship shown in FIG8A. Furthermore, the sixth to fifteenth columns below the third row group matrix will continue to the cells U12-U15, U23-U25, U34-U35, and U45 whose order has been adjusted above without adjusting the arrangement order, thereby generating the result shown on the right side of FIG. 8A. Specifically, the sixth to fifteenth columns below the third row group matrix can be input to the pulse array 31 operating in the triangle operation mode, and the first control circuit 300 provides a suitable external enable signal ext_en and a control signal ext_op to the pulse array 31, so that the pulse array 31 operates in the shift buffer mode and generates the arrangement relationship shown in FIG. 8A. Compared to using an additional shift register to arrange the sixth to fifteenth columns below the third row group matrix, the same effect can be achieved by operating the pulse array 31 in the triangular operation mode in the shift buffer mode, thereby effectively reducing the hardware requirements and cost of the Gaussian elimination calculation system 1. Finally, the inverse matrix U ^-1 of the upper triangular matrix U stored in the memory 303 will be used for the reproduction operation. Furthermore, in Figures 8B and 8C, the second row group matrix and the first row group matrix will be inverted in sequence and used for the reproduction operation. After completing the operations of Figures 8A to 8C, the overall U ^-1 in the above formula (3) is completed. (L ^-1 .P.T) is calculated, and the public key matrix PK is calculated.

圖9繪示了本發明實施例一高斯消去計算方法的流程圖。高斯消去計算方法包括步驟S90~S92。高斯消去計算方法可應用在如圖1的高斯消去計算系統1及/或圖3的高斯消去計算系統3中。在步驟S90中，運算矩陣會被輸入至高斯消去計算系統。在步驟S91中，運算矩陣將被輸入至高斯消去計算系統中的脈動陣列來進行矩陣分解運算，以將運算矩陣分解為下三角矩陣及上三角矩陣。在步驟S92中，於高斯消去計算系統中的記憶體中配置運算矩陣相同的大小的運算資料區塊，並將分解後的下三角矩陣及上三角矩陣儲存於運算資料區塊。關於高斯消去計算方法的詳細操作內容，請參考上方關於高斯消去計算系統1、3的操作說明。 FIG9 shows a flow chart of the Gaussian elimination calculation method of the first embodiment of the present invention. The Gaussian elimination calculation method includes steps S90 to S92. The Gaussian elimination calculation method can be applied to the Gaussian elimination calculation system 1 of FIG1 and/or the Gaussian elimination calculation system 3 of FIG3. In step S90, the operation matrix is input to the Gaussian elimination calculation system. In step S91, the operation matrix is input to the pulse array in the Gaussian elimination calculation system to perform matrix decomposition operation to decompose the operation matrix into a lower triangular matrix and an upper triangular matrix. In step S92, a computing data block of the same size as the computing matrix is configured in the memory of the Gaussian elimination computing system, and the decomposed lower triangular matrix and upper triangular matrix are stored in the computing data block. For detailed operation contents of the Gaussian elimination computing method, please refer to the above operation instructions for Gaussian elimination computing systems 1 and 3.

綜上所述，本發明的高斯消去計算方法及高斯消去計算系統可以有效率地利用記憶體空間，並且節省無效的運算時間，有效改善公鑰計算的效率以及硬體需求。 In summary, the Gaussian elimination calculation method and Gaussian elimination calculation system of the present invention can efficiently utilize memory space and save ineffective calculation time, effectively improving the efficiency and hardware requirements of public key calculation.

1:高斯消去計算系統 1: Gaussian elimination calculation system

10:控制電路 10: Control circuit

11:脈動陣列 11: Pulsating array

12:記憶體 12: Memory

R:輸入矩陣 R: Input matrix

Claims

A Gaussian elimination computing system includes: a control circuit receiving an operation matrix; a systolic array including a square matrix formed by a plurality of operation cells, the systolic array being configured to perform a matrix decomposition operation on the operation matrix to decompose the operation matrix into a lower triangular matrix and an upper triangular matrix; and a memory configured with an operation data block having the same size as the operation matrix, for storing the decomposed lower triangular matrix and the upper triangular matrix when performing the matrix decomposition operation.

As described in claim 1, the Gaussian elimination calculation system, wherein the control circuit divides the operation matrix into a plurality of row group matrices, each of which has the same number of columns as the operation matrix, and the control circuit inputs each of the row group matrices into the pulse array to perform the matrix decomposition operation.

A Gaussian elimination calculation system as described in claim 2, wherein the control circuit provides multiple columns of each row group matrix to a first column of multiple columns in the pulse array in column order, and each column of the pulse array receives an input column matrix from the previous column or from the control circuit, generates an output column matrix after operation, and transmits the output column matrix to the next column of each column.

A Gaussian elimination computing system as described in claim 3, wherein the computing cells include: a plurality of first computing cells arranged on a diagonal line in the pulse array; and a plurality of second computing cells arranged outside the diagonal line in the pulse array, wherein when the pulse array operates in a triangular operation mode, the first computing cells are operated in a decomposition operation mode, and the second computing cells arranged below the diagonal line are closed, Each of the first operation cells generates a control signal according to a received first data value and a stored second data value to control the second operation cells in the same row to perform the same operation as the first operation cell; when the pulse array operates in an array operation mode, all of the operation cells are turned on and the first operation cells are operated in a recurrence operation mode, and the operation cells in each row receive a control signal together to perform the same operation.

A Gaussian elimination computing system as described in claim 4, wherein the operations performed by the operation cells include one of shift, addition, exchange, and activation.

A Gaussian elimination computing system as described in claim 4, wherein after completing the operation on each row group matrix, each first computing cell checks whether each second data value stored therein is zero.

A Gaussian elimination calculation system as described in claim 6, wherein when it is detected that any of the second data values is zero, the pulse array terminates the operation of the operation matrix and requests to update the operation matrix.

As described in claim 2, the Gaussian elimination calculation system, wherein when the pulse array writes the calculation results of each row group matrix into the memory, the pulse array writes from the first row of the memory in a manner aligned with the head of the row in the memory, and when the pulse array completes writing the last row of the memory in row order, the pulse array loops back to the first row of the memory and writes thereafter in a manner aligned with the tail of the row in the memory.

The Gaussian elimination calculation system as described in claim 8 further includes a multiplexer coupled between the pulse array and the memory, and the multiplexer is used to receive a selection signal to selectively write the calculation results of each row group matrix into the memory according to the bit.

A Gaussian elimination calculation system as described in claim 1, wherein the control circuit receives an input matrix and obtains the operation matrix and a target matrix by distinguishing the input matrix, wherein the input matrix, the operation matrix and the target matrix have the same number of columns, wherein the control circuit also records a sorting matrix to record the sorting of all columns of the operation matrix in the matrix decomposition operation, and the control circuit controls the pulse array to calculate the target matrix according to the sorting matrix, the upper triangular matrix and the lower triangular matrix to obtain a common key matrix.

A Gaussian elimination calculation method includes: receiving a calculation matrix; inputting the calculation matrix into a systolic array to perform a matrix decomposition operation, and decomposing the calculation matrix into a lower triangular matrix and an upper triangular matrix; and configuring a calculation data block of the same size as the calculation matrix in a memory, and storing the decomposed lower triangular matrix and the upper triangular matrix in the calculation data block when performing the matrix decomposition operation.

As described in claim 11, the Gaussian elimination calculation method, wherein the step of inputting the operation matrix into the pulse array to perform the matrix decomposition operation includes: dividing the operation matrix into a plurality of row group matrixes, each of which has the same number of columns as the operation matrix, and inputting each of the row group matrixes into the pulse array to perform the matrix decomposition operation.

As described in claim 12, the step of inputting each row group matrix into the pulse array to perform the matrix decomposition operation includes: providing multiple columns of each row group matrix to a first column of multiple columns in the pulse array in column order, and each column of the pulse array receives an input column matrix from the previous column or from the control circuit, generates an output column matrix after operation, and transmits the output column matrix to the next column of each column.

The Gaussian elimination calculation method as described in claim 13, wherein the pulse array includes a plurality of operation cells, the operation cells include: a plurality of first operation cells, arranged on a diagonal line in the pulse array; and a plurality of second operation cells, arranged at positions outside the diagonal line in the pulse array, the Gaussian elimination calculation method includes: when the pulse array operates in a triangular operation mode, the first operation cells operate in a decomposition operation mode, The second operation cells arranged below the diagonal line are turned off, and each of the first operation cells generates a control signal according to a received first data value and a stored second data value to control the second operation cells in the same row to perform the same operation as the first operation cell; and when the pulse array operates in an array operation mode, all of the operation cells are turned on, and the operation cells in each row receive a control signal together to perform the same operation.

The Gaussian elimination calculation method as described in claim 14, wherein the operations performed by the operation cells include one of shift, addition, exchange, and activation.

The Gaussian elimination calculation method as described in claim 14 also includes, after completing the operation on each row group matrix, each first operation cell checks whether each second data value stored therein is zero.

The Gaussian elimination calculation method as described in claim 16 also includes terminating the operation of the operation matrix and requesting to update the operation matrix when any of the second data values is detected to be zero.

The Gaussian elimination calculation method as described in claim 12 further includes: when the pulse array writes the operation results of each row group matrix into the memory, the pulse array writes from the first row of the memory in a manner aligned with the head of the row in the memory, and when the pulse array completes writing the last row of the memory in row order, the pulse array loops back to the first row of the memory and writes thereafter in a manner aligned with the tail of the row in the memory.

The Gaussian elimination calculation method as described in claim 18 further includes receiving a selection signal by coupling a multiplexer between the pulse array and the memory, so as to selectively write the operation results of each row group matrix into the memory according to the bit.

As described in claim 11, the step of receiving the operation matrix includes: receiving an input matrix, and obtaining the operation matrix and a target matrix by distinguishing the input matrix, the input matrix, the operation matrix and the target matrix have the same number of columns, and the Gaussian elimination calculation method further includes: recording a sorting matrix to record the sorting of all columns of the operation matrix in the matrix decomposition operation, and calculating the target matrix according to the sorting matrix, the upper triangular matrix and the lower triangular matrix by the pulse array to obtain a public key matrix.