CN113919489A

CN113919489A - Method and device for improving resource utilization rate of on-chip multiplier-adder of FPGA

Info

Publication number: CN113919489A
Application number: CN202010653627.2A
Authority: CN
Inventors: 李嘉树; 张�浩; 卢冕
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2022-01-11

Abstract

A method and apparatus for improving resource utilization of on-chip multiplier-adder of FPGA is provided. For each multiply-add circuit in a matrix multiplier composed of a predetermined number of multiply-add circuits of the FPGA, the input end of the multiply-add circuit is connected with a multiplexer, and the output end of the multiply-add circuit is connected with a demultiplexer, and the following operations are executed: sending k inputs to the multiplexer; transmitting the k paths of input to the multiply-add circuit through the multiplexer in k clock cycles, wherein one path of input is selected to be transmitted in each clock cycle; at each of the k clock cycles, sending a respective one of the outputs to the demultiplexer via a multiply-add circuit; outputting, by the demultiplexer, a corresponding k outputs in the k clock cycles, wherein the k outputs are sent to a multiplexer to which a subsequent multiply-add circuit is connected, wherein k is a multiplexing parameter of the multiply-add circuit, wherein each multiply-add circuit includes one multiply-adder.

Description

Method and device for improving resource utilization rate of on-chip multiplier-adder of FPGA

Technical Field

The present invention relates to the technical field of hardware structure optimization, and more particularly, to a method and apparatus for improving resource utilization of an on-chip multiplier-adder of an FPGA by optimizing a hardware structure of a matrix calculator based on characteristics of an FPGA chip.

Background

In implementations requiring regular matrix multiplication computations, such as deep neural network hardware accelerator implementations, the amount of multiplier-adder resource usage tends to directly determine the performance goodness of the accelerator. However, since the resources of the on-chip available multiplier-adder of the FPGA are fixed and limited, how to improve the utilization efficiency of the resources of the on-chip multiplier-adder is a very critical optimization hotspot. Due to the limitation of the regular shape of the matrix multiplier, existing solutions often fail to efficiently utilize the multiplier-adder resources available on-chip of the FPGA, or require the introduction of a large amount of extra computational overhead at the software level.

Disclosure of Invention

Exemplary embodiments of the present invention aim to overcome the above-mentioned disadvantages of not efficiently utilizing the multiplier-adder resources available on-chip of an FPGA or having a large amount of additional software computation overhead.

According to an aspect of the present invention, there is provided a method for improving resource utilization rate of an on-chip multiplier-adder of an FPGA, wherein for each multiplier-adder circuit in a matrix multiplier composed of a predetermined number of multiplier-adder circuits of the FPGA, an input terminal thereof is connected to a multiplexer, and an output terminal thereof is connected to a demultiplexer, the method comprising: performing the following for each of a predetermined number of multiply-add circuits of the FPGA: sending k inputs to the multiplexer; transmitting the k paths of input to the multiply-add circuit through the multiplexer in k clock cycles, wherein one path of input is selected to be transmitted in each clock cycle; at each of the k clock cycles, sending a respective one of the outputs to the demultiplexer via a multiply-add circuit; outputting, by the demultiplexer, a corresponding k outputs in the k clock cycles, wherein the k outputs are sent to a multiplexer to which a subsequent multiply-add circuit is connected, wherein k is a multiplexing parameter of the multiply-add circuit, wherein each multiply-add circuit includes one multiply-adder.

Optionally, the FPGA may be used to implement a deep neural network hardware accelerator.

Optionally, the on-chip matrix multiplier of the FPGA may include (2)^M×2^M) And k multiply-add circuits, wherein M is the bit width of an input channel of the matrix multiplier.

Optionally, the method may further comprise: acquiring the number A of on-chip available multiply-add devices of the FPGA; and determining the optimal values of the input bit width M and the multiplexing parameter k according to the number A of the available multipliers and adders.

Alternatively, the step of determining the optimal values of the input bit width M and the multiplexing parameter k according to the number a of available multipliers and adders may comprise: setting initial values of M and k; acquiring a performance test result of the FPGA based on initial values of M and k; obtaining performance test results for the FPGA based on the increased M and/or k values by incrementally setting the M and/or k values; and determining the M value and the k value when the performance of the FPGA is improved for the last time as the optimal values of M and k.

Alternatively, the step of setting the initial values of M and k may include: setting an initial value of M to floor (sqrt (A)); the initial value of k is set to 1.

Optionally, the step of obtaining performance test results of the FPGA based on the increased M-value and/or k-value by incrementally setting the M-value and/or k-value may comprise: circularly executing the following operations until the performance of the FPGA is not improved: (1) setting M to M + 1; (2) judgment 4^MWhether/k is less than A; (3) when 4 is present^MWhen the/k is not less than A, setting k as k +1, and jumping to the step (2); (4) when 4 is present^MWhen the/k is smaller than A, obtaining a performance test result of the FPGA based on the current M value and the k value; (5) and determining whether the performance of the FPGA is improved or not based on the comparison between the currently acquired performance test result of the FPGA and the last acquired performance measurement result of the FPGA.

According to another aspect of the present invention, there is provided an apparatus for increasing resource utilization of an on-chip multiplier-adder of an FPGA, comprising: a predetermined number of multiplexers; the predetermined number of demultiplexers, wherein each of the matrix multipliers for the FPGA, which are made up of the predetermined number of multiply-add circuits, has an input connected to one multiplexer and an output connected to one demultiplexer, wherein the multiplexers are configured to receive k inputs and send the k inputs to the multiply-add circuits for k clock cycles, wherein one input send is selected at each clock cycle, wherein the multiply-add circuits send a corresponding one output to the demultiplexers for each of the k clock cycles, and wherein the demultiplexers are configured to output the corresponding k outputs through the demultiplexers for the k clock cycles, wherein the k outputs are sent to the multiplexers to which subsequent multiply-add circuits are connected, and k is a multiplexing parameter of the multiply-add circuit, wherein each multiply-add circuit comprises a multiply-add device.

Optionally, the apparatus may further comprise: and the parameter determiner is configured to acquire the number A of the on-chip available multipliers and adders of the FPGA, and determine the input bit width M and the optimal value of the multiplexing parameter k according to the number A of the available multipliers and adders.

Optionally, the parameter determiner may be configured to: setting initial values of M and k; acquiring a performance test result of the FPGA based on initial values of M and k; obtaining performance test results for the FPGA based on the increased M and/or k values by incrementally setting the M and/or k values; and determining the M value and the k value when the performance of the FPGA is improved for the last time as the optimal values of M and k.

Optionally, the parameter determiner may be configured to: setting an initial value of M to floor (sqrt (A)); the initial value of k is set to 1.

Optionally, the parameter determiner may be configured to:circularly executing the following operations until the performance of the FPGA is not improved: (1) setting M to M + 1; (2) judgment 4^MWhether/k is less than A; (3) when 4 is present^MWhen the/k is not less than A, setting k as k +1, and jumping to the step (2); (4) when 4 is present^MWhen the/k is smaller than A, obtaining a performance test result of the FPGA based on the current M value and the k value; (5) and determining whether the performance of the FPGA is improved or not based on the comparison between the currently acquired performance test result of the FPGA and the last acquired performance measurement result of the FPGA.

According to another aspect of the present invention, there is provided a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform operations performed by a parameter determiner.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform operations performed by a parameter determiner.

According to the method and the device for improving the resource utilization rate of the on-chip multiplier and adder of the FPGA, the hardware structure is optimized through time division multiplexing of the on-chip multiplier and adder of the FPGA, so that balance can be freely obtained among resource utilization and performance, the utilization efficiency of the on-chip multiplier and adder of the FPGA is improved while extra cost of software is not increased, and the operation performance of a matrix multiplier (such as a deep neural network accelerator) can be improved.

In addition, according to the method and the apparatus for improving resource utilization rate of the on-chip multiplier and adder of the FPGA of the present invention, parameters for optimizing performance of a matrix multiplier (e.g., a deep neural network accelerator), for example, bit width M of an input channel of the matrix multiplier and multiplexing parameter k of a multiplier and adder circuit, may be configured through a parameter exploration process, thereby effectively improving utilization efficiency of the on-chip multiplier and adder of the FPGA.

Drawings

These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic diagram of a systolic array showing a conventional matrix multiplier.

Fig. 2a, 2b and 2c are schematic diagrams illustrating an apparatus for increasing resource utilization of an on-chip multiplier-adder of an FPGA according to an exemplary embodiment of the present invention.

Fig. 3 is a flowchart illustrating a method of determining optimal values of parameters M and k according to an exemplary embodiment of the present invention.

Fig. 4 is a flow chart illustrating a method of increasing resource usage of on-chip multiplier-adders of an FPGA according to an illustrative embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, exemplary embodiments thereof will be described in further detail below with reference to the accompanying drawings and detailed description.

In recent years, artificial intelligence techniques represented by neural networks have been rapidly developed and widely and deeply applied to image, voice, video processing and the like. However, as the deep neural network is rapidly popularized, the model depth of various neural networks is continuously increasing, and the model size thereof is also gradually increasing. Obviously, a larger neural network model means that more parameters need to participate in the calculation, and the increase of the calculation not only affects the training time of the model, but also affects the response speed of the model in actual deployment.

In the pure software algorithm implementation of the deep neural network, the training and reasoning performance of the model is often directly limited by the computing power of the central processing unit. However, since the cpu is a general-purpose hardware, it is often not suitable for increasing a specific computational power. Different from the solution, the solution of the special hardware based on ASIC and FPGA can be used for optimizing and improving the calculation ability of the bottleneck portion in calculation in a targeted manner, and is increasingly adopted in engineering practice. Because the current popular deep neural networks often include convolution layers or full-link layers with considerable quantity, the main operations in the deep neural networks are usually decomposed into matrix multiplication operations. Depending on the neural network architecture, the time taken for these matrix multiplications can often reach more than 90% of the total operation time. Many hardware acceleration schemes are based on the use of dedicated hardware to improve the performance of matrix multiplication.

From the mathematical definition of matrix multiplication it follows that: the multiplier-adder is the basic computational unit that implements the matrix multiplier. In the existing hardware acceleration scheme, the amount of the multiplier-adder usage directly determines the number of parameters that can be simultaneously multiplied and added in a single clock cycle, and also often directly determines the performance of the hardware accelerator. In an FPGA-based implementation, since the number of on-chip multiplier resources of an FPGA is an inherent property of an FPGA device, that is, for a selected piece of the FPGA device, the total number of multiplier resources is limited and fixed. This means that for FPGA-based hardware optimization schemes, the efficiency of the use of on-chip multiplier-adder resources often directly determines the acceleration performance of the FPGA deep neural network accelerator.

Since the main operations in the deep neural network can be decomposed into matrix multiplication operations, the systolic array is a common engineering method for realizing an efficient matrix multiplication accelerator. Therefore, in engineering practice, the deep neural network accelerator based on FPGA will often implement a 2 on FPGA^M×2^MA systolic array of sizes implemented matrix multiplier, where M is the bit width of the matrix multiplier input channel. Fig. 1 is a schematic diagram of a systolic array showing a conventional matrix multiplier. As shown in fig. 1, M is 2 in the exemplary systolic array, and 4 in the accelerator circuit^MThe multiplication and addition circuits are arranged according to a systolic array, and each multiplication and addition circuit needs to comprise a hardware multiplication and addition device. Each group of multiply-add circuits is connected with the multiply-add circuits which are adjacent up, down, left and right. In each clock cycle, 4 in the circuit^MAll the multiply-add circuits operate simultaneously to carry out multiply-add operation, and the operand and/or operation result are transmitted to the right or lower multiply-add circuit according to a set rule before the clock period is ended. A matrix multiplication accelerator of this size may be at 2^MTwo at a time per clock cycle 2^M×2^MMatrix multiplication between size matrices. Since each clock cycle matrix multiplier needs to perform 4^MThe sub-multiply-add operation, obviously, implements such a 2^M×2^MThe size of the matrix multiplier needs to use 4^MA multiplier-adder.

To improve the performance of the matrix multiplier described above, the most common engineering practice is to increase the size of M. That is, every time M is increased by 1, the operation capacity of the matrix multiplier per clock cycle will reach 4 times of the original capacity, but at the same time, the multiplier-adder resources required by the matrix multiplier will also become 4 times of the original capacity. That is, when M is {1, 2, 3, 4, 5, 6, 7, 8. }, {4, 16, 64, 256, 1024, 4096, 16384, 65536. } multiplier resources are needed to implement the matrix multipliers described above on the FPGA, respectively.

Since the number of on-chip multiplier resources of an FPGA is fixed and limited, we often cannot effectively utilize these multiplier resources in this usage mode. For example:

the 10GX1150 FPGA device has 3036 on-chip multiply-adders, but we can only realize M-5 matrix multipliers at most according to the matrix multiplication implementation scheme described above. That is, we can only utilize the multiplier-adder resource of 1024/3036-33.7%, which greatly limits the performance of the deep neural network accelerator that can be achieved by FPGAs.

There are generally two solutions to this problem in the existing solutions, but they all have their own significant drawbacks and disadvantages.

The existing scheme 1: irregular shaped matrix multipliers are implemented on the FPGA.

The scheme achieves the purpose of changing the use amount of multiplier-adder resources by changing the shape of the matrix multiplier. For example, a matrix multiplier with a size of 48 × 60 is implemented on an FPGA, and it can be seen from the above that the number of multipliers and adders used by such multiplier and adder is 2880, and it can be seen that the utilization rate of on-chip multiplier and adder resources of the FPGA can be greatly improved by such a scheme. However, since the matrix shape is closely related to the network structures such as the size of the filter/the number of channels, the difficulty and complexity of segmentation during software layer allocation and scheduling can be increased by using an irregular matrix multiplier, and a large amount of extra calculation overhead is often introduced into irregular segmentation, so that the performance of the deep neural network accelerator is further reduced.

Existing scheme 2: a plurality of smaller and independent matrix multipliers are implemented on an FPGA.

This scheme improves the efficiency of use of the on-chip multiplier-adder resources of the FPGA by arranging multiple smaller regularly shaped matrix multipliers in parallel. For example, 2-in-2 implementation on an FPGA⁵×2⁵The number of multipliers and adders required for such multipliers and adders is 2048. However, since these matrix multipliers are independent of each other and are subject to data transmission between the matrix multipliers, this scheme is often applied to batch applications, i.e., multiple matrix multipliers are scheduled to compute independent sets of inputs. In this application mode, although the processing throughput of the deep neural network accelerator is increased, the delay of the input processing cannot be effectively reduced.

According to the method and the device, in order to improve the resource utilization rate of the on-chip multiplier-adder of the FPGA and avoid the defects and drawbacks, the hardware structure of the matrix calculator is optimally designed according to the characteristics of FPGA devices. In particular, by flexibly applying the multiplexer and the demultiplexer, time division multiplexing of the multiplier-adder is realized in the matrix multiplier, so that the new matrix multiplier can more freely balance resource usage and performance. That is, a new multiplexing parameter k of the multiply-add circuit is introduced in the design flow of the accelerator, and the multiplexer and the demultiplexer are applied to the input and the output ends of the multiply-add circuit, so that a 2^M×2^MThe size of the matrix multiplier can be in 2^MTwo at a time in x k clock cycles 2^M×2^MThe matrix multiplication between large and small matrixes freely realizes the k-path time division multiplexing of multiplier-adder resources in a matrix multiplier, and the number of the multiplier-adders required by the matrix multiplier to be used is only 1/k multiplied by 4^MSo that the resource consumption of the multiplier-adder is 1/k of that of the original matrix multiplier, therebyThe resource utilization rate of the multiplier-adder can be greatly improved. In addition, a parameter exploration process aiming at the efficient utilization of the multiplier-adder is introduced, and the set parameters M and k can enable the accelerator to achieve the best performance by designing and adjusting the bit width M of an input channel of the matrix multiplier and the size of a multiplexing parameter k of the multiplier-adder circuit, so that the multiplier-adder resource inherent on the chip of the FPGA is efficiently utilized.

Hereinafter, an apparatus and method for improving resource utilization of an on-chip multiplier-adder of an FPGA according to an exemplary embodiment of the present invention will be described in detail with reference to fig. 2 to 4.

Fig. 2a, 2b and 2c are schematic diagrams illustrating an apparatus 200 for increasing resource usage of on-chip multipliers of an FPGA according to an exemplary embodiment of the present invention.

For the merging part of the multiply-add circuits in the systolic array shown in fig. 1, k groups of multiply-add circuits in the original circuit can be combined into a new group of circuits, k-1 multiply-add circuits are reduced, k paths of inputs in the original k groups of circuits are connected to the multiplexer of the new circuit, and outputs in the original k groups of circuits are connected to the multi-path branching unit in the new circuit.

As shown in fig. 2a, assuming that M is 2 and k is 2, the original circuit shown in fig. 1 requires 16 multiply-add circuits, and the new circuit reduces the original circuit to only 8 multiply-add circuits by performing horizontal compression.

As shown in fig. 2b, it is assumed that in the case where M is 2 and k is 4, the original circuit shown in fig. 1 needs 16 multiply-add circuits, and the new circuit is reduced to only 4 multiply-add circuits by performing horizontal and vertical compression on the original circuit.

For clarity, the multiplexers at the inputs and demultiplexers to which the outputs of each of the multiply-add circuits in the new circuit are connected are not shown in fig. 2a and 2b, while a schematic diagram of the connections of one multiply-add circuit to the multiplexers and demultiplexers is shown in fig. 2c, each of the multiply-add circuits in the new circuit shown in fig. 2a and 2b having the structure shown in fig. 2 c.

As shown in fig. 2c, a multiplexer is connected to the input of each multiply-add circuit and a demultiplexer is connected to the output of each multiply-add circuit, and thus, the apparatus 200 may include the predetermined number of multiplexers 201 and the predetermined number of demultiplexers 202 in a matrix multiplier including the predetermined number of multiply-add circuits. By using a multiplexer and demultiplexer, the k inputs and outputs can share the original summation circuit at different clock cycles. As shown in fig. 2c, the solid line input of the multiplexer and the solid line output of the demultiplexer represent the input and output of the current clock cycle, and the dotted line input and dotted line output represent the input and output of other clock cycles.

Specifically, multiplexer 201 may receive k inputs. Where k is a multiplexing parameter of the multiply-add circuit 222, and k can be determined by a user according to a requirement, and an exemplary preferred determination method of the parameter k will be described in detail later.

Multiplexer 201 sends the k inputs to multiply-add circuit 222 for k clock cycles, with one input selected for transmission each clock cycle. Multiply-add circuit 222 sends a respective one of the outputs to demultiplexer 202 for each of k clock cycles. Demultiplexer 202 outputs the corresponding k outputs through demultiplexer 202 for k clock cycles, where the k outputs may be sent to the multiplexers to which subsequent multiply-add circuits are connected.

For example, in clock cycle 1, the multiplexer 201 selects the 1 st input to send to the multiply-add circuit 222, the multiply-add circuit 222 sends the 1 st output corresponding to the 1 st input to the demultiplexer 202, and the demultiplexer 202 may select the received output as the 1 st output. In clock cycle 2, multiplexer 201 selects the 2 nd input to send to multiply-add circuit 222, multiply-add circuit 222 sends the 2 nd output corresponding to the 2 nd input to demultiplexer 202, and demultiplexer 202 may select the received output as the 2 nd output. By analogy, multiplexer 201 selects the kth input to send to multiply-add circuit 222, multiply-add circuit 222 sends the kth output corresponding to the kth input to demultiplexer 202, and demultiplexer 202 may select the received output as the kth output. Demultiplexer 202 may send the received k outputs to multiplexers connected to subsequent multiply-add circuits associated with the multiply-add circuits (e.g., arranged to the right and/or below multiply-add circuit 222).

In the deep neural network hardware accelerator, the bit width M of the input channel of the optimal matrix calculator is closely related to the network structure such as the size of a filter, the number of input and output channels and the like. Meanwhile, the bit width M of the input channel is also related to the memory bandwidth and the dominant frequency that the hardware accelerator can achieve. By means of the characteristic that the FPGA can be repeatedly programmed, the optimal values of the parameters M and k can be determined through a parameter exploration process, so that the utilization rate of resources of the on-chip multiplier-adder is improved.

According to an exemplary embodiment of the present invention, the apparatus 200 may further include a parameter determiner (not shown). The parameter determiner may obtain the number a of on-chip available multipliers and adders of the FPGA, and determine the optimal values of M and k according to a.

Specifically, the parameter determiner may set initial values of M and k. For example, the initial value of M may be set to floor (sqrt (A)), and the initial value of k may be set to 1. The performance (e.g., computation speed, etc.) of an FPGA (e.g., deep neural network hardware accelerator) may then be tested based on the M and k initial values. Accordingly, the parameter determiner may obtain performance test results based on the initial values of M and k. Subsequently, the parameter determiner may incrementally set the M and/or k values for testing performance of the FPGA to obtain performance test results based on the increased M and/or k values. Subsequently, a parameter determiner may determine the value of M and the value of k at the last time the performance of the FPGA was improved as optimal values of M and k.

Hereinafter, a method of the parameter determiner determining the optimal values of the parameters M and k according to an exemplary embodiment of the present invention will be described in detail with reference to fig. 3. Fig. 3 is a flowchart illustrating a method of determining optimal values of parameters M and k according to an exemplary embodiment of the present invention.

Referring to fig. 3, in step 301, the parameter determiner sets initial values of M and k.

In step 302, the parameter determiner obtains performance test results for the FPGA based on the initial values of M and k.

In step 303, the parameter determiner may set M to M + 1.

In step 304, the parameter determiner may determine 4^MWhether/k is less than A.

If the determination at step 304 is negative, then at step 305, the parameter determiner may set k to k +1 and jump to step 304 and execute step 304 again.

If the determination at step 304 is yes, then at step 306, the parameter determiner may obtain the performance test results for the FPGA based on the current M and k values.

In step 307, it is determined whether the performance of the FPGA is improved based on a comparison of the performance test result of the FPGA obtained in step 306 and the performance measurement result of the FPGA obtained in the last test.

If the determination at step 307 is yes, then a jump is made to step 303 and step 303 is performed again.

If the determination at step 307 is negative, the parameter determiner determines the corresponding M and k values (i.e., the M and k values at the last time the performance was improved) as optimal values at step 308.

Of course, the parameter exploration process described above is merely exemplary, and may be performed according to other feasible methods or adjustments of steps.

For each multiply-add circuit in a matrix multiplier composed of a predetermined number of multiply-add circuits of the FPGA, one multiplexer is connected to an input terminal thereof, and one demultiplexer is connected to an output terminal thereof, and the operation as shown in fig. 4 is performed.

Referring to FIG. 4, in step 401, k inputs are sent to a multiplexer.

At each of k clock cycles, the multiplexer sends one of the k inputs to the multiply-add circuit at step 402; in step 403, the multiply-add circuit sends a single output corresponding to the single input to the demultiplexer, and in step 404, the demultiplexer selects the received output as a single output.

Upon expiration of k clock cycles, the demultiplexer outputs a corresponding k outputs, respectively, at step 405. For example, the demultiplexers may send the k outputs to multiplexers to which subsequent multiply-add circuits (e.g., right-side multiply-add circuits and/or lower multiply-add circuits) are connected, respectively. In addition, the demultiplexer may send one output thereof when receiving the output of the multiply-add circuit in a single clock cycle, or may temporarily store the output when receiving the output of the multiply-add circuit in a single clock cycle, and output the output after a number of clock cycles or k clock cycles expire, as required. The invention is not limited in this regard.

Furthermore, the method comprises a method of determining the parameters M and k. According to an exemplary embodiment of the present invention, the number a of on-chip available multipliers and adders of the FPGA may be obtained; and the optimal values of M and k can be determined from a.

According to an exemplary embodiment of the present invention, initial values of M and k may be set first. For example, the initial value of M may be set to floor (sqrt (A)), and the initial value of k may be set to 1. Subsequently, performance (e.g., computational speed, etc.) of an FPGA (e.g., a deep neural network hardware accelerator) may be tested based on the M and k initial values, such that performance test results based on the M and k initial values may be obtained. Subsequently, the M and/or k values may be incrementally set for testing the performance of the FPGA so that performance test results based on the increased M and/or k values may be obtained. Subsequently, the M and k values at the last time the performance of the FPGA was upgraded can be determined as the optimal values of M and k.

According to an exemplary embodiment of the present invention, the step of obtaining the performance test result based on the increased M value and/or k value may include: circularly executing the following operations until the performance of the FPGA is not improved: (1) setting M to M + 1; (2) judgment 4^MWhether/k is less than A; (3) when 4 is present^MWhen the/k is not less than A, setting k as k +1, and jumping to the step (2); (4) when 4 is present^MWhen the/k is smaller than A, obtaining a performance test result of the FPGA based on the current M value and the k value; (5) based on the performance test result of the FPGA obtained currently and the performance of the FPGA obtained last timeComparison of the results can be measured to determine whether the performance of the FPGA is improving.

In addition, the method and the device for improving the resource utilization rate of the on-chip multiplier-adder of the FPGA can be applied to deep neural network hardware accelerators and can also be applied to the realization of any other matrix multiplication calculation needing rules.

A method and apparatus for increasing resource usage of on-chip multiplier-adders of an FPGA according to exemplary embodiments of the present invention has been described above with reference to fig. 2 through 4.

The parameter determiner in the apparatus for increasing resource usage of an on-chip multiplier-adder of an FPGA according to the present invention may be configured as software, hardware, firmware, or any combination thereof, that performs a specific function. For example, the parameter determiner may correspond to a dedicated integrated circuit, may correspond to pure software code, and may correspond to a module combining software and hardware. Further, one or more functions implemented by the security monitoring module may also be performed collectively by components in a physical entity device (e.g., a processor, a client or server, etc.).

Further, the operations performed by the parameter determiner described with reference to fig. 3 may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present invention, a computer-readable storage medium may be provided that stores instructions, which when executed by at least one computing device, cause the at least one computing device to perform operations performed by a parameter determiner.

The computer program in the computer-readable storage medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than the above steps or perform more specific processing when the above steps are performed, and the content of the additional steps and the further processing is already mentioned in the description of the related method with reference to fig. 3, and therefore will not be described again here in order to avoid repetition.

It should be noted that the parameter determiner according to the exemplary embodiment of the present invention may fully rely on the execution of the computer program to implement the corresponding function, i.e., the respective units correspond to the respective steps in the functional architecture of the computer program, so that the entire system is called by a special software package (e.g., lib library) to implement the corresponding function.

In another aspect, the parameter determiner may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, the parameter determiner of the exemplary embodiments of the present invention may also be implemented as a computing device including a storage component having stored therein a set of computer-executable instructions that, when executed by a processor, perform operations performed by the parameter determiner according to the exemplary embodiments of the present invention.

In particular, computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions.

The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In a computing device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the operations performed by the parameter determiner according to the exemplary embodiments of the present invention may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

The processor may execute instructions or code stored in one of the memory components, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.

Operations performed by the parameter determiner according to exemplary embodiments of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.

Accordingly, the operations described with reference to fig. 3 as performed by the parameter determiner may be implemented by a system including at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present invention, the at least one computing device is a computing device for performing operations performed by the parameter determiner according to an exemplary embodiment of the present invention, the storage device having stored therein a set of computer-executable instructions that, when executed by the at least one computing device, perform the steps of the operations performed by the parameter determiner described with reference to fig. 3.

While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims

1. A method for increasing resource utilization of an on-chip multiply-add circuit of an FPGA, wherein for each multiply-add circuit of a matrix multiplier comprising a predetermined number of multiply-add circuits of the FPGA, an input thereof is connected to a multiplexer and an output thereof is connected to a demultiplexer, the method comprising:

performing the following for each of a predetermined number of multiply-add circuits of the FPGA:

sending k inputs to the multiplexer;

transmitting the k paths of input to the multiply-add circuit through the multiplexer in k clock cycles, wherein one path of input is selected to be transmitted in each clock cycle;

at each of the k clock cycles, sending a respective one of the outputs to the demultiplexer via a multiply-add circuit;

outputting, by the demultiplexer, a corresponding k outputs over the k clock cycles, wherein the k outputs are sent to a multiplexer to which a subsequent multiply-add circuit is connected,

wherein k is a multiplexing parameter of the multiply-add circuit,

wherein each multiply-add circuit comprises a multiply-add device.

2. An apparatus for increasing resource usage of an on-chip multiplier-adder of an FPGA, comprising:

a predetermined number of multiplexers;

the predetermined number of demultiplexers is selected from a group consisting of,

wherein, for each multiply-add circuit in the matrix multiplier composed of the predetermined number of multiply-add circuits of the FPGA, the input end is connected with a multiplexer, and the output end is connected with a demultiplexer,

wherein the multiplexer is configured to receive k inputs and to send the k inputs to the multiply-add circuit over k clock cycles, wherein one input is selected to be sent each clock cycle, wherein the multiply-add circuit sends a respective one output to the demultiplexer each clock cycle of the k clock cycles,

the demultiplexer is configured to output, over the k clock cycles, respective k outputs through the demultiplexer, wherein the k outputs are sent to a multiplexer to which a subsequent multiply-add circuit is connected,

wherein k is a multiplexing parameter of the multiply-add circuit,

wherein each multiply-add circuit comprises a multiply-add device.

3. The apparatus of claim 2, wherein the FPGA is to implement a deep neural network hardware accelerator.

4. The apparatus of claim 2, wherein the on-chip matrix multiplier of the FPGA comprises (2)^M×2^M) And k multiply-add circuits, wherein M is the bit width of an input channel of the matrix multiplier.

5. The apparatus of claim 4, further comprising:

and the parameter determiner is configured to acquire the number A of the on-chip available multipliers and adders of the FPGA, and determine the input bit width M and the optimal value of the multiplexing parameter k according to the number A of the available multipliers and adders.

6. The apparatus of claim 5, wherein the parameter determiner is configured to:

setting initial values of M and k;

acquiring a performance test result of the FPGA based on initial values of M and k;

obtaining performance test results for the FPGA based on the increased M and/or k values by incrementally setting the M and/or k values;

and determining the M value and the k value when the performance of the FPGA is improved for the last time as the optimal values of M and k.

7. The apparatus of claim 6, wherein the parameter determiner is configured to:

setting an initial value of M to floor (sqrt (A));

the initial value of k is set to 1.

8. The apparatus of claim 6, wherein the parameter determiner is configured to:

circularly executing the following operations until the performance of the FPGA is not improved:

(1) setting M to M + 1;

(2) judgment 4^MWhether/k is less than A;

(3) when 4 is present^MWhen the/k is not less than A, setting k as k +1, and jumping to the step (2);

(4) when 4 is present^MWhen the/k is smaller than A, obtaining a performance test result of the FPGA based on the current M value and the k value;

(5) and determining whether the performance of the FPGA is improved or not based on the comparison between the currently acquired performance test result of the FPGA and the last acquired performance measurement result of the FPGA.

9. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform operations performed by a parameter determiner of any of claims 5 to 8.

10. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform operations performed by a parameter determiner of any of claims 5 to 8.