Nothing Special   »   [go: up one dir, main page]

CN116362314A - Integrated storage and calculation device and calculation method - Google Patents

Integrated storage and calculation device and calculation method Download PDF

Info

Publication number
CN116362314A
CN116362314A CN202111599630.1A CN202111599630A CN116362314A CN 116362314 A CN116362314 A CN 116362314A CN 202111599630 A CN202111599630 A CN 202111599630A CN 116362314 A CN116362314 A CN 116362314A
Authority
CN
China
Prior art keywords
calculation
bit
data
column
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111599630.1A
Other languages
Chinese (zh)
Inventor
华幸成
曾重
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111599630.1A priority Critical patent/CN116362314A/en
Priority to PCT/CN2022/141634 priority patent/WO2023116923A1/en
Publication of CN116362314A publication Critical patent/CN116362314A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/491Computations with decimal numbers radix 12 or 20.
    • G06F7/498Computations with decimal numbers radix 12 or 20. using counter-type accumulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the application provides a memory calculation integrated device and a calculation method, relates to the technical field of chips, and is used for reducing calculation cost in neural network calculation and improving calculation efficiency. The method comprises the following steps: the method comprises the steps of calculating a plurality of input data through a bit width calculation module to obtain a plurality of effective data, inputting the plurality of effective data into the calculation module, obtaining a calculation result of each column in a calculation array through the calculation module according to the plurality of effective data and bits of weight data stored by each storage calculation unit, inputting the calculation result of each column into a result processing module, and finally carrying out weighted calculation on the calculation result of each column through the result processing module to obtain a final result. The embodiment of the application is used in the process of calculation by the integrated storage and calculation device.

Description

Integrated storage and calculation device and calculation method
Technical Field
The embodiment of the application relates to the technical field of chips, in particular to a memory and calculation integrated device and a calculation method.
Background
In recent years, neural Networks (NNs) have been rapidly developed and widely used in the fields of robots, voice recognition, image recognition, natural language processing, expert systems, and the like. The core computation of the neural network is a matrix vector multiplication, which has the characteristics of being computationally intensive and memory intensive. When the general-purpose chip is used for the calculation of the neural network, the general-purpose chip has obvious defects in power consumption, performance and size, so that in order to improve the calculation efficiency of the neural network, a special chip (neural network accelerator) needs to be customized for the neural network to perform the calculation.
The integrated memory and calculation device not only reserves the memory and read-write functions of the memory circuit, but also can support multiplication and addition operation in parallel, reduces the data moving amount, improves the energy consumption efficiency, and provides an efficient solution for the design of the neural network accelerator. When performing calculation, the integrated memory and calculation device generally needs to perform calculation by expanding multi-bit (bit) data into single-bit/low-bit (e.g., 2-bit or 4-bit) data according to the data bit width, and then combine the calculation results, so that the number of times of expansion calculation is large, resulting in high cost.
Disclosure of Invention
The embodiment of the application provides an integrated storage and calculation device and a calculation method, which are applied to the integrated storage and calculation device, and can reduce the cost and improve the calculation efficiency when the neural network is calculated.
In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme:
in a first aspect, an embodiment of the present application provides a unified device for memory and calculation, where the unified device for memory and calculation includes a bit width calculation module, a calculation module, and a result processing module. The computing module comprises a computing array, and the computing array comprises a plurality of storage computing units, wherein the storage computing units are used for storing weight data. The bit width calculation module is used for calculating a plurality of input data to obtain a plurality of effective data, the plurality of effective data are input into the calculation module, the plurality of input data are in one-to-one correspondence with the plurality of effective data, first input data in the plurality of input data are corresponding to first effective data in the plurality of effective data, and the bit width of the first input data is larger than that of the first effective data. The computing module is used for obtaining the computing result of each column in the computing array according to the bits of the plurality of effective data and the weight data, and inputting the computing result of each column into the result processing module, wherein one column of computing result is the sum of the products computed by the same bit of the plurality of effective data and one column of storage computing unit. The result processing module is used for carrying out weighted calculation on the calculation result of each column to obtain a final result.
Therefore, compared with the calculation method in the prior art that multi-bit input data are unfolded into a plurality of single-bit/low-bit input data according to the data bit width to be input and calculated, the unfolding calculation is caused to be performed too many times, and large expenditure is generated.
In one possible design, the bit width calculation module is specifically configured to perform mask calculation on a plurality of input data to obtain a mask value, determine a plurality of valid data according to valid bits of the mask value, and input the plurality of valid data into the calculation module bit by bit, so that the calculation module calculates the plurality of valid data bit by bit. Therefore, according to the calculation method, the bit width calculation module calculates the effective data of the input data through the mask, and the effective data is input into the calculation module bit by bit, so that the calculation times of the calculation array can be greatly reduced.
In one possible design, when the computing array receives the nth bit corresponding to each of the plurality of valid data, where N is an integer greater than or equal to 0, the computing array is configured to compute a product of the nth bit corresponding to each of the plurality of valid data and the bit of the weight data; the computing module also comprises an accumulating circuit, wherein the accumulating circuit is used for adding the products calculated by the storage computing units in the same column in the computing array to obtain the sum of the products calculated by the storage computing units in each column in the computing array. Therefore, according to the calculation method provided by the application, the calculation module calculates the Nth bit corresponding to each of the plurality of effective data, the calculation times of the calculation module correspond to the bit width of the effective data, and the number of times of calculation of the calculation array can be effectively reduced because the bit width of the effective data is smaller than the bit width of the input data.
In one possible design, the weight data includes multiple weight data, and the integrated device further includes a weight bit width configuration module; the weight bit width configuration module is used for storing bit width information of various weight data, wherein the bit width information comprises bit widths of each weight data and an identification of a starting column of each weight data in a computing array, and the bit widths of at least two weight data in the various weight data are different. Therefore, compared with the prior art that the bit width of the weight data is fixed, the calculation method provided by the application cannot calculate the mixing precision of the weight data, so that the calculation efficiency is low.
In one possible design, the integrated memory device further includes a control module for writing the plurality of weight data into the plurality of memory computing units according to the bit width information. Therefore, according to the calculation method provided by the application, the control module can deploy the weight data into each storage calculation unit in the calculation array according to the bit width information, so that the bit width of various weight data is included in a single calculation array, the weight data mixing precision calculation is realized, and the calculation efficiency of the storage and calculation integrated device is improved.
In one possible design, the control module is further configured to determine the valid bit of the mask value bit by bit, and generate the first control signal and the second control signal when any bit of the mask value is determined to be valid. The first control signal is used for indicating the calculation module to calculate the sum of products of each row of storage calculation units in the calculation array, the second control signal is used for indicating the result processing module to perform weighted calculation on the sum of products of a plurality of rows of storage calculation units corresponding to each weight data in the calculation array according to the bit width information, so as to obtain a plurality of weighted results of the Nth bit corresponding to the effective data respectively, and each weighted result in the weighted results corresponds to one weight data. Therefore, according to the calculation method provided by the application, the control module can generate the control signal according to the valid bit of the mask value, and the calculation module and the result processing module are controlled. Because the bit width of the effective bit of the mask value is the same as that of the effective data, and is generally smaller than that of the input data, the control signal is generated according to the effective bit of the mask value, so that the calculation times of a calculation module can be reduced, and the calculation cost is reduced.
In one possible design, the control module is further configured to generate the third control signal when it is determined that the bit width of the mask value is equal to the bit width of the input data. The third control signal is used for indicating the result processing module to perform weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data, so as to obtain a final result, wherein the final result comprises the weighted result of each type of weighted data. Therefore, according to the calculation method provided by the application, after the calculation of the calculation module is finished, the result processing module carries out weighted calculation according to the bit width information and the bit weight of the mask value effective bit, so that the calculation result of the multi-time single-bit effective data and the multi-bit weight data can be accurately converted into the calculation result of the multi-bit input data and the multi-bit weight data. On the premise of ensuring the unchanged calculation precision, the calculation times are effectively reduced, and the cost is reduced.
In a second aspect, an embodiment of the present application provides a computing method applied to a unified device for computing, where the unified device for computing includes a computing array including a plurality of storage computing units for storing weight data. The method comprises the following steps: calculating a plurality of input data to obtain a plurality of effective data, wherein the plurality of input data corresponds to the plurality of effective data one by one, first input data in the plurality of input data corresponds to first effective data in the plurality of effective data, the bit width of the first input data is larger than that of the first effective data, and according to the plurality of effective data and the bit of the weight data, a calculation result of each column in the calculation array is obtained, wherein one column of calculation result is the sum of products calculated by the same bit of the plurality of effective data and one column of storage calculation unit, and the calculation result of each column is subjected to weighted calculation to obtain a final result. The advantages achieved by the second aspect may be seen in the advantages of the first aspect.
In one possible design, computing the plurality of input data to obtain the plurality of valid data includes: performing mask calculation on a plurality of input data to obtain a mask value, determining a plurality of effective data according to the effective bits of the mask value, and obtaining a calculation result of each column in the calculation array according to the plurality of effective data and the bit of the weight data, wherein the calculation result comprises: and calculating the bit by bit of the plurality of effective data and the bit of the weight data to obtain a calculation result of each column in the calculation array.
In one possible design, obtaining the calculation result of each column in the calculation array according to the bits of the plurality of valid data and the weight data includes: when the computing array receives N-th bit corresponding to each of the plurality of effective data, wherein N is an integer greater than or equal to 0, calculating the product of the N-th bit corresponding to each of the plurality of effective data and the bit of the weight data, and adding the products calculated by the same column of storage computing units in the computing array to obtain the sum of the products calculated by each column of storage computing units in the computing array.
In one possible design, the method further comprises: and storing bit width information of multiple weight data, wherein the bit width information comprises the bit width of each weight data and the identification of a starting column corresponding to each weight data in a computing array, and the bit widths of at least two weight data in the multiple weight data are different.
In one possible design, the weight data includes a plurality of weight data, the method further comprising: the plurality of weight data is written into the plurality of storage computing units according to the bit width information.
In one possible design, the method further comprises: and determining the valid bit of the mask value bit by bit, and generating a first control signal and a second control signal when any bit of the mask value is determined to be valid. The first control signal is used for calculating and obtaining the sum of products of each column of storage calculation units in the calculation array, the second control signal is used for carrying out weight calculation on the sum of products of a plurality of columns of storage calculation units corresponding to each weight data in the calculation array according to the bit width information, so as to obtain a plurality of weight results of the N bit corresponding to the effective data respectively, and each weight result in the plurality of weight results corresponds to one weight data.
In one possible design, the method further comprises: when the bit width of the mask value is equal to the bit width of the input data, a third control signal is generated, wherein the third control signal is used for carrying out weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of a plurality of valid data, so as to obtain a final result, and the final result comprises the weighted results of each type of weighted data.
In a third aspect, a computer readable storage medium stores computer instructions that, when executed on an electronic device, cause the electronic device to perform the method of any one of the second aspects and possible designs of the second aspect.
In a fourth aspect, a computer program product for causing an electronic device to carry out the method of any one of the above second aspect and the possible designs of the second aspect when the computer program product is run on a computer.
The corresponding advantageous effects of the other aspects mentioned above may be referred to as descriptions about the advantageous effects of the first aspect, and are not repeated here.
Drawings
FIG. 1 is a schematic diagram of a simulated computing array;
FIG. 2 is a schematic diagram of a digital computing array;
FIG. 3 is a schematic structural diagram of an integrated storage device according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a computing array according to an embodiment of the present disclosure;
FIG. 5 is a flowchart of a calculation method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of calculating effective data according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a computing module according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a control module according to an embodiment of the present disclosure;
FIG. 9 is a flowchart of a calculation method according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a storage-calculation integrated device according to an embodiment of the present application.
Detailed Description
For ease of understanding, a description of some of the concepts related to the embodiments of the present application are given by way of example for reference. The following is shown:
artificial neural network (artificial neural network, ANN): a neural network or neural-like network for short is a mathematical or computational model that mimics the structure and function of a biological neural network (central nervous system, such as the brain) for estimating or approximating a function. A neural network is made up of a large number of nodes (neurons) interconnected, each node representing a specific output function, called an excitation function or activation function (activation function), and each link between two nodes representing a weight for a signal passing through the connection, called weight data.
Neural network accelerator: an application specific integrated circuit (application specific integrated circuit, ASIC) chip suitable for artificial neural network reasoning or training is used for performing the calculation of the neural network and improving the calculation efficiency of the neural network.
The memory is integrated with the algorithm embedding, the operation in the computer is transferred from the central processing unit (central processing unit, CPU) to the memory for calculation in the memory calculation unit (cell), and the data exchange time and the data access energy consumption in the calculation process can be greatly reduced.
There are two implementations of a memory integrated device, that is, an analog device (e.g., resistive random-access memory (ReRAM)) is used to construct a computing array, and a digital device (e.g., static random-access memory (SRAM)) is used to construct a computing array.
Fig. 1 is a schematic diagram of an analog computing array constructed by analog devices, where the analog devices can be understood as memory computing units, and the memory computing units are arranged in an array form, the analog devices in the same row share a word line, and the analog devices in the same column share a bit line. The conductance in an analog device can be understood as weight data, the voltage can be understood as input data, and the input voltages of the same word line are the same. The current value output by each bit line represents the sum of the products of the conductance and the voltage of analog devices (located in the same column) sharing the bit line, i.e., the sum of the products of the column weight data and the input data. For example, in a 4×4 analog computing array, the conductance in the first column is G1, G2, G3, and G4, i.e., the weight data in the first column is G1, G2, G3, and G4, the input voltages in each row are V1, V2, V3, and V4, i.e., the input data are V1, V2, V3, and V4, and the input data are input in parallel, so that the current i1=g1×v1+g2×v2+g3×v3+g4×v4 output in the first column represents the sum of the products of the weight data and the input data in the first column.
Fig. 2 is a schematic diagram of a digital computing array constructed by using digital devices, in which, when performing neural network computation, a weight data is stored in each storage computing unit, an input unit inputs input data to each storage computing unit in the digital computing array, the input data of the storage computing units located in the same row are the same, multiplication computation of the weight data and the input data is performed on the storage computing units, and the multiplication computation results on the same column are accumulated by a peripheral accumulation circuit, so as to obtain the sum of the products of the weight data and a plurality of input data of each column.
Both implementations can input multiple input data in parallel on a row and perform multiple multiply-accumulate calculations in parallel on a column.
Data bit width: bits for short, equivalent to bits (bits), represent the number of binary bits transmitted at one time by the bus. Bits are the smallest unit of data storage within a computer, for example 11010100 is an 8-bit binary number, i.e., 8 bits wide, which may be referred to as 8-bit data.
Compute array (crossbar, XB), in this application, refers to a compute array constructed from memory compute units, each compute array containing several rows and several columns.
Bit weight: the unit value corresponding to each fixed position in the number is called a bit weight. For multiple digits, at a certain position The size of the value represented by "l" is referred to as the bit weight of the bit. For example, the weight of the decimal number on the 2 nd digit from right to left is 10, and the weight of the decimal number on the 3 rd digit is 100; and the bit weight on the 2 nd bit number from right to left of the binary number is 2, the bit weight on the 3 rd bit number is 4, and for the N-ary number, the bit weight on the i th bit number from right to left of the integer part is N i-1 While the fraction part has a bit weight N from left to right on the jth bit number -j
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein, in the description of the embodiments of the present application, "/" means or is meant unless otherwise indicated, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the present embodiment, unless otherwise specified, the meaning of "a plurality" is two or more.
The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present embodiment, unless otherwise indicated, the meaning of "coupled" means that two or more circuit elements are directly connected or indirectly connected, for example, a and B coupling may mean that a is directly connected to B, or that a is connected to B through C.
Currently, when a neural network accelerator performs computation using a memory integrated device, when the computing array is an analog computing array constructed using analog devices, the computing array is generally prone to performing low-bit computation because the analog computing array is limited by the precision of the analog devices and the cost of devices such as analog-to-digital converter (ADC)/digital-to-analog converter (DAC). For example, the input data and the weight data are both 16 bits, the storage computing units use 2 bits, that is, each storage computing unit stores 2 bits of data, and the 16 bits of weight data need to be stored by 8 storage computing units, which can be understood as that 8 columns of storage computing units represent a column of weight data. When the neural network calculation is performed, the 16-bit input data is expressed as a 0/1 voltage sequence with the length of 16, and 1-bit input data is sequentially input in parallel from the lower bit of each clock cycle to perform calculation, namely, the storage calculation unit calculates once every clock cycle, and each time the product of the 1-bit input data and the 2-bit weight data is calculated, 16 clock cycles are needed to complete the calculation of the 16-bit input data and the 16-bit weight data. After each clock cycle, each row of storage computing units obtains a sum of products (a sum of products obtained by parallel input of the same single bit of the input data), and after 16 clock cycles, each row of storage computing units outputs a sum of 16 products obtained by 16 times of computation. The sum of the products of each column of weight data and a plurality of input data is obtained by combining 8 sums output by the continuous 8 columns of storage calculation units by shift addition, which can be understood as I1 in fig. 1.
When the compute array is a digital compute array built with digital devices, multi-bit computation needs to be implemented by multiple single/low bit computations, since digital compute arrays are also typically prone to single/low bit computations. For example, the input data and the weight data are both 4 bits, the storage computing unit is a single bit multiplier, that is, the storage computing unit stores 1 bit of data, and the 4 bits of weight data need to be stored by 4 storage computing units, which can be understood that the 4 columns of storage computing units represent a column of weight data. When the neural network calculation is performed, input data is input into storage calculation units located in the same row bit by bit, each time the single bit of the input data is multiplied by all bits of the weight data, namely each time the single bit of the input data is multiplied by 4 storage calculation units (the 4 storage calculation units store one weight data), each storage calculation unit calculates the product of 1 bit of input data and 1 bit of weight data, the product result is the product of one 4 bits of data (the product of the single bit of the input data and the 4 storage calculation units), and the product result is output to a peripheral accumulation circuit. After each calculation is finished, the peripheral accumulation circuit adds a plurality of product results obtained by parallel input calculation of the same single bit of a plurality of input data in the same column of weight data to obtain 4 product accumulation results corresponding to 4 bits of the plurality of input data. Finally, the peripheral accumulation circuit performs corresponding shift summation on the 4 multiplication accumulation results to obtain the sum of products of a column of weight data and a plurality of input data.
It can be seen that when performing calculations using a memory device, it is generally necessary to spread multi-bit input data into a plurality of single-bit/low-bit input data according to the data bit width for input and calculation. Since the bit width of the multi-bit input data is fixed, the number of times of expansion calculation of the multi-bit input data is fixed, and the number of times of expansion calculation is the same regardless of whether the value of the input data is large or small. For example, if the 8-bit input data is 00001010 and the multiplication is performed by expanding the input data into a plurality of single-bit input data, the expansion is performed 8 times, and the multiplication is performed 8 times on the single- bit input data 0, 1, and 0, respectively. It can be seen that since the calculation result obtained by multiplying the 8-bit input data by the bit where the single-bit input data 0 is located is 0, it can be understood that the calculation of the single-bit input data 0 is invalid in these 8 calculations. When the calculation is performed, the numerical value of most multi-bit data is smaller, and the expansion calculation is not needed to be performed too many times, so that the multi-bit input data is expanded into a plurality of single-bit/low-bit input data according to the data bit width to perform the input and calculation, redundant calculation exists, and high cost is generated. In addition, when the integrated memory computing device is used for computing, the bit width of the weight data is fixed, that is, the number of memory computing units required for being deployed on the computing array is the same no matter the value of the weight data is large or small, so that the computing efficiency is low.
Accordingly, the present application proposes a computationally integrated device, which in the present application may be understood as a chip, such as a neural network accelerator. In consideration of the problems that in the prior art, when a memory integrated device is adopted to perform neural network calculation, multi-bit input data are unfolded into a plurality of single-bit/low-bit input data according to data bit widths to be input and calculated, the input data bit widths and the weight data bit widths are fixed, so that calculation cost is large and calculation efficiency is low. Therefore, the number of times of calculation array expansion calculation is effectively reduced, the calculation cost is reduced, and the calculation efficiency is improved.
The integrated device for calculation provided by the embodiment of the application can be applied to a scene for calculation, for example, a scene for neural network calculation. When the neural network calculation is performed, the integrated memory device calculates the weight data and the plurality of input data of the plurality of neural networks.
As shown in fig. 3, a schematic diagram of a memory device, which may be a chip, is shown, which is illustrated in fig. 3 as chip 300. The chip 300 includes a data processing unit (processing element, PE) 301, a data switching module (switch) 302, and an input-output module (TxRx) 303, etc.
It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the chip 300. In other embodiments of the present application, chip 300 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The data processing unit 301 may include one or more data processing units, one data processing unit including a plurality of compute engines. A part of calculation engines are used for completing multiply-add calculation of the neural network, and in the embodiment of the present application, the calculation engines used for completing multiply-add calculation of the neural network include a bit width calculation module 3011, a calculation module 3012, a weight bit width configuration module 3013, a control module 3014 and a result processing module 3015. Another part of the calculation engine is used for completing calculation such as activation, point multiplication, point addition and division in the neural network.
The bit width calculation module 3011 may be configured to calculate valid data of the input data, for example, perform logic or calculation on a plurality of input data to obtain a mask value, determine a plurality of valid data of the plurality of input data according to the mask value, and input the plurality of valid data obtained by calculation to the calculation module.
The computation module 3012 includes a computation array and an accumulation circuit. The computing array includes a plurality of storage computing units arranged in an array, each of which may be used to store bits of weight data, for example, 1-bit data, 2-bit data, 4-bit data, or the like in multi-bit weight data. Referring to the 8 x 8 computing array shown in fig. 4, wherein the computing array includes 8 columns of storage computing units, each column of storage computing units includes 8 storage computing units. Taking the example that 1 bit of data is stored in each storage computing unit, the weight data adopts 4 bits, and one 4 bit of weight data needs to be stored in 4 storage computing units, it can be understood that 4 columns of storage computing units represent one column of weight data, and one column of weight data comprises 8 4 bits of weight data. The calculation array may be used to calculate a plurality of valid data and a plurality of weight data, for example, multiply the same bit (single bit/low bit) of the plurality of valid data and the bit of the weight data stored in each storage calculation unit, to obtain a plurality of product results (in one calculation, how many storage calculation units in the calculation array obtain how many product results), and input the plurality of product results to the accumulation circuit.
The accumulation circuit may be configured to accumulate a plurality of product results output by the computing array, for example, accumulate a plurality of product results obtained by the same column of storage computing units, obtain a sum of products of each column of storage computing units, and input the obtained sum of the plurality of products to the result processing module 3015.
The weight bit width configuration module 3013 may be used to store bit width information of various weight data, where a column of weight data is one weight data, so it may be understood that the weight bit width configuration module 3013 is used to store bit width information of multiple columns of weight data. The bit widths of the weight data in the same column are the same, and the bit widths of the weight data in different columns may be the same or different. The bit width information includes the bit width of each weight data and the identification of the starting column of each weight data in the compute array, and can be understood to include the bit width of each column of weight data and the identification of the starting column of each column of weight data in the compute array. Taking the 8×8 computing array shown in fig. 4 as an example, the computing array is from left to right, and is respectively a 0 th column storage computing unit, a 1 st column storage computing unit, … …, and a 7 th column storage computing unit. If the bit width of the 0 th column weight data in the bit width information stored in the weight bit width configuration module 3013 is 4 bits, the identification of the start column in the calculation array is the 0 th column storage calculation unit, and the 0 th column weight data includes the 0 th column storage calculation unit-the 3 rd column storage calculation unit as shown in fig. 4.
The control module 3014 may be configured to write various weight data stored in the memory 302 to a plurality of storage computing units according to the bit width information in the weight bit width configuration module 3013. The control module 3014 may also be used to generate control signals to control the calculation module 3012 and the result processing module 3015. For example, when the control module 3014 determines that any one bit of the mask value obtained by the bit width calculation module 3011 is valid, it generates a first control signal for instructing the calculation module 3012 to multiply the same bit (single bit/low bit) of the plurality of valid data and the bit of the weight data stored in each storage calculation unit, and inputs the sum of the obtained plurality of products to the result processing module 3015. The second control signal is used for instructing the result processing module 3015 to perform weighted calculation on the sum of products of the multiple columns of storage calculation units corresponding to each weight data in the calculation array according to the bit width information, so as to obtain multiple weighted results of the nth bit corresponding to each of the multiple effective data, wherein the lowest bit is the 0 th bit, and N is an integer greater than or equal to 0. The control module 3014 may further generate a third control signal when determining that the bit width of the mask value is equal to the bit width of the input data, where the third control signal is used to instruct the result processing module 3015 to perform weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data, so as to obtain a weighted result of each weighted data.
The result processing module 3015 may be configured to perform a corresponding action according to the control signal after receiving the control signal sent by the control module 3014. For example, when the second control signal is received, the sum of products of the multiple columns of storage computing units corresponding to each weight data in the computing array is weighted according to the bit width information, so that multiple weighted results of the Nth bit corresponding to the multiple effective data are obtained. And when the third control signal is received, carrying out weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data to obtain the weighted result of each weighted data.
The data exchange module 302 may be used to implement data exchange between various units within the chip, for example, between the input/output module 303 and the plurality of data processing units 301.
The input-output module 303 may be used to receive input data and weight data, and may also be used to output the resulting final result in the data processing unit 301. For example, the input/output module 303 may interact with an off-chip memory (in which input data and weight data are stored), receive the input data and weight data, and input the input data and weight data into the data processing unit 301 through the data exchange module 302. The resulting end result in the data processing unit 301 may also be output to off-chip memory or on-chip cache (not shown in fig. 3), without limitation.
By applying the integrated storage and calculation device provided by the application, the calculation method provided by the application aiming at the integrated storage and calculation device is described below by taking the integrated storage and calculation device as a chip as an example, and in the process of calculating the neural network by the chip, the process of calculating the effective data of each input data in the plurality of input data and the process of calculating the plurality of effective data and the plurality of weight data are introduced.
As shown in fig. 5, the embodiment of the present application provides a calculation method, which is applied to a memory integrated device, taking the memory integrated device as a chip 300 as an example, the chip 300 includes a bit width calculation module 3011, a calculation module 3012, and a result processing module 3015. Wherein the calculation module comprises a calculation array comprising a plurality of storage calculation units, each of the plurality of storage calculation units being for storing bits of weight data, see the description of the calculation array shown in fig. 4. The method comprises the following steps:
step 501, calculating a plurality of input data to obtain a plurality of valid data.
When the multiplication is performed by expanding the input data, the result of the multiplication on the bit of 0 of the input data is 0, which is considered to be invalid. Multiplication calculation of a bit of 1 of input data can be understood to be valid, and thus valid data of input data can be understood to be data composed of valid bits (bits of 1) of the input data. The plurality of input data corresponds to the plurality of valid data one by one, first input data in the plurality of input data corresponds to first valid data in the plurality of valid data, and the bit width of the first input data is larger than the bit width of the first valid data. The first input data may be any one of a plurality of input data. Because the neural network calculation is carried out on the input data and the result obtained by carrying out the neural network calculation on the effective data of the input data is the same, the accuracy of the calculation result can be ensured. In addition, because the bit width of the first input data is larger than that of the first effective data, the number of times of multiplying the effective data of the input data is smaller than that of multiplying the effective data of the input data, so that the number of times of calculating by a calculating module can be effectively reduced, and the cost is reduced.
Specifically, step 501 is that the bit width calculation module 3011 calculates a plurality of input data to obtain a plurality of valid data, and inputs the plurality of valid data to the calculation module 3012.
Illustratively, the bit width calculation module 3011 is capable of obtaining a plurality of input data from the input/output module 303, the bit width calculation module 3011 calculates valid data of each of the plurality of input data, and the calculated plurality of valid data is input to the calculation module 3012 for calculation.
In some alternative embodiments, step 501 comprises: and performing mask calculation on the plurality of input data to obtain a mask value, and determining a plurality of effective data according to the effective bits of the mask value. Specifically, the bit width calculation module 3011 performs mask calculation on a plurality of input data to obtain a mask value, and determines a plurality of valid data according to valid bits of the mask value.
Wherein the valid data of the plurality of input data is to be determined according to the plurality of input data, and the method for calculating the valid data of the plurality of input data comprises performing mask (mask) calculation on the plurality of input data. Taking mask calculation as logic or calculation as an example, performing logic or calculation on a plurality of input data bit by bit, namely performing logic or calculation on the same bit of the plurality of input data according to the sequence from the highest bit to the lowest bit, so as to obtain a mask value, namely a mask value, and determining the effective data of each input data in the plurality of input data according to the effective bit (bit being 1) of the mask value.
As illustrated in fig. 6, the plurality of input data is exemplified by 4 8-bit input data, and the 4 8-bit input data are 00001101, 00010100, 00001001, and 00000001, respectively. The same bits of the 4 8-bit input data are logically or-calculated in order from the most significant bit to the least significant bit, for example, the most significant bit (7 th bit) of the 4 8-bit input data is 0, so that the logical or calculation result is 0, and the least significant bits (0 th bit) of the 4 8-bit input data are 1, 0, 1, and 1, respectively, so that the logical or calculation result is 1. After performing logical OR calculation on the 4 8-bit input data bit by bit, the mask value is 00011101. The effective bits of the mask value are the 4 th bit, the 3 rd bit, the 2 nd bit and the 0 th bit respectively, and the numbers corresponding to the 4 th bit, the 3 rd bit, the 2 nd bit and the 0 th bit in the plurality of input data are extracted, namely the effective data of each input data. The effective data resulting in the 4 8-bit input data are 0111, 1010, 0101 and 0001, respectively.
After obtaining the valid data of each input data in the plurality of input data, the bit width calculation module 3011 inputs the valid data bit by bit to the calculation module 3012, so that the calculation module 3012 calculates the valid data bit by bit and the bit of the weight data stored in each storage calculation unit, and obtains the calculation result of each column in the calculation array. Wherein, the calculation module 3012 keeps consistent the result of calculating the plurality of valid data and the result of calculating the plurality of input data. Illustratively, taking the plurality of valid data 0111, 1010, 0101, and 0001 shown in fig. 6 as an example, the plurality of valid data is input to the calculation module 3012 bit by bit in parallel in order from the high bit to the low bit. For example, the most significant bits 0, 1, 0 and 1 of the plurality of significant data are input into the calculation module 3012 in parallel, and then the remaining bits are input into the calculation module 3012 in parallel in sequence, so that the calculation module 3012 calculates the plurality of significant data bit by bit.
In some embodiments, the bit width calculation module 3011 may also determine the significant bits of the multiple input data (i.e. calculate the mask values of the 4 input data bit by bit), and when any bit is determined to be significant, input the significant bits of the multiple input data to the calculation module 3012 for calculation. Illustratively, taking the 4 input data as 00001101, 00010100, 00001001 and 00000001 as examples, the bit width calculation module 3011 determines the valid bit of the 4 input data bit by bit, determines that the 4 th bit is valid when the 4 th bit is determined, inputs the 4 th bit of the 4 input data to the calculation module 3012 for calculation, and so on, and determines that the invalid bit is not input to the calculation module 3012.
It will be appreciated that the bit width calculation module 3011 obtains multiple input data from the input/output module 303, obtains multiple input data each time, calculates valid data of the obtained multiple input data each time, and inputs the calculated multiple valid data to the calculation module 3012. The bit width of the valid data is related to the plurality of input data acquired each time, and the bit widths of the plurality of valid data calculated each time may be the same or different, so the bit width calculation module 3011 can dynamically calculate the valid data of the plurality of input data.
In some alternative embodiments, the mask calculation may also be other calculation manners, for example, by determining a maximum value of a plurality of input data, directly determining whether high-order data of the mask is zero, and the like, which is not limited in the application. In addition, when there is only one input data, the mask value is the input data, and the bit width calculation module 3011 may directly determine the valid data of the input data according to whether each bit of the input data is 1.
In some alternative embodiments, the bit width calculation module 3011 may further expand the calculated multiple valid data into the remaining low bits to be input into the calculation module 3012 according to different device and circuit implementations, for example, expand the multiple valid data into 2 bits to be input into the calculation module 3012, which is not limited in this application.
Step 502, obtaining a calculation result of each column in the calculation array according to the bits of the plurality of effective data and the weight data.
Wherein, a column of calculation results in the calculation results of each column is the sum of the same bit of a plurality of valid data and the products calculated by a column of storage calculation units. Specifically, step 502 is that the calculation module 3012 calculates, according to the plurality of valid data and the bits of the weight data stored in each storage calculation unit, a calculation result of each column in the calculation array, and inputs the calculation result of each column to the result processing module 3015. The calculation module 3012 includes a calculation array including a plurality of storage calculation units into which one weight data is spread out to be stored as a plurality of single bit/low bit weight data, and a bit of the weight data stored by each storage calculation unit may be understood as a partial bit of one weight data stored by each storage calculation unit, which may be a single bit or a plurality of bits. The calculation module 3012 multiplies the bits of the weight data stored in each storage calculation unit by the plurality of valid data input by the bit width calculation module 3011, and specifically, each valid data in the plurality of valid data is input into a different row in the calculation array, that is, each valid data corresponds to a row of storage calculation units, and each valid data multiplies the bit of the weight data stored in each corresponding storage calculation unit. After the calculation is finished, each column in the calculation array corresponds to a calculation result, the calculation result of each column is the sum of the products of the plurality of valid data and the column, and the calculation module 3012 inputs the calculation result of each column into the result processing module 3015.
In some alternative embodiments, step 502 includes: when the computing array receives the N-th bit corresponding to the plurality of effective data respectively, the computing array computes the product of the N-th bit corresponding to the plurality of effective data respectively and the bit of the weight data.
Wherein N is an integer greater than or equal to 0. The "each valid data performs multiplication calculation with the bit of the weight data stored in each corresponding storage calculation unit" is specifically that a plurality of times of calculation are performed, and each time of calculation, a single bit of each valid data performs multiplication calculation with the bit of the weight data stored in each corresponding storage calculation unit, and the number of times of calculation is determined according to the bit width of the valid data. For example, 4 bits of valid data are calculated 4 times, and each time a single bit of the valid data is calculated. And in each calculation, the same single bit of the plurality of effective data is calculated in parallel, that is, the N-th bit corresponding to the plurality of effective data is calculated in parallel, which can be understood as that when the calculation array receives the N-th bit corresponding to the plurality of effective data, the calculation array performs one calculation.
For example, fig. 7 illustrates a computing module 700, which includes a4×8 computing array 701, where the effective data and the weight data of the input data are both 4 bits, the storage computing unit uses 1 bit, that is, the storage computing unit stores 1 bit of weight data, and multiplies the 1 bit of input data, the effective data of the multiple input data are A1B1C1D1, A2B2C2D2, A3B3C3D3, and A4B4C4D4, and a column of weight data in the computing array 701 is A1B1C1D1, A2B2C2D2, A3B3C3D3, and A4B4C4D4, respectively. The 3 rd bits (the most significant bits) of the valid data are a1, a2, a3 and a4, respectively, and when the computing array 701 receives a1, a2, a3 and a4, the computing array 701 inputs a1, a2, a3 and a4 into different rows of the computing array 701, specifically, inputs a1, a2, a3 and a4 into each storage computing unit on the corresponding row. Taking A1 as an example, A1 performs multiplication with bits of weight data stored by each storage computing unit on a corresponding row to obtain a plurality of product results, namely product results of a1×a1, a1×b1, a1×c1, a1×d1 and the like. Similarly, a2, a3 and a4 also perform multiplication to obtain a plurality of product results. After the computation of a1, a2, a3 and a4 is finished, the representative computation array 701 finishes one computation. It will be appreciated that 4 bits of valid data require 4 times of the above calculation to calculate the end of the entire valid data, and 3 times of calculation for b1, b2, b3 and b4, c1, c2, c3 and c4 and d1, d2, d3 and d4 are performed after the end of the calculation for a1, a2, a3 and a 4.
In some alternative embodiments, the computing module further includes an accumulation circuit that sums the products calculated by the same column of storage computing units in the computing array to obtain a sum of the products calculated by each column of storage computing units in the computing array.
After each calculation of the calculation array is finished, the accumulation circuit accumulates a plurality of results obtained by the calculation array, specifically, accumulates a plurality of product results obtained by calculating the same column of storage calculation units in the calculation array to obtain a calculation result of each column in the calculation array, namely, a sum of products of each column of storage calculation units in the calculation array is obtained, and the sum of products of each column of storage calculation units is input to the result processing module 3015.
For example, as shown in fig. 7, after the calculation of a1, a2, a3, and a4 by the calculation array 701 is completed, the accumulation circuit 702 accumulates a plurality of product results calculated by the calculation units stored in the same column of the calculation array 701. The accumulation circuit 702 accumulates a plurality of product results calculated by the 0 th column storage calculation unit in the calculation array 701 to obtain a sum s3=a1×a1+a2×a2+a3×a3+a4×a4 of the products of the 0 th column storage calculation unit, accumulates a plurality of product results calculated by the 1 st column storage calculation unit in the calculation array 701 to obtain a sum s2=a1×b1+a2×b2+a3×b3+a4×b4 of the products of the 1 st column storage calculation unit, accumulates a plurality of product results calculated by the 2 nd column storage calculation unit in the calculation array 701 to obtain a sum s1=a1×c1+a2×c2+a3×c3+a4×c4 of the products of the 2 nd column storage calculation unit in the calculation array 701, accumulates a sum s0=a1×d1+a2×d3+a3×d4 of the products of the 3 rd column storage calculation unit, and inputs the sum s1+s5 to the processing module to obtain a plurality of the products S1 and the like. It can be understood that, each time the computing array 701 receives the nth bit corresponding to each of the plurality of valid data, and performs one computation, the accumulating circuit 702 calculates the sum of products obtained by each column of storage computing units, and inputs the calculated sum of products obtained by each column of storage computing units to the result processing module 3015,4, and the accumulating circuit 702 needs to input 4 computation results to the result processing module 3015.
In some alternative embodiments, the integrated memory device further includes a weight bit width configuration module that stores bit width information for a plurality of weight data.
Wherein the weight data includes various weight data, the weight bit width configuration module may be the weight bit width configuration module 3013 in fig. 3. The bit width information includes the bit width of each weight data and an identification of the starting column of each weight data in the compute array. A weight data may be understood as a column of weight data, for example, a 4×8 computing array 701 in fig. 7, where a 0 th column of storage computing units-a 3 rd column of storage computing units represent a column of weight data (a weight data), and multiple columns of weight data (multiple weight data) may be included in the computing array 701, where bit widths of at least two weight data in the multiple weight data are different. In the computing array 701, the 0 th column stores the computing unit-the 3 rd column stores the computing unit, that is, the 0 th column of the computing array 701, the bit width of the 0 th column of the computing array 701 is 4 bits, and the identification of the initial column of the 0 th column of the computing array 701 is the 0 th column stores the computing unit.
Illustratively, the bit width information is shown in table 1 below, where table 1 includes 3 kinds of weight data, namely, column 0 weight data, column 1 weight data, and column 2 weight data, corresponding to the calculation array 701 shown in fig. 7. The bit width of the first type of weight data (column 0 weight data) is 4 bits, and the start column is identified as a column 0 storage computing unit, that is, a column 0 storage computing unit-a column 3 storage computing unit represents the first type of weight data (column 0 weight data). The second weight data (1 st column weight data) has a bit width of 2 bits, and the initial column is identified as a 4 th column storage computing unit, that is, the 4 th column storage computing unit and the 5 th column storage computing unit represent the second weight data (1 st column weight data). The bit width of the third weight data (2 nd column weight data) is 2 bits, and the initial column is identified as 6 th column storage computing unit, that is, the 6 th column storage computing unit and the 7 th column storage computing unit represent the third weight data (2 nd column weight data).
TABLE 1
Weight data identification Bit width Start column identification
Column
0 weight data 4 bits Column 0 storage computing unit
Column
1 weight data 2 bits Column 4 storage computing unit
Column 2 weight data 2 bits Column 6 storage computing unit
It can be seen that, the weight bit width configuration module 3013 of the present application can store bit width information of multiple weight data, and the bit widths of at least two weight data in the multiple weight data are different, that is, the single calculation array of the present application can include the weight data with multiple bit widths, so as to support calculation of the mixing precision of the weight data, thereby effectively improving the calculation efficiency of the integrated storage and calculation device.
In some alternative embodiments, the integrated memory device further includes a control module that writes the plurality of weight data to the plurality of memory computing units based on the bit width information.
The control module may be the control module 3014 in fig. 3. The control module 3014 can write various weight data stored in the memory 302 to a plurality of storage calculation units according to the bit width information in the weight bit width configuration module 3013.
Illustratively, taking the bit width information shown in table 1 and the computing array 701 shown in fig. 7 as an example, the control module 3014 writes each bit of the 0 th column weight data (A1B 1C1D1, A2B2C2D2, A3B3C3D3, and A4B4C4D 4) stored in the memory 302 into each of the 0 rd column storage computing unit-the 3 rd column storage computing unit according to the bit width and the start column identification of the 0 th column weight data shown in table 1, and so on until all of the plurality of weight data in the memory 302 are written into each of the storage computing units in the computing array 701 according to the bit width information shown in table 1.
In some alternative embodiments, the control module determines the valid bits of the mask value bit by bit, and generates the first control signal and the second control signal when any bit of the mask value is determined to be valid.
The control module 3014 can generate a control signal according to the mask value calculated by the bit width calculation module 3011 to control the calculation module 3012 and the result processing module 3015. Specifically, the bit width calculation module 3011 inputs the mask value into the control module 3014 bit by bit, the control module 3014 determines whether each bit of the mask value is valid (i.e., 1) bit by bit, and when any bit of the mask value is determined to be valid, the control module 3014 generates the first control signal and the second control signal. It will be appreciated that there are several valid bits in the mask value and that the control module 3014 generates the first control signal and the second control signal several times.
The first control signal is used to instruct the calculation module 3012 to calculate the sum of products of each column of storage calculation units in the calculation array, which can be understood as instructing the calculation module 3012 to perform one calculation on the nth bit corresponding to each of the plurality of valid data shown in fig. 7, and obtain the sum of products of each column of storage calculation units in the calculation array.
The second control signal is used for indicating the result processing module 3015 to perform weighted calculation on the sum of products of the multiple columns of storage calculation units corresponding to each weight data in the calculation array according to the bit width information, so as to obtain multiple weighted results of the nth bit corresponding to the multiple effective data respectively. Since it is possible to know which column of the storage computing unit of the computing array one type of weight data (one column of weight data) corresponds to from the bit width information, the result processing module 3015 is able to determine a plurality of columns of storage computing units corresponding to each type of weight data from the bit width information. The result processing module 3015 performs a weight calculation on the sum of products of the plurality of columns of storage calculation units corresponding to each type of weight data, specifically, performs a weight calculation according to the bit weights of the weight data bits. For example, a row of memory computation units corresponding to the least significant bit (bit 0) of the weight data, and the sum of the products of the memory computation units is 2 when weighting computation 0 Multiplying and accumulating, a row of storage calculation units corresponding to the most 2 bits of weight data, and when weighting calculation is performed, the sum of the products of the storage calculation units is 2 2 The multiplication is then accumulated, and the multiplication with the power of 2 can be realized by shifting. It will be appreciated that several weights are included in the computational array Data (several columns of weight data), and the result processing module 3015 performs a weight calculation to obtain several weight results. When the computing array includes multiple weight data, the result processing module 3015 receives the second control signal, and then obtains multiple weighted results of the nth bit corresponding to the multiple effective data, where each weighted result in the multiple weighted results corresponds to one weight data.
For example, as shown in the control module 800 of fig. 8, the control module 800 includes a first comparator for comparing whether the bit inputted into the control module 800 is the same as 1, and if so, generating a first control signal and a second control signal, and if not, generating no first control signal and no second control signal.
Take mask value 00011101 as shown in fig. 6 as an example. The bit width calculation module 3011 inputs the mask values into the control module 800 bit by bit in order from the most significant bit to the least significant bit. First, the bit width calculation module 3011 inputs the highest bit (7 th bit) 0 of the mask value into the control module 800, and the first comparator in the control module 800 compares 0 with 1 differently, i.e., determines that the bit is not valid, and does not generate the first control signal and the second control signal. Similarly, when the bit width calculation module 3011 inputs the 4 th bit 1 of the mask value into the control module 800, the first comparator in the control module 800 compares 1 with 1, determines that the bit is valid, and generates the first control signal and the second control signal.
The first control signal generated by the control module 800 is input to the calculation module 3012, and is used for instructing the calculation module 3012 to calculate the nth bit corresponding to each of the plurality of valid data. Accordingly, the first control signal generated according to the 4 th bit of the mask value instructs the calculation module 3012 to perform one calculation on the most significant bits (3 rd bit) 0, 1, 0 and 0 of the plurality of valid data, so as to obtain the sum of products of the storage calculation units in each column of the calculation array, i.e. S3, S2, S1 and S0 shown in fig. 7.
The second control signal generated by the control module 800 is input to the result processing module 3015 for indicating the pair of result processing modules 3015The calculation module 3012 performs a weighted calculation of the sum of the products generated by one calculation. Taking the bit width information shown in table 1 and the calculation module 700 shown in fig. 7 as an example, the result processing module 3015 determines that the 0 th column storage calculation unit-3 rd column storage calculation unit in the calculation array 701 represents the first type of weight data (0 th column weight data) according to the bit width of the 0 th column weight data in table 1 being 4 bits, the starting column being identified as the 0 th column storage calculation unit. And the sum S3, S2, S1 and S0 of the products corresponding to the 0 th row storage computing unit-3 rd row storage computing unit obtained by the computing module 700 are respectively weighted with the bit weights of the weight data bits to obtain a weighted result sum0, wherein sum0 corresponds to one weight data (0 th row weight data), and sum 0=S3×2 3 +S2×2 2 +S1×2 1 +S0×2 0 And so on. It is understood that the calculation module 700 can obtain 3 weighted results corresponding to the 0 th column weight data, the 1 st column weight data, and the 2 nd column weight data, respectively.
In some alternative embodiments, the control module generates the third control signal when it determines that the bit width of the mask value is equal to the bit width of the input data.
As can be seen from the description shown in fig. 6, the bit width of the input data is the same as the bit width of the mask value. Because the mask value is input into the control module 3014 bit by bit, when the control module 3014 determines that the bit width of the mask value is the same as the bit width of the input data, it can determine that the mask value is input completely, so as to generate a third control signal. It is understood that the control module 3014 outputs the third control signal after outputting the first control signal and the second control signal.
The third control signal is used for indicating the result processing module 3015 to perform weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data, so as to obtain a final result, where the final result includes weighted results of each type of weighted data.
Illustratively, as shown in FIG. 8, a counter and a second comparator are also included in the control module 800. Every one bit of the mask value is input, the counter performs a 1-up operation, and the bit width of the mask value is recorded. The second comparator is used for comparing whether the bit width of the mask value recorded in the counter is the same as the bit width of the input data, if so, the second control signal is generated, and if not, the second control signal is not generated.
Taking the input data bit width of 8 bits as shown in fig. 6 as an example, the mask value is 00011101. The effective bits of the mask value are respectively the 4 th bit, the 3 rd bit, the 2 nd bit and the 0 th bit, and the bit weights corresponding to the effective bits are respectively 2 4 、2 3 、2 2 And 2 0 . The bit width calculation module 3011 inputs the highest bit (7 th bit) 0 of the mask value into the control module 800, and the first comparator in the control module 800 compares 0 with 1 differently, i.e., determines that the bit is not valid, and does not generate the first control signal and the second control signal. The counter records the mask with the bit width of 1, the second comparator compares the mask recorded by the counter with the bit width (1) of the input data, and the second comparator does not generate a third control signal. Similarly, when the bit width calculation module 3011 inputs the lowest bit (bit 0) 1 of the mask value into the control module 800, the first comparator in the control module 800 compares 1 with 1 and determines that the bit is a valid bit, and generates the first control signal and the second control signal. The second comparator compares the bit width (8) of the mask recorded by the counter with the bit width (8) of the input data to generate a third control signal.
The third control signal generated by the control module 800 is input to the result processing module 3015, where the result processing module 3015 has received the second control signal 4 times, that is, has performed 4 weight calculations on the sum of the products of the calculation module 3012, where each weight calculation obtains multiple weight results (e.g., the 1 st weight calculation obtains a weight result such as sum 0). The third control signal is used for indicating the result processing module 3015 to perform weighted calculation again according to the bit weight corresponding to the valid bit of the mask value and the weighted results obtained by multiple weighted calculation to obtain a final result. Taking the result processing module 3015 for calculating the 1 st weighting of the 0 th column weight data to obtain sum0, the 2 nd weighting to obtain sum1, the 3 rd weighting to obtain sum2, the 4 th weighting to obtain sum3 as an example, a plurality of weighting results sum0, sum1, sum2 and sum3 are respectively and valid bit weight 2 of mask value 4 、2 3 、2 2 And 2 0 Correspondingly, it can be understood that the bit weight of the valid bit of the mask value corresponding to the weighted result is the same as the bit weight of the valid bit of the mask value corresponding to the second control signal for generating the weighted result. The result processing module 3015 performs weighting calculation again to obtain the final result out0=sum0×2 of the 0 th column weight data 4 +sum1×2 3 +sum2×2 2 +sum3×2 0 . And so on, the final result includes a weighted result for each weight data. It is understood that the calculation module 700 can obtain 3 final results, corresponding to the 0 th column weight data, the 1 st column weight data, and the 2 nd column weight data, respectively.
And step 503, performing weighted calculation on the calculation result of each column to obtain a final result.
In step 503, the result processing module 3015 performs weighted calculation on the calculation result of each column, so as to obtain a final result. The calculation result of each column is the sum of the products of the storage calculation units of each column, and can be understood as the calculation results of S3, S2, S1, S0, and the like in the above step 502. The result processing module 3015 performs weighted calculation on the calculation result of each column, specifically, performs weighted calculation according to the bit weight of the bit of the weight data to obtain a plurality of sum values, and performs weighted calculation according to the bit weight corresponding to the valid bit of the mask value to obtain a plurality of out values, thereby obtaining a final result. Reference may be made to the description of the control module 3014 (control module 800) described above, which is not repeated here.
In some alternative embodiments, the input data and the weight data include an unsigned number and a signed number, where a calculation method of the unsigned number may refer to an example in the embodiment of the present application, and the signed number may be implemented by a calculation method such as complement calculation and differential calculation, which is not limited in the present application.
Therefore, the calculation method provided in the embodiment of the application can be applied to a memory calculation integrated device, for example, a chip, when the neural network calculation is performed, a plurality of input data are calculated through a bit width calculation module, a plurality of effective data of the plurality of input data obtained through calculation are input to a calculation module, the calculation module obtains a calculation result of each column in a calculation array according to the plurality of effective data and bits of weight data stored by each storage calculation unit, the calculation result of each column is input to a result processing module, and finally the calculation result of each column is weighted and calculated by the result processing module to obtain a final result. Compared with the prior art that multi-bit input data are unfolded into a plurality of single-bit/low-bit input data according to the data bit width to be input and calculated, the unfolding calculation is caused to be too many times, and large expenditure is generated. In addition, the prior art cannot achieve the calculation of the mixing precision of the weight data, so that the calculation efficiency is low, the bit width information of various weight data stored by the weight bit width configuration module can be utilized, the deployment and calculation of the weight data with various bit widths are realized in a single calculation array, the calculation of the mixing precision of the weight data is supported, and the calculation efficiency of the integrated memory calculation device is effectively improved.
Corresponding to the calculation method provided in fig. 5, on the basis of the structure of the integrated memory calculation device shown in fig. 3, as shown in fig. 9, the embodiment of the application provides a flow chart of a calculation method, in which a bit width calculation module is taken as a bit width calculation module 3011, a calculation module is taken as a calculation module 3012, a weight bit width configuration module is taken as a weight bit width configuration module 3013, a control module is taken as a control module 3014, a structure processing module is taken as a result processing module 3015, a plurality of input data are 00011, 00101 and 00010, a calculation array is a 3×3 calculation array, a storage unit stores 1 bit, and only one weight data is stored in the weight bit width configuration module 3013. The calculation flow comprises the following steps:
step 1, the bit width calculation module 3011 calculates a plurality of input data, obtains mask values of the plurality of input data and valid data corresponding to each input data, and inputs the plurality of valid data to the calculation module 3012.
Wherein the plurality of input data are 00011, 00101 and 00010, performing mask calculation (taking logic or calculation as an example) on the plurality of input data, and determining that the mask value obtained by calculation is 00111, thereby determining that the plurality of valid data are 011, 101 and 010 respectively, and inputting 011, 101 and 010 into the calculation module bit by bit. See in particular the description of step 501 above, which is not repeated here.
Step 2, the control module 3014 writes various weight data into a plurality of storage computing units according to the weight bit width configuration module 3013.
Wherein, only one weight data is stored in the weight bit width configuration module 3013, the bit width is 3 bits, and the initial column is identified as the 0 th column storage computing unit. Taking the plurality of weight data as 101, 011 and 111 as an example, the control module 3014 writes the plurality of weight data into a plurality of storage computing units according to the bit width information in the weight bit width configuration module 3013, see the computing array shown in fig. 9. Reference is made specifically to the description of the control module described above, which is not repeated here.
Step 3, the control module 3014 generates a first control signal and a second control signal according to the valid bit of the mask value calculated by the bit width calculation module 3011.
The most significant bit (bit 4) 0 of the mask value is input to the control module 3014, and the control module 3014 determines that the most significant bit is not significant, and does not generate the first control signal and the second control signal. Then, bit 30 of the mask value is input to the control module 3014, and the control module 3014 determines that the bit 3 is not valid, and does not generate the first control signal and the second control signal. Then, bit 2 of the mask value, bit 1, is input to the control module 3014, and the control module 3014 determines that bit 2 of the mask value is a valid bit, and generates a first control signal and a second control signal. The first control signal generated according to the 2 nd bit of the mask value is used to control the calculation module 3012 to calculate the most significant bits (2 nd bit) 0, 1 and 0 of the plurality of valid data, resulting in s0=0×1+1×0+0×1=0, s1=0×0+1+0×1=1 and s2=0×1+1×1+0×1=1, corresponding to the 0 th column storage calculation unit, the 1 st column storage calculation unit and the 2 nd column storage calculation unit, respectively, and input them to the result processing module 3015. The second control signal generated according to the 2 nd bit of the mask value is used for controlling the result processing module 3015 to perform the processing on SO, S1 and S2 according to the weight data ratio Weighting calculation is carried out on the bit weight of the bit to obtain sum=0×2 2 +1×2 1 +1×2 0 =3. Then, bit 1 of the mask value is input to the control module 3014, and the control module 3014 determines that bit 1 of the mask value is a valid bit, and generates a first control signal and a second control signal. The first control signal generated according to the 1 st bit of the mask value is used to control the calculation module 3012 to calculate 1 st bits 1, 0 and 1 of the plurality of effective data, to obtain s0 ' =1×1+0×0+1×1=2, s1 ' =1×0+0×1+1×1=1 and s2 ' =1×1+0×1+1×1=2, corresponding to the 0 th column storage calculation unit, the 1 st column storage calculation unit and the 2 nd column storage calculation unit, respectively, and input them to the result processing module 3015. The second control signal generated according to the 1 st bit of the mask value is used to control the result processing module 3015 to perform weighted calculation on the SO ', S1', and S2 'according to the bit weights of the weight data bits, to obtain sum' =2×2 2 +1×2 1 +2×2 0 =12. Finally, bit 0 and bit 1 of the mask value are input to the control module 3014, and the control module 3014 determines that bit 0 of the mask value is a valid bit, and generates a first control signal and a second control signal. The first control signal generated according to the 0 th bit of the mask value is used to control the calculation module 3012 to calculate the 0 th bits 1, 1 and 0 of the plurality of effective data, to obtain s0″ =1×1+1×0+0×1=1, s1″ =1×0+1×1+0×1=1+1×1+0×1=2, corresponding to the 0 th column storage calculation unit, the 1 st column storage calculation unit and the 2 nd column storage calculation unit, respectively, and input them to the result processing module 3015. The second control signal generated according to the 0 th bit of the mask value is used to control the result processing module 3015 to perform weighted calculation on the SO ', S1', and S2 'according to the bit weights of the weighted data bits, to obtain sum' "=1×2 2 +1×2 1 +2×2 0 =8. See in particular the description of the control module 3014 above, which is not repeated here.
And step 4, when the control module 3014 determines that the bit width of the mask value is equal to the bit width of the input data, a third control signal is generated.
Wherein, when the control module 3014 determines that the bit width of the mask value is 5 bits, a third control signal is generated. The third control signal is used for controllingThe result processing module 3015 performs weighted calculation on sum, sum ', and sum' "calculated in step 3 according to the bit weights of the significant bits of the mask value, to obtain out=sum×2 2 +sum'×2 1 +sum''×2 0 =44, which is the final result. See in particular the description of the control module 3014 above, which is not repeated here.
Thus far, the integrated memory device has completed the calculation of the plurality of input data and the plurality of weight data. It will be appreciated that the above steps 1-4 are merely examples of computing arrays comprising one type of weight data (a list of weight data), and that in practice a plurality of types of weight data are possible. According to the calculation method, the effective data of the plurality of input data are dynamically calculated, only the effective bits of the input data are calculated, the calculation times of the calculation module can be effectively reduced, the cost is reduced, meanwhile, the calculation of the weight data mixing precision is supported, and the calculation efficiency of the integrated memory calculation device is improved.
According to the calculation method provided by the embodiment of the application, the result of calculation by using the target detection yolov3-tiny model is shown in table 2 (the data set is the COCO2017val data set). Taking the example that the number of bit operands and array calculation times are 100% when the 8-bit model (the weight data is 8 bits) is adopted in the prior art, the number of bit operands can be reduced to 81.38% in the prior art and the number of array calculation times can be reduced to 78.31% in the prior art when the 8-bit model and the integrated storage and calculation device are adopted for calculation. When the 4/8 bit mixed model (weight data comprises 4 bits and 8 bits) and the integrated storage and calculation device are adopted for calculation, the bit operand can be reduced to 69.14% in the prior art and the array calculation frequency can be reduced to 72.23% in the prior art while the calculation accuracy is ensured. It can be seen that the method provided by the embodiment of the application can effectively reduce the calculation times, and when the weight data mixing precision is calculated, the calculation times can be greatly reduced, so that the calculation cost is effectively reduced, and the calculation efficiency is improved.
TABLE 2
Figure BDA0003432583270000151
Figure BDA0003432583270000161
It will be appreciated that the above-described integrated device, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present application.
The embodiment of the present application may divide the functional modules of the integrated storage device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
In the case of an integrated unit, as shown in fig. 10, an embodiment of the present application discloses a unified device 1000, and the unified device 1000 may be the chip 300 in the above embodiment. The integrated memory device 1000 may include a processing module, a storage module, and a communication module. The processing module may be configured to control and manage the operations of the integrated device 1000, for example, may be configured to support the integrated device 1000 to perform the steps performed by the bit width calculation module 3011, the calculation module 3012, the weight bit width configuration module 3013, the control module 3014, and the result processing module 3015. The memory module may be used to support the memory device 1000 to store program codes and data, etc., and may be used to store input data, weight data, etc., for example. The communication module may be used to support communication between the integrated device 1000 and other devices, for example, may be used to input a plurality of input data and weight data from an external device, or may be used to output the final result obtained by the result processing module 3015 to the external device.
Of course, the unit modules in the above-described integrated storage device 1000 include, but are not limited to, the above-described processing module, memory module, and communication module.
Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a neural network processor (neural network processing unit, NPU), a combination of digital signal processing (digital signal processing, DSP) and a microprocessor, or the like. The memory module may be a memory. The communication module may specifically be a device that interacts with other external devices.
For example, the processing module is a processor 1001, the storage module may be a memory 1002, and the communication module may be referred to as a communication interface 1003. The integrated device 1000 provided in the embodiments of the present application may be the chip 300 shown in fig. 3. The processor 1001, the memory 1002, the communication interface 1003, and the like may be connected together, for example, by a bus.
Embodiments of the present application also provide an electronic device including one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, the one or more memories being configured to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the relevant method steps described above to implement the computing methods of the embodiments described above.
The embodiment of the application also provides an electronic device, which comprises one or more communication interfaces and one or more processors, wherein the communication interfaces and the processors are interconnected through lines, and the processors receive and execute computer instructions from a memory of the electronic device through the communication interfaces, so that the electronic device executes the relevant method steps to realize the computing method in the embodiment.
The present application also provides a computer-readable storage medium having stored therein computer program code which, when executed on a computer or a processor, causes the computer or the processor to perform the calculation method in the above embodiments.
Embodiments of the present application also provide a computer program product, where the computer program product includes computer instructions, which when executed on a computer or a processor, cause the computer or the processor to perform the above-mentioned related steps, so as to implement the computing method performed by the electronic device in the above-mentioned embodiments.
The integrated storage device, the electronic device, the computer storage medium, the computer program product, or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects achieved by the integrated storage device, the electronic device, the computer storage medium, the computer program product, or the chip can refer to the beneficial effects in the corresponding method provided above, and are not repeated herein.
It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or units, electrical, mechanical, or other form.
The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. The integrated memory and calculation device is characterized by comprising a bit width calculation module, a calculation module and a result processing module; the computing module comprises a computing array, wherein the computing array comprises a plurality of storage computing units, and the storage computing units are used for storing weight data;
the bit width calculation module is used for calculating a plurality of input data to obtain a plurality of effective data, and inputting the plurality of effective data into the calculation module; the plurality of input data and the plurality of effective data are in one-to-one correspondence, first input data in the plurality of input data and first effective data in the plurality of effective data are in correspondence, and the bit width of the first input data is larger than that of the first effective data;
The computing module is used for obtaining a computing result of each column in the computing array according to the bits of the effective data and the weight data, and inputting the computing result of each column into the result processing module; wherein, a column of calculation results is the sum of the products calculated by a column of storage calculation units and the same bit of the plurality of effective data;
and the result processing module is used for carrying out weighted calculation on the calculation result of each column to obtain a final result.
2. The integrated device according to claim 1, wherein,
the bit width calculation module is specifically configured to perform mask calculation on the plurality of input data to obtain a mask value, and determine the plurality of valid data according to valid bits of the mask value;
the plurality of valid data is input to the calculation module bit by bit, so that the calculation module calculates the plurality of valid data bit by bit.
3. The integrated device of claim 2, wherein when the computing array receives an nth bit corresponding to each of the plurality of valid data, where N is an integer greater than or equal to 0,
The computing array is used for computing the product of the N bit corresponding to each of the plurality of effective data and the bit of the weight data;
the computing module further comprises an accumulation circuit;
and the accumulation circuit is used for adding the products calculated by the storage calculation units in the same column in the calculation array to obtain the sum of the products calculated by the storage calculation units in each column in the calculation array.
4. The integrated device of claim 2 or 3, wherein the weight data comprises a plurality of weight data, the integrated device further comprising a weight bit width configuration module;
the weight bit width configuration module is configured to store bit width information of the plurality of weight data, where the bit width information includes a bit width of each weight data and an identifier of a start column of each weight data corresponding to the calculation array, where the bit widths of at least two weight data in the plurality of weight data are different.
5. The integrated deposit and withdrawal device of claim 4, characterized in that it further comprises a control module;
the control module is used for writing the multiple weight data into the multiple storage computing units according to the bit width information.
6. The integrated device according to claim 5, wherein,
the control module is further used for determining the valid bit of the mask value bit by bit, and generating a first control signal and a second control signal when any bit of the mask value is determined to be valid;
the first control signal is used for indicating the calculation module to calculate the sum of products calculated by each row of storage calculation units in the calculation array; the second control signal is used for indicating the result processing module to perform weighted calculation on the sum of products of a plurality of columns of storage calculation units corresponding to each weight data in the calculation array according to the bit width information, so as to obtain a plurality of weighted results of the Nth bit corresponding to each effective data, wherein each weighted result in the plurality of weighted results corresponds to one weight data.
7. The integrated storage device of claim 6, wherein,
the control module is further configured to generate a third control signal when determining that the bit width of the mask value is equal to the bit width of the input data; the third control signal is configured to instruct the result processing module to perform weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data, so as to obtain the final result, where the final result includes the weighted result of each weighted data.
8. A computing method, characterized in that the method is applied to a computationally integrated device comprising a computing array comprising a plurality of storage computing units for storing weight data; the method comprises the following steps:
calculating a plurality of input data to obtain a plurality of effective data; the plurality of input data and the plurality of effective data are in one-to-one correspondence, first input data in the plurality of input data and first effective data in the plurality of effective data are in correspondence, and the bit width of the first input data is larger than that of the first effective data;
obtaining a calculation result of each column in the calculation array according to the bits of the effective data and the weight data; wherein, a column of calculation results is the sum of the products calculated by a column of storage calculation units and the same bit of the plurality of effective data;
and carrying out weighted calculation on the calculation result of each column to obtain a final result.
9. The method of claim 8, wherein computing the plurality of input data to obtain the plurality of valid data comprises:
Performing mask calculation on the plurality of input data to obtain a mask value, and determining the plurality of effective data according to the effective bits of the mask value;
the obtaining the calculation result of each column in the calculation array according to the bits of the plurality of valid data and the weight data includes:
and calculating the bits of the effective data and the bits of the weight data bit by bit to obtain a calculation result of each column in the calculation array.
10. The method of claim 9, wherein obtaining the computation result for each column in the computation array based on the plurality of valid data and the weight data bits comprises: when the computing array receives the N bit corresponding to each of the plurality of valid data, wherein N is an integer greater than or equal to 0,
calculating the product of the N bit corresponding to each of the effective data and the bit of the weight data;
and adding the products calculated by the storage calculation units in the same column in the calculation array to obtain the sum of the products calculated by the storage calculation units in each column in the calculation array.
11. The method of claim 9 or 10, wherein the weight data comprises a plurality of weight data, the method further comprising:
And storing bit width information of the plurality of weight data, wherein the bit width information comprises the bit width of each weight data and the identification of a starting column of each weight data in the computing array, and the bit widths of at least two weight data in the plurality of weight data are different.
12. The method of claim 11, wherein the method further comprises:
and writing the multiple weight data into the multiple storage computing units according to the bit width information.
13. The method according to claim 12, wherein the method further comprises:
determining the valid bit of the mask value bit by bit, and generating a first control signal and a second control signal when any bit of the mask value is determined to be valid;
the first control signal is used for calculating the sum of products calculated by each row of storage calculation units in the calculation array; the second control signal is used for carrying out weighted calculation on the sum of products of a plurality of columns of storage calculation units corresponding to each weight data in the calculation array according to the bit width information to obtain a plurality of weighted results of the Nth bit corresponding to the plurality of effective data respectively, and each weighted result in the plurality of weighted results corresponds to one weight data.
14. The method of claim 13, wherein the method further comprises:
generating a third control signal when the bit width of the mask value is determined to be equal to the bit width of the input data; and the third control signal is used for carrying out weighted calculation according to the bit weight corresponding to the valid bit of the mask value and a plurality of weighted results of each bit of the plurality of valid data to obtain the final result, wherein the final result comprises the weighted result of each weighted data.
15. A computer readable storage medium, characterized in that computer instructions are stored which, when run on an electronic device, cause the electronic device to perform the method of any of the preceding claims 8-14.
CN202111599630.1A 2021-12-24 2021-12-24 Integrated storage and calculation device and calculation method Pending CN116362314A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111599630.1A CN116362314A (en) 2021-12-24 2021-12-24 Integrated storage and calculation device and calculation method
PCT/CN2022/141634 WO2023116923A1 (en) 2021-12-24 2022-12-23 Storage and calculation integrated device and calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111599630.1A CN116362314A (en) 2021-12-24 2021-12-24 Integrated storage and calculation device and calculation method

Publications (1)

Publication Number Publication Date
CN116362314A true CN116362314A (en) 2023-06-30

Family

ID=86901378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111599630.1A Pending CN116362314A (en) 2021-12-24 2021-12-24 Integrated storage and calculation device and calculation method

Country Status (2)

Country Link
CN (1) CN116362314A (en)
WO (1) WO2023116923A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821047B (en) * 2023-08-31 2023-10-31 北京犀灵视觉科技有限公司 Sensing and storing integrated circuit, system and method
CN117331512B (en) * 2023-12-01 2024-04-12 芯动微电子科技(武汉)有限公司 Data compression and processing method for executing write operation on GPU (graphics processing unit) nuclear memory

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423816B (en) * 2017-03-24 2021-10-12 中国科学院计算技术研究所 Multi-calculation-precision neural network processing method and system
CN110990060B (en) * 2019-12-06 2022-03-22 北京瀚诺半导体科技有限公司 Embedded processor, instruction set and data processing method of storage and computation integrated chip
CN113255875A (en) * 2020-02-07 2021-08-13 华为技术有限公司 Neural network circuit and neural network system
US12050888B2 (en) * 2020-04-15 2024-07-30 Macronix International Co., Ltd. In-memory computing method and apparatus
CN214225915U (en) * 2020-11-23 2021-09-17 格科微电子(上海)有限公司 Multimedia chip architecture and multimedia processing system applied to portable mobile terminal

Also Published As

Publication number Publication date
WO2023116923A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
CN111837145B (en) System and method for mapping matrix calculations to matrix multiplication accelerators
CN108154228B (en) Artificial neural network computing device and method
US20210264273A1 (en) Neural network processor
AU2020274862B2 (en) Training of artificial neural networks
US11341400B1 (en) Systems and methods for high-throughput computations in a deep neural network
US11537879B2 (en) Neural network weight discretizing method, system, device, and readable storage medium
CN116362314A (en) Integrated storage and calculation device and calculation method
CN110543939A (en) hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
TWI737228B (en) Quantization method based on hardware of in-memory computing and system thereof
CN109284824A (en) A kind of device for being used to accelerate the operation of convolution sum pond based on Reconfiguration Technologies
CN110580519A (en) Convolution operation structure and method thereof
Mao et al. Energy-efficient machine learning accelerator for binary neural networks
CN115145536A (en) Adder tree unit with low bit width input and low bit width output and approximate multiply-add method
KR20200020117A (en) Deep learning apparatus for ANN with pipeline architecture
US20230068941A1 (en) Quantized neural network training and inference
Niknia et al. Nanoscale Accelerators for Artificial Neural Networks
Niknia et al. Nanoscale design of multi-layer perceptrons using floating-point arithmetic units
JP7036224B2 (en) Arithmetic processing unit and control method of arithmetic processing unit
Kim et al. An Asynchronous Inter-Processor Communication Based, Input Recycling Parallel Architecture for Large Scale Neural Network Simulation
CN113554162B (en) Axon input extension method, device, equipment and storage medium
US20240028900A1 (en) Energy Efficient Computations Using Bit-Sparse Data Representations
CN112446472A (en) Method, apparatus and related product for processing data
US20240111525A1 (en) Multiplication hardware block with adaptive fidelity control system
CN114168888B (en) Memory simulation type linear equation set solver, solving system and solving method
CN111198714A (en) Retraining method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination