CN118094069B

CN118094069B - Channel-by-channel convolution device

Info

Publication number: CN118094069B
Application number: CN202410464951.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-08-09
Anticipated expiration: 2044-04-18
Also published as: CN118094069A

Abstract

The present disclosure provides a channel-by-channel convolution apparatus, comprising: the device comprises a channel-by-channel convolution control unit, a convolution kernel input control unit, a feature map input control unit and a floating point operation unit. The channel-by-channel convolution control unit is used for executing the channel-by-channel convolution instruction transmitted by the instruction scheduling unit. The convolution kernel input control unit calls the convolution kernel to the floating point operation unit based on the control of the channel-by-channel convolution control unit. The feature map input control unit calls feature map data to the floating point operation unit based on the control of the channel-by-channel convolution control unit. The floating point arithmetic unit calculates result elements in the channel-by-channel convolution result matrix.

Description

Channel-by-channel convolution device

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to an electronic device to perform a convolution in depth (DEPTHWISE SEPARABLE CONVOLUTION), and in particular to a channel-by-channel convolution device to perform a channel-by-channel convolution in depth (DEPTHWISE CONVOLUTION) separable convolution.

Background

Depth separable convolution (DEPTHWISE SEPARABLE CONVOLUTION) is a lightweight convolution operation. The operation of the depth separable convolution includes two stages, a channel-by-channel convolution (DEPTHWISE CONVOLUTION) and a point-by-point convolution (Pointwise Convolution). The depth separable convolution can effectively reduce the calculated amount and the parameter quantity, and improve the reasoning speed and the operation efficiency of the model. Assuming that the size of the input Feature Map (Feature Map) is h×w, the number of input channels is N, the size of the convolution kernel is k×k, and the number of output channels is M, the operation amount of the standard convolution is h×w×n×k ² ×m, and the parameter amount is n×k ² ×m. Under the same assumption, for the depth separable convolution, the parameter amount of the first stage (channel-by-channel convolution) is K ² x N, and the parameter amount of the second stage (point-by-point convolution) is N x M (assuming that the size of the convolution kernel is 1*1). Thus, the total parameter of the depth separable convolution is K ²*N + N*M = N*(K² +M). Typically, N (K ² +m) is much smaller than N K ² M.

The depth separable convolution, like the standard convolution, may be performed using general matrix multiplication Unit (General Matrix Multiplication, GEMM) hardware, or by CPU hardware, or by Vector calculation Unit (Vector Unit) hardware of the GPU. There is currently no hardware dedicated to implementing depth separable convolution.

Disclosure of Invention

The present disclosure provides a channel-by-channel convolution apparatus to perform a first stage "channel-by-channel convolution (DEPTHWISE CONVOLUTION)" in a depth separable convolution (DEPTHWISE SEPARABLE CONVOLUTION) operation.

In an embodiment according to the present disclosure, the channel-by-channel convolution apparatus includes a channel-by-channel convolution control unit, a convolution kernel input control unit, a feature map input control unit, and at least one floating point arithmetic unit. The channel-by-channel convolution control unit is used for executing the channel-by-channel convolution instruction transmitted by the instruction scheduling unit. The convolution kernel input control unit is coupled to the channel-by-channel convolution control unit. A channel-by-channel convolution control unit controls a convolution kernel input control unit based on the channel-by-channel convolution instruction. The convolution kernel input control unit invokes a convolution kernel based on the control of the channel-by-channel convolution control unit. The feature map input control unit is coupled to the channel-by-channel convolution control unit. The channel-by-channel convolution control unit controls the feature map input control unit based on the channel-by-channel convolution instruction. The feature map input control unit calls the feature map data based on the control of the channel-by-channel convolution control unit. At least one floating point arithmetic unit is coupled to the convolution kernel input control unit and the feature map input control unit. The convolution kernel input control unit broadcasts a convolution kernel to the at least one floating point arithmetic unit. The feature map input control unit invokes a corresponding portion of the matrix elements from the feature map data to corresponding ones of the at least one floating point arithmetic units. The at least one floating point arithmetic unit each computes at least one result element in a channel-by-channel convolution result matrix.

Based on the above, the channel-by-channel convolution device may be a piece of hardware dedicated to "speeding up the channel-by-channel convolution stage in the depth-separable convolution operation".

Drawings

Fig. 1 is a circuit block diagram of a channel-by-channel convolution apparatus according to an embodiment of the present disclosure.

FIG. 2 is a schematic diagram illustrating a channel-by-channel convolution, according to one embodiment of the present disclosure.

Fig. 3 is a circuit block diagram of a floating point arithmetic unit according to an embodiment of the present disclosure.

Fig. 4 is a circuit block diagram of a floating point arithmetic unit according to another embodiment of the present disclosure.

Fig. 5 is a circuit block diagram of a channel-by-channel convolution control unit according to an embodiment of the present disclosure.

Description of the reference numerals

11: Instruction dispatch unit

12: Memory (or register set)

100: Channel-by-channel convolution device

110: Channel-by-channel convolution control unit

111: Channel-by-channel convolution controller

120: Feature map input control unit

130: Convolution kernel input control unit

140_1, 140_2, 140_N: floating point arithmetic unit

150: Constant cache

160: Register set

210: Feature map data

220: Convolution kernel

230: Channel-by-channel convolution result matrix

310. 410, 420, 430: Multiplier unit

320. 440, 450: Adder device

330. 460: Normalization device

T0 to T31: matrix element

U0 to U31: result element

W0 to w8: weight element

ACC3, ACC4: accumulated value

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout the specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the upper or lower limit of the number of components or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.

Fig. 1 is a schematic block diagram of a channel-by-channel convolution (DEPTHWISE CONVOLUTION) apparatus 100 according to one embodiment of the present disclosure. The operation of the depth separable convolution (DEPTHWISE SEPARABLE CONVOLUTION) includes two stages, a channel-by-channel convolution and a point-by-point convolution (Pointwise Convolution). The channel-by-channel convolution apparatus 100 may be used as a piece of hardware dedicated to "speeding up the channel-by-channel convolution stage in a depth separable convolution operation". The direction of the channel-by-channel convolution that can be supported by the channel-by-channel convolution apparatus 100 includes forward (forward) convolution and backward (backward) convolution. The kernel sizes of the channel-by-channel convolutions that can be supported by the channel-by-channel convolutions apparatus 100 include 1*1, 2 x 2, 3*3, 4*4, 5*5, or other sizes. The x-direction fill amount (padding) that can be supported by the lane-wise convolution apparatus 100 includes 0, +1, -1, +2, -2, or other fill amounts. The y-direction fill levels that can be supported by the channel-wise convolution apparatus 100 include 0, +1, -1, +2, -2, or other fill levels. The stride amounts (stride) that the channel-by-channel convolution apparatus 100 may support include 1, 2, or other stride amounts. The amount of dilation (dilation) that can be supported by the channel-wise convolution apparatus 100 includes 1, 2, or other amounts of dilation. The data types (data types) that can be supported by the channel-by-channel convolution apparatus 100 include fp32, fp16, fp8, int16, int8, int4, or other data types. Based on the channel-by-channel convolution instruction transmitted from the instruction scheduling unit 11, the channel-by-channel convolution apparatus 100 may perform the channel-by-channel convolution in the depth-separable convolution, and then store the channel-by-channel convolution result matrix in the memory (or register set) 12. Based on the actual design, in some embodiments, memory 12 may include a main memory or other memory. In other embodiments, memory 12 may include register set 160 or other registers. The channel-by-channel convolution result matrix in the memory (or register set) 12 may be used for a second stage "point-by-point convolution" of the depth separable convolution operation.

In the embodiment shown in fig. 1, the channel-by-channel convolution apparatus 100 includes a channel-by-channel convolution control unit 110, a feature map (feature map) input control unit 120, a convolution kernel (convolution kernel) input control unit 130, a constant buffer 150, and a register set 160. According to various designs, in some embodiments, at least one of instruction dispatch unit 11, lane-by-lane convolution control unit 110, feature map input control unit 120, and convolution kernel input control unit 130 may be implemented as a hardware (hardware) circuit. In other embodiments, the implementation of at least one of the load store circuit 220 and tensor core 260 may be a combination of multiple ones of hardware, firmware, software (i.e., program).

In hardware, at least one of the instruction scheduling unit 11, the channel-by-channel convolution control unit 110, the feature map input control unit 120, and the convolution kernel input control unit 130 may be implemented as logic circuits on an integrated circuit (INTEGRATED CIRCUIT). For example, the relevant functions of at least one of the instruction dispatch unit 11, the channel-by-channel convolution control unit 110, the feature map input control unit 120, and the convolution kernel input control unit 130 may be implemented in various logic blocks, modules, and circuits in one or more controllers, hardware controllers, microcontrollers (Microcontroller), hardware processors (hardware processor), microprocessors (microprocessors), application-specific integrated circuits (ASICs), digital signal processors (DIGITAL SIGNAL processors, DSPs), field programmable logic gate arrays (Field Programmable GATE ARRAY, FPGA), central processing units (Central Processing Unit, CPUs), or other processing units. The relevant functions of at least one of instruction dispatch unit 11, lane-by-lane convolution control unit 110, feature map input control unit 120, and convolution kernel input control unit 130 may be implemented as hardware circuits, such as various logic blocks, modules, and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.

In terms of "software or firmware running in hardware", the relevant functions of at least one of the instruction scheduling unit 11, the channel-by-channel convolution control unit 110, the feature map input control unit 120, and the convolution kernel input control unit 130 may be implemented as programming codes (programming codes). For example, at least one of the instruction scheduling unit 11, the channel-by-channel convolution control unit 110, the feature map input control unit 120, and the convolution kernel input control unit 130 is implemented using a general programming language (programming languages, such as C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded or deposited on a "non-transitory machine readable storage medium (non-transitory machine-readable storage medium)". In some embodiments, the non-transitory machine-readable storage medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor Memory includes a Memory card, a Read Only Memory (ROM), a FLASH Memory (FLASH Memory), a programmable logic circuit, or other semiconductor Memory. The storage device includes a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid-state drive, STATE DRIVE, SSD), or other storage device. An electronic device (e.g., a CPU, hardware controller, microcontroller, hardware processor, or microprocessor) may read and execute the programming code from the non-transitory machine-readable storage medium to perform the associated functions of at least one of instruction dispatch unit 11, lane-by-lane convolution control unit 110, feature map input control unit 120, and convolution kernel input control unit 130.

The constant cache 150 and the register set 160 shown in fig. 1 may be internal components of the channel-by-channel convolution apparatus 100, however, in other embodiments at least one of the constant cache 150 and the register set 160 may be external components of the channel-by-channel convolution apparatus 100. The present embodiment does not limit the implementation of the constant cache 150 and the register set 160. For example, at least one of the constant cache 150 and the register set 160 may be a register (register), a cache (cache), a main memory (main memory), or other memory. Constant cache 150 is used to store convolution kernels (convolution kernel). The constant buffer 150 is coupled to the convolution kernel input control unit 130 to provide the convolution kernel. The register set 160 is used to store feature map (feature map) data. The register set 160 is coupled to the feature map input control unit 120 to provide feature map data.

The channel-by-channel convolution apparatus 100 further comprises floating-point arithmetic units 140_1, 140_2, …, 140_n. The number N of the floating point units 140_1 to 140_N can be determined according to the actual design. For example, assuming that the feature map data is an x-by-y matrix, the number N of floating point operation units 140_1 to 140_n may be any integer ranging from 1 to x-by-y in one embodiment. If the number N of floating point units 140_1 to 140_n is x×y and assuming stride (stride) is 1, the channel-by-channel convolution apparatus 100 performs a single iteration to complete a channel-by-channel convolution of a feature map data. If the number N of floating point units 140_1 to 140_n is 1 and the stride is assumed to be 1, the channel-by-channel convolution apparatus 100 performs x×y iterations to complete the channel-by-channel convolution of one feature map data.

The feature map input control unit 120 and the convolution kernel input control unit 130 are coupled to the channel-by-channel convolution control unit 110. The channel-by-channel convolution control unit 110 is configured to execute the channel-by-channel convolution instruction transmitted by the instruction scheduling unit 11. The channel-by-channel convolution control unit 110 controls the feature map input control unit 120 and the convolution kernel input control unit 130 based on the channel-by-channel convolution instruction. The convolution kernel input control unit 130 calls the convolution kernels from the constant buffer 150 to the floating-point arithmetic units 140_1 to 140_n based on the control of the channel-by-channel convolution control unit 110. The map input control unit 120 calls the map data from the register group 160 to the floating point arithmetic units 140_1 to 140_n based on the control of the channel-by-channel convolution control unit 110.

Floating point arithmetic units 140_1-140_n are coupled to convolution kernel input control unit 130 and feature map input control unit 120. The convolution kernel input control unit 130 broadcasts a convolution kernel to each of the floating-point arithmetic units 140_1 to 140_n. The signature input control unit 120 invokes a corresponding portion of the matrix elements from the signature data to a corresponding one of the floating point units 140_1-140_N. The floating point arithmetic units 140_1 to 140_n each calculate at least one result element in the channel-by-channel convolution result matrix.

FIG. 2 is a schematic diagram illustrating a channel-by-channel convolution, according to one embodiment of the present disclosure. In the example shown in fig. 2, the feature map data 210 is assumed to be 4*8 matrix, the convolution kernel 220 is assumed to be 3*3 matrix, the channel-by-channel convolution result matrix 230 is assumed to be 4*8 matrix, and the number N of floating point arithmetic units 140_1 to 140_n is assumed to be 4*8 =32. Referring to fig. 1 and 2, based on the control of the channel-by-channel convolution control unit 110, the convolution kernel input control unit 130 broadcasts a convolution kernel 220 to each of the floating-point arithmetic units 140_1 to 140_n. The profile input control unit 120 invokes the corresponding partial matrix elements T0, T1, T2, T8, T9, T10, T16, T17 and T18 from the profile data 210 to the corresponding floating point arithmetic unit 140_1. The floating point operation unit 140_1 calculates the result element U0 in the channel-by-channel convolution result matrix 230 using the corresponding partial matrix element and the convolution kernel 220, wherein u0=t0×w0+t1×w1+t2×w2+t8×w3+t9×w4+t10×w5+t16×w6+t17×w7+t18×w8, and w0 to w8 are weight elements in the convolution kernel 220. The profile input control unit 120 invokes the corresponding partial matrix elements T1, T2, T3, T9, T10, T11, T17, T18 and T19 from the profile data 210 to the corresponding floating point arithmetic unit 140_2. The floating point arithmetic unit 140_2 uses the corresponding partial matrix elements and the convolution kernel 220 to calculate the result element U1 in the channel-by-channel convolution result matrix 230, wherein u1=t1×w0+t2×w 1+t3×w2+t9×w3+t 1+t3×w2+ t9×w3+t. Similarly, the floating point arithmetic unit 140_32 (when N of the floating point arithmetic unit 140_N shown in FIG. 1 is 32) calculates a result element U31 in the channel-by-channel convolution result matrix 230 using the corresponding partial matrix elements T31, T24_1, T25_1, T7_2, T0_3, T1_3, T15_2, T8_3, and T9_3 in the signature data 210, where T24_1 and T25_1 are elements of the right neighbor 4*8 matrix (not depicted) of the 4*8 matrix shown in FIG. 2 in the signature data 210, T7_2 and T15_2 are elements of the lower neighbor 4*8 matrix (not depicted) of the 4*8 matrix shown in FIG. 2 in the signature data 210, T0_3, t1_3, t8_3, and t9_3 are elements of the lower right neighbor 4*8 matrix (not depicted) of the 4*8 matrix shown in figure 2 in the feature map data 210, and u31=t31×w0+t24_1×w1 +t25_1 xw2+t7_2 xw3+t +T25_1 w2+ t7_2 xw3+t.

In the case where the number N of floating point arithmetic units 140_1 to 140_n is smaller than the number of elements of the channel-by-channel convolution result matrix 230, the floating point arithmetic units 140_1 to 140_n may calculate the channel-by-channel convolution result matrix 230 in batches. For example, the feature map data 210, the convolution kernel 220, and the channel-by-channel convolution result matrix 230 are assumed as shown in FIG. 2, while the number N of floating-point arithmetic units 140_1-140_N is assumed to be 16, the channel-by-channel convolution may be divided into two iterations. In the first iteration, floating point units 140_1-140_16 (when N of floating point unit 140_N shown in FIG. 1 is 16) may calculate result elements U1-U15 in the channel-by-channel convolution result matrix 230. In the second iteration, floating point arithmetic units 140_1-140_16 may calculate result elements U16-U31 in the channel-by-channel convolution result matrix 230.

Fig. 3 is a circuit block diagram of a floating point operation unit 140_1 according to an embodiment of the disclosure. The floating point unit 140_1 shown in FIG. 3 may be used as one of the embodiments of the floating point unit 140_1 shown in FIG. 1. The other floating point units 140_2 to 140_n shown in fig. 1 can refer to the related description of the floating point unit 140_1 and so on, and thus will not be described in detail. Referring to fig. 1,2 and 3, the floating point operation unit 140_1 is coupled to the convolution kernel input control unit 130 and the feature map input control unit 120. The convolution kernel input control unit 130 broadcasts the convolution kernels 220 (i.e., the weight elements w 0-w 8) to the floating-point arithmetic unit 140_1. The signature input control unit 120 calls the partial matrices (i.e., the matrix elements T0-T2, T8-T10, and T16-T18) in the signature data 210 to the floating point arithmetic unit 140_1. The floating point arithmetic unit 140_1 uses the convolution kernel 220 and the partial matrix of the feature map data 210 to calculate the result element U0 in the channel-by-channel convolution result matrix 230.

In the embodiment shown in fig. 3, the floating-point operation unit 140_1 includes a multiplier 310, an adder 320, and a normalizer 330. Multiplier 310 is coupled to convolution kernel input control unit 130 to receive weight elements w 0-w 8 of convolution kernel 220. The multiplier 310 is coupled to the profile input control unit 120 to receive the matrix elements T0-T2, T8-T10 and T16-T18 of the profile data 210. The floating point arithmetic unit 140_1 performs a number of cycles to obtain the result element U0 in the channel-by-channel convolution result matrix 230. In the first cycle, multiplier 310 uses matrix element w0 in convolution kernel 220 and matrix element T0 in feature map data 210 to calculate product T0 x w0. Adder 320 is coupled to multiplier 310 to receive product T0 w0. Adder 320 adds the product T0 w0 to the first result element (the current result element U0, which is 0 at this time) to generate an accumulated value ACC3. The normalizer 330 is coupled to the adder 320 to receive the accumulated value ACC3. The normalizer 330 normalizes the accumulated value ACC3 to generate the first result element (T0×w0 at this time). The normalizer 330 feeds back the first result element (normalized result element U0) to the adder 320. Normalizer 330 may adjust values to different standard floating point formats, such as fp16, fp32, or other floating point formats, based on the operation commands.

In the second cycle, multiplier 310 uses matrix element w1 in convolution kernel 220 and matrix element T1 in feature map data 210 to calculate product T1 x w1 to adder 320. Adder 320 adds the product t1×w1 to the current result element U0 to generate an added value acc3=t0×w0+t1×w1. The normalizer 330 normalizes the accumulated value ACC3 to generate the first result element (t0+t0+t1×w1 at this time). The normalizer 330 feeds the current result element U0 back to the adder 320. In this way, in the ninth cycle, multiplier 310 uses matrix elements w8 and T18 to calculate product T18 w8 for adder 320. Adder 320 adds product t18×w8 to the current result element U0 to generate an added value ACC3. The normalizer 330 normalizes the accumulated value ACC3, and generating a result element u0=t0 = w0+t1+w1+t2+w2+t8 w0+t1+w1+ t2+t8×w2. Normalizer 330 stores result element U0 in memory (or register set) 12 for use in a second stage "point-by-point convolution" of the depth separable convolution operation.

Fig. 4 is a circuit block diagram of a floating-point arithmetic unit 140_1 according to another embodiment of the present disclosure. The floating point unit 140_1 shown in FIG. 4 can be used as one of the embodiments of the floating point unit 140_1 shown in FIG. 1. The other floating point units 140_2 to 140_n shown in fig. 1 can refer to the related description of the floating point unit 140_1 and so on, and thus will not be described in detail. Referring to fig. 1,2 and 4, the floating point operation unit 140_1 is coupled to the convolution kernel input control unit 130 and the feature map input control unit 120. The convolution kernel input control unit 130 broadcasts the convolution kernels 220 (i.e., the weight elements w 0-w 8) to the floating-point arithmetic unit 140_1. In the embodiment shown in fig. 4, the convolution kernel input control unit 130 broadcasts the convolution kernels 220 to the floating-point operation unit 140_1 in three batches (for example, the first batch is the weight elements w 0-w 2, the second batch is the weight elements w 3-w 5, and the third batch is the weight elements w 6-w 8). The signature input control unit 120 invokes partial matrices (i.e., matrix elements T0-T2, T8-T10, and T16-T18) from the signature data 210 to the floating point arithmetic unit 140_1. In the embodiment shown in fig. 4, the profile input control unit 120 calls the floating point operation unit 140_1 with partial matrix elements T0 to T2, T8 to T10 and T16 to T18 from the profile data 210 in three batches (for example, the first batch is the matrix elements T0 to T2, the second batch is the matrix elements T8 to T10 and the third batch is the matrix elements T16 to T18). The floating point arithmetic unit 140_1 uses the convolution kernel 220 and the partial matrix of the feature map data 210 to calculate the result element U0 in the channel-by-channel convolution result matrix 230.

In the embodiment shown in FIG. 4, the floating-point arithmetic unit 140_1 includes a multiplier 410, a multiplier 420, a multiplier 430, an adder 440, an adder 450, and a normalizer 460. Multiplier 410, multiplier 420, multiplier 430 and adder 440 are coupled to convolution kernel input control unit 130 and feature map input control unit 120. The floating point arithmetic unit 140_1 performs a number of cycles to obtain the result element U0 in the channel-by-channel convolution result matrix 230. In the first round, multiplier 410 uses matrix element w0 in convolution kernel 220 and matrix element T0 in feature map data 210 to calculate product T0 x w0, multiplier 420 uses matrix element w1 in convolution kernel 220 and matrix element T1 in feature map data 210 to calculate product T1 x w1, and multiplier 430 uses matrix element w2 in convolution kernel 220 and matrix element T2 in feature map data 210 to calculate product T2 x w2. Adder 440 is coupled to multipliers 410, 420, and 430 to receive products T0 x w0, T1 x w1, and T2 x w2. Adder 440 calculates the sum of products T0 w0, T1 w1 and T2 w2. Adder 450 is coupled to adder 440 to receive sum of T0 xw0+T 0 x w 0+T. The adder 450 adds the added value to the first result element (the current result element U0, which is 0 at this time) to generate an added value ACC4. The normalizer 460 is coupled to the adder 450 to receive the accumulated value ACC4. The normalizer 460 normalizes the accumulated value ACC4 to generate the first result element (t0+w0+t1+w1+t2×w2 at this time). The normalizer 460 feeds back the first result element (normalized result element U0) to the adder 450. Normalizer 460 may adjust values to different standard floating point formats, such as fp16, fp32, or other floating point formats, according to an operation command.

In the second cycle, multiplier 410 uses matrix elements w3 and T8 to calculate product T8 x w3, multiplier 420 uses matrix elements w4 and T9 to calculate product T9 x w4, and multiplier 430 uses matrix elements w5 and T10 to calculate product T10 x w5. Adder 440 calculates the sum of products T8 w3, T9 w4 and T10 w5. Adder 450 sums the sum values T8 w3 +t9+w4+t10+w5 is added up to the present time the result element U0 (t0×w0+t1×w1+t2×w2 in this case) generates the accumulated value ACC4. The normalizer 460 normalizes the accumulated value ACC4, generating the first result element (in this case, T0+w0+ t1×w1+t2×w2+t8×w 3+t9+w4+t10+w5). The normalizer 460 feeds back the current result element U0 to the adder 320. In the third cycle, multiplier 410 uses matrix elements w6 and T16 to calculate product T6 x w16, multiplier 420 uses matrix elements w7 and T17 to calculate product T7 x w17, and multiplier 430 uses matrix elements w8 and T18 to calculate product T18 x w8. Adder 440 calculates the sum of products t16×w6, T17×w7, and T18×w8. Adder 450 adds the sum t16+t17+w7+t18+w8 to the current result element U0 to generate an added value ACC4. The normalizer 460 normalizes the accumulated value ACC4, and generating a result element u0=t0 = w0+t1+w1+t2+w2+t8 w0+t1+w1+ t2+t8×w2. Normalizer 460 stores result element U0 in memory (or register set) 12 for use in a second stage "point-by-point convolution" of the depth separable convolution operation.

Fig. 5 is a circuit block diagram of a channel-by-channel convolution control unit 110 according to an embodiment of the present disclosure. The channel-by-channel convolution control unit 110 shown in fig. 5 may be used as one of many embodiments of the channel-by-channel convolution control unit 110 shown in fig. 1. In the embodiment shown in fig. 5, the channel-by-channel convolution control unit 110 includes a channel-by-channel convolution controller 111. The channel-by-channel convolution controller 111 is coupled to the convolution kernel input control unit 130 and the feature map input control unit 120. The channel-by-channel convolution instruction is sent to the channel-by-channel convolution controller 111 to be ready for transmission, by scheduling by the instruction scheduling unit 11. The channel-by-channel convolution controller 111 executes the channel-by-channel convolution instruction to control the convolution kernel input control unit 130 and the feature map input control unit 120. The channel-by-channel convolution controller 111 determines the execution period of the current instruction according to the core size, stride, expansion (dilation) and other parameters in the instruction, and loops executing the instruction until the current channel-by-channel convolution is completed.

The convolution kernel input control unit 130 calls the convolution kernel 220 to the floating-point operation units 140_1 to 140_n based on the control of the channel-by-channel convolution controller 111. For example, the convolution kernel input control unit 130 selects the corresponding weight element in the convolution kernel 220 according to the sequence number of the current cycle to broadcast to all floating point operation units 140_1 to 140_n. The feature map input control unit 120 calls corresponding partial matrix elements from the feature map data 210 to corresponding floating point operation units among the floating point operation units 140_1 to 140_n based on the control of the channel-by-channel convolution controller 111. For example, the feature map input control unit 120 selects corresponding matrix elements (feature values) from the feature map data 210 to the corresponding floating point operation units in the floating point operation units 140_1 to 140_n according to the sequence number of the current cycle in combination with padding (padding), expanding, stepping, and the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present disclosure.

Claims

1. A channel-by-channel convolution apparatus comprising:

the channel-by-channel convolution control unit is used for executing the channel-by-channel convolution instruction transmitted by the instruction scheduling unit;

A convolution kernel input control unit coupled to the channel-by-channel convolution control unit, wherein the channel-by-channel convolution control unit controls the convolution kernel input control unit based on the channel-by-channel convolution instruction, the convolution kernel input control unit invoking a convolution kernel based on control of the channel-by-channel convolution control unit;

A feature map input control unit coupled to the channel-by-channel convolution control unit, wherein the channel-by-channel convolution control unit controls the feature map input control unit based on the channel-by-channel convolution instruction, the feature map input control unit invoking feature map data based on control of the channel-by-channel convolution control unit; and

At least one floating point arithmetic unit coupled to the convolution kernel input control unit and the feature map input control unit, wherein the convolution kernel input control unit broadcasts the convolution kernel to the at least one floating point arithmetic unit, the feature map input control unit invokes corresponding partial matrix elements from the feature map data to corresponding floating point arithmetic units in the at least one floating point arithmetic unit, and the at least one floating point arithmetic unit each calculates at least one result element in a channel-by-channel convolution result matrix;

The channel-by-channel convolution control unit comprises a channel-by-channel convolution controller, the channel-by-channel convolution controller determines an execution period of the channel-by-channel convolution instruction according to parameters in the channel-by-channel convolution instruction, and circularly executes the channel-by-channel convolution instruction until the channel-by-channel convolution is completed, wherein the parameters comprise a core size, a stride amount and a expansion amount.

2. The channel-by-channel convolution apparatus of claim 1, further comprising:

and a constant buffer for holding the convolution kernel, wherein the constant buffer is coupled to the convolution kernel input control unit to provide the convolution kernel.

3. The channel-by-channel convolution apparatus of claim 1, further comprising:

and the register set is used for storing the characteristic diagram data, and is coupled to the characteristic diagram input control unit to provide the characteristic diagram data.

4. The lane-by-lane convolutional apparatus of claim 1, wherein the at least one floating-point operation unit comprises a first floating-point operation unit coupled to the convolution kernel input control unit and the feature map input control unit, the convolution kernel input control unit broadcasting the convolution kernel to the first floating-point operation unit, the feature map input control unit invoking a first partial matrix in the feature map data to the first floating-point operation unit, and the first floating-point operation unit calculating a first result element in the lane-by-lane convolution result matrix using the convolution kernel and the first partial matrix.

5. The channel-by-channel convolution apparatus of claim 4, wherein the first floating-point operation unit comprises:

a multiplier coupled to the convolution kernel input control unit and the feature map input control unit, wherein the multiplier calculates a first product using a first matrix element in the convolution kernel and a second matrix element in the first partial matrix;

An adder coupled to the multiplier to receive the first product, wherein the adder adds the first multiply accumulation to the first result element to generate an accumulated value; and

A normalizer coupled to the adder to receive the accumulated value, wherein the normalizer normalizes the accumulated value to generate the first result element, and the normalizer feeds the first result element back to the adder.

6. The channel-by-channel convolution apparatus of claim 4, wherein the first floating-point operation unit comprises:

a first multiplier coupled to the convolution kernel input control unit and the feature map input control unit, wherein the first multiplier computes a first product using a first matrix element in the convolution kernel and a second matrix element in the first partial matrix;

A second multiplier coupled to the convolution kernel input control unit and the feature map input control unit, wherein the second multiplier computes a second product using a third matrix element in the convolution kernel and a fourth matrix element in the first partial matrix;

A third multiplier coupled to the convolution kernel input control unit and the feature map input control unit, wherein the third multiplier calculates a third product using a fifth matrix element in the convolution kernel and a sixth matrix element in the first partial matrix;

A first adder coupled to the first multiplier, the second multiplier, and the third multiplier to receive the first product, the second product, and the third product, wherein the first adder calculates a sum of the first product, the second product, and the third product;

a second adder coupled to the first adder to receive the sum value, wherein the second adder adds the sum value to the first result element to generate an added value; and

A normalizer coupled to the second adder to receive the accumulated value, wherein the normalizer normalizes the accumulated value to generate the first result element, and the normalizer feeds back the first result element to the second adder.

7. The channel-by-channel convolution apparatus according to claim 1, wherein the channel-by-channel convolution control unit comprises:

A channel-by-channel convolution controller coupled to the convolution kernel input control unit and the feature map input control unit, wherein the channel-by-channel convolution controller executes the channel-by-channel convolution instruction to control the convolution kernel input control unit and the feature map input control unit, the convolution kernel input control unit invokes a convolution kernel to the at least one floating-point operation unit based on control of the channel-by-channel convolution controller, and the feature map input control unit invokes a corresponding portion of matrix elements from the feature map data to corresponding floating-point operation units in the at least one floating-point operation unit based on control of the channel-by-channel convolution controller.

8. The channel-by-channel convolution apparatus of claim 1, wherein the channel-by-channel convolution apparatus is to perform a channel-by-channel convolution in a depth-separable convolution.

9. The channel-by-channel convolution apparatus according to claim 8, wherein a direction of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of a forward convolution and a reverse convolution.

10. The channel-by-channel convolution apparatus of claim 8, wherein a core size of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of 1*1, 2 x 2, 3*3, 4*4, and 5*5.

11. The channel-by-channel convolution apparatus according to claim 8, wherein an x-direction fill amount of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of 0, +1, -1, +2, and-2.

12. The channel-by-channel convolution apparatus of claim 8, wherein a y-direction fill level of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of 0, +1, -1, +2, and-2.

13. The channel-by-channel convolution apparatus of claim 8, wherein a stride amount of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of 1 and 2.

14. The channel-by-channel convolution apparatus according to claim 8, wherein an amount of expansion of the channel-by-channel convolution supported by the channel-by-channel convolution apparatus comprises at least one of 1 and 2.

15. The channel-by-channel convolution apparatus of claim 8, wherein the channel-by-channel convolutions supported by the channel-by-channel convolution apparatus include at least one of fp32, fp16, fp8, int16, int8, and int 4.