WO2021237513A1

WO2021237513A1 - Data compression storage system and method, processor, and computer storage medium

Info

Publication number: WO2021237513A1
Application number: PCT/CN2020/092627
Authority: WO
Inventors: 李鹏; 王耀杰; 阮肇夏
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-02

Abstract

A data compression storage system and method, a processor, and a computer storage medium, used for compressing a feature map in an on-chip memory and then storing same in an external memory. The system comprises: a compression command generating module is used for distributing compression commands to each of at least two compression paths; each compression path is used for reading a corresponding original feature map from an on-chip memory on the basis of the compression command and compressing same; a read arbitration module is used for performing arbitration on read feature map commands of the at least two compression paths; and a write arbitration module is used for performing arbitration on write requests of the at least two compression paths. Thus, the system enables compressed data to occupy less storage space, reducing the space occupation of the external memory and also reducing bandwidth resources during read/write, saving power consumption. In addition, the data compression is performed in parallel by at least two compression paths, increasing compression efficiency.

Description

System, method, processor and computer storage medium for data compression storage

Technical field

The embodiments of the present invention relate to the field of data processing, and more specifically, to a system, method, processor, and computer storage medium for data compression storage.

Background technique

In more and more scenarios, a large amount of data needs to be stored. In order to make full use of the storage space of the memory, in order to store more data, the data is generally compressed and then stored.

However, the current data compression method has a small amount of data compression, that is, even the compressed data occupies a large storage space. Moreover, for larger compressed data, larger bandwidth resources need to be consumed in the process of reading and writing from the memory.

Summary of the invention

The embodiment of the present invention provides a data compression storage system, method, processor, and computer storage medium, which can compress the feature map of the on-chip memory and then store it in the external memory, reducing the storage space and reducing the time for reading and writing. Bandwidth resources.

In the first aspect, a system for data compression and storage is provided. The system is used to compress a feature map in an on-chip memory and then store it in an external memory. The system includes a compression instruction generation module and a read arbitration module. , At least two compression paths and write arbitration module:

The compression instruction generating module is configured to distribute the compression instruction to each of the at least two compression paths;

Each of the at least two compression paths is configured to read the corresponding original feature map from the on-chip memory according to the compression instruction received from the compression instruction generation module, and read the original The feature map is compressed;

The read arbitration module is configured to arbitrate the read feature map commands of the at least two compressed paths for the original feature map in the on-chip memory;

The write arbitration module is configured to arbitrate the write requests of the at least two compression paths to write compressed data into the external memory.

In a second aspect, a method for data compression storage is provided. The method is used to compress a feature map in an on-chip memory and then store it in an external memory. The method includes:

Distributing the compression instruction to each of the at least two compression paths;

Each compression path reads the corresponding original feature map from the on-chip memory according to the received compression instruction, and compresses the read original feature map;

Storing the compressed feature map in the external memory;

Wherein, when at least two compressed paths read the original feature map in the on-chip memory, arbitrate the read feature map commands of the at least two compressed paths;

Wherein, when the at least two compression paths write the compressed feature map into the external memory, the write requests of the at least two compression paths are arbitrated.

In the third aspect, a processor is provided, including:

On-chip memory, and

The system for data compression storage described in the first aspect above.

In a fourth aspect, a computer storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.

It can be seen that the system for data compression storage of the embodiment of the present invention can compress the feature map of the on-chip memory and then store it in the external memory, which can make the compressed data occupy a small storage space, and on the one hand, it can reduce the external The space occupied by the memory, on the other hand, can also reduce the bandwidth resources when reading and writing, and save power consumption. In addition, the data compression in the embodiment of the present invention is compressed in parallel by at least two compression paths, which can also improve the efficiency of compression.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some of the present invention. For the embodiments, for those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Fig. 1 is a schematic diagram of data storage according to an embodiment of the present invention.

Fig. 2 is a schematic block diagram of a system for data compression storage according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of various modules of a system for data compression storage according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a flow of compression performed by the system for data compression storage according to an embodiment of the present invention.

FIG. 5 is another schematic diagram of a process of performing compression by the system for data compression storage according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a state machine of the system for data compression storage according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of each module of a compression path of a system for data compression storage according to an embodiment of the present invention.

FIG. 8 is a schematic flowchart of a data storage method according to an embodiment of the present invention.

FIG. 9 is a schematic diagram of a minimum access unit storage characteristic map according to an embodiment of the present invention.

FIG. 10 is another schematic diagram of a minimum access unit storage characteristic map according to an embodiment of the present invention.

FIG. 11 is a schematic diagram of calculating multiple differences for one data unit according to an embodiment of the present invention.

Fig. 12 is a schematic structural diagram of a scan coding module according to an embodiment of the present invention.

Fig. 13 is a schematic diagram of fetching a data unit from a minimum access unit according to an embodiment of the present invention.

FIG. 14 is a schematic structural diagram of a difference algorithm compression module according to an embodiment of the present invention.

FIG. 15 is a schematic diagram of several situations of compressed data of a data unit according to an embodiment of the present invention.

FIG. 16 is a schematic flowchart of a compression process performed by a compression path according to an embodiment of the present invention.

FIG. 17 is a schematic block diagram of an apparatus for data storage according to an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

With the development of artificial intelligence technology, algorithms such as deep learning will be involved in more and more fields. One of the cores of deep learning is neural networks, such as Convolution Neural Networks (CNN). In the calculation process of the convolutional neural network, a large amount of feature map data will be generated. When writing these feature map data to the external memory of the processor, data compression technology is usually used, which can reduce the space occupied by the external memory. , And can reduce the bandwidth when reading and writing. Among them, the external memory may be, for example, a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate Synchronous Dynamic Random Access Memory), or DDR for short.

However, a convolutional neural network generally includes a large number of convolutional layers, and each convolutional layer generates a large amount of feature map data. When these large amounts of feature map data are read and written to DDR, they will consume valuable system external memory bandwidth resources, resulting in other modules with high bandwidth requirements (such as CNN or other modules) because they cannot quickly access DDR and affect computing performance. . Moreover, due to the increased access to DDR, it will further lead to higher power consumption.

In order to further reduce the amount of compressed data, further reduce the bandwidth when reading and writing DDR, and thereby reduce power consumption, an embodiment of the present invention provides a method for a data compression storage system. Specifically, the feature map data calculated by the convolutional neural network may be located in an on-chip memory. The embodiment of the present invention aims to compress the feature map data in the on-chip memory and store the compressed data in the external memory. The process can be similar to that shown in Figure 1, where the on-chip memory can be assumed to be SRAM on-chip, and the external memory can be assumed to be DDR. After the compression system reads the feature map data from the on-chip memory, the feature map data is compressed, and the compressed feature map data is stored in the external memory.

In the embodiment of the present invention, a system for data compression storage includes at least: a compression instruction generation module, a read arbitration module, at least two compression paths, and a write arbitration module, as shown in FIG. 2.

It should be understood that the number of compression paths in the embodiment of the present invention is at least two, for example, it can be 3 or more, and can be specifically configured according to the performance of the processor and the size of the feature map data to be processed. In this way, the embodiment of the present invention can flexibly configure the number of compression paths according to the output rate of the feature map data, so as to flexibly meet the performance requirements of different processing tasks and improve the compression performance.

To simplify the illustration, only two compression paths are shown in FIG. 2, namely compression path 1 and compression path 2.

The compression instruction generation module can be used to distribute the compression instructions to various compression paths. The compression path can read the corresponding feature map data from the on-chip memory according to the compression instruction, and then compress the read feature map data. The read arbitration module can arbitrate the read feature map commands of at least two compression paths for the feature map data in the on-chip memory. The write arbitration module can arbitrate the write requests of at least two compression paths to write the compressed data into the external memory.

The compression instruction generation module can be expressed as the ENC_INSTR_PROC module, which can receive the compression instruction, parse the compression instruction; and further distribute the parsed compression instruction to each compression path. Specifically, when the feature map data is stored in the on-chip memory and needs to be compressed and stored, the processor may send a compression instruction to the compression instruction generation module. After the compression instruction generation module receives the compression instruction, it can correspondingly distribute the compression instruction to each compression path, so that each compression path reads the characteristic map data from the on-chip memory and compresses it. Among them, the compression instructions distributed to a certain compression path may include: the number of feature maps to be compressed by the compression path, the width of the feature maps, the height of the feature maps, the base addresses of these numbers of feature maps in the on-chip memory, and the number of feature maps in the on-chip memory. The inter-picture storage interval of the on-chip memory, the base address of these number of feature maps output to the external memory after being compressed by the compression path, the inter-picture storage interval of these numbers of feature maps output to the external memory after being compressed by the compression path, and these numbers The base address of the compressed header information after the feature map is compressed by the compression path is output to the external memory, and the header information storage interval of the compressed header information after these number of feature maps are compressed by the compression path is output to the external memory.

The read arbitration module can be represented as the FM_RD_ARB module, and referring to Figure 3, the system can also include a read command buffer module (which can be represented as RD_CMD_FIFO module) and a read data path identification buffer module (which can be represented as RDATA_ID_FIFO module); both are arbitrated with read Module connection. The read arbitration module can obtain the read feature map commands issued by each compression path, that is, the read feature map commands issued by each compression path can be gathered here. In addition, the read characteristic map commands of each compression path can be cached in the respective read command cache module. The read arbitration module can arbitrate the read characteristic map commands in each read command cache module according to the arbitration rules to obtain the arbitration result. If the arbitration result indicates that the first compression path wins, the command to read the characteristic map of the first compression path is sent to the on-chip memory first, and the path identification (ID) of the first compression path is stored in the read data path identification cache module. After that, after returning the feature map data from the on-chip memory, according to the path identification (ID) stored in the read data path identification cache module, the returned feature map data will be sent to the compressed path corresponding to the path identification (ID). . Optionally, the arbitration rules may be a priority mechanism or a fair polling mechanism configured according to the compressed instruction; or may be other arbitration-related mechanisms, which are not listed here. It can be understood that if the arbitration rule is a priority mechanism, the compression processing of the compression path with a high priority can be guaranteed first, and the task performance of the compression path with the priority can be ensured.

The write arbitration module, which can be expressed as the FM_WR_ARB module, can obtain the write requests issued by each compression path, that is, the write requests issued by each compression path can be gathered here. In addition, each write request can be arbitrated according to the arbitration rules, and the arbitration result can be obtained. If the result of the arbitration indicates that the second compression path wins, the write request of the second compression path is first sent to the external memory, that is, the compressed data obtained by the second compression path is stored in the external memory. Optionally, the arbitration rules may be a priority mechanism or a fair polling mechanism configured according to the compressed instruction; or may be other arbitration-related mechanisms, which are not listed here.

The compression path module can be expressed as the ENC_PATH module, and referring to Figure 3, each compression path module can include a feature map reading module, a feature map caching module, a data compression module, a data packing module, a compression header generation module, a length alignment module, and compression Header cache module, compressed feature map cache module, compressed header write module and compressed feature map write module.

The feature map reading module, which can be expressed as an RD-FM module, can send a feature map read command for the original feature map in the on-chip memory according to the compression instruction received from the compression instruction generation module. Among them, the read feature map command may include the width and height of the original feature map to be read, the base address of the on-chip memory, and so on. Optionally, when the bypass operation needs to be performed, the feature map reading module is also used to re-read the original feature map compressed this time from the on-chip memory.

The feature map cache module, which can be expressed as the SRC_FM_FIFO module, can be used to store the original feature map read back from the on-chip memory.

The data compression module can be used to divide the original feature map in the feature map cache module into multiple data units, and perform differential compression for each of the multiple data units. Exemplarily, the data compression module may include: a scan coding module and a difference algorithm compression module. Among them, the scan coding module can be expressed as the SCAN_DPCM module, and the difference algorithm compression module can be expressed as the RES_ENC module. Specifically, the data compression module will be described in more detail below in conjunction with FIG. 7 to FIG. 17.

The data packing module can be expressed as a DATA_PACK module, which is used to splice the data compressed by the data compression module into complete compressed data. Specifically, the fragmented data compressed by the data compression module is spliced into complete data, for example, into data with a unit of 16 bytes.

The length alignment module can be expressed as the LEN_ALIGN module, which is used to fill in the length of the compressed data spliced by the data packing module to a specific length. Exemplarily, when a bypass operation needs to be performed, it is also used to fill in the length of the original feature map to a specific length. In other words, the length of the data to be output can be padded to a certain length. Among them, the specific length is related to the chip performance of the external memory. That is, the specific length may be preset according to the performance of the chip of the external memory. Exemplarily, for the current data unit, at the end, the compressed length can be filled with invalid data to a certain length. For example, at the end of the current data unit, the length of the compressed data is added from N×16B to ceil(N/4)×64B, where ceil means rounding up, and N is a positive integer. It can be understood that because some external memory (such as DDR) chips will only work efficiently when the write data meets a certain length, the embodiment of the present invention can ensure that the external memory can work more efficiently by setting the length alignment module. Improve the performance of the entire system.

The compression header generation module can be expressed as the ENC_HDR_GEN module, which can generate compression header information corresponding to the compressed data obtained by the data compression module according to the compression instruction received from the compression instruction generation module. Specifically, the compression header information can be generated according to the address information in the compression instruction, the feature map size information, the length of the compression result in the current clock cycle, whether it is the end of the current data unit, and so on. On the one hand, the generated compression header information can be used to determine whether the current data unit needs to be bypassed, and on the other hand, the generated compression header information can be used to decompress compressed data in the future. It is understandable that the process of judging whether the bypass is needed based on the compressed header information is optional, but not necessary, that is, the compressed data and compressed header information can be stored without judging whether the bypass is needed.

The compressed header buffer module, which can be expressed as the ENC_HDR_FIFO module, is used to buffer the compressed header information to be output generated by the compressed header generator module.

Compressed feature map cache module, which can be expressed as the ENC_FM_FIFO module, used to cache the data to be output. The cached data may be compressed data with length complement, or it may be the original feature map read back from the on-chip memory during bypass operation. The original feature map after the length is complemented.

The compression header writing module can be expressed as the ENC_HDR_WR module, which performs the writing operation of the compression header information.

The compressed feature map write module can be represented as the ENC_FM_WR module, which performs data storage operations, specifically the compressed data in the compressed feature map cache module or the original feature map read back from the on-chip memory during the bypass operation to the external memory.

Among them, if the compressed data after compression is larger than the original data, that is, the storage space occupied by the compressed data will be greater than the storage space occupied by the original data. At this time, it is unreasonable to store the compressed data, so the bypass operation will be performed, and Store raw data. Specifically, the workflow when performing the bypass operation can be briefly described as follows:

a. Record the bypass information of this bypass operation to the compression header information for use when decompressing. It can be understood that the compression header information for the compressed data and the compression header information for the original data when the bypass operation is performed may have different compression identifiers. For example, the first compression identifier represents compressed data, and the second compression identifier represents original data.

b. The compression feature map write module, namely the ENC_FM_WR module, records the base address of this write operation, that is, the address that will be written to the external memory.

c. Reset the currently working module of the compression path. For example, the currently working module may include a data compression module and so on.

d. The reading feature map module, that is, the RD_FM module, re-sends the reading feature map instruction, thereby restarting the reading of the original feature map of the compression unit this time. In other words, the original feature map is read from the on-chip memory again, and the read original feature map can be stored in the feature map cache module.

e. After the bypass mechanism reads the original feature map, it will not go through the data compression module and the data packing module, but directly from the feature map cache module to the length alignment module, and reuse the module for output length alignment.

f. The compressed feature map writing module overwrites the previously obtained compressed data into the original feature map, and outputs the original feature map to the external memory.

Therefore, the embodiment of the present invention can ensure that the storage space occupied by the external memory is smaller by setting the bypass mechanism.

Through the system for data compression storage in the embodiment of the present invention, the feature map data obtained through the convolutional neural network in the processor can be compressed and stored in the external memory.

Exemplarily, the system shown in FIG. 3 in the embodiment of the present invention may execute a data compression storage method. A schematic flowchart of the method may be shown in FIG. 4 and includes:

S101: Distribute the compression instruction to each of the at least two compression paths;

S102, each compression path reads a corresponding original feature map from the on-chip memory according to the received compression instruction, and compresses the read original feature map;

S103: Store the compressed feature map in the external memory;

Wherein, when at least two compressed paths read the original feature maps in the on-chip memory, arbitration is performed on the read feature map commands of the at least two compressed paths. Wherein, when the at least two compression paths write the compressed feature map into the external memory, the write requests of the at least two compression paths are arbitrated.

Or, specifically, the process of performing compression by the system shown in 3 can also be shown in more detail in FIG. 5.

Exemplarily, the compression instruction generation module may receive the compression instruction, parse the compression instruction, and then distribute the compression instruction to each compression path (PATH). Among them, the received compression instruction may include information describing the compression task to be performed by each compression path and the priority of each compression path. Then, after the compression path is analyzed, the tasks of each compression path can be configured according to the analysis and each compression path can be configured. The priority of the compressed path. After that, each compression path can perform compression work in accordance with the received compression instruction. Specifically, the feature map data can be read from the on-chip memory, compressed, and then compressed information (such as compression header information including length) can be calculated. Determine whether to perform the bypass operation according to the compressed information, and if the compressed data is greater than the length of the original feature map data, read the original feature map data again. After determining the data to be output for storage (compressed data or original feature map data when performing a bypass operation), length alignment is performed, and the compression result is written. Among them, the written compression result includes not only the compressed data after the length is filled or the original feature map data when the bypass operation is performed, but also the compressed header information. If each compression path has completed the compression storage process, the flow of the compression instruction ends; otherwise, it waits for the unfinished compression path to continue execution.

It can be seen that the system for data compression storage of the embodiment of the present invention can realize compression instruction reception, processing, and distribution, and can monitor and feedback completion. The workflow shown in FIG. 5 is clear, and can realize the compression and storage of feature map data.

In addition, the system for data compression storage in the embodiment of the present invention may have multiple different states, including but not limited to: idle state, receiving instruction state, parsing instruction state, waiting for completion state, and the like. Exemplarily, the state switching can be implemented according to the state machine shown in FIG. 6.

The idle state can be expressed as the IDLE state. When the system is in this state, it waits for the compression command start signal, and after receiving the compression command start signal, it switches to the receiving command state. The start signal of the compression command can be expressed as instr_strt.

The receiving instruction status can be expressed as the RCV_INSTR state. When the system is in this state, the compressed instruction is being received until the receiving is completed. After the reception is completed, the command ready signal can be output, and at the same time as the command ready signal is output or after the command ready signal is output, switch to the analysis command state. The instruction ready signal can be expressed as instr_rdy.

The state of the analysis instruction can be expressed as the PROC_INSTR state. When the system is in this state, the compression instruction received in the receiving instruction state is analyzed, and the compression instruction is distributed to each compression path according to the analysis. The compression instructions distributed to each compression path can be expressed as instr_isu. Specifically, the compression instructions distributed to a certain compression path may include: the number of feature maps to be compressed by the compression path, the width of the feature maps, the height of the feature maps, the base addresses of these numbers of feature maps in the on-chip memory, and the number of feature maps. The storage interval between pictures in the on-chip memory, the base address of these number of feature maps output to the external memory after being compressed by the compression path, the storage interval between these numbers of feature maps output to the external memory after being compressed by the compression path, these The base address of the compressed header information of the number of feature maps compressed by the compression path is output to the external memory, and the header information storage interval of the compressed header information of the number of feature maps compressed by the compression path is output to the external memory.

Taking the compression command distributed to compression path 1 as an example, the command information included in the compression command distributed to compression path 1 may include: (1) FM_NUM, which indicates the number of feature maps that need to be compressed in compression path 1; (2) FM_WIDTH, which indicates The width of the feature map that needs to be compressed in path 1; (3) FM_HIGHT, which indicates the height of the feature map that needs to be compressed in path 1; (4) FM_SRAM_BADDR, the base address of the feature map that needs to be compressed in path 1 in the on-chip memory; (5) FM_SRAM_LEN indicates the storage interval of the feature map that needs to be compressed in the on-chip memory in compression path 1; (6) FM_DDR_BADDR indicates the base address of the feature map after compression path 1 is compressed to the external memory; (7) FM_DDR_LEN, indicates compression The storage interval between the feature map compressed in path 1 and output to the external memory; (8) FM_HDR_BADDR, which means the base address of the compressed header information corresponding to the feature map compressed in path 1 output to the external memory; (9) FM_HDR_LEN , Indicates the storage interval of the header information corresponding to the compressed feature map after compression path 1 is output to the external memory.

The waiting state can be expressed as the WAIT_DONE state. When the system is in this state, the completion signal of each compression path can be monitored, and after the completion of all the compression paths is monitored, it can be switched to the idle state. Exemplarily, after the compression path is completed, an instruction completion signal may be output to the upper-level module. The instruction completion signal can be expressed as instr_done.

It can be seen that, by setting the state machine of the system for data compression storage, the embodiment of the present invention can ensure the normal operation of the system and ensure the safe and orderly storage of the feature map data.

For the system shown in Figure 3, the following will describe how each compression path performs data compression in conjunction with Figures 7 to 17. It can be understood that, since the process of compressing the feature map data by each compression path is similar, the following compression process may be performed for any compression path.

Figure 7 shows a schematic diagram of each module of a compression path. Among them, the function of each module is as described above in conjunction with FIG. 3, and in FIG. 7, the dashed box shows a data compression module, which includes a scan coding module and a difference algorithm compression module. Among them, the scan coding module can be expressed as the SCAN_DPCM module, and the difference algorithm compression module can be expressed as the RES_ENC module. The scan coding module (SCAN_DPCM module) is a difference calculation module of the difference (Residual, RES) compression method, which can scan (SCAN) the data to be compressed according to the compression performance, and take out a certain amount of data for compression. And it can be understood that the amount of data that can be compressed in each clock cycle (cycle) is a manifestation of the compression performance of the compression path. The difference algorithm compression module (RES_ENC module) is a data compression module of the difference compression method. According to the difference compression algorithm, the difference value output by the scan encoding module (SCAN_DPCM module) can be compressed, and the current cycle compression result will be output at the same time The length and whether it is the end of the current data unit.

In the scenario where the feature map data is the output of the convolutional layer of the convolutional neural network, since the values of the adjacent two pixels of the feature map output by the convolutional layer are very close or even equal, it can be fully utilized to consider the characteristics The direct difference between adjacent pixels is used for compression.

As shown in FIG. 8 is a schematic flowchart of a data storage method according to an embodiment of the present invention. The method shown in Figure 8 includes:

S110: Receive feature map data to be stored;

S120: Divide the feature map data into multiple data units;

S130, for each data unit of the multiple data units: determine whether the data in the data unit is all zeros, and compress according to the result of the determination;

S140: Store the compressed feature map data.

Exemplarily, before S110, it may further include: receiving a compression instruction; according to the received compression instruction, sending a feature map read command, so as to obtain feature map data corresponding to the feature map read command from the on-chip memory in S110. Specifically, after a compression path receives the compression instruction, it sends a read feature map command to the read arbitration module according to the instruction information in the compression instruction. Optionally, the read feature map command includes the width, height, and height of the feature map to be read. The base address of the on-chip memory, etc. In one embodiment, the compression instruction generation module reads the compression instruction from the on-chip memory.

Exemplarily, the feature map read at one time through the read feature map command may correspond to the smallest access unit of the memory. Optionally, it can be understood that the feature map data corresponding to the smallest access unit of the memory is received in S110. As described below, the size of the received feature map data may be equal to or smaller than the minimum access unit of the memory. As an example, the feature map data corresponding to the smallest access unit of the memory can be referred to as a compression unit. Correspondingly, in S120, the feature map data corresponding to the smallest access unit is divided into multiple data units.

In this way, aligning the width and the number of rows of the feature map according to the minimum access unit of the memory can facilitate the access of the feature map on the one hand, and can efficiently use the bandwidth of the read-write memory on the other hand.

Exemplarily, if the storage space required by a row of data of the feature map data is greater than the minimum access unit, the data located in the same minimum access unit belongs to the same row of the feature map data. If the storage space required for one row of feature map data is less than the minimum access unit, the data belonging to the same row of feature map data is located in the same minimum access unit. Among them, the storage space required for one row of feature map data is determined according to the width of the feature map and the data bit width of each pixel.

Specifically, it is assumed that the minimum access unit of the memory is 32Byte (referred to as 32B for short), and the data bit width of each pixel is 8 bits (bit). Then the total storage length of each feature map is aligned according to 32B. Assuming that the width of the feature map is fm_w, then, as shown in Figure 9: (1) If fm_w>=17, each row is aligned with 32B, and each 32B stores at most 1 row of the feature map (may also require multiple 32Bs to store 1 row ), the remaining invalid data can be filled with 0; (2) If fm_w<=16, each row is aligned with 16B, and each 32B stores at most 2 rows of the feature map, and the remaining invalid data is filled with 0. Exemplarily, for ease of understanding, the feature map with fm_w>=17 can be defined as a large image, and the feature map with fm_w<=16 can be defined as a small image.

Specifically, it is assumed that the minimum access unit of the memory is 64Byte (64B for short), and the data bit width of each pixel is 8 bits. Then the total storage length of each feature map is aligned according to 64B. Assuming that the width of the feature map is fm_w, then, as shown in Figure 10: (1) If fm_w>=33, each row is aligned with 64B, and each 64B stores at most 1 row of the feature map (may also require multiple 64Bs to store 1 row ), the remaining invalid data can be filled with 0; (2) If fm_w=[17,32], each row is aligned with 32B, and each 64B can store at most 2 rows of the feature map, and the remaining invalid data can be filled with 0; ( 3) If fm_w<=16, each row is aligned with 16B, and each 64B stores at most 4 rows of the feature map, and the remaining invalid data is filled with 0. Exemplarily, for ease of understanding, the feature map of fm_w>=33 can be defined as a large image, the feature map of fm_w=[17,32] can be defined as a middle image, and the feature map of fm_w<=16 can be defined as a small image. .

Those skilled in the art should understand that the minimum access unit of the memory can also be 16Byte or other sizes, and the data bit width of each pixel can also be 4 bits or 16 bits or other sizes, and the feature map can be determined similarly. The storage form of the file is not listed one by one in the embodiment of the present invention.

Subsequently, the feature map data to be stored corresponding to the read feature map command can be received in S110, and temporarily stored in the feature map cache module. It can be understood that what is received in S110 is the original feature map data before compression.

Exemplarily, S120 and S130 in FIG. 8 may be executed by the data compression module. In S120, the current compression unit can be set, such as a row of the feature map data or all of the feature map data. Subsequently, the current compression unit is divided into multiple data units. As an example, one data unit may include 8 pixels. In this way, a compression path can compress data units of 8 pixels at a time. When the system shown in Figure 3 is used for parallel compression by at least two compression paths, each compression path can compress data units of 8 pixels at a time, which can improve the degree of parallelism, and on the one hand, improve the efficiency and speed of compression. , On the other hand, it also avoids becoming the performance bottleneck of the system.

Exemplarily, in S130, for one data unit, compression may be performed through the following process: divide the data unit into one or more groups; if the data of the first group of the plurality of groups is all zeros, Then the compressed data is 0; if the data in the second group of the multiple groups is not all zeros, then: determine multiple differences between the data in the second group, and based on the multiple The difference is compressed.

Among them, the data of the first group is all zeros means: all the data of the first group are zeros. If the data in the second group is not all zeros, it means that at least one data in the second group is not zero.

Exemplarily, if the data in the second group is not all zeros, the multiple differences between the data in the second group refer to the differences between every two adjacent pixels.

Wherein, if the data in the second group is not all zeros, then determining the multiple differences between the data in the second group may include: determining the first data in the second group and the last data before the second group The difference between a data, and determine the difference between each data in the second group except the first data and the first data. It is understandable that if the second group includes n0 data, then n0 differences will be obtained. Also, it should be noted that multiple differences are signed bit differences.

With reference to Figure 7, the embodiment of the present invention can be executed by the scan coding module: divide a data unit into one or more groups; determine whether the data in each group is all zeros; the data in a certain group is not all zeros Calculate multiple differences between the data in the non-all-zero group.

Assuming that one data unit has 8 pixels, one data unit can be divided into two groups, that is, each group includes 4 pixels. If the 8 pixels of a data unit are represented as {p1,p2,p3,p4,p5,p6,p7,p8}, then the two groups after division are: {p1,p2,p3,p4} and { p5, p6, p7, p8}. Subsequently, for the first set of data {p1, p2, p3, p4}, determine whether the pixel values of these four pixels are all zeros, if they are all zeros, they can be represented by an all-zero indicator, for example, the all-zero indicator is 1-bit "0". If the pixel values of these four pixels are not all zeros (that is, not all zeros), that is, at least one pixel is non-zero, it can be represented by a non-all zero indicator, for example, the non-all zero indicator is a 1-bit "1". ". For the second set of data {p5, p6, p7, p8}, a similar judgment can be performed, and an all-zero indicator or a non-all-zero indicator can be obtained.

In this way, for a data unit {p1, p2, p3, p4, p5, p6, p7, p8}, the indicator can be obtained by judging whether the two groups are all zeros, as shown in Table 1 below.

Table I

指示符indicator	含义meaning
0000	两组像素都全零Both sets of pixels are all zeros
1010	第二组像素全零，第一组像素非全零The second group of pixels are all zeros, the first group of pixels are not all zeros
0101	第一组像素全零，第二组像素非全零The first set of pixels are all zeros, the second set of pixels are not all zeros
1111	两组像素都非全零Both sets of pixels are not all zeros

Further, if the indicator is "10", the difference values D1, D2, D3, and D4 can be calculated. If the indicator is "01", the difference values D5, D6, D7, and D8 can be calculated. If the indicator is "11", the difference values D1, D2, D3, D4, D5, D6, D7, and D8 can be calculated.

Specifically, D1 = p1-p0; D2 = p2-p1; D3 = p3-p1; D4 = p4-p1. And D5=p5–p4; D6=p6–p5; D7=p7–p5; D8=p8–p5.

Among them, p0 represents the last pixel of the previous data unit located before the data unit, as shown in FIG. 11. It should be noted that if the data unit is the starting position of the current compression unit, that is, there is no previous data unit in the data unit, p0=0 can be defined.

It should be noted that the multiple difference values obtained are signed numbers. For example, assuming that each pixel in a data unit is an 8-bit signed number, the difference obtained is a 9-bit signed number, where the first bit of the 9-bit signed number is its sign bit, for example, A sign bit of 0 indicates a positive number, and a sign bit of 1 indicates a negative number.

Exemplarily, a schematic structural diagram of the scan coding module (SCAN_DPCM module) in the embodiment of the present invention may be as shown in FIG. 12.

The register, which can be expressed as SRORAGE_MIN_UNIT, is the smallest access unit of the temporary storage memory, which can include multiple data units.

Specifically, the scan coding module may divide the feature map data in the minimum access unit into multiple data units, that is, all data in one data unit are located in the same minimum access unit.

With reference to the foregoing description of the minimum access unit in conjunction with Figure 9 and Figure 10, if the storage space required for a row of feature map data is greater than the minimum access unit, the data located in the same minimum access unit belongs to the feature map data. Same line. If the storage space required for one row of feature map data is less than the minimum access unit, the data belonging to the same row of feature map data is located in the same minimum access unit.

SCAN_MUX can select a data unit from the register (SRORAGE_MIN_UNIT), and then divide the data unit into one or more groups for compression. Specifically, SCAN_MUX fetches a data unit from the smallest access unit until the traversal completes the smallest access unit. And in order to avoid invalid compression operations, a data unit must contain at least one valid pixel. If all data contained in a data unit is invalid data for complement, it is an invalid data unit. At this time, the data unit can be skipped and the next data unit can be continued.

For example, assuming that the minimum access unit of the memory is 32B, and the data bit width of each pixel of the feature map is 8 bits, the following describes how to avoid invalid compression operations in conjunction with Figure 13: (1) When the width of the feature map (fm_w) is taken When the value is [1,8], then the height of 8B will be discarded every 16B; (2) When the width of the feature map (fm_w) takes the value [17,24], then the height of 8B will be discarded every 32B.

Now return to Figure 12, for a data unit taken out, suppose it contains 8 pixels {p1, p2, p3, p4, p5, p6, p7, p8}, every 4 pixels are a group, that is, divided into two groups, respectively These are {p1,p2,p3,p4} and {p5,p6,p7,p8}. Then it can be judged whether the 4 pixels of the first group are all zeros, indicated by the all-zero/non-all-zero indicator (ALL0_FLAG1), and whether the 4 pixels of the second group are all zeros, use the all-zero/non-all-zero indicator ( ALL0_FLAG2) said.

If the 4 elements of the first group are all zeros, then ALL0_FLAG1=0, otherwise ALL0_FLAG1=1. If the 4 elements of the second group are all zeros, then ALL0_FLAG2=0, otherwise ALL0_FLAG2=1. Refer to Table 1 above, which indicates the all-zero/non-all-zero situation of the pixels in the data unit.

Further, if ALL0_FLAG1=1, the difference (CALC_RES) D1, D2, D3, D4 can be calculated. If ALL0_FLAG2=1, the difference (CALC_RES) D5, D6, D7, D8 can be calculated. Among them, you can use registers {D1, D2, D3, D4, D5, D6, D7, D8} to store the difference value in a pipeline, and the difference value satisfies:

D1=p1–p0;

D2=p2–p1;

D3=p3-p1;

D4=p4–p1;

D5=p5–p4;

D6=p6–p5;

D7=p7–p5;

D8=p8–p5;

Wherein, referring to FIG. 12, p0 is 0 or the last pixel of the previous data unit. Specifically, p0 is the last pixel of the previous data unit (ie PRE_P8 in the figure), but if there is no previous data unit, that is, the current data unit is the start of a row of the feature map (row_start), then p0=0.

In addition, it is understandable that if {p1, p2, p3, p4, p5, p6, p7, p8} each pix is an 8-bit signed number, then the above difference {D1, D2, D3, D4, D5, D6, D7, D8} are 9-bit signed numbers.

Further, compressing multiple differences in S130 may include: determining the number of storage bits according to multiple non-negative numbers corresponding to the multiple differences, and according to the sign bits of the multiple differences and the determined number of storage bits, Compress multiple differences.

In the embodiment of the present invention, since multiple differences may have both large and small values, when multiple differences are all small, fewer bits can be used to represent them, so that less space can be occupied after compression. storage. The above-mentioned number of stored bits can be expressed as len, which is used to represent the minimum number of bits that need to be used after compressing multiple differences.

This process can be executed by the difference algorithm compression module. Specifically, the number of stored bits can be determined according to the multiple non-negative numbers corresponding to the multiple differences, and the multiple difference values can be combined according to the sign bits and the number of bits of the multiple differences. Perform compression.

Exemplarily, it may include: determining a plurality of non-negative numbers corresponding to the plurality of difference values one-to-one; determining the number of bits required for storage according to the position of the highest non-zero value in the plurality of non-negative numbers; The sign bit of and the number of bits compress multiple differences, where the storage length of the compressed difference is the number of bits.

Among them, the non-negative number corresponding to the difference may refer to the absolute binary value of the difference. Exemplarily, if the sign bit of the first difference value indicates that the first difference value is a non-negative number, the non-negative number corresponding to the first difference value is the number obtained by removing the sign bit of the first difference value. If the sign bit of the second difference value indicates that the second difference value is a negative number, the non-negative number corresponding to the second difference value is the second difference value after removing its sign bit and inverted.

Wherein, the position of the highest non-zero value in the multiple non-negative numbers can be determined by performing a "bitwise OR" operation on multiple non-negative numbers, and then the number of bits required to store multiple differences can be determined.

Among them, when compressing multiple differences, only the digits of the number of bits behind each difference are retained, and the digits with 0 in front are deleted, and then the sign bit is added in front of the reserved part. Moreover, when compressing and storing a group of non-all zeros, the stored compressed data may include: a non-all zero indicator, a bit number indicator, and multiple compressed differences, where the bit number indicator indicates the compressed data The length of the multiple differences after removing the sign bit. That is to say, each of the multiple difference values after compression has data with a sign bit and a bit number.

The following describes the difference compression process in conjunction with the aforementioned data units {p1, p2, p3, p4, p5, p6, p7, p8}.

Assuming that the first group is non-all zeros, and the second group is non-all zeros, the scan coding module obtains 8 difference values D1 to D8. The following describes the difference algorithm compression module with reference to Figure 14 to compare the 8 difference values D1 to D8 Example process for compression. Referring to FIG. 14, the sign bits of the eight differences D1 to D8 can be extracted, and then the corresponding non-negative numbers can be determined according to the sign bits. Taking D1 as an example, F1 represents the highest bit of D1, that is, the sign bit. D1' represents the remaining binary number after the sign bit of D1 is removed. d1' represents the non-negative number corresponding to the difference D1. Specifically, if D1 itself is a non-negative number, for example, F1 is 0, then D1' is a non-negative number corresponding to D1, that is, it is determined that d1' is D1'. In another case, if D1 itself is a negative number, for example, F1 is 1, then (~D1') is a non-negative number corresponding to D1, that is, d1' is determined to be (~D1'), where ~ represents the inverse. It should be noted that if D1 itself is a negative number (F1 is 0), then the absolute value of the negative number represented by D1 is ~D1'+1. For example, the decimal number represented by an 8-bit binary number ranges from -256 to 255, because the binary number "11111111" represents the decimal number 255, and the sign bit "1" for negative numbers is added to the front of it to represent the decimal number -256; also That is to say, when the sign bit is "1", the absolute value of the corresponding negative number is the decimal number +1 after removing the sign bit. In this way, through a similar process, 8 non-negative numbers corresponding to 8 differences can be obtained: d1’, d2’, d3’, d4’, d5’, d6’, d7’, d8’.

Subsequently, for the first group, d_max1 is a bitwise OR operation on d1', d2', d3', and d4' to obtain the 4 difference values D1, D2, D3 and D4 of the first group. It is detected that 1 of the highest bit of d_max1 is the first bit, that is, len1. Similarly, for the second group, d_max2 is a bitwise OR operation of d5', d6', d7', and d8' to obtain the 4 difference values D5, D6, D7, and D8 of the second group. How many bits are needed to represent , It is detected that 1 of the highest bit of d_max2 is the first bit, that is, len2.

After this, you can keep the len1 bit behind d1', d2', d3', d4', and then add the corresponding sign bits F1, F2, F3, F4 in front to get the result d1, d2 after the difference compression , D3 and d4. You can keep the len2 bits behind d5', d6', d7', d8', and then add the corresponding sign bits F5, F6, F7, F8 in front to get the difference compressed results d5, d6, d7 and d8 .

It should be understood that although the process of compressing the eight differences is described in conjunction with FIG. 14, the present invention is not limited thereto. For example, if the first group is all zeros, there is no need to calculate D1, D2, D3, and D4, and there is no need to compress to get d1, d2, d3, and d4. Similarly, if the second group is all zeros, there is no need to calculate D5, D6, D7, and D8, and there is no need to compress to get d5, d6, d7, and d8.

As described above, in the embodiment of the present invention, it is assumed that the data bit width of each pixel is 8 bits, so that each of the multiple difference values is a 9-bit difference value (including a 1-bit sign bit). Therefore, the number of storage bits occupied by each difference is at most 8, so that len1 and len2 only need 3 bits. It is understandable that a 3-bit binary number can represent [0,7], and in this example, the number of bits corresponding to the compressed difference is [1,8]. For example, suppose len1 is “010”, which means that the number of bits after the difference is compressed is 3; suppose len1 is “111”, which means that the number of bits after the difference is compressed is 8. In addition, it can be understood that since each compressed difference value d1 to d8 also includes its own sign bit, the actual number of bits occupied by each compressed difference value is [2,9].

After obtaining the d1-d8 of the compression difference, the compressed data can be further obtained based on this. Specifically, the compressed data for a group in the data unit includes: all zero/non-all zero indicator, bit number indicator (if not All zeros) and the compression difference (if not all zeros). Exemplarily, for the data unit shown in FIG. 11, the compressed data may be as shown in FIG. 15, and there may be three situations.

In case 1, if both groups of data units are all zeros, only two all-zero indicators are needed, occupying 2 bits. That is, ALL0_FLAG1=0, ALL0_FLAG2=0 and compressed data (ENC_RESULT)=ALL0_FLAG1, ALL0_FLAG2.

In case 2, if one of the two groups of data units is all zeros and the other is non-all zeros, an all-zero indicator and a non-all-zero indicator are required, occupying 2 bits; a bit number indicator is also required, Occupies 3 bits; and 4 compression differences are required. The number of occupied bits is related to the specific value of the number of bits. That is, ALL0_FLAG1=1, ALL0_FLAG2=0 and compressed data (ENC_RESULT)=ALL0_FLAG1, ALL0_FLAG2, len1, d1, d2, d3, d4; or ALL0_FLAG1=0, ALL0_FLAG2=1 and compressed data (ENC_RESULT)=ALL0_FLAG1, ,len2,d5,d6,d7,d8.

In case 3, if both groups of data units are non-all zeros, then two non-all zero indicators occupy 2 bits; two bit number indicators are also required, occupying 6 bits; and 8 compression differences are required, The number of occupied bits is related to the specific value of the number of bits. That is, ALL0_FLAG1=1, ALL0_FLAG2=1 and compressed data (ENC_RESULT)=ALL0_FLAG1, ALL0_FLAG2, len1, len2, d1, d2, d3, d4, d5, d6, d7, d8.

Exemplarily, the compression length of the compressed data can also be calculated, and the compression length can represent all the bits occupied by the compressed data, for example, refer to the sum of the bits of each data contained in the compressed data shown in FIG. 15.

After that, the obtained compressed data can be cached in the compressed feature map cache module after length-aligned for subsequent output process. Exemplarily, as described in conjunction with the aforementioned FIG. 3, if the compression header generation module determines that bypass (bypass operation) is not needed, then the compressed data is written to the external memory; otherwise, the original feature map is read again and the compressed feature map is replaced The compressed data in the module is cached, and the original feature map after replacement is written into the external memory.

And, it should be understood that, in S130, after performing compression on the first data unit of the multiple data units, the second data unit located after the first data unit is read immediately, and a similar compression operation is performed. The first data unit and the second data unit may be two adjacent data units that are sequentially compressed in time by the compression path.

Exemplarily, after the first data unit is compressed, while storing the compressed first data unit, the compression process for the second data unit is started. Wherein, starting the compression process for the second data unit includes: judging whether the data of the second data unit is all zeros. That is to say, while the compressed data unit is written into the external memory, the compression process of determining whether all zeros is started is started for the second data unit. In this way, the pipeline compression processing process for multiple data units can be realized, and the resource utilization rate can be improved.

Exemplarily, after the scan coding module determines whether the two groups in the first data unit are all zeros, and calculates the difference (if there is a non-zero group), the difference algorithm compression module compresses the difference. And when the difference algorithm compression module compresses the difference of the data in the first data unit, the scan coding module starts to determine whether the two groups in the second data unit are all zeros. In other words, at the same time, different modules may be performing compression processing for different data units. In this way, resource utilization can be further improved, and the efficiency of data unit compression by the compression path can be improved.

It can be understood that the first data unit and the second data unit herein may be two adjacent data units to be processed in a register (as shown in FIG. 12, where data of the smallest access unit size of the temporary storage memory) is stored.

It can be seen that the data compression module in the embodiment of the present invention may include a scan coding module and a difference algorithm compression module. It is a multi-stage compression pipeline design that can reduce the combinatorial logic of each stage, so that the generated circuit can support more High clock frequency improves chip performance. In addition, the scan encoding module and the difference algorithm compression module have the circuit design structures shown in FIG. 12 and FIG. 14, respectively, so that one compression path can compress 8 pixel data at a time.

Specifically, the process of performing compression by each compression path may be as shown in FIG. 16. Exemplarily, the steps after reading the feature map data in FIG. 16 are performed by the difference compression model. And as mentioned above, when reading the feature map data, it is read according to the smallest access unit of the memory, that is, the feature map data of the smallest access unit size is read at a time, and the feature map data of the smallest access unit size is completed. After the compression, read the feature map data of the next smallest access unit size until all the feature map data indicated by the compression instruction has been read.

The scan coding module can read a data unit of the feature map data with the smallest access unit size, and judge whether the two groups included in the data unit are all zeros, and if there are non-all zero groups, the original difference is calculated, where the original difference is The value represents the difference between two adjacent pixels in the data unit, such as the above D1 to D8.

The difference algorithm compression module can compress the original difference values (such as the above D1 to D8) to obtain the compressed difference values (such as the above d1 to d8), and calculate the compression length. Optionally, a flag indicating whether the current data unit is the end of the current compression unit can also be output.

After the compression of one data unit is completed, the next data unit can be read from the feature map data of the smallest access unit size until the compression process of all pixels of the feature map data of the smallest access unit size is completed.

It can be seen that the method for data compression and storage in the embodiment of the present invention fully takes into account the situation of zero in the feature map and the feature that the values of adjacent pixels of the feature map data are close, and the difference method is used for compression, which can make the compressed data The occupied storage space is smaller, on the one hand, it can reduce the space occupation of the external memory, on the other hand, it can also reduce the bandwidth resources during reading and writing, and save power consumption.

As another aspect of the embodiments of the present invention, another device for data compression and storage is also provided. As shown in FIG. 17, the device may include: a receiving device 210, a dividing device 220, a compression device 230, and a storage device 240. .

The receiving device 210 is configured to receive feature map data to be stored;

The dividing device 220 is configured to divide the feature map data into multiple data units;

The compression device 230 is configured to, for each data unit of the multiple data units, determine whether the data in the data unit is all zeros, and perform compression according to the result of the determination;

The storage device 240 is configured to store the compressed feature map data.

In one implementation, the compression device 230 compresses a data unit through the following process: divide the data unit into one or more groups; if the data in the first group of the multiple groups is all zeros, the compressed data It is 0; if the data of the second group in the multiple groups is not all zeros, then: determine multiple differences between the data in the second group, and compress according to the multiple differences.

In one implementation, the compression device 230 is configured to: determine the difference between the first data in the second group and the last data before the second group, and to determine the difference between the first data in the second group The difference between each other data and the first data.

In an implementation manner, the compression device 230 is configured to: determine a plurality of first non-negative numbers corresponding to a plurality of difference values one-to-one; The number of bits required; according to the sign bits and the number of bits of the multiple differences, the multiple differences are compressed, wherein the compressed length of each difference is the number of bits plus one.

In an implementation manner, the compression device 230 is configured to: determine the number of bits required for storage by performing a bitwise OR operation on a plurality of first non-negative numbers.

In one implementation, the compression device 230 is configured to:

If the sign bit of the first difference value indicates that the first difference value is a second non-negative number, the first non-negative number corresponding to the first difference value is the number obtained by removing the sign bit of the first difference value;

If the sign bit of the first difference value indicates that the first difference value is a negative number, the first non-negative number corresponding to the first difference value is the first difference value after removing its sign bit and inverted.

In an implementation manner, if the data of the second group is non-all zeros, the data stored after compressing the second group includes: a non-all zero indicator, a bit number indicator, and multiple difference values after compression. Wherein, the bit number indicator represents the length of the compressed multiple differences after removing the sign bit.

In one implementation, the non-all zero indicator is 1.

In an implementation manner, the compression device 230 is further configured to: generate compression header information corresponding to the compressed data unit; wherein, the storage device 240 is configured to: combine the compressed data unit with the corresponding compression header. Information is stored in external storage.

In an implementation manner, the compression device 230 is further configured to: determine whether a bypass operation needs to be performed according to the compression header information; if it is determined that the bypass operation needs to be performed, generate bypass compression header information corresponding to the bypass operation. The storage device 240 is configured to store uncompressed feature map data and bypass compression header information in an external memory.

In an implementation manner, it further includes a reading device configured to: receive a compression instruction; send a feature map read command according to the compression instruction, so as to obtain feature map data corresponding to the read feature map command from the on-chip memory.

In one implementation, the read feature map command includes the width and height of the feature map data, and the base address of the on-chip memory.

In one implementation, the receiving device 210 is configured to receive feature map data consistent with the size of the minimum access unit.

In one implementation, the compression device 240 is configured to divide the feature map data into multiple data units according to the minimum access unit of the memory, wherein all data in one data unit are located in the same minimum access unit.

In an implementation manner, if the storage space required for a row of data of the feature map data is greater than the minimum access unit, the data located in the same minimum access unit belongs to the same row of the feature map data. If the storage space required for one row of feature map data is less than the minimum access unit, the data belonging to the same row of feature map data is located in the same minimum access unit.

In one implementation, the feature map data to be stored is the output of the convolutional layer in the neural network.

In one implementation, the compression device 230 is configured to: while storing the compressed first data unit, start the compression process for the second data unit. Among them, the first data unit and the second data unit are data units that are compressed sequentially in time.

In one implementation, the compression device 230 is configured to start the compression process of the second data unit by determining whether the data of the second data unit is all zeros.

Exemplarily, the device shown in FIG. 17 can be used to implement the data storage method shown in FIG. 8. In order to avoid repetition, it will not be repeated here.

In addition, in conjunction with FIG. 3, the device shown in FIG. 17 can be any one of the at least two compression paths, and it is understandable that the device shown in FIG. 17 is only schematic, and it can also be implemented as Figure 7 shows the various modules.

It should be understood that the system for data compression storage in the embodiment of the present invention can be implemented on a processor, for example, it can be a processor of various devices such as a computer, a server, a workstation, a mobile terminal, and a pan/tilt. In addition, the original feature map may be received or obtained by the processor from other devices, or generated by the processor in the process of executing other operations or algorithms. For example, the processor may be in the process of executing a convolutional neural network. Generate the original feature map.

Exemplarily, an embodiment of the present invention also provides a processor. The processor may include an on-chip memory and the system as shown in FIG. 3. Alternatively, the processor may include on-chip memory and the device as shown in FIG. 17.

In the embodiment of the present invention, the processor may include a central processing unit (Central Processing Unit, CPU) or other forms of processing units with data processing capabilities and/or instruction execution capabilities, such as Field-Programmable Gate Array (Field-Programmable Gate Array). , FPGA) or Advanced RISC (Reduced Instruction Set Computer) Machine (ARM), etc., and the processor may include other components to perform various desired functions.

It should be understood that the terms "characteristic map", "characteristic map data", and "original characteristic map" in the embodiment of the present invention refer to the data before compression by the system of the embodiment of the present invention, unless otherwise indicated. , It can have two dimensions of width and height, or alternatively can have three dimensions of width, height and channel.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.

In addition, the embodiment of the present invention also provides a computer storage medium on which a computer program is stored. When the computer program is executed by the processor, the steps of the data storage method shown above can be realized. For example, the computer storage medium is a computer-readable storage medium. For example, when the computer program instructions are executed by the computer or the processor, the computer or the processor executes the steps of the method shown in FIG. 4 or FIG. 8.

In one embodiment, when the computer program instructions are executed by the computer or the processor, the computer or the processor executes the following steps: receiving the feature map data to be stored; dividing the feature map data into multiple data units; Each data unit of the plurality of data units: judges whether the data in the data unit is all zeros, and compresses the data according to the judgment result; and stores the compressed feature map data.

The computer storage medium may include, for example, the memory card of a smart phone, the storage component of a tablet computer, the hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory ( CD-ROM), USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

In addition, an embodiment of the present invention also provides a computer program product, which contains instructions, which when executed by a computer, cause the computer to execute the steps of the data storage method shown in FIG. 4 or FIG. 8.

In one embodiment, when the instruction is executed by the computer, the computer is caused to execute: receive the feature map data to be stored; divide the feature map data into a plurality of data units; A data unit: judge whether the data in the data unit is all zeros, and compress according to the judgment result; store the compressed feature map data.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc. .

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processor, or each unit may exist alone physically, or two or more units may be integrated into one unit.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A system for data compression storage, characterized in that the system is used to compress a feature map in an on-chip memory and then store it in an external memory, and the system includes a compression instruction generation module, a read arbitration module, and at least Two compression paths and write arbitration module:

The compression instruction generating module is configured to distribute the compression instruction to each of the at least two compression paths;

Each of the at least two compression paths is configured to read the corresponding original feature map from the on-chip memory according to the compression instruction received from the compression instruction generation module, and read the original The feature map is compressed;

The read arbitration module is configured to arbitrate the read feature map commands of the at least two compressed paths for the original feature map in the on-chip memory;

The write arbitration module is configured to arbitrate the write requests of the at least two compression paths to write compressed data into the external memory.
The system according to claim 1, wherein each of the at least two compression paths comprises: a feature map reading module, a feature map caching module, a data compression module, a data packing module, and a compression header generation module ,

The read feature map module is configured to send a read feature map command for the original feature map in the on-chip memory according to the compression instruction received from the compression instruction generation module;

The feature map cache module is configured to store the original feature map read back from the on-chip memory;

The data compression module is configured to divide the original feature map in the feature map cache module into multiple data units, and perform differential compression for each data unit of the multiple data units;

The data packing module is used to splice the data compressed by the data compression module into complete compressed data;

The compression header generation module is configured to generate compression header information corresponding to the compressed data obtained by the data compression module according to the compression instruction received from the compression instruction generation module.
The system according to claim 2, wherein each of the at least two compression paths further comprises a length alignment module for:

The length of the compressed data spliced by the data packing module is complemented to a specific length.
The system according to claim 2 or 3, wherein:

The compressed header generating module is also used to determine whether bypass is needed, and to generate compressed header information corresponding to the original feature map when it is determined that bypass is needed;

The feature map reading module is further configured to re-read the original feature map from the on-chip memory when the compression head generation module determines that bypassing is required.
The system according to claim 4, wherein each of the at least two compression paths further comprises a length alignment module for:

The length of the re-read original feature map is padded to a specific length.
The system according to claim 3 or 5, wherein the specific length is preset according to the performance of the external memory chip.
The system according to any one of claims 2 to 6, wherein the data compression module comprises: a scan coding module and a difference algorithm compression module,

The scan coding module is configured to divide the original feature map into multiple data units, for each data unit: divide the data unit into one or more groups, and determine whether all the data in each group is complete Zero, and when it is determined to be non-zero, calculate multiple differences between the data in the non-zero group;

The difference algorithm compression module is configured to determine the number of stored bits according to the plurality of non-negative numbers corresponding to the plurality of differences, and calculate the number of stored bits according to the sign bits of the plurality of differences and the number of bits. The difference is compressed.
The system according to any one of claims 2 to 7, wherein each compression path of the at least two compression paths further comprises: a compressed header cache module, a compressed feature map cache module, and a compressed header write module And compression feature map writing module,

The compressed header caching module is configured to buffer the compressed header information to be output generated by the compressed header generating module;

The compressed feature map caching module is configured to cache data to be output, the data to be output is compressed data with length complemented or original feature maps with length complemented when bypass is needed;

The compression header writing module is configured to perform a writing operation of the compression header information in the compression header caching module;

The compressed feature map writing module is configured to perform a write operation of the to-be-output data in the compressed feature map cache module.
The system according to any one of claims 1 to 8, wherein the read feature map command includes the width and height of the original feature map to be read, and the base address of the on-chip memory.
The system according to any one of claims 1 to 9, wherein the system further comprises a read command cache module and a read data path identification cache module, both of which are connected to the read arbitration module,

The read arbitration module is used for:

Obtain the read feature map commands issued by each compression path, where the read feature map commands of each compression path can be cached in the respective read command cache module;

Arbitrate the read characteristic map commands in each read command cache module according to the arbitration rules, and obtain the arbitration result;

The read characteristic map command of the compressed path that won the arbitration is sent to the on-chip memory first, and the path identifier of the compressed path that won the arbitration is stored in the read data path identifier cache module.
The system according to any one of claims 1 to 10, wherein the write arbitration module is specifically configured to:

Get the write request issued by each compression path;

Arbitrate each write request according to the arbitration rules, and get the arbitration result;

The write request of the compression path that wins the arbitration is sent to the external memory first.
The system according to claim 10 or 11, wherein the arbitration rule is a priority mechanism or a fair polling mechanism configured according to the compressed instruction.
The system according to any one of claims 1 to 12, wherein the compressed instruction generating module is specifically configured to:

The compression instruction is received, the received compression instruction is parsed, and the parsed compression instruction is distributed to each of the at least two compression paths.
The system according to any one of claims 1 to 13, wherein the compression instructions distributed to each compression path include:

The number of feature maps to be compressed in the compression path,

Feature map width,

Feature map height,

These number of feature maps are in the base address of the on-chip memory,

These number of feature maps are stored in the inter-map storage interval of the on-chip memory,

These number of feature maps are compressed by the compression path and output to the base address of the external memory,

These number of feature maps are compressed by the compression path and output to the external memory storage interval,

These numbers of feature maps are output to the base address of the external memory after the compression header information compressed by the compression path,

The compressed header information after these number of feature maps are compressed by the compression path is output to the header information storage interval of the external memory.
The system according to any one of claims 1 to 14, wherein the system switches between the following states: idle state, receiving instruction state, parsing instruction state, and waiting for completion state, wherein,

When the system is in the idle state, waiting for a compression command start signal, and after receiving the compression command start signal, switch to the receiving command state;

When the system is in the receiving instruction state, receiving a compression instruction, and after receiving the instruction, outputting an instruction ready signal, and switching to the analysis instruction state;

When the system is in the parsing instruction state, parsing the compression instruction received in the receiving instruction state, and distributing the compression instruction to each compression path;

When the system is in the waiting completion state, it monitors the completion signals of each compression path, and can switch to the idle state after monitoring that all the compression paths are completed.
The system according to any one of claims 1 to 15, wherein the feature map is the output of a convolutional layer in a neural network.
A method for data compression storage, characterized in that the method is used for compressing the feature map in the on-chip memory and then storing it in the external memory, and the method includes:

Distributing the compression instruction to each of the at least two compression paths;

Each compression path reads the corresponding original feature map from the on-chip memory according to the received compression instruction, and compresses the read original feature map;

Storing the compressed feature map in the external memory;

Wherein, when at least two compressed paths read the original feature map in the on-chip memory, arbitrate the read feature map commands of the at least two compressed paths;

Wherein, when the at least two compression paths write the compressed feature map into the external memory, the write requests of the at least two compression paths are arbitrated.
The method according to claim 17, wherein compressing the read original feature map comprises:

Dividing the original feature map into multiple data units;

Perform difference compression for each data unit of the plurality of data units.
The method according to claim 18, compressing the read original feature map, further comprising:

The feature map after the difference value compression is spliced into a complete compressed data.
The method of claim 19, further comprising:

The length of the spliced compressed data is padded to a specific length.
The method according to any one of claims 18 to 20, wherein performing difference compression for each data unit of the plurality of data units comprises:

For a data unit:

Divide the data unit into one or more groups;

Determine whether all the data in each group are all zeros, and when it is determined to be non-all zeros, calculate multiple differences between the data in the non-all zero groups;

Determine the number of storage bits according to the multiple non-negative numbers corresponding to the multiple differences;

Compressing the plurality of differences according to the sign bits of the plurality of differences and the number of bits.
The method according to any one of claims 18 to 21, characterized in that compressing the read original feature map, further comprising:

Generate compressed header information corresponding to the feature map after difference compression.
The method according to claim 22, further comprising:

Judging whether a bypass operation needs to be performed according to the compressed header information;

When it is determined that the bypass operation needs to be performed, the original feature map is re-read from the on-chip memory, and bypass compression header information corresponding to the original feature map is generated.
The method according to claim 23, further comprising:

The length of the re-read original feature map is padded to a specific length, and the feature map after compression of the difference is discarded.
The method according to claim 20 or 24, wherein the specific length is preset according to the performance of the chip of the external memory.
The method according to claim 20 or 24 or 25, wherein the complementing the length to a specific length comprises:

By adding invalid data, the length is padded to the specified length.
The method according to claim 23, wherein determining whether a bypass operation needs to be performed comprises:

By comparing the size of the feature map after the difference is compressed with the size between the original feature map, it is determined whether a bypass operation needs to be performed.
The method according to any one of claims 17 to 27, wherein reading the corresponding original feature map from the on-chip memory comprises:

The original feature map consistent with the minimum access unit size is read from the on-chip memory.
The method according to any one of claims 17 to 28, wherein the read feature map command includes the width and height of the original feature map to be read, and the base address of the on-chip memory.
The method according to any one of claims 17 to 29, wherein arbitrating the read characteristic map commands of the at least two compressed paths comprises:

Arbitrate the at least two read characteristic map commands from the at least two compression paths according to the arbitration rule to obtain an arbitration result;

The command to read the characteristic map of the compressed path that won the arbitration is sent to the on-chip memory first, and the path identifier of the compressed path that won the arbitration is stored.
The method according to any one of claims 17 to 30, wherein arbitrating the write requests of the at least two compression paths comprises:

Arbitrate the at least two write requests from the at least two compression paths according to the arbitration rule, and obtain an arbitration result;

The write request of the compression path that wins the arbitration is sent to the external memory first.
The method according to claim 30 or 31, wherein the arbitration rule is a priority mechanism or a fair polling mechanism configured according to the compressed instruction.
The method according to any one of claims 17 to 32, wherein the compression instructions distributed to each compression path comprise:

The number of feature maps to be compressed in the compression path,

Feature map width,

Feature map height,

These number of feature maps are in the base address of the on-chip memory,

These number of feature maps are stored in the inter-map storage interval of the on-chip memory,

These number of feature maps are compressed by the compression path and output to the base address of the external memory,

These number of feature maps are compressed by the compression path and output to the external memory storage interval,

These numbers of feature maps are output to the base address of the external memory after the compression header information compressed by the compression path,

The compressed header information after these number of feature maps are compressed by the compression path is output to the header information storage interval of the external memory.
The method according to any one of claims 17 to 33, further comprising:

A state machine is preset to perform state switching during the process of data compression and storage, where the state machine includes the following states: idle state, receiving instruction state, parsing instruction state, and waiting for completion state, wherein,

When in the idle state, waiting for a compression instruction start signal, and after receiving the compression instruction start signal, switch to the receiving instruction state;

When in the receiving instruction state, receive the compression instruction, and after the reception is completed, output an instruction ready signal, and switch to the analysis instruction state;

When in the analyzing instruction state, analyze the compression instruction received in the receiving instruction state, and distribute the compression instruction to each compression path;

When in the waiting state, the completion signal of each compression path is monitored, and after the completion of all the compression paths is monitored, the idle state can be switched to.
The method according to any one of claims 17 to 34, wherein the feature map is the output of a convolutional layer in a neural network.
A processor, characterized in that it comprises:

On-chip memory, and

The system according to any one of claims 1 to 16.
A computer storage medium having a computer program stored thereon, wherein the computer program implements the steps of any one of claims 17 to 35 when the computer program is executed by a processor.