CN112199343A

CN112199343A - Data compression/decompression acceleration method for small data block scene

Info

Publication number: CN112199343A
Application number: CN202011391958.XA
Authority: CN
Inventors: 汪洋; 凌阳; 吴纹; 汤鲲
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-01-08

Abstract

The invention discloses a data compression/decompression acceleration method for a small data block scene, which comprises the following steps: performing data combination at the host end, and transmitting the processed data from the host end to the FPGA DRAM through the PCIE bus; grouping the batch data, compressing each group of data in parallel, and converting 512-bit data bit width into 8-bit stream data aiming at each file; during compression, LZ77 algorithm compression and LZ4 algorithm compression are sequentially performed on data; during decompression, the data in the LZ4 format is converted into the data in the LZ77 format, and the data in the LZ77 format is analyzed into original data; and transmitting the compressed/decompressed data to a host computer end through the FPGA DRAM, and splitting the data by the host computer end to obtain a compressed/decompressed file. According to the invention, the LZ4 algorithm is adopted on the FPGA to process small files in batch, data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.

Description

Data compression/decompression acceleration method for small data block scene

Technical Field

The invention discloses a data compression decompression acceleration method for a small data block scene, and relates to the technical field of large data processing.

Background

In the big data age, mass data is increasingly transmitted and stored. Under limited conditions, data compression techniques are important to satisfy the user's data acquisition. Compression may reduce storage and reduce data transmission, thereby reducing network latency.

However, the existing compression and decompression algorithms are implemented based on a server architecture, and due to the limitation of a hardware architecture, the performance of the compression algorithm implemented on a CPU is limited, and a large amount of CPU resources are occupied. When a user performs an inquiry operation, the load of the machine is too high, and other services on the machine are affected. The compression work on the CPU can be transferred to the FPGA using FPGA technology, thereby reducing the load on the CPU.

At present, the compression algorithms used in the field of the FPGA can obtain better throughput when the input file is large, and when a large number of small file scenes are encountered, the compression performance cannot reach the performance of the CPU due to the limitation of the data transmission bandwidth and the parallelism.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the defects of the prior art, the 16KB small file compression system and method based on the FPGA accelerator card are provided, the LZ4 algorithm is adopted on the FPGA to process the small files in batches, the data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.

The invention adopts the following technical scheme for solving the technical problems:

a method for accelerating data compression/decompression of a small data block scene, the method comprising the steps of:

step one, data combination is carried out at a host end, after each file is aligned and spliced according to 64 bytes, processed data are transmitted to an FPGA DRAM from the host end through a PCIE bus;

grouping the batch data, wherein each 8 files form one group, and the last residual files less than 8 are independently formed into one group, compressing each group of data in parallel, and converting the 512-bit data bit width into 8-bit stream data aiming at each file;

the data compression acceleration method comprises the following steps:

201. performing first compression on the data obtained in the step two by using an LZ77 algorithm; the LZ77 compression algorithm is compressed in a dictionary mode;

202. performing a second compression round on the data obtained in the step 201 by using an LZ4 algorithm; all character sequences and { matching length, matching position and literal sequence length } sequences in the LZ77 format data are finally obtained through the LZ4 algorithm;

203. for the data obtained in step 202, converting the data from 8 bits to 512 bits, and storing the data in a space with the same size as the original file; if the file obtained after compression is larger than the source file, outputting the original file without compression;

the data decompression acceleration method comprises the following steps:

211. in the decompression process, the currently input data is in an LZ4 format, and the data in the LZ4 format is converted into the data in the LZ77 format;

212. continuously analyzing the data in the LZ77 format to obtain original data;

and step three, transmitting the compressed/decompressed data to a host computer end through the FPGA DRAM, and splitting the data by the host computer end to obtain a compressed/decompressed file.

As a further preferred embodiment of the invention, the FPGA DRAM is a Xilinx UltraScale + architecture and U50 accelerator card integrating ultra-high bandwidth HBM2 memory technology.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects: according to the invention, the LZ4 algorithm is adopted on the FPGA to process small files in batch, data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.

Drawings

FIG. 1 is a schematic diagram of the present invention, in which the host side performs byte alignment on each piece of data in the batch data, and then transfers the aligned batch data block to the U50 card at one time.

FIG. 2 is a flow chart illustrating the compressing operation performed on each file according to the present invention.

FIG. 3 is a flow chart illustrating the calculation of the matching length of a character string according to the present invention.

Fig. 4 is a schematic diagram of the data structure of the LZ4 standard.

Fig. 5 is a flow chart showing a process of merging and outputting all character sequences and { matching length, matching position, literal sequence length } sequences in LZ77 in the present invention.

Fig. 6 is a schematic diagram of a decompression process in the present invention.

FIG. 7 is a schematic diagram of a process of repeatedly parsing and outputting original data according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

The technical scheme of the invention is further explained in detail by combining the attached drawings:

in an embodiment of the invention, the FPGA accelerator card is a U50 accelerator card with a Xilinx UltraScale + architecture and integrated ultra-high bandwidth HBM2 memory technology. The following describes the flow of the invention according to the method steps and with reference to the examples.

Firstly, in order to transmit the data from the host side to the U50 card with the maximum performance, data merging needs to be performed at the host side, bulk transmission is performed to the U50 card, and the transmission bandwidth is performed with the size of 512bit, so that the host side needs to perform byte alignment on each piece of data in the bulk data, and then transmit the aligned bulk data block to the U50 card at one time, as shown in fig. 1.

After the data is transmitted to the U50 card, the Xilinx vitas supports an UNROLL instruction, i.e., a loop unrolling instruction, which can execute operations in a loop in parallel. And grouping the batch data, wherein each group comprises 8 data, and less than 8 data are left, and each group of data is subjected to UNROLL, namely 8 files are compressed in parallel.

Next, each file is compressed separately, as shown in fig. 2. Since each data to be compressed is 512 bits in size, 512 bits need to be converted into 8 bits of byte stream. The LZ77 compression follows, and the LZ77 compression algorithm employs dictionary compression, i.e., dictionary compression is performed by adding some characters in the data that can be organized into phrases, and then replacing the phrases in the dictionary with tokens at the same occurrence of the characters.

Firstly, the dictionary length is set to be 4096, the size of each element is 6 x 72 bits, 6 represents that hash collision is prevented, and 6 matched elements {48 bits } + current positions {24 bits } are stored in 72 bits by adopting a chain address method. The current matching element is 6 bytes, a hash value is calculated, the calculation mode is { (x [0] < < 4) ^ (x [1] < < 3) ^ (x [2] < < 3) ^ (x [3]) }), the calculated value is matched with the dictionary, the dictionary is updated by using a new value at first, then the character string matching is carried out on 6 data in the dictionary and the current 6 elements, the maximum matching length is found, if the matching length is more than 3 and the positions of two characters are not more than 65536, the matching is successful, and the information of the first character and the matching position is stored.

The dictionary lookup is to quickly find the position where the character string appears in the front, and the longest matching length cannot be calculated. Next, the matching length of the character string is calculated as shown in fig. 3. Setting the size of a sliding window to be 16KB, traversing the output result of dictionary matching one by one, and when the match _ length is not 0, starting to match the current character with the character at the position of the match _ offset until the matching is unsuccessful or the matching length exceeds 255. At which point LZ77 compression ends.

The compression is followed by LZ4, i.e. the compression result of LZ77 is converted to the standard of LZ4, and the structure of LZ4 is shown in FIG. 4. The compression result of the LZ77 comprises the current character, the matching length and the matching position, the output result of the LZ77 is traversed, the matching length is 0, the word sequence length is stored, and when the matching length is not 0, the matching length, the matching position and the word sequence length are stored. And after the characters with the matching length of 0 are saved again, counting the lengths of the characters, and repeating the process. Finally, all character sequences and { matching length, matching position and literal sequence length } sequences in the LZ77 are obtained.

Then the two results are merged and output. The specific process is as follows, as shown in fig. 5:

and traversing the { matching length, the matching position and the length of the word face sequence }, and if the length of the word face sequence is less than or equal to 15, directly filling the word face sequence, the matching position and the matching length. If it is larger than 15, the following is filled with the length of the word sequence, the word length field is one byte, maximum 255, if the previously calculated word sequence length is larger than 255, the first byte is 255, the second byte is the word sequence length-255, if it exceeds 255, the second field is still 255, and so on, until it is smaller than 255. And filling in a literal sequence, a matching position and a matching length.

At this point, after the compression process is finished, the data is converted from 8 bits to 512 bits and stored in a space with the same size as the original file { because the original file is byte aligned }, and if the compressed file is larger than the source file, the original file is output without compression. The aforementioned 8 files are compressed in parallel in a group, the compression result is stored in the queue to be output, and then the next group is compressed, and the process is consistent. And finally, after all groups are compressed, storing the data on the DRAM, transmitting the data to the host from the U50 card, and taking the data from the host to split the data to obtain files which are compressed one by one.

Next, a decompression flow is introduced, as shown in fig. 6, data organization and transmission at the host end are consistent with the compression flow, the only difference is that the size of the output space is allocated, compression is only required to be as large as the original file space, decompression requires that the original file size is obtained first and then the space is allocated according to file information, then the file information is transmitted to a U50 card DRAM according to 64-byte alignment, grouping is performed again, each group of data is converted into 8 bits from 512bit stream, and decompression is performed in parallel, and a decompression algorithm portion is described in detail below.

The currently input data is in an LZ4 format, the LZ4 is firstly converted into an LZ77 format, namely, the output is an LZ77 format character string which is a 32-bit stream, the part is a loop analysis character string, and a compiling instruction pipeline can be used. Firstly, resolving the upper 4 bits, if the upper 4 bits are 15, then reading the original file length, if the next byte is 255, continuing to the backward degree, reading until the length is less than 255, and adding all the values read in the front, then reading the original character and putting the original character into the bottom eight bits of the stream to be output, then reading two bytes of offset, wherein the lower 4 of the first byte just resolved in the front is a matching length, if the lower 4 is 15, then continuously resolving the next byte, putting the offset and match _ len into the stream to be output, the offset is the lower 16 bits, and the match _ len is the upper 16 bits. And then, continuously analyzing the matching blocks of the next round, and keeping consistent with the steps.

The parsed character string is in the LZ77 format. And sequentially reading the Output results, if the height 16 is 0, indicating that the characters are original characters, directly outputting the characters, updating the content of the local _ memory of the matching window, if not, indicating that the matching fields are encountered, analyzing the offset and the match _ len, and copying the match _ len characters in the local _ memory by using the offset into the Output. The above parsing process is then repeated, as shown in fig. 7, until all parsing is completed, to obtain the original data.

Similar to compression, decompressed data is put into an array to be output, after all groups are decompressed, all the data are stored in the DRAM of the U50, and then the data are transmitted to the host end through PCIE.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for accelerating data compression/decompression of small data block scenes, the method comprising the steps of:

the data compression acceleration method comprises the following steps:

201. performing first compression on the data obtained in the step two by using an LZ77 algorithm;

202. performing a second compression round on the data obtained in the step 201 by using an LZ4 algorithm;

203. for the data obtained in step 202, converting the data from 8 bits to 512 bits, and storing the data in a space with the same size as the original file;

the data decompression acceleration method comprises the following steps:

2. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 201, the LZ77 compression algorithm compresses in a dictionary manner.

3. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 202, through the LZ4 algorithm, all character sequences and { matching length, matching position, literal sequence length } sequences in LZ77 format data are finally obtained.

4. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 203, if the file obtained after compression is larger than the source file, the original file is output without compression.

5. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: the FPGA DRAM is a U50 accelerator card with Xilinx UltraScale + architecture and integrated ultra-high bandwidth HBM2 memory technology.