Nothing Special   »   [go: up one dir, main page]

CN112199343A - Data compression/decompression acceleration method for small data block scene - Google Patents

Data compression/decompression acceleration method for small data block scene Download PDF

Info

Publication number
CN112199343A
CN112199343A CN202011391958.XA CN202011391958A CN112199343A CN 112199343 A CN112199343 A CN 112199343A CN 202011391958 A CN202011391958 A CN 202011391958A CN 112199343 A CN112199343 A CN 112199343A
Authority
CN
China
Prior art keywords
data
compression
decompression
file
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011391958.XA
Other languages
Chinese (zh)
Inventor
汪洋
凌阳
吴纹
汤鲲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fiberhome Telecommunication Technologies Co ltd
Original Assignee
Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fiberhome Telecommunication Technologies Co ltd filed Critical Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority to CN202011391958.XA priority Critical patent/CN112199343A/en
Publication of CN112199343A publication Critical patent/CN112199343A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a data compression/decompression acceleration method for a small data block scene, which comprises the following steps: performing data combination at the host end, and transmitting the processed data from the host end to the FPGA DRAM through the PCIE bus; grouping the batch data, compressing each group of data in parallel, and converting 512-bit data bit width into 8-bit stream data aiming at each file; during compression, LZ77 algorithm compression and LZ4 algorithm compression are sequentially performed on data; during decompression, the data in the LZ4 format is converted into the data in the LZ77 format, and the data in the LZ77 format is analyzed into original data; and transmitting the compressed/decompressed data to a host computer end through the FPGA DRAM, and splitting the data by the host computer end to obtain a compressed/decompressed file. According to the invention, the LZ4 algorithm is adopted on the FPGA to process small files in batch, data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.

Description

Data compression/decompression acceleration method for small data block scene
Technical Field
The invention discloses a data compression decompression acceleration method for a small data block scene, and relates to the technical field of large data processing.
Background
In the big data age, mass data is increasingly transmitted and stored. Under limited conditions, data compression techniques are important to satisfy the user's data acquisition. Compression may reduce storage and reduce data transmission, thereby reducing network latency.
However, the existing compression and decompression algorithms are implemented based on a server architecture, and due to the limitation of a hardware architecture, the performance of the compression algorithm implemented on a CPU is limited, and a large amount of CPU resources are occupied. When a user performs an inquiry operation, the load of the machine is too high, and other services on the machine are affected. The compression work on the CPU can be transferred to the FPGA using FPGA technology, thereby reducing the load on the CPU.
At present, the compression algorithms used in the field of the FPGA can obtain better throughput when the input file is large, and when a large number of small file scenes are encountered, the compression performance cannot reach the performance of the CPU due to the limitation of the data transmission bandwidth and the parallelism.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the defects of the prior art, the 16KB small file compression system and method based on the FPGA accelerator card are provided, the LZ4 algorithm is adopted on the FPGA to process the small files in batches, the data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.
The invention adopts the following technical scheme for solving the technical problems:
a method for accelerating data compression/decompression of a small data block scene, the method comprising the steps of:
step one, data combination is carried out at a host end, after each file is aligned and spliced according to 64 bytes, processed data are transmitted to an FPGA DRAM from the host end through a PCIE bus;
grouping the batch data, wherein each 8 files form one group, and the last residual files less than 8 are independently formed into one group, compressing each group of data in parallel, and converting the 512-bit data bit width into 8-bit stream data aiming at each file;
the data compression acceleration method comprises the following steps:
201. performing first compression on the data obtained in the step two by using an LZ77 algorithm; the LZ77 compression algorithm is compressed in a dictionary mode;
202. performing a second compression round on the data obtained in the step 201 by using an LZ4 algorithm; all character sequences and { matching length, matching position and literal sequence length } sequences in the LZ77 format data are finally obtained through the LZ4 algorithm;
203. for the data obtained in step 202, converting the data from 8 bits to 512 bits, and storing the data in a space with the same size as the original file; if the file obtained after compression is larger than the source file, outputting the original file without compression;
the data decompression acceleration method comprises the following steps:
211. in the decompression process, the currently input data is in an LZ4 format, and the data in the LZ4 format is converted into the data in the LZ77 format;
212. continuously analyzing the data in the LZ77 format to obtain original data;
and step three, transmitting the compressed/decompressed data to a host computer end through the FPGA DRAM, and splitting the data by the host computer end to obtain a compressed/decompressed file.
As a further preferred embodiment of the invention, the FPGA DRAM is a Xilinx UltraScale + architecture and U50 accelerator card integrating ultra-high bandwidth HBM2 memory technology.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects: according to the invention, the LZ4 algorithm is adopted on the FPGA to process small files in batch, data processing and transmission overlap are increased, the internal processing parallelism is improved, the algorithm is optimized, and the overall compression performance is improved.
Drawings
FIG. 1 is a schematic diagram of the present invention, in which the host side performs byte alignment on each piece of data in the batch data, and then transfers the aligned batch data block to the U50 card at one time.
FIG. 2 is a flow chart illustrating the compressing operation performed on each file according to the present invention.
FIG. 3 is a flow chart illustrating the calculation of the matching length of a character string according to the present invention.
Fig. 4 is a schematic diagram of the data structure of the LZ4 standard.
Fig. 5 is a flow chart showing a process of merging and outputting all character sequences and { matching length, matching position, literal sequence length } sequences in LZ77 in the present invention.
Fig. 6 is a schematic diagram of a decompression process in the present invention.
FIG. 7 is a schematic diagram of a process of repeatedly parsing and outputting original data according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
The technical scheme of the invention is further explained in detail by combining the attached drawings:
in an embodiment of the invention, the FPGA accelerator card is a U50 accelerator card with a Xilinx UltraScale + architecture and integrated ultra-high bandwidth HBM2 memory technology. The following describes the flow of the invention according to the method steps and with reference to the examples.
Firstly, in order to transmit the data from the host side to the U50 card with the maximum performance, data merging needs to be performed at the host side, bulk transmission is performed to the U50 card, and the transmission bandwidth is performed with the size of 512bit, so that the host side needs to perform byte alignment on each piece of data in the bulk data, and then transmit the aligned bulk data block to the U50 card at one time, as shown in fig. 1.
After the data is transmitted to the U50 card, the Xilinx vitas supports an UNROLL instruction, i.e., a loop unrolling instruction, which can execute operations in a loop in parallel. And grouping the batch data, wherein each group comprises 8 data, and less than 8 data are left, and each group of data is subjected to UNROLL, namely 8 files are compressed in parallel.
Next, each file is compressed separately, as shown in fig. 2. Since each data to be compressed is 512 bits in size, 512 bits need to be converted into 8 bits of byte stream. The LZ77 compression follows, and the LZ77 compression algorithm employs dictionary compression, i.e., dictionary compression is performed by adding some characters in the data that can be organized into phrases, and then replacing the phrases in the dictionary with tokens at the same occurrence of the characters.
Firstly, the dictionary length is set to be 4096, the size of each element is 6 x 72 bits, 6 represents that hash collision is prevented, and 6 matched elements {48 bits } + current positions {24 bits } are stored in 72 bits by adopting a chain address method. The current matching element is 6 bytes, a hash value is calculated, the calculation mode is { (x [0] < < 4) ^ (x [1] < < 3) ^ (x [2] < < 3) ^ (x [3]) }), the calculated value is matched with the dictionary, the dictionary is updated by using a new value at first, then the character string matching is carried out on 6 data in the dictionary and the current 6 elements, the maximum matching length is found, if the matching length is more than 3 and the positions of two characters are not more than 65536, the matching is successful, and the information of the first character and the matching position is stored.
The dictionary lookup is to quickly find the position where the character string appears in the front, and the longest matching length cannot be calculated. Next, the matching length of the character string is calculated as shown in fig. 3. Setting the size of a sliding window to be 16KB, traversing the output result of dictionary matching one by one, and when the match _ length is not 0, starting to match the current character with the character at the position of the match _ offset until the matching is unsuccessful or the matching length exceeds 255. At which point LZ77 compression ends.
The compression is followed by LZ4, i.e. the compression result of LZ77 is converted to the standard of LZ4, and the structure of LZ4 is shown in FIG. 4. The compression result of the LZ77 comprises the current character, the matching length and the matching position, the output result of the LZ77 is traversed, the matching length is 0, the word sequence length is stored, and when the matching length is not 0, the matching length, the matching position and the word sequence length are stored. And after the characters with the matching length of 0 are saved again, counting the lengths of the characters, and repeating the process. Finally, all character sequences and { matching length, matching position and literal sequence length } sequences in the LZ77 are obtained.
Then the two results are merged and output. The specific process is as follows, as shown in fig. 5:
and traversing the { matching length, the matching position and the length of the word face sequence }, and if the length of the word face sequence is less than or equal to 15, directly filling the word face sequence, the matching position and the matching length. If it is larger than 15, the following is filled with the length of the word sequence, the word length field is one byte, maximum 255, if the previously calculated word sequence length is larger than 255, the first byte is 255, the second byte is the word sequence length-255, if it exceeds 255, the second field is still 255, and so on, until it is smaller than 255. And filling in a literal sequence, a matching position and a matching length.
At this point, after the compression process is finished, the data is converted from 8 bits to 512 bits and stored in a space with the same size as the original file { because the original file is byte aligned }, and if the compressed file is larger than the source file, the original file is output without compression. The aforementioned 8 files are compressed in parallel in a group, the compression result is stored in the queue to be output, and then the next group is compressed, and the process is consistent. And finally, after all groups are compressed, storing the data on the DRAM, transmitting the data to the host from the U50 card, and taking the data from the host to split the data to obtain files which are compressed one by one.
Next, a decompression flow is introduced, as shown in fig. 6, data organization and transmission at the host end are consistent with the compression flow, the only difference is that the size of the output space is allocated, compression is only required to be as large as the original file space, decompression requires that the original file size is obtained first and then the space is allocated according to file information, then the file information is transmitted to a U50 card DRAM according to 64-byte alignment, grouping is performed again, each group of data is converted into 8 bits from 512bit stream, and decompression is performed in parallel, and a decompression algorithm portion is described in detail below.
The currently input data is in an LZ4 format, the LZ4 is firstly converted into an LZ77 format, namely, the output is an LZ77 format character string which is a 32-bit stream, the part is a loop analysis character string, and a compiling instruction pipeline can be used. Firstly, resolving the upper 4 bits, if the upper 4 bits are 15, then reading the original file length, if the next byte is 255, continuing to the backward degree, reading until the length is less than 255, and adding all the values read in the front, then reading the original character and putting the original character into the bottom eight bits of the stream to be output, then reading two bytes of offset, wherein the lower 4 of the first byte just resolved in the front is a matching length, if the lower 4 is 15, then continuously resolving the next byte, putting the offset and match _ len into the stream to be output, the offset is the lower 16 bits, and the match _ len is the upper 16 bits. And then, continuously analyzing the matching blocks of the next round, and keeping consistent with the steps.
The parsed character string is in the LZ77 format. And sequentially reading the Output results, if the height 16 is 0, indicating that the characters are original characters, directly outputting the characters, updating the content of the local _ memory of the matching window, if not, indicating that the matching fields are encountered, analyzing the offset and the match _ len, and copying the match _ len characters in the local _ memory by using the offset into the Output. The above parsing process is then repeated, as shown in fig. 7, until all parsing is completed, to obtain the original data.
Similar to compression, decompressed data is put into an array to be output, after all groups are decompressed, all the data are stored in the DRAM of the U50, and then the data are transmitted to the host end through PCIE.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A method for accelerating data compression/decompression of small data block scenes, the method comprising the steps of:
step one, data combination is carried out at a host end, after each file is aligned and spliced according to 64 bytes, processed data are transmitted to an FPGA DRAM from the host end through a PCIE bus;
grouping the batch data, wherein each 8 files form one group, and the last residual files less than 8 are independently formed into one group, compressing each group of data in parallel, and converting the 512-bit data bit width into 8-bit stream data aiming at each file;
the data compression acceleration method comprises the following steps:
201. performing first compression on the data obtained in the step two by using an LZ77 algorithm;
202. performing a second compression round on the data obtained in the step 201 by using an LZ4 algorithm;
203. for the data obtained in step 202, converting the data from 8 bits to 512 bits, and storing the data in a space with the same size as the original file;
the data decompression acceleration method comprises the following steps:
211. in the decompression process, the currently input data is in an LZ4 format, and the data in the LZ4 format is converted into the data in the LZ77 format;
212. continuously analyzing the data in the LZ77 format to obtain original data;
and step three, transmitting the compressed/decompressed data to a host computer end through the FPGA DRAM, and splitting the data by the host computer end to obtain a compressed/decompressed file.
2. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 201, the LZ77 compression algorithm compresses in a dictionary manner.
3. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 202, through the LZ4 algorithm, all character sequences and { matching length, matching position, literal sequence length } sequences in LZ77 format data are finally obtained.
4. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: in step 203, if the file obtained after compression is larger than the source file, the original file is output without compression.
5. A method for accelerating data compression/decompression of small data block scenes according to claim 1, characterized in that: the FPGA DRAM is a U50 accelerator card with Xilinx UltraScale + architecture and integrated ultra-high bandwidth HBM2 memory technology.
CN202011391958.XA 2020-12-03 2020-12-03 Data compression/decompression acceleration method for small data block scene Pending CN112199343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011391958.XA CN112199343A (en) 2020-12-03 2020-12-03 Data compression/decompression acceleration method for small data block scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011391958.XA CN112199343A (en) 2020-12-03 2020-12-03 Data compression/decompression acceleration method for small data block scene

Publications (1)

Publication Number Publication Date
CN112199343A true CN112199343A (en) 2021-01-08

Family

ID=74033849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011391958.XA Pending CN112199343A (en) 2020-12-03 2020-12-03 Data compression/decompression acceleration method for small data block scene

Country Status (1)

Country Link
CN (1) CN112199343A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023509A (en) * 2012-11-14 2013-04-03 无锡芯响电子科技有限公司 Hardware LZ77 compression implementation system and implementation method thereof
CN105183557A (en) * 2015-08-26 2015-12-23 东南大学 Configurable data compression system based on hardware
CN107565972A (en) * 2017-09-19 2018-01-09 郑州云海信息技术有限公司 A kind of compression method, device, equipment and the storage medium of LZ codings

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023509A (en) * 2012-11-14 2013-04-03 无锡芯响电子科技有限公司 Hardware LZ77 compression implementation system and implementation method thereof
CN105183557A (en) * 2015-08-26 2015-12-23 东南大学 Configurable data compression system based on hardware
CN107565972A (en) * 2017-09-19 2018-01-09 郑州云海信息技术有限公司 A kind of compression method, device, equipment and the storage medium of LZ codings

Similar Documents

Publication Publication Date Title
US9768803B2 (en) Hardware data compressor using dynamic hash algorithm based on input block type
US5281967A (en) Data compression/decompression method and apparatus
US9515678B1 (en) Hardware data compressor that directly huffman encodes output tokens from LZ77 engine
US8698657B2 (en) Methods and systems for compressing and decompressing data
US7924183B2 (en) Method and system for reducing required storage during decompression of a compressed file
US7538695B2 (en) System and method for deflate processing within a compression engine
US9509336B1 (en) Hardware data compressor that pre-huffman encodes to decide whether to huffman encode a matched string or a back pointer thereto
US20160336959A1 (en) Hardware data compressor that constructs and uses dynamic-prime huffman code tables
CN108287877B (en) FPGA (field programmable Gate array) compression/decompression system and hardware decompression method for RIB (run in Box) rendering compressed file
US9503122B1 (en) Hardware data compressor that sorts hash chains based on node string match probabilities
US9628111B2 (en) Hardware data compressor with multiple string match search hash tables each based on different hash size
CN111723059B (en) Data compression method and device, terminal equipment and storage medium
EP4030628A1 (en) Near-storage acceleration of dictionary decoding
US20190052284A1 (en) Data compression apparatus, data decompression apparatus, data compression program, data decompression program, data compression method, and data decompression method
US9137336B1 (en) Data compression techniques
CN116015311A (en) Lz4 text compression method based on sliding dictionary implementation
US10027346B2 (en) Hardware data compressor that maintains sorted symbol list concurrently with input block scanning
CN117493386B (en) Database access method and device, storage medium and electronic equipment
US9197243B2 (en) Compression ratio for a compression engine
CN112199343A (en) Data compression/decompression acceleration method for small data block scene
US10491241B1 (en) Data compression scheme utilizing a repetitive value within the data stream
CN118100955B (en) Method for preprocessing compressed data by parallel decompression
CN112597082B (en) Bus data transmission method and electronic equipment
Wang et al. Lightweight Lossless Compression Algorithm for Fast Decompression Application
Höglund Lightweight Real-Time Lossless Software Compression of Trace Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210108

RJ01 Rejection of invention patent application after publication