CN103559020A

CN103559020A - Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data

Info

Publication number: CN103559020A
Application number: CN201310551802.7A
Authority: CN
Inventors: 郑晶晶; 王婷; 张常有; 詹科
Original assignee: Institute of Software of CAS; Guangzhou Institute of Software Application Technology Guangzhou GZIS
Current assignee: Institute of Software of CAS; Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2014-02-05
Anticipated expiration: 2033-11-07
Also published as: CN103559020B

Abstract

A method for parallel compression and parallel decompression of DNA reading sequence data FASTQ files, aiming at the compression and decompression of DNA reading sequence data FASTQ files, using circular double buffer queues, circular double memory mapping and memory mapping combined with data block processing, multiple Thread pipeline parallel compression and decompression processing, reading and writing sequence two-dimensional array and other technologies realize parallel compression and parallel decompression processing between multiple processes of FASTQ files and multiple threads within a process. It can be implemented based on MPI and OpenMP, or based on MPI and Pthread. The invention makes full use of the powerful computing capabilities of each computing node and the multi-core CPU in the node, and can solve the limitation of processor, memory and other resources that the serial compression and decompression program is subjected to.

Description

A parallel compression and decompression method for FASTQ files of DNA read sequence data

技术领域technical field

本发明涉及生物信息、数据压缩和高性能计算领域，特别涉及一种DNA读序数据FASTQ文件的并行压缩和并行解压缩方法。The invention relates to the fields of biological information, data compression and high-performance computing, in particular to a method for parallel compression and parallel decompression of DNA reading sequence data FASTQ files.

背景技术Background technique

生物信息学的主要任务之一是采集和分析大量的基因数据。这些数据对于基因研究来说至关重要，有助于确定防止或导致疾病产生的基因组件，并研究出具有针对性的疗法。高通量的测序方法和设备产生海量的短的读序数据。存储、管理和传输DNA读序数据的常用方法是采用FASTQ文件格式，这种格式主要包含DNA读序数据以及每个DNA碱基所对应的注释信息，例如表示测序标记过程的不确定性的Quality Scores信息。读序标记和其它诸如设备名称的描述也包含在FASTQ文件中。相比其它DNA数据的存储格式（例如FASTA），FASTQ格式能存储更多的信息，但这同时也使得文件大小和存储空间急剧增长。目前针对碱基读序数据和其注释描述信息进行有效的无损压缩和解压缩的算法研究，是一个研究热点。One of the main tasks of bioinformatics is to collect and analyze large amounts of genetic data. These data are critical for genetic research, helping to identify genetic components that prevent or cause disease and develop targeted therapies. High-throughput sequencing methods and equipment generate massive amounts of short-read sequence data. The common way to store, manage and transmit DNA read sequence data is to use the FASTQ file format, which mainly contains DNA read sequence data and annotation information corresponding to each DNA base, such as Quality, which represents the uncertainty of the sequencing labeling process. Scores information. Read sequence markers and other descriptions such as device names are also included in the FASTQ files. Compared with other DNA data storage formats (such as FASTA), the FASTQ format can store more information, but it also causes a sharp increase in file size and storage space. At present, the algorithm research of effective lossless compression and decompression for base reading data and its annotation description information is a research hotspot.

针对FASTQ文件数据的压缩，目前取得重要进展的是美国转译遗传研究所（TGen）所研究的G_SQZ算法（Tembe,W.et al.G-SQZ:compact encoding of genomic sequence and qualitydata.Bioinformatics2010;26,2192–2194.），以及Deorowicz Sebastian等人所研究的DSRC算法（Deorowicz S,Grabowski S.Compression of DNA sequence reads in FASTQ format.Bioinformatics2011;27:860-862.）。这两种算法均使用了索引系统允许从规律的间隔处（简称分块）进行访问，所需信息不需要从头开始解码。G_SQZ算法主要使用Huffman编码<碱基，Quality score>对，而DSRC算法对碱基数据行和Quality Score行单独利用Huffman编码并辅以其它精细的压缩处理（例如游程处理等）。这类方法的优点是保留数据相对顺序信息的同时能够随机解码部分数据进行访问，无损压缩效率高。这代表了一类FASTQ文件压缩方法，为方便叙述，以下简称为分块索引串行算法。For the compression of FASTQ file data, the G_SQZ algorithm (Tembe, W. et al. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics2010; 26, 2192–2194.), and the DSRC algorithm studied by Deorowicz Sebastian et al. (Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics2011;27:860-862.). Both algorithms use an indexing system that allows access at regular intervals (blocks for short), without the required information having to be decoded from scratch. The G_SQZ algorithm mainly uses Huffman encoding <base, Quality score> pairs, while the DSRC algorithm uses Huffman encoding alone for base data rows and Quality Score rows and supplemented by other fine compression processing (such as run-length processing, etc.). The advantage of this type of method is that it can randomly decode part of the data for access while retaining the relative sequence information of the data, and the lossless compression efficiency is high. This represents a class of FASTQ file compression methods, and for the convenience of description, hereinafter referred to as the block index serial algorithm.

目前需要进行基因组序列分析的数据已达到TB数量级。大型测序中心正计划或已安装PB级规模的存储设备。对于这些海量数据，为减少存储空间和传输时间，便于对大量基因组序列数据实时分析，必须对其进行实时数据压缩和解压缩，这需要借助高性能计算平台的强大的计算能力。随着高性能计算平台的快速发展，充分利用各个计算节点上的多核CPU的强大的计算能力，来实时压缩与解压缩海量的大的FASTQ文件，能够解决串行压缩解压缩程序所受到的处理器处理能力、内存等资源的限制。Currently, the data required for genome sequence analysis has reached the order of terabytes. Large-scale sequencing centers are planning or have installed petabyte-scale storage devices. For these massive data, in order to reduce storage space and transmission time and facilitate real-time analysis of a large amount of genome sequence data, real-time data compression and decompression must be performed, which requires the powerful computing power of a high-performance computing platform. With the rapid development of high-performance computing platforms, make full use of the powerful computing power of multi-core CPUs on each computing node to compress and decompress massive FASTQ files in real time, which can solve the processing of serial compression and decompression programs. resource constraints such as processor processing power and memory.

上述G-SQZ算法和DSRC算法均是串行算法，目前尚未见到与这类算法相关的基于多节点的多核CPU的并行算法的研究文章和专利。The above-mentioned G-SQZ algorithm and DSRC algorithm are both serial algorithms, and there are no research articles and patents related to this type of algorithm based on multi-node multi-core CPU parallel algorithms.

发明内容Contents of the invention

鉴于目前尚未见到与上述G-SQZ和DSRC等这一类分块索引串行算法相关的并行算法的研究与专利，本发明的目的在于提供一种这一类FASTQ文件分块索引串行压缩解压缩算法对应的并行压缩解压缩方法，利用多计算节点和多核CPU，可以基于MPI+OpenMP实现，也可以基于MPI+Pthread实现；能够充分利用高性能计算平台的强大的计算能力，大幅度提升对海量基因组序列实时分析处理的速度，对基因数据的更广泛应用提供重要的技术基础。In view of the fact that there are no researches and patents on parallel algorithms related to the above-mentioned G-SQZ and DSRC block index serial algorithms, the purpose of the present invention is to provide a serial compression of this type of FASTQ file block index The parallel compression and decompression method corresponding to the decompression algorithm, using multi-computing nodes and multi-core CPUs, can be implemented based on MPI+OpenMP, or based on MPI+Pthread; it can make full use of the powerful computing capabilities of high-performance computing platforms and greatly improve The speed of real-time analysis and processing of massive genome sequences provides an important technical basis for the wider application of genetic data.

本发明的技术方案如下所述。The technical scheme of the present invention is as follows.

一种DNA读序数据的FASTQ文件的并行压缩方法，包括以下步骤：A method for parallel compression of the FASTQ file of DNA reading sequence data, comprising the following steps:

一、并行压缩进程任务分割1. Parallel compression process task division

根据FASTQ文件大小、并行压缩进程数目、FASTQ文件中每个读序片段（包含碱基信息以及对应的其它注释信息，以下为叙述方便，标记为一条记录）数据的特点确定每个进程待处理数据的起始和结束位置。每个进程均运行进程任务分割模块，将待压缩的原始数据近似均匀地分配到各个进程上，以实现数据并行。这样每个进程在处理时相互之间没有通信时间的消耗，提升了数据并行的处理效率。每个进程得到单独的压缩文件，压缩数据的顺序与进程号一致。Determine the data to be processed in each process according to the size of the FASTQ file, the number of parallel compression processes, and the data characteristics of each read segment in the FASTQ file (including base information and other corresponding annotation information, which is marked as a record for the convenience of description below). start and end positions. Each process runs the process task division module, and distributes the raw data to be compressed approximately evenly to each process, so as to realize data parallelism. In this way, each process does not consume communication time with each other during processing, which improves the processing efficiency of data parallelism. Each process gets a separate compressed file, and the order of the compressed data is consistent with the process number.

二、并行压缩进程内多线程流水并行压缩2. Multi-thread pipeline parallel compression in parallel compression process

进程处理模块中包含一个原始数据读取线程、一个压缩数据写入线程和多个压缩工作线程，工作线程的具体数目可以根据硬件CPU的核数以及进程设置来设定。The process processing module includes a raw data reading thread, a compressed data writing thread and multiple compression working threads. The specific number of working threads can be set according to the number of hardware CPU cores and process settings.

每个进程所处理的数据被原始数据读取线程分成多个块，每个块包含特定的固定数目的记录数据（最末端块可能少于这个固定数目）。The data processed by each process is divided into multiple blocks by the original data reading thread, and each block contains a specific fixed number of record data (the last block may be less than this fixed number).

每个工作线程均有两个循环双缓冲队列，一个是原始数据循环双缓冲队列，一个是压缩数据循环双缓冲队列。原始数据循环双缓冲队列和压缩数据循环双缓冲队列具有类似结构，其中缓冲区的结构根据存储数据的不同稍有不同，后面在具体实施方式部分详细介绍各个缓冲区的结构。每个原始数据循环双缓冲队列包含两个队列：一个是空块缓冲区队列，一个是原始数据块队列。每个压缩数据循环双缓冲队列也包含两个队列：一个是空块缓冲区队列，一个是压缩数据块队列。这两个循环双缓冲队列的处理方式相同。Each worker thread has two circular double buffer queues, one is the original data circular double buffer queue, and the other is the compressed data circular double buffer queue. The original data circular double-buffer queue and the compressed data circular double-buffer queue have a similar structure, and the structure of the buffer is slightly different according to the stored data. The structure of each buffer will be introduced in detail in the specific implementation mode later. Each raw data circular double buffer queue contains two queues: one is an empty block buffer queue, and the other is a raw data block queue. Each compressed data circular double buffer queue also includes two queues: one is an empty block buffer queue, and the other is a compressed data block queue. The two circular double-buffered queues are handled the same way.

下面以原始数据循环双缓冲队列为例，详细说明其处理方式：The following takes the original data circular double buffer queue as an example to describe its processing method in detail:

（1）原始数据循环双缓冲队列初始化处理：将空块缓冲区队列实例化，具有特定数目的空块缓冲区，原始数据块队列为空。(1) Raw data circular double buffer queue initialization processing: instantiate the empty block buffer queue with a specific number of empty block buffers, and the original data block queue is empty.

（2）原始数据读取线程读取一个原始数据块。(2) The raw data reading thread reads a raw data block.

（3）在空块缓冲区队列头获取一个空块缓冲区。(3) Obtain an empty block buffer at the head of the empty block buffer queue.

（4）用原始数据块填充获取的这个空块缓冲区。(4) Fill this acquired empty block buffer with raw data blocks.

（5）将这个填充的原始数据块放入原始数据块队列的末端。(5) Put this padded raw data block at the end of the raw data block queue.

（6）压缩工作线程在原始数据块队列头中获取一个原始数据块缓冲区中的块数据进行压缩处理。(6) The compression worker thread obtains the block data in an original data block buffer from the head of the original data block queue for compression processing.

（7）将此原始数据块缓冲区清空，并放入空块缓冲区队列。(7) Clear the original data block buffer and put it into the empty block buffer queue.

在每个进程内，进行以原始数据块为单位的数据的并行压缩流水线处理，具体流水并行处理流程如下：In each process, the parallel compression pipeline processing of the data in the unit of the original data block is carried out. The specific pipeline parallel processing flow is as follows:

（1）原始数据读取线程不断地根据记录数据特点解析读取原始数据块，循环依次查找每个压缩工作线程的原始数据循环双缓冲队列中的空块缓冲区，找到后将原始数据块放入，然后释放此块缓冲区到此循环双缓冲队列中的原始数据块队列的末端。(1) The original data reading thread continuously parses and reads the original data block according to the characteristics of the recorded data, loops to find the empty block buffer in the original data circular double buffer queue of each compression worker thread in turn, and puts the original data block in the input, and then release this block buffer to the end of the raw data block queue in this circular double buffer queue.

（2）每个压缩工作线程不断地从本线程的原始数据循环双缓冲队列中的原始数据块队列头获取原始数据块，然后进行压缩处理。(2) Each compression worker thread continuously obtains the original data block from the head of the original data block queue in the original data circular double buffer queue of this thread, and then performs compression processing.

（3）每个压缩工作线程不断地将压缩后的块数据填充到获取的本线程的压缩数据循环双缓冲队列中的空块缓冲区中，并释放此缓冲区到此循环双缓冲队列的压缩数据块队列的尾部。(3) Each compression worker thread continuously fills the compressed block data into the empty block buffer in the obtained compressed data circular double buffer queue of this thread, and releases this buffer to the compression of this circular double buffer queue The tail of the data block queue.

（4）压缩数据写入线程不断地按照块号从小到大的顺序依次查找已经压缩处理完毕的块数据所在的线程号，获取此线程内的压缩数据循环双缓冲队列中的压缩数据块队列头中的此块压缩数据，写入最终的压缩文件。(4) The compressed data writing thread continuously searches for the thread number of the block data that has been compressed and processed in sequence according to the block number from small to large, and obtains the compressed data block queue head in the compressed data circular double buffer queue in this thread This block in compressed data is written to the final compressed file.

上述各个线程的具体算法以及结束条件详见具体实施方式部分。For the specific algorithms and end conditions of the above-mentioned threads, refer to the specific implementation mode for details.

原始数据读取线程中，采用内存映射技术结合FASTQ数据分块技术来提高大数据文件的读取速度。结合DNA读序片段信息的分块，根据内存页面大小以及映射的空间大小，计算每个块的数据在内存映射空间的位置，以及何时进行内存映射空间的释放和重新映射。采用内存映射一个很明显的好处就是:进程可以直接读写内存,基本上不需要任何额外的数据拷贝.而对于像fread/fwrite之类的文件I/O,则需要在内核空间和用户空间之间进行四次数据拷贝,而内存映射只需要两次拷贝:一次是从输入文件拷贝到内存映射区,另外一次是从内存映射区拷贝到输出文件中。实际上,进程可以像访问内存一样对普通文件进行操作。此技术的详细说明见具体实施部分。In the original data reading thread, memory mapping technology combined with FASTQ data block technology is used to improve the reading speed of large data files. Combined with the block information of the DNA reading sequence, according to the size of the memory page and the size of the mapped space, calculate the position of the data of each block in the memory mapped space, and when to release and remap the memory mapped space. An obvious benefit of using memory mapping is that the process can directly read and write the memory, basically without any additional data copying. For file I/O like fread/fwrite, it needs to be stored between the kernel space and the user space. Four data copies are performed in between, while memory mapping only needs two copies: one is copied from the input file to the memory-mapped area, and the other is copied from the memory-mapped area to the output file. In fact, processes can operate on ordinary files like accessing memory. A detailed description of this technique is provided in the specific implementation section.

压缩数据写入线程中，每个块压缩后写入最终压缩文件的顺序需要与原始数据读取线程中原始数据块的读取顺序相同，在此使用一个读写顺序二维数组。二维数组的第一维即表示块号；第二维的大小为2，分别记录每个块分配的线程号以及压缩处理完毕标志信息，In the compressed data writing thread, the order in which each block is compressed and written to the final compressed file needs to be the same as the reading order of the original data blocks in the original data reading thread. Here, a two-dimensional array of read and write order is used. The first dimension of the two-dimensional array represents the block number; the size of the second dimension is 2, which records the thread number assigned to each block and the compression processing completion flag information respectively.

每个进程得到的压缩文件起始为文件头，包含设置信息，譬如块所包含的记录数目。紧接着是按照原始数据块顺序的各个块的压缩后数据。文件最后是文件尾部数据，包含每个块的压缩数据在文件中的位置索引信息、块数，以及文件尾部在整个文件中的位置信息。这些信息用于并行解码，以及随机访问特定块时仅解码部分数据，无需解码整个文件。The compressed file obtained by each process starts with a file header, which contains setting information, such as the number of records contained in the block. Immediately following is the compressed data of each block in the order of the original data block. At the end of the file is the file tail data, which contains the position index information of the compressed data of each block in the file, the number of blocks, and the position information of the file tail in the entire file. This information is used for parallel decoding and for random access to specific blocks to decode only part of the data without decoding the entire file.

一种DNA读序数据FASTQ文件的并行解压缩方法，包括以下步骤：A method for parallel decompression of DNA reading sequence data FASTQ files, comprising the following steps:

一、根据进程号确定进程处理的压缩文件1. Determine the compressed file processed by the process according to the process number

待压缩的FASTQ文件根据设置的并行压缩进程的数目得到相应数目的压缩文件。在解压缩中，根据压缩文件的数目设置并行解压缩进程的数目，各个解压缩进程得到的解压缩文件的顺序由压缩文件的顺序决定。每个解压缩进程在处理时相互之间没有通信时间的消耗，提升了数据并行的处理效率。The FASTQ file to be compressed will obtain a corresponding number of compressed files according to the number of parallel compression processes set. In decompression, the number of parallel decompression processes is set according to the number of compressed files, and the sequence of decompressed files obtained by each decompression process is determined by the sequence of compressed files. Each decompression process does not consume communication time with each other during processing, which improves the parallel processing efficiency of data.

二、读取压缩文件尾部，获取块设置、块索引和块数等信息2. Read the end of the compressed file to obtain information such as block settings, block index and block number

区别于并行压缩方法的是，并行解压缩方法在每个进程初始从压缩文件的尾部获得块包含的记录数目的设置、每个块在压缩文件中的位置等索引、块的数目等信息，这些信息使得并行解压缩方法区别于并行压缩方法。The difference from the parallel compression method is that the parallel decompression method initially obtains the setting of the number of records contained in the block, the position of each block in the compressed file, the number of blocks and other information from the end of the compressed file at the beginning of each process. The information distinguishes the parallel decompression method from the parallel compression method.

三、并行解压缩进程内多线程流水并行解压缩3. Parallel decompression in-process multi-thread pipeline parallel decompression

同并行压缩方法类似，并行解压缩方法的解压缩进程处理模块中包含一个压缩数据读取线程、一个解压缩数据写入线程和多个解压缩工作线程，工作线程的具体数目可以根据硬件CPU的核数以及进程设置来设定。Similar to the parallel compression method, the decompression process processing module of the parallel decompression method includes a compressed data read thread, a decompressed data write thread and a plurality of decompression worker threads. The specific number of worker threads can be determined according to the hardware CPU. Set the number of cores and process settings.

每个解压缩工作线程有两个循环双缓冲队列，一个是压缩数据循环双缓冲队列，一个是解压缩数据循环双缓冲队列。压缩数据循环双缓冲队列和解压缩数据循环双缓冲队列具有类似结构，其中缓冲区的结构根据存储数据的不同稍有不同，后面在具体实施方式部分详细介绍各个缓冲区的结构。每个压缩数据循环双缓冲队列包含两个队列：一个是空块缓冲区队列，一个是压缩数据块队列。每个解压缩数据循环双缓冲队列也包含两个队列：一个是空块缓冲区队列，一个是解压缩数据块队列。这两个循环双缓冲队列的处理方式均与前述的并行压缩方法中的原始数据循环双缓冲队列处理方式相同，不再赘述。Each decompression worker thread has two circular double buffer queues, one is the compressed data circular double buffer queue, and the other is the decompressed data circular double buffer queue. The compressed data circular double-buffer queue and the decompressed data circular double-buffer queue have similar structures, and the structure of the buffer is slightly different according to the stored data. The structure of each buffer will be described in detail later in the specific implementation section. Each compressed data circular double buffer queue contains two queues: one is an empty block buffer queue, and the other is a compressed data block queue. Each decompressed data circular double buffer queue also includes two queues: one is an empty block buffer queue, and the other is a decompressed data block queue. The processing methods of the two circular double buffer queues are the same as the original data circular double buffer queue processing methods in the aforementioned parallel compression method, and will not be repeated here.

在每个进程内，进行以读序数据压缩块为单位的数据的并行解压缩流水线处理，具体并行流水处理流程如下：In each process, the parallel decompression pipeline processing of the data in units of read sequence data compression blocks is performed. The specific parallel pipeline processing flow is as follows:

（1）压缩数据读取线程根据文件尾部得到的压缩块的位置索引信息，按照块号从小到大的顺序，不断地读取已知压缩大小的压缩块，循环依次查找每个解压缩工作线程的压缩数据循环双缓冲队列头的空块缓冲区，找到后将压缩块数据放入，并释放此缓冲区到此循环双缓冲队列中的压缩数据块队列的末端。(1) The compressed data reading thread continuously reads the compressed blocks with known compression sizes according to the position index information of the compressed blocks obtained at the end of the file, in order of block numbers from small to large, and searches for each decompression worker thread in turn in a loop The empty block buffer at the head of the compressed data circular double buffer queue, put the compressed block data into it after finding it, and release this buffer to the end of the compressed data block queue in this circular double buffer queue.

（2）每个解压缩工作线程不断地从本线程的压缩数据循环双缓冲队列中的压缩数据块队列头获取压缩数据块，然后进行解压缩处理。(2) Each decompression worker thread continuously obtains the compressed data block from the head of the compressed data block queue in the compressed data circular double buffer queue of this thread, and then performs decompression processing.

（3）每个解压缩工作线程不断地将解压缩后的块数据填充到获取的本线程的解压缩数据循环双缓冲队列中的空块缓冲区中，并释放此缓冲区到此循环双缓冲队列的解压缩数据块队列尾部。(3) Each decompression worker thread continuously fills the decompressed block data into the empty block buffer in the obtained thread's decompressed data circular double buffer queue, and releases this buffer to this circular double buffer The decompressed block queue tail of the queue.

（4）解压缩数据写入线程不断地按照块号从小到大的顺序依次查找已经压缩处理完毕的块数据所在的线程号，获取此线程内的解压缩数据循环双缓冲队列中的解压缩数据块队列头的此块解压缩数据，写入最终的压缩文件。(4) The decompressed data writing thread continuously searches for the thread number of the block data that has been compressed and processed in sequence according to the block number from small to large, and obtains the decompressed data in this thread and the decompressed data in the circular double buffer queue This block at the head of the block queue decompresses the data, writing the final compressed file.

压缩数据读取线程采用循环双内存映射技术结合数据分块技术来提高大数据文件的读取速度。其中关键技术是循环双内存映射技术，使得解压缩工作线程读取压缩数据进行解压缩和压缩数据读取线程内存映射并行执行。即具有两个内存映射——内存映射1和内存映射2，按照压缩块的顺序依次循环放入这两个内存映射中。根据压缩文件尾部的压缩块数据位置索引信息，以及两个内存映射空间的大小，按照块号从小到大的顺序，依次计算每个压缩块数据所在的内存映射缓冲区以及在内存映射缓冲区内的位置信息。解压缩工作线程直接在压缩数据循环双缓冲队列中使用此循环双内存映射区域，以减少数据拷贝次数。对于正在使用的内存映射，需要等待此内存映射的前一次映射数据被所有解压缩工作线程使用完毕后，才能重新进行内存映射。此技术的详细说明见具体实施方式部分。The compressed data reading thread uses circular dual memory mapping technology combined with data block technology to improve the reading speed of large data files. The key technology is the circular dual memory mapping technology, which enables the decompression worker thread to read the compressed data for decompression and the memory mapping of the compressed data reading thread to execute in parallel. That is to say, there are two memory maps——memory map 1 and memory map 2, which are circularly put into these two memory maps in the order of compressed blocks. According to the compressed block data location index information at the end of the compressed file, and the size of the two memory-mapped spaces, the memory-mapped buffer where each compressed block data is located and the memory-mapped buffer in which each compressed block data is located are calculated sequentially in the order of the block numbers from small to large location information. The decompression worker thread directly uses this circular double memory-mapped area in the compressed data circular double buffer queue to reduce the number of data copies. For the memory mapping in use, it is necessary to wait for the previous mapping data of this memory mapping to be used by all decompression worker threads before performing memory mapping again. A detailed description of this technique is found in the Detailed Description section.

解压缩数据写入线程中，每个块解压缩后写入最终解压缩文件的顺序需要与压缩数据读取线程中压缩块的读取顺序相同。与并行压缩方法相同，在此也使用一个相同的读写顺序二维数组来记录每个块分配的线程号以及解压缩处理完毕标志信息。In the decompressed data writing thread, the order in which each block is decompressed and written to the final decompressed file needs to be the same as the reading order of the compressed blocks in the compressed data reading thread. Same as the parallel compression method, a two-dimensional array with the same reading and writing sequence is also used here to record the thread number assigned to each block and the decompression completion flag information.

为提高I/O速度，解压缩数据写入线程也使用了内存映射结合数据分块技术，根据待解压的块数建立特定大小的内存映射文件，按照块号从小到大的顺序将解压缩数据块依次放入内存映射空间中，期间需要根据写入数据的位置、内存映射空间的大小、重新映射的阈值进行重新映射，并调整重新映射的阈值。详细说明见具体实施部分。In order to improve the I/O speed, the decompressed data writing thread also uses memory mapping combined with data block technology to create a memory-mapped file of a specific size according to the number of blocks to be decompressed, and decompresses the data in order of block numbers from small to large. Blocks are put into the memory mapping space in turn, and remapping needs to be performed according to the location of the written data, the size of the memory mapping space, and the remapping threshold, and the remapping threshold should be adjusted. See the specific implementation section for details.

针对DNA读序数据FASTQ文件的压缩与解压缩，近几年取得重要进展的是G_SQZ和DSRC这一类分块索引串行算法。目前尚未见到与这类算法相关的并行算法的研究文章和专利。随着基因组序列分析的数据达到TB级甚至PB级规模，为便于对大量基因组序列数据实时分析，必须对其进行实时数据压缩和解压缩，这需要借助高性能计算平台的强大的计算能力。因此，研究上述G_SQZ和DSRC这一类分块索引串行算法的并行压缩解压缩方法具有重要意义。For the compression and decompression of FASTQ files of DNA reading sequence data, the block index serial algorithms such as G_SQZ and DSRC have made important progress in recent years. At present, there are no research articles and patents on parallel algorithms related to this type of algorithm. As the data of genome sequence analysis reaches terabytes or even petabytes, in order to facilitate real-time analysis of a large amount of genome sequence data, it is necessary to perform real-time data compression and decompression, which requires the powerful computing power of high-performance computing platforms. Therefore, it is of great significance to study the parallel compression and decompression methods of the block index serial algorithms such as G_SQZ and DSRC mentioned above.

本发明首次提出上述G_SQZ和DSRC这一类分块索引串行压缩解压缩算法相应的并行压缩和并行解压缩方法。利用循环双缓冲队列、循环双内存映射和内存映射并结合数据分块处理、多线程流水并行压缩解压缩处理、读写顺序二维数组等技术，实现FASTQ文件的多个进程以及进程内的多个线程之间的并行压缩和并行解压缩处理。The present invention proposes for the first time the parallel compression and parallel decompression methods corresponding to the block index serial compression decompression algorithms such as G_SQZ and DSRC. Using circular double buffer queue, circular double memory mapping and memory mapping combined with data block processing, multi-thread pipeline parallel compression and decompression processing, two-dimensional array of read and write sequences, etc., to realize multiple processes of FASTQ files and multiple processes within a process Parallel compression and parallel decompression processing between threads.

本发明的优点在于：The advantages of the present invention are:

（1）本发明充分利用各个计算节点以及节点内多核CPU的强大计算能力，能够解决串行压缩解压缩程序所受到的处理器、内存等资源的限制。本并行压缩和并行解压缩方法实现灵活，可以基于MPI和OpenMP实现，也可以基于MPI和Pthread实现。(1) The present invention makes full use of the powerful computing capabilities of each computing node and the multi-core CPU in the node, and can solve the limitations of resources such as processors and memory for serial compression and decompression programs. The parallel compression and parallel decompression method can be implemented flexibly, and can be implemented based on MPI and OpenMP, and can also be implemented based on MPI and Pthread.

（2）由于本发明适用于任何数据块内的压缩和解压缩算法，因此本并行压缩和并行解压缩方法并不仅仅局限于G_SQZ和DSRC这两个串行算法的并行化，只要串行压缩解压缩算法具有分块和索引这两个特征，本并行压缩和并行解压缩方法就适用于这类分块索引串行算法的并行化。(2) Since the present invention is applicable to compression and decompression algorithms in any data block, the parallel compression and parallel decompression method is not limited to the parallelization of the two serial algorithms G_SQZ and DSRC, as long as the serial compression and decompression The compression algorithm has two characteristics of block and index, and the parallel compression and parallel decompression method is suitable for the parallelization of such block index serial algorithm.

（3）本发明充分利用高性能计算平台的强大的计算能力，能够大幅度提升海量基因组序列实时分析处理的速度，对基因数据的更广泛应用提供重要的技术基础。(3) The present invention makes full use of the powerful computing power of the high-performance computing platform, can greatly increase the speed of real-time analysis and processing of massive genome sequences, and provides an important technical basis for the wider application of genetic data.

附图说明Description of drawings

图1为本发明并行压缩方法中进程内多线程流水并行压缩图；Fig. 1 is a parallel compression diagram of multi-thread pipeline in process in the parallel compression method of the present invention;

图2为本发明中原始数据循环双缓冲队列图；Fig. 2 is original data circular double buffer queue figure among the present invention;

图3为本发明并行压缩方法中压缩数据块缓冲区；Fig. 3 is compressed data block buffer in parallel compression method of the present invention;

图4为本发明并行解压缩方法中的进程内多线程流水并行解压缩图；Fig. 4 is the in-process multi-thread pipeline parallel decompression diagram in the parallel decompression method of the present invention;

图5为本发明并行解压缩方法中循环双内存映射与压缩数据块缓冲区关系图；Fig. 5 is a diagram of the relationship between cyclic double memory mapping and compressed data block buffer in the parallel decompression method of the present invention;

图6为本发明并行解压缩方法中循环双内存映射在读取线程和解压缩工作线程间的协同；Fig. 6 is the cooperation between the reading thread and the decompression worker thread in the circular dual memory mapping in the parallel decompression method of the present invention;

图中1.内存映射1，2.内存映射2，3.时间轴，4.内存映射区域指针，5.块在内存映射区域起始点，6.压缩数据块长度，7.压缩数据块号。In the figure 1. Memory mapping 1, 2. Memory mapping 2, 3. Time axis, 4. Memory mapping area pointer, 5. Block starting point in memory mapping area, 6. Compressed data block length, 7. Compressed data block number.

具体实施方式Detailed ways

本发明提供一种DNA读序数据的FASTQ文件的并行压缩解压缩方法，为使本发明的目的、技术方案及效果更加清楚、明确，以下结合附图，对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。The present invention provides a method for parallel compression and decompression of FASTQ files of DNA reading sequence data. In order to make the purpose, technical solution and effect of the present invention clearer and clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

下面详细解释FASTQ文件的并行压缩方法中的原始数据读取线程，其具体实施步骤如下所示：The raw data reading thread in the parallel compression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

（1）打开待压缩的原始DNA读序数据FASTQ压缩文件。(1) Open the FASTQ compressed file of the raw DNA read sequence data to be compressed.

（2）获取当前运行机器的文件系统的内存分页大小。(2) Obtain the memory paging size of the file system of the currently running machine.

（3）根据内存分页大小设定内存映射空间大小。(3) Set the memory mapping space size according to the memory paging size.

（4）根据进程任务分割模块所分配的当前进程所需处理的原始数据的范围，设定内存映射的起始点（起始点需要根据内存页面大小，对齐内存页面的边界）以及映射数据大小，进行内存映射。(4) According to the range of raw data that the current process needs to process allocated by the process task segmentation module, set the starting point of the memory mapping (the starting point needs to be aligned with the boundary of the memory page according to the size of the memory page) and the size of the mapped data, and carry out memory mapping.

（5）记录进程内第一个原始数据压缩块的内存映射起始位置。(5) Record the starting position of the memory map of the first original data compression block in the process.

（6）循环依次查找各个压缩工作线程的原始数据循环双缓冲队列，查找空块缓冲区。(6) Circularly search the original data circular double buffer queues of each compression worker thread in turn, and search for empty block buffers.

（7）若空快缓冲区存在，转（8），否则转（6）。(7) If there is an empty fast buffer, go to (8), otherwise go to (6).

（8）从内存映射区域，以记录为粒度，循环顺序读取一定数目的记录数据，形成一个原始数据块，填充该空块缓冲区，块号加1，存储块内的记录数目。或者到达映射终点时转（9）。(8) From the memory mapping area, with the record as the granularity, read a certain number of record data in a cyclic order to form an original data block, fill the empty block buffer, add 1 to the block number, and store the number of records in the block. Or go to (9) when reaching the end of the map.

（9）释放此缓冲区到此循环双缓冲队列的原始数据块队列末端。(9) Release this buffer to the end of the original data block queue of this circular double buffer queue.

（10）在读写顺序二维数组中设置此块的线程分配号。(10) Set the thread allocation number of this block in the read-write sequence two-dimensional array.

（11）当前若到达进程分配任务的数据末端，转（15），否则，转（12）。(11) If the data end of the process assignment task is currently reached, go to (15), otherwise, go to (12).

（12）根据当前读取的内存映射结束位置，计算已读取的原始数据块的长度，设置待读取的下一个块在内存映射区域中的起始位置。(12) Calculate the length of the read original data block according to the end position of the currently read memory map, and set the start position of the next block to be read in the memory map area.

（13）若（当前内存映射读取位置距离此次映射结束位置的长度小于1.5倍的前一个刚读取完毕的块的长度）&&（没有到达文件结束位置），转（14），否则（8）。(13) If (the length of the current memory map reading position and the end position of this mapping is less than 1.5 times the length of the block just read) && (the end position of the file is not reached), go to (14), otherwise ( 8).

（14）释放上一次内存映射，根据当前文件读取位置（下一个待读块的起始点）计算下一次内存映射的起始点以及映射大小，重新进行内存映射。转（8）。(14) Release the previous memory mapping, calculate the starting point and mapping size of the next memory mapping according to the current file reading position (the starting point of the next block to be read), and perform memory mapping again. turn (8).

（15）释放内存映射，设置读取线程结束标志。(15) Release the memory map and set the read thread end flag.

（16）原始数据读取线程结束。(16) The original data reading thread ends.

上述步骤中所使用的原始数据循环双缓冲队列中的原始数据块缓冲区如图3所示。此缓冲区使用一个结构体实现，包含三个字段——块数据指针、块号和块内记录数目，块数据指针指向一个记录数组，每个数组元素指向一个记录对象，这个记录对象包含FASTQ四部分信息（title部分，DNA sequence部分，“+”部分，Quality Score部分）的全部数据，分别为：title部分——title数据指针、title数据长度、title预留空间长度，其中title数据指针指向title数据缓冲区；DNA sequence部分——sequence数据指针、sequence数据长度、sequence预留空间长度，其中sequence数据指针指向sequence数据缓冲区；“+”部分——plus数据指针、plus数据长度、plus预留空间长度，其中plus数据指针指向plus数据缓冲区；Quality Score部分——Quality数据指针、Quality数据长度、Quality预留空间长度，其中Quality数据指针指向Quality数据缓冲区；DNA读序数据截断信息——sequence截断信息vector和quality截断信息vectorThe original data block buffer in the original data circular double buffer queue used in the above steps is shown in FIG. 3 . This buffer is implemented using a structure, which contains three fields - block data pointer, block number and the number of records in the block. The block data pointer points to a record array, and each array element points to a record object. This record object contains FASTQ four All data of partial information (title part, DNA sequence part, "+" part, Quality Score part) are: title part - title data pointer, title data length, title reserved space length, where the title data pointer points to title Data buffer; DNA sequence part - sequence data pointer, sequence data length, sequence reserved space length, where the sequence data pointer points to the sequence data buffer; "+" part - plus data pointer, plus data length, plus reserved space Space length, where the plus data pointer points to the plus data buffer; the Quality Score part—Quality data pointer, Quality data length, and Quality reserved space length, where the Quality data pointer points to the Quality data buffer; DNA sequence data truncation information—— sequence truncation information vector and quality truncation information vector

下面详细解释FASTQ文件的并行压缩方法中的压缩工作线程，其具体实施步骤如下所示：The following explains in detail the compression worker threads in the parallel compression method of FASTQ files, and its specific implementation steps are as follows:

（1）压缩工作线程前期准备工作，包含线程中对象的建立与初始化工作。(1) Compress the preparatory work of the working thread, including the creation and initialization of objects in the thread.

（2）从原始数据循环双缓冲队列的原始数据块队列中取得队列头。(2) Obtain the head of the queue from the original data block queue of the original data circular double buffer queue.

（3）所取得的队列头若为空，转（2），否则转（4）。(3) If the obtained queue head is empty, go to (2), otherwise go to (4).

（4）压缩此队列头缓冲区中的原始数据块，将压缩后数据存入线程内的临时缓冲区，并记录压缩后的块数据大小。(4) Compress the original data block in the queue head buffer, store the compressed data in the temporary buffer in the thread, and record the compressed block data size.

（5）释放此缓冲区到此循环双缓冲队列中的空块缓冲区队列中。(5) Release this buffer to the empty block buffer queue in this circular double buffer queue.

（6）从压缩数据循环双缓冲队列的空块缓冲区队列中取得队列头。(6) Obtain the head of the queue from the empty block buffer queue of the compressed data circular double buffer queue.

（7）所取得的队列头若为空，转（6），否则转（8）。(7) If the obtained queue head is empty, go to (6), otherwise go to (8).

（8）将线程内的临时缓冲区中缓存的压缩后的块数据存入此队列头中的空块缓冲区中，并记录压缩数据大小以及块号。(8) Store the compressed block data cached in the temporary buffer in the thread into the empty block buffer in the queue head, and record the compressed data size and block number.

（9）释放此缓冲区到此循环双缓冲队列中的压缩数据块队列的尾部。(9) Release this buffer to the end of the compressed data block queue in this circular double buffer queue.

（10）在读写顺序二维数组中设置此块的压缩处理完毕标志。(10) Set the compression processing completion flag of this block in the read-write sequence two-dimensional array.

（11）若原始数据读取线程结束并且所有块均处理完毕，转（13），否则转（2）。(11) If the original data reading thread ends and all blocks are processed, go to (13), otherwise go to (2).

（12）若压缩数据写入线程结束，转（13），否则转（2）。(12) If the compressed data writing thread ends, go to (13), otherwise go to (2).

（13）设置此压缩工作线程结束标志。(13) Set the end flag of this compression worker thread.

（14）此压缩工作线程结束。(14) This compression worker thread ends.

需要注意的是，上述步骤(4)中的每个原始数据块缓冲区数据可以根据需求，使用DSRC算法、G_SQZ算法等此类分块索引的算法来压缩数据。It should be noted that, the buffer data of each original data block in the above step (4) can be compressed by using DSRC algorithm, G_SQZ algorithm and other block indexing algorithms according to requirements.

上述步骤中所使用的压缩数据循环双缓冲队列中的压缩数据块缓冲区使用一个结构体实现，包含四个字段——压缩数据块指针、压缩数据块长度、压缩数据块号和块内记录数目，其中压缩数据块指针指向一个数据缓冲区。The compressed data block buffer in the compressed data circular double buffer queue used in the above steps is implemented using a structure, which contains four fields - compressed data block pointer, compressed data block length, compressed data block number and number of records in the block , where the compressed data block pointer points to a data buffer.

下面详细解释FASTQ文件的并行压缩方法中的压缩数据写入线程，其具体实施步骤如下所示：The compressed data writing thread in the parallel compression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

（1）压缩数据写入线程准备工作，包含在压缩文件的头部写入块数据包含的记录数目设置信息。(1) Compressed data writing thread preparation work, including the setting information of the number of records contained in the block data written in the header of the compressed file.

（2）设置块号block_no=0。(2) Set the block number block_no=0.

（3）查找读写顺序二维数组block_no号块压缩处理标志。(3) Find the block compression processing flag of the two-dimensional array block_no in the read-write sequence.

（4）若block_no块压缩完毕，转（5），否则转（3）。(4) If block_no is compressed, go to (5), otherwise go to (3).

（5）从压缩数据循环双缓冲队列的压缩数据块队列中取得队列头。(5) Obtain the head of the queue from the compressed data block queue of the compressed data circular double buffer queue.

（6）若队列头为空，转（5），否则转（7）。(6) If the queue head is empty, go to (5), otherwise go to (7).

（7）将此队列头的压缩数据块写入最终的压缩文件中。(7) Write the compressed data block at the head of the queue into the final compressed file.

（8）释放此缓冲区到此循环双缓冲队列中的空块缓冲区队列尾部。(8) Release this buffer to the end of the empty block buffer queue in this circular double buffer queue.

（9）block_no加1(9) block_no plus 1

（10）若原始数据读取线程结束并且所有块均已经写入最终的压缩文件，转（11），否则转（3）。(10) If the original data reading thread ends and all blocks have been written into the final compressed file, go to (11), otherwise go to (3).

（11）在压缩文件末尾写入各个块压缩位流在压缩文件中的位置索引、总的块数、文件尾起始位置等压缩文件尾部信息。(11) At the end of the compressed file, write the position index of the compressed bit stream of each block in the compressed file, the total number of blocks, the starting position of the end of the file, and other compressed file tail information.

（12）设置压缩数据写入线程结束标志。(12) Set the compressed data write thread end flag.

（13）压缩数据写入线程结束。(13) The compressed data write thread ends.

下面详细解释FASTQ文件的并行解压缩方法中的压缩数据读取线程，其具体实施步骤如下所示：The compressed data reading thread in the parallel decompression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

（1）打开进程分配的待解压缩的压缩文件，获取文件描述符1，即fd1。(1) Open the compressed file allocated by the process to be decompressed, and obtain file descriptor 1, namely fd1.

（2）再次打开进程分配的待解压缩的压缩文件，获取文件描述符2，即fd2。(2) Open the compressed file to be decompressed allocated by the process again, and obtain file descriptor 2, namely fd2.

（3）获取当前运行机器的文件系统的内存分页大小。(3) Obtain the memory page size of the file system of the currently running machine.

（4）根据内存分页大小设定内存映射空间大小。(4) Set the memory mapping space size according to the memory paging size.

（5）根据压缩文件尾部的各个块压缩位流的位置索引信息，得到进程待解压缩的所有块在压缩位流中的起始位置和结束位置。(5) According to the position index information of the compressed bit stream of each block at the end of the compressed file, the start position and end position of all the blocks to be decompressed in the process in the compressed bit stream are obtained.

（6）根据上述起始位置，设定内存映射的起始点（起始点需要根据内存页面大小，对齐内存页面的边界）、映射数据大小、内存映射结束点，对fd1进行映射内存映射得到内存映射区域1的内存地址lpBuf1。当前映射起始点和结束点是相对整个压缩文件计算。(6) According to the above starting position, set the starting point of the memory mapping (the starting point needs to be aligned with the boundary of the memory page according to the size of the memory page), the size of the mapped data, and the end point of the memory mapping, and map the memory mapping of fd1 to obtain the memory mapping Memory address lpBuf1 of area 1. The current mapping start point and end point are calculated relative to the entire compressed file.

（7）设置当前内存映射区域号current_buffer_symbol为1。(7) Set the current memory mapping area number current_buffer_symbol to 1.

（8）初始化内存映射区域转换次数变量reverse_num为0；初始化当前内存映射区域current_lpbuffer的起始块号变量和终止块号变量均为0，初始化前一个内存映射区域last_lpbuffer的起始块号变量和终止块号变量均为0。(8) Initialize the variable reverse_num of memory mapping area conversion times to 0; initialize the start block number variable and end block number variable of the current memory map area current_lpbuffer to 0, initialize the start block number variable and end block number variable of the previous memory map area last_lpbuffer The block number variables are all 0.

（9）设置当前待解压缩块的块号=0。(9) Set the block number of the current block to be decompressed = 0.

（10）检索压缩位流位置索引信息，得到当前待解压块在压缩文件中的起始位置和结束位置。(10) Retrieve the position index information of the compressed bit stream, and obtain the start position and end position of the current block to be decompressed in the compressed file.

（11）如果当前待解压块的起始位置和结束位置在当前映射区域的范围内，转（20）；否则，转（12）。(11) If the start position and end position of the current block to be decompressed are within the range of the current mapping area, go to (20); otherwise, go to (12).

（12）内存映射区域转换次数变量reverse_num+1。(12) Variable reverse_num+1 for memory mapping area conversion times.

（13）当前内存映射区域的起始块号值赋值给前一个内存映射区域的起始块号变量，当前块号值赋值给当前内存映射区域的起始块号变量，（当前块号值-1）赋值给前一个内存映射区域的终止块号变量。(13) Assign the starting block number value of the current memory mapping area to the starting block number variable of the previous memory mapping area, and assign the current block number value to the starting block number variable of the current memory mapping area, (current block number value - 1) Assign a value to the termination block number variable of the previous memory mapping area.

（14）根据当前待解压块在压缩文件中的起始位置，设定内存映射的起始点（起始点需要根据内存页面大小，对齐内存页面的边界）、映射数据大小、内存映射结束点。当前映射起始点和结束点是相对整个压缩文件计算。(14) According to the starting position of the current block to be decompressed in the compressed file, set the starting point of the memory map (the starting point needs to be aligned with the boundary of the memory page according to the size of the memory page), the size of the mapped data, and the end point of the memory map. The current mapping start point and end point are calculated relative to the entire compressed file.

（15）改变当前内存映射区域号，即两个映射轮换：若current_buffer_symbol为1，则更改为2；若current_buffer_symbol为2，则更改为1。(15) Change the current memory mapping area number, that is, two mapping rotations: if current_buffer_symbol is 1, change to 2; if current_buffer_symbol is 2, change to 1.

（16）若reverse_num>=2，则转（17），否则转（19）(16) If reverse_num>=2, then go to (17), otherwise go to (19)

（17）根据前一个内存映射区域记录的起始块号和终止块号，不断循环查询读写顺序二维数组中相应块的解压缩处理结束标志，直至范围内的所有块均被解压缩处理结束。转（18）。(17) According to the start block number and end block number recorded in the previous memory mapping area, continuously query the decompression processing end flag of the corresponding block in the read-write sequence two-dimensional array until all blocks in the range are decompressed Finish. turn (18).

（18）如果current_buffer_symbol=1，释放内存映射1；否则释放内存映射2。(18) If current_buffer_symbol=1, release memory mapping 1; otherwise, release memory mapping 2.

（19）如果current_buffer_symbol=1，则重新对fd1进行内存映射1得到内存映射地址lpbuf1，否者重新对fd2进行内存映射2得到内存映射地址lpbuf2。上面重新进行的内存映射参数均是（14）中计算出的参数。(19) If current_buffer_symbol=1, perform memory mapping 1 on fd1 again to obtain memory mapping address lpbuf1, otherwise perform memory mapping 2 on fd2 to obtain memory mapping address lpbuf2. The memory mapping parameters re-performed above are all parameters calculated in (14).

（20）循环依次查找各个解压缩工作线程的压缩数据循环双缓冲队列，查找空块缓冲区。(20) Circularly search the compressed data circular double buffer queues of each decompression worker thread in turn, and search for empty block buffers.

（21）若空块缓冲区存在，转（22），否则转（20）。(21) If the empty block buffer exists, go to (22), otherwise go to (20).

（22）根据current_buffer_symbol、当前内存映射区域的映射起始点、当前待解压块的起始位置和结束位置，设置当前得到的压缩数据循环双缓冲队列中的空块缓冲区结构中四个字段：块在内存映射区域起始点、内存映射区域指针、压缩数据块长度、压缩数据块号，形成压缩数据块缓冲区。(22) According to the current_buffer_symbol, the mapping start point of the current memory mapping area, the start position and the end position of the current block to be decompressed, set the four fields in the empty block buffer structure in the currently obtained compressed data circular double buffer queue: block A compressed data block buffer is formed at the starting point of the memory mapping area, the pointer of the memory mapping area, the length of the compressed data block, and the number of the compressed data block.

（23）释放此缓冲区到此循环双缓冲队列的压缩数据块队列末端。(23) Release this buffer to the end of the compressed data block queue of this circular double buffer queue.

（24）在读写顺序二维数组中设置此块的线程分配号。(24) Set the thread allocation number of this block in the read-write sequence two-dimensional array.

（25）当前待解压块的块号加一，若当前待解压块的块号>进程待解压的最大块号，转（26），否者转（10）。(25) Add one to the block number of the current block to be decompressed, if the block number of the current block to be decompressed > the maximum block number to be decompressed by the process, go to (26), otherwise go to (10).

（26）等待写线程结束，若没有结束，则一直等待；若写线程结束，释放内存映射1和内存映射2，关闭fd1和fd2。(26) Wait for the writing thread to end, if not, keep waiting; if the writing thread ends, release memory mapping 1 and memory mapping 2, and close fd1 and fd2.

（27）设置压缩数据读取线程结束标志。(27) Set the compressed data reading thread end flag.

（28）压缩数据读取线程结束。(28) The compressed data reading thread ends.

上述步骤中所使用的压缩数据循环双缓冲队列中的压缩数据块缓冲区使用一个结构体实现，包含四个字段——内存映射区域指针（4）、块在内存映射区域起始点（5）、压缩数据块长度（6）和压缩数据块号（7）。为节省空间以及数据多次拷贝的时间，直接使用指针指向压缩数据块所在的内存映射区域以及块在内存映射区域的起始点。The compressed data block buffer in the compressed data circular double-buffer queue used in the above steps is implemented using a structure, which contains four fields—the memory mapping area pointer (4), the starting point of the block in the memory mapping area (5), Compressed data block length (6) and compressed data block number (7). In order to save space and time for multiple data copies, pointers are directly used to point to the memory mapping area where the compressed data block is located and the starting point of the block in the memory mapping area.

图5示出了并行解压缩方法中循环双内存映射与压缩数据块缓冲区的关系图。图中以块1为例，显示了缓冲区的每个字段与双内存映射区域——内存映射（1）和内存映射（2）的关系。可以看出解压缩工作线程直接在压缩数据循环双缓冲队列中使用循环双内存映射区域，减少了数据拷贝次数。需要注意的是，每次内存映射的起始点和结束点很有可能不在块的开始和结束位置，这是由于起始点需要根据内存页面大小对齐内存页面的边界，内存映射的空间大小除了最后一次映射受文件尾部影响映射大小外，其余的映射大小均是一个固定值。有的块在一个内存映射中只包含了部分数据，这样的块需要在另外一个内存映射中重新映射整块数据。Fig. 5 shows the relationship diagram of the circular dual memory mapping and the compressed data block buffer in the parallel decompression method. Taking block 1 as an example in the figure, it shows the relationship between each field of the buffer and the dual memory-mapped areas—memory map (1) and memory map (2). It can be seen that the decompression worker thread directly uses the circular double memory mapping area in the compressed data circular double buffer queue, which reduces the number of data copies. It should be noted that the starting point and ending point of each memory mapping may not be at the beginning and ending position of the block. This is because the starting point needs to align the boundary of the memory page according to the size of the memory page. The space size of the memory mapping is except for the last Except for the mapping size affected by the end of the file, the rest of the mapping size is a fixed value. Some blocks contain only part of the data in one memory map. Such blocks need to remap the entire block of data in another memory map.

下面详细解释FASTQ文件的并行解压缩方法中的解压缩工作线程，其具体实施步骤如下所示：The decompression worker thread in the parallel decompression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

（1）解压缩工作线程前期准备工作，包含线程获取进程开始阶段处理压缩文件头和文件尾得到的块记录数设置信息、压缩文件包含的压缩块数目、每个块在压缩文件中的位置索引等信息，以及此线程中一些对象的建立和初始化等工作。(1) Preparatory work for the decompression working thread, including the setting information of the number of block records obtained by processing the header and tail of the compressed file at the beginning of the thread acquisition process, the number of compressed blocks contained in the compressed file, and the position index of each block in the compressed file and other information, as well as the creation and initialization of some objects in this thread.

（2）从压缩数据循环双缓冲队列的压缩数据块队列中取得队列头。(2) Obtain the queue head from the compressed data block queue of the compressed data circular double buffer queue.

（4）读取此队列头缓冲区结构中的四个字段的数据——块在内存映射区域起始点、内存映射区域指针、压缩数据块长度、压缩数据块号，解压缩此块的压缩位流，将解压缩后数据存入线程内特定的块记录结构对象中。(4) Read the data of the four fields in the queue head buffer structure - the starting point of the block in the memory mapping area, the pointer of the memory mapping area, the length of the compressed data block, the number of the compressed data block, and decompress the compression bit of this block Stream, store the decompressed data in a specific block record structure object in the thread.

（6）从解压缩数据循环双缓冲队列的空块缓冲区队列中取得队列头。(6) Obtain the head of the queue from the empty block buffer queue of the decompressed data circular double buffer queue.

（8）线程内在块记录结构对象中缓存的解压缩后的块数据存入此队列头中的空块缓冲区中，并记录解压缩数据块的长度以及块号。(8) The decompressed block data cached in the block record structure object in the thread is stored in the empty block buffer in the queue head, and the length and block number of the decompressed data block are recorded.

（9）释放此缓冲区到此循环双缓冲队列中的解压缩数据块队列的尾部。(9) Release this buffer to the tail of the decompressed data block queue in this circular double buffer queue.

（10）在读写顺序二维数组中设置此块的解压缩处理完毕标志。(10) Set the decompression processing completion flag of this block in the read-write sequence two-dimensional array.

（11）检索读写顺序二维数组，若所有块均解压缩处理完毕，转（13），否则转（2）。(11) Retrieve the two-dimensional array of read-write sequence, if all blocks are decompressed and processed, go to (13), otherwise go to (2).

（12）若解压缩数据写入线程结束，转（13），否则转（2）。(12) If the decompressed data writing thread ends, go to (13), otherwise go to (2).

（13）设置此解压缩工作线程结束标志。(13) Set the decompression worker thread end flag.

（14）此解压缩工作线程结束。(14) This decompression worker thread ends.

需要注意的是，上述步骤(4)中的每个压缩块缓冲区数据可以根据压缩算法，使用DSRC算法、G_SQZ算法等此类分块索引的算法来解压缩数据。It should be noted that the buffer data of each compressed block in the above step (4) can be decompressed according to the compression algorithm, using such block index algorithms as DSRC algorithm and G_SQZ algorithm.

上述步骤中所使用的解压缩数据循环双缓冲队列中的解压缩数据块缓冲区使用一个结构体实现，包含三个字段——解压缩数据块指针、解压缩数据块长度和解压缩数据块号，其中解压缩数据块指针指向一个缓冲区。The decompressed data block buffer in the decompressed data circular double buffer queue used in the above steps is implemented using a structure, which contains three fields - decompressed data block pointer, decompressed data block length and decompressed data block number, Wherein the decompressed data block pointer points to a buffer.

图6显示了并行解压缩方法中循环双内存映射——内存映射1（1）和内存映射2（2）在读取线程和解压缩工作线程间的协同，时间轴为（3），块的设置和图5相同。图中可以看出，在第二次使用内存映射1（1）重新映射时，需要等待所有解压缩工作线程处理完原有内存映射1（1）中的所有块数据，即等待块0到块i处理完毕后，才能够释放上次映射，重新映射块j+1到块k的数据。同样的情况也出现在第二次使用内存映射2（2）重新内存映射新的压缩块的时候，需要等待所有解压缩线程处理完块i+1到块j+1。Fig. 6 shows the synergy between the read thread and the decompression worker thread in the parallel decompression method of cyclic dual memory mapping - memory mapping 1 (1) and memory mapping 2 (2), the time axis is (3), the block setting Same as Figure 5. It can be seen from the figure that when memory map 1 (1) is used for remapping for the second time, it is necessary to wait for all decompression worker threads to finish processing all block data in the original memory map 1 (1), that is, to wait for block 0 to block After i is processed, the last mapping can be released, and the data from block j+1 to block k can be remapped. The same situation also occurs when memory mapping 2 (2) is used for the second time to re-memory-map a new compressed block, and it is necessary to wait for all decompression threads to finish processing block i+1 to block j+1.

下面详细解释FASTQ文件的并行解压缩方法中的解压缩数据写入线程，其具体实施步骤如下所示：The decompressed data writing thread in the parallel decompression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

（1）解压缩数据写入线程准备工作，包含解压缩文件名的确定。(1) Decompression data writing thread preparation work, including determination of decompression file name.

（3）根据待解压的块数，设置内存映射文件的大小qwFileSize。根据内存分页大小，设置每次内存映射区域的大小，以及重新进行内存映射的阈值。(3) Set the size of the memory-mapped file qwFileSize according to the number of blocks to be decompressed. According to the size of memory paging, set the size of each memory mapping area and the threshold for re-mapping memory.

（4）建立解压缩文件，得到文件描述符fd，并设置文件所占用空间为qwFileSize大小。(4) Create the decompressed file, get the file descriptor fd, and set the space occupied by the file to the size of qwFileSize.

（5）计算本次内存映射大小，对fd进行内存映射，得到内存映射的内存空间地址lpBuf。(5) Calculate the size of this memory mapping, perform memory mapping on fd, and obtain the memory space address lpBuf of the memory mapping.

（6）设置块号block_no=0。(6) Set block number block_no=0.

（7）查找读写顺序二维数组中的block_no号块解压缩处理标志。(7) Find the decompression processing flag of block_no number in the two-dimensional array of read-write sequence.

（8）若block_no块解压缩完毕，转（9），否则转（7）。(8) If block_no is decompressed, go to (9), otherwise go to (7).

（9）从解压缩数据循环双缓冲队列的解压缩数据块队列中取得队列头。(9) Obtain the queue head from the decompressed data block queue of the decompressed data circular double buffer queue.

（10）若队列头为空，转（9），否则转（11）。(10) If the queue head is empty, go to (9), otherwise go to (11).

（11）将此队列头的解压缩数据块写入内存映射区域中，内存映射区域数据偏移以及文件数据的偏移均根据写入数据的大小相应地增加。(11) Write the decompressed data block at the head of the queue into the memory-mapped area, and the data offset of the memory-mapped area and the offset of the file data will increase accordingly according to the size of the written data.

（12）释放此缓冲区到此循环双缓冲队列中的空块缓冲区队列尾部。(12) Release this buffer to the end of the empty block buffer queue in this circular double buffer queue.

（13）若当前内存映射区域写入的数据达到阈值，释放此内存映射。并根据当前文件数据的偏移和文件大小计算内存映射的起点、映射大小、内存映射区域数据偏移以及新的阈值，重新进行内存映射。(13) If the data written in the current memory mapping area reaches the threshold, release the memory mapping. And according to the offset and file size of the current file data, calculate the starting point of the memory map, the map size, the data offset of the memory map area, and the new threshold, and re-map the memory.

（14）block_no加1。(14) block_no plus 1.

（15）若所有块均已经写入内存映射区域，转（16），否则转（7）。(15) If all blocks have been written into the memory mapping area, go to (16), otherwise go to (7).

（16）释放内存映射将解压缩数据写入最终的解压缩文件中，关闭文件描述符。(16) Release the memory map, write the decompressed data into the final decompressed file, and close the file descriptor.

（17）设置解压缩数据写入线程结束标志。(17) Set the decompressed data write thread end flag.

（18）解压缩数据写入线程结束。(18) The decompressed data write thread ends.

Claims

1. a DNA reading sequence data FASTQ file parallel compression method is characterized in that comprising parallel compression process task segmentation part and compression process processing part, specifically as follows:

(1) Parallel compression process task division part

According to the size of the FASTQ file, the number of parallel compression processes, each reading segment in the FASTQ file - the data characteristics of each record, determine the start and end positions of the data to be processed in each compression process; each process will compress the original The data is approximately evenly distributed to each process to achieve data parallelism, so that each process does not consume communication time with each other during processing, which improves the processing efficiency of data parallelism; each process obtains a separate compressed file, compressed data The sequence is consistent with the process number;

(2) The compression process processing part is responsible for multi-thread pipeline parallel compression in the process

Each compression process processing part includes a raw data reading thread, a compressed data writing thread and multiple compression working threads; the specific number of working threads can be set according to the number of hardware CPU cores and process settings;

The data to be compressed processed by each process is divided into multiple blocks by the original data reading thread, each block contains a specific fixed number of records, and the last block is less than the fixed number;

Each worker thread has two circular double buffer queues, one is the original data circular double buffer queue, and the other is the compressed data circular double buffer queue; each original data circular double buffer queue contains two queues: one is the empty block buffer Area queue, one is the original data block queue; each compressed data circular double buffer queue also contains two queues: one is the empty block buffer queue, and the other is the compressed data block queue;

In each process, the parallel compression pipeline processing of the data in the unit of the original data block is carried out. The specific pipeline parallel processing flow is as follows:

(1) The original data reading thread continuously parses and reads the original data block according to the characteristics of the recorded data, loops to find the empty block buffer in the original data circular double buffer queue of each compression worker thread in turn, and puts the original data block in the input, and then release this block buffer to the end of the original data block queue in this circular double buffer queue;

The original data reading thread uses memory mapping combined with data block technology;

(2) Each compression worker thread continuously obtains the original data block from the head of the original data block queue in the original data circular double buffer queue of this thread, and then performs compression processing;

(3) Each compression worker thread continuously fills the compressed block data into the empty block buffer in the obtained compressed data circular double buffer queue of this thread, and releases this buffer to the compression of this circular double buffer queue the tail of the data block queue;

(4) The compressed data writing thread continuously searches for the thread number of the block data that has been compressed and processed in sequence according to the block number from small to large, and obtains the compressed data block queue head in the compressed data circular double buffer queue in this thread This block in compressed data is written to the final compressed file.

2. DNA reading sequence data FASTQ file parallel compression method according to claim 1, is characterized in that: described original data circular double-buffer queue processing mode is as follows:

(1) Original data circular double buffer queue initialization processing: instantiate the empty block buffer queue, with a specific number of empty block buffers, and the original data block queue is empty;

(2) The raw data reading thread reads a raw data block;

(3) Obtain an empty block buffer at the head of the empty block buffer queue;

(4) Fill the obtained empty block buffer with the original data block;

(5) Put this padded raw data block at the end of the raw data block queue;

(6) The compression worker thread obtains the block data in an original data block buffer from the head of the original data block queue for compression processing;

(7) Clear the original data block buffer and put it into the empty block buffer queue.

3. DNA sequence data FASTQ file parallel compression method according to claim 1, is characterized in that: the memory mapping that described raw data reading thread adopts combines data block technology, is used for improving the reading speed of large file, Combined with data partitioning, it is mainly based on the size of the memory page and the size of the mapped space to calculate the position of the data of each block in the memory mapped space, and when to release and remap the memory mapped space.

4. A parallel decompression method for DNA reading sequence data FASTQ files, characterized in that it comprises the following parts:

(1) Determine the compressed files to be processed by the parallel decompression process according to the process number

The FASTQ file to be compressed obtains the corresponding number of compressed files according to the number of parallel compression processes set; in parallel decompression, the number of parallel decompression processes is set according to the number of compressed files, and the sequence of decompressed files obtained by each decompression process It is determined by the order of compressed files; each decompression process does not consume communication time with each other during processing, which improves the processing efficiency of data parallelism;

(2) Read the end of the compressed file to obtain block settings, block index and block number information

At the beginning of each process, the setting of the number of records contained in the block, the position index of each block in the compressed file, and the number of blocks are obtained from the end of the compressed file;

(3) Multi-thread pipeline parallel decompression in parallel decompression process

Each parallel decompression process includes a compressed data read thread, a decompressed data write thread and multiple decompression worker threads;

Each decompression worker thread has two circular double buffer queues, one is the compressed data circular double buffer queue, and the other is the decompressed data circular double buffer queue; each compressed data circular double buffer queue contains two queues: one is an empty block The buffer queue, the other is the compressed data block queue; each decompressed data circular double buffer queue also contains two queues: one is the empty block buffer queue, and the other is the decompressed data block queue;

In each process, the parallel decompression pipeline processing of the data in units of compressed blocks is performed. The specific parallel pipeline processing flow is as follows:

(1) The compressed data reading thread continuously reads the compressed blocks with known compression sizes according to the position index information of the compressed blocks obtained at the end of the compressed file, in order of block numbers from small to large, and searches for each decompression job in turn in a loop The compressed data of the thread is found in the empty block buffer at the head of the circular double-buffer queue. After finding it, put the compressed block data into it, and release this buffer to the end of the compressed data block queue in the circular double-buffer queue;

The compressed data reading thread adopts circular dual memory mapping combined with data block technology;

(2) Each decompression worker thread continuously obtains compressed data blocks from the head of the compressed data block queue in the compressed data circular double buffer queue of this thread, and then performs decompression processing;

(3) Each decompression worker thread continuously fills the decompressed block data into the empty block buffer in the obtained thread's decompressed data circular double buffer queue, and releases this buffer to this circular double buffer The decompressed data block queue tail of the queue;

(4) The decompressed data writing thread continuously searches for the thread number of the block data that has been compressed and processed in sequence according to the block number from small to large, and obtains the decompressed data in this thread and the decompressed data in the circular double buffer queue This block at the head of the block queue decompresses the data, writing the final compressed file.

5. the method for parallel decompression of DNA sequence data FASTQ files according to claim 4, characterized in that: the compressed data block buffer in the compressed data circular double buffer queue is implemented using a structure, comprising four fields— —Memory mapping area pointer, starting point of block in memory mapping area, length of compressed data block and number of compressed data block; in order to save space and time for multiple copying of data, directly use the pointer to point to the memory mapping area where the compressed data block is located and where the block is located The starting point of the memory mapped region.

6. The method for parallel decompression of DNA sequence data FASTQ files according to claim 4, characterized in that: the cyclic double memory mapping technology adopted by the compressed data reading thread is combined with data block technology to improve the performance of large data files. The reading speed of , the specific implementation is as follows:

The key technology is the circular dual memory mapping technology, which enables the decompression worker thread to read the compressed data for decompression and the memory mapping of the compressed data reading thread to execute in parallel, that is, there are two memory maps——memory map 1 and memory map 2, according to The order of the compressed blocks is put into these two memory maps in turn; according to the compressed block data position index information at the end of the compressed file, and the size of the two memory map spaces, each compressed block is calculated sequentially in the order of block numbers from small to large. The memory-mapped buffer where the block data is located and the location information in the memory-mapped buffer; the decompression worker thread directly uses this circular double-memory-mapped area in the compressed data circular double-buffer queue to reduce the number of data copies; For memory mapping, it is necessary to wait for the previous mapping data of this memory mapping to be used by all decompression worker threads before performing memory mapping again.