Nothing Special   »   [go: up one dir, main page]

CN103559020A - Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data - Google Patents

Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data Download PDF

Info

Publication number
CN103559020A
CN103559020A CN201310551802.7A CN201310551802A CN103559020A CN 103559020 A CN103559020 A CN 103559020A CN 201310551802 A CN201310551802 A CN 201310551802A CN 103559020 A CN103559020 A CN 103559020A
Authority
CN
China
Prior art keywords
data
queue
thread
compression
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310551802.7A
Other languages
Chinese (zh)
Other versions
CN103559020B (en
Inventor
郑晶晶
王婷
张常有
詹科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Original Assignee
Institute of Software of CAS
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS, Guangzhou Institute of Software Application Technology Guangzhou GZIS filed Critical Institute of Software of CAS
Priority to CN201310551802.7A priority Critical patent/CN103559020B/en
Publication of CN103559020A publication Critical patent/CN103559020A/en
Application granted granted Critical
Publication of CN103559020B publication Critical patent/CN103559020B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a method for realizing parallel compression and parallel decompression on an FASTQ file containing DNA (deoxyribonucleic acid) sequence read data. By aiming at the compression and the decompression of the FASTQ file containing the DNA sequence read data, by utilizing circular double buffering queues, circular double memory mapping and memory mapping and by combining the data segmentation processing technology, the multi-thread streamline parallel compression and parallel decompression technology, the read-write sequence two-dimensional array technology and the like, the parallel compression and the parallel decompression between multiple processes of the FASTQ file and between in-process multiple threads is realized. The parallel compression and parallel decompression can be realized based on MPI and OpenMP, and also can be realized based on the MPI and Pthread (POSIX thread). According to the method disclosed by the invention, by fully utilizing all computational nodes and the powerful computational capability of an intra-node multi-core CPU (central processing unit), constraints of resources, such as a processor, a memory and the like, on a serial compression and decompression program, can be released.

Description

A kind of DNA reads order sequenced data FASTQ file in parallel compression and decompression method
Technical field
The present invention relates to biological information, data compression and high-performance computing sector, particularly a kind of DNA reads the parallelly compressed and parallel decompression method of order sequenced data FASTQ file.
Background technology
One of main task of bioinformatics is to gather and analyze a large amount of gene datas.These data are most important for gene studies, contribute to determine the gene assembly that prevents or cause disease to produce, and work out pointed therapy.High-throughout sequence measurement and equipment produce the short order sequenced data of reading of magnanimity.The common method that storage, management and transmission DNA read order sequenced data is to adopt FASTQ file layout, this form mainly comprises DNA and reads order sequenced data and the corresponding annotation information of each DNA base, for example, represent probabilistic Quality Scores information of order-checking labeling process.Reading order mark and other description such as device name is also contained in FASTQ file.The storage format (for example FASTA) of comparing other DNA data, FASTQ form can be stored more information, but this also makes file size and storage space sharp increase simultaneously.For base, reading order sequenced data at present and annotate with it algorithm research that descriptor is carried out effective Lossless Compression and decompression, is a study hotspot.
Compression for FASTQ file data; that make important progress at present is G_SQZ algorithm (Tembe, the W.et al.G-SQZ:compact encoding of genomic sequence and quality data.Bioinformatics2010 that the U.S. translates genetic research institute (TGen) and studies; And the people such as Deorowicz Sebastian the DSRC algorithm (Deorowicz S, the Grabowski S.Compression of DNA sequence reads in FASTQ format.Bioinformatics2011 that study 26,2192 – 2194.); 27:860-862.).These two kinds of algorithms have all been used directory system to allow to conduct interviews from the interval (abbreviation piecemeal) of rule, and information needed does not need to start anew to decode.G_SQZ algorithm is mainly used Huffman coding < base, Quality score> couple, and DSRC algorithm to base data line and Quality Score capable utilize separately Huffman to encode and be aided with other meticulous compression process (such as distance of swimming processing etc.).The partial data of can decoding at random when the advantage of these class methods is the relative order information of reservation data conducts interviews, and Lossless Compression efficiency is high.This has represented a class FASTQ file compression method, for convenience of narration, is designated hereinafter simply as block index serial algorithm.
The data that need to carry out at present genomic sequence analysis have reached the TB order of magnitude.The memory device of PB level scale is just being planned or is being installed at large-scale order-checking center.For these mass datas, for reducing storage space and transmission time, to be convenient to, to the real-time analysis of lots of genes data unit sequence, must carry out real time data compression and decompression to it, this need to be by the powerful computing power of high-performance calculation platform.Fast development along with high-performance calculation platform, make full use of the powerful computing power of the multi-core CPU on each computing node, carry out the large FASTQ file of Real Time Compression and decompression magnanimity, can solve the restriction of the resources such as the suffered processor processing power of serial compressed gunzip, internal memory.
Above-mentioned G-SQZ algorithm and DSRC algorithm are all serial algorithms, have not yet to see research article and the patent of the parallel algorithm of the multi-core CPU based on multinode relevant to this class algorithm.
Summary of the invention
In view of having not yet to see research and the patent of the parallel algorithm relevant to this class block index serial algorithms such as above-mentioned G-SQZ and DSRC, the object of the present invention is to provide parallelly compressed decompression method corresponding to the serial compressed decompression algorithm of a kind of this class FASTQ file block index, utilize many computing nodes and multi-core CPU, can realize based on MPI+OpenMP, also can realize based on MPI+Pthread; Can make full use of the powerful computing power of high-performance calculation platform, significantly promote the speed that the real-time analysis of magnanimity genome sequence is processed, to the broader applications of gene data, provide important technical foundation.
Technical scheme of the present invention is as described below.
A parallel compression method of reading the FASTQ file of order sequenced data, comprises the following steps:
One, parallelly compressed process task is cut apart
According in FASTQ file size, parallelly compressed number of processes, FASTQ file, each reads order fragment (other annotation information that comprises base information and correspondence, below for sake of convenience, being labeled as a record) feature of data determines the starting and ending position of the pending data of each process.Each process is all moved process task and is cut apart module, is assigned to equably in each process raw data to be compressed is approximate, to realize data parallel.Each process does not have each other the consumption of call duration time when processing like this, has promoted the treatment effeciency of data parallel.Each process obtains independent compressed file, and the order of packed data is consistent with process number.
Two, parallelly compressed in-process multithreading flowing water is parallelly compressed
In process processing module, comprise that a raw data reads thread, a packed data writes thread and a plurality of compression work thread, the concrete number of worker thread can be set according to the check figure of hardware CPU and process setting.
The handled data of each process are read thread by raw data and are divided into a plurality of, the record data that each piece comprises specific fixed number (most end end block may be less than this fixed number).
Each worker thread all has the queue of two circulation double bufferings, and one is the queue of raw data circulation double buffering, and one is the queue of packed data circulation double buffering.Raw data circulation double buffering queue and the queue of packed data circulation double buffering have similar structures, and wherein the structure of buffer zone is slightly different according to the difference of storage data, after in embodiment part, introduce in detail the structure of each buffer zone.Each raw data circulation double buffering queue comprises two queues: one is the queue of sky block buffer, and one is original data block queue.Each packed data circulation double buffering queue also comprises two queues: one is the queue of sky block buffer, and one is compression data block queue.The processing mode of this two circulation double bufferings queue is identical.
The raw data circulation double buffering queue of take is below example, describes its processing mode in detail:
(1) raw data circulation double buffering queue initialization process: by the queue instantiation of empty block buffer, have the empty block buffer of given number, original data block queue is empty.
(2) raw data reads thread and reads an original data block.
(3) in empty block buffer queue heads, obtain an empty block buffer.
(4) with original data block, fill this sky block buffer obtaining.
(5) original data block of this filling is put into the end of original data block queue.
(6) compression work thread obtains a blocks of data in raw data block buffer and compresses processing in original data block queue heads.
(7) this raw data block buffer is emptied, and put into the queue of sky block buffer.
In-process at each, the parallelly compressed pipeline processes of the data that to carry out take original data block be unit, concrete flowing water parallel processing flow process is as follows:
(1) raw data reads thread and constantly according to record data feature, resolves and to read original data block, the empty block buffer in the raw data circulation double buffering queue of each compression work thread is searched in circulation successively, after finding, original data block is put into, then discharged this block buffer to the end of the original data block queue in this circulation double buffering queue.
(2) each compression work thread constantly the original data block queue heads from the raw data circulation double buffering queue of this thread obtain original data block, then compress processing.
(3) each compression work thread is constantly filled into the blocks of data after compression in the empty block buffer in the packed data circulation double buffering queue of this thread obtaining, and discharges this buffer zone to the afterbody of the compression data block queue of this circulation double buffering queue.
(4) packed data writes thread and constantly according to piece order number from small to large, searches successively the thread number at the compressed blocks of data place being disposed, obtain this piece packed data in the compression data block queue heads in the packed data circulation double buffering queue in this thread, write final compressed file.
The specific algorithm of above-mentioned each thread and termination condition refer to embodiment part.
Raw data reads in thread, adopts memory-mapped technology in conjunction with FASTQ deblocking technology, to improve the reading speed of large data files.In conjunction with DNA, read the piecemeal of order frag info, the space size according to memory pages size and mapping, calculates the data of each piece in the position in memory-mapped space, and when carries out the release in memory-mapped space and remap.Adopt one of memory-mapped benefit to be clearly exactly: process can direct read/write internal memory, substantially without any need for extra data copy. and for the file I/O as fread/fwrite and so on, need between kernel spacing and user's space, carry out four secondary data copies, and memory-mapped only needs copy twice: being once to copy memory mapping area to from input file, is once to copy to output file from memory mapping area in addition.In fact, process can operate ordinary file as access memory.Concrete implementation section is shown in the detailed description of this technology.
Packed data writes in thread, after the compression of each piece, write final compressed file order need to read the reading order of thread Central Plains beginning data block identical with raw data, at this, use a read-write order two-dimensional array.The first dimension of two-dimensional array represents piece number; The size of the second dimension is 2, records respectively thread number that each piece distributes and the compression flag information that is disposed,
The compressed file that each process obtains is initiated with file header, comprises configuration information, such as the record count that comprises of piece.And then be according to data after the compression of each piece of original data block order.File is finally tail of file data, the packed data that comprises each piece location index information, piece number hereof, and the positional information of tail of file in whole file.These information are for parallel decoding, and decoded portion data only during specific of random access, without the whole file of decoding.
A parallel decompression method of reading order sequenced data FASTQ file, comprises the following steps:
One, according to process number, determine the compressed file that process is processed
FASTQ file to be compressed obtains the compressed file of respective number according to the number of the parallelly compressed process arranging.In decompression, the number of parallel decompression process is set according to the number of compressed file, the order of the decompress files that each decompression process obtains is determined by the order of compressed file.Each decompression process does not have each other the consumption of call duration time when processing, and has promoted the treatment effeciency of data parallel.
Two, read compressed file afterbody, obtain the information such as piece setting, piece index and piece number
What be different from parallel compression method is, the index such as position of the setting, each piece that parallel decompression method initially obtains from the afterbody of compressed file the record count that piece comprises in each process compressed file, the information such as number of piece, these information make parallel decompression method be different from parallel compression method.
Three, the in-process multithreading flowing water of parallel decompression parallel decompression
Similar with parallel compression method, in the decompression process processing module of parallel decompression method, comprise that a packed data reads thread, a decompressed data writes thread and a plurality of decompression work thread, the concrete number of worker thread can be set according to the check figure of hardware CPU and process setting.
Each decompression work thread has the queue of two circulation double bufferings, and one is the queue of packed data circulation double buffering, and one is the queue of decompressed data circulation double buffering.Packed data circulation double buffering queue and the queue of decompressed data circulation double buffering have similar structures, and wherein the structure of buffer zone is slightly different according to the difference of storage data, after in embodiment part, introduce in detail the structure of each buffer zone.Each packed data circulation double buffering queue comprises two queues: one is the queue of sky block buffer, and one is compression data block queue.Each decompressed data circulation double buffering queue also comprises two queues: one is the queue of sky block buffer, and one is the queue of decompressed data piece.The processing mode of this two circulation double bufferings queue is all identical with the raw data circulation double buffering queue processing mode in aforesaid parallel compression method, repeats no more.
In-process at each, carry out reading the parallel decompression pipeline processes that order sequenced data compression blocks is the data of unit, concrete parallel pipelining process treatment scheme is as follows:
(1) packed data reads the location index information of the compression blocks that thread obtains according to tail of file, according to piece order number from small to large, constantly read the compression blocks of known compression sizes, the empty block buffer of the packed data circulation double buffering queue heads of each decompression work thread is searched in circulation successively, after finding, compression blocks data are put into, and discharged this buffer zone to the end of the compression data block queue in this circulation double buffering queue.
(2) each decompression work thread constantly the compression data block queue heads from the packed data circulation double buffering queue of this thread obtain compression data block, then carry out decompression.
(3) each decompression work thread is constantly filled into the blocks of data after decompressing in the empty block buffer in the decompressed data circulation double buffering queue of this thread obtaining, and discharges this buffer zone to the decompressed data piece queue afterbody of this circulation double buffering queue.
(4) decompressed data writes thread and constantly according to piece order number from small to large, searches successively the thread number at the compressed blocks of data place being disposed, this piece decompressed data of obtaining the decompressed data piece queue heads in the decompressed data circulation double buffering queue in this thread, writes final compressed file.
The specific algorithm of above-mentioned each thread and termination condition refer to embodiment part.
Packed data reads thread and adopts the two memory-mapped technology of circulation in conjunction with deblocking technology, to improve the reading speed of large data files.Wherein gordian technique is the two memory-mapped technology of circulation, makes decompression work thread read packed data and decompresses and read thread memory-mapped executed in parallel with packed data.There are two memory-mapped---memory-mapped 1 and memory-mapped 2, circulate successively and put into this two memory-mapped according to the order of compression blocks.According to the compression blocks Data Position index information of compressed file afterbody, and the size in two memory-mapped spaces, according to piece order number from small to large, calculate successively memory-mapped buffer zone and the positional information in memory-mapped buffer zone at each compression blocks data place.Decompression work thread is directly used the two memory-mapped of this circulation region in the queue of packed data circulation double buffering, to reduce data copy number of times.For the memory-mapped of using, the front mapping (enum) data that need to wait for this memory-mapped just can re-start memory-mapped after being finished using by all decompression work threads.Embodiment part is shown in the detailed description of this technology.
Decompressed data writes in thread, each piece after decompressing, write final decompress files order need to read in thread the reading order of compression blocks identical with packed data.Identical with parallel compression method, at this, also with an identical read-write order two-dimensional array, record thread number and the complete flag information of decompression that each piece distributes.
For improving I/O speed, decompressed data writes thread and has also used memory-mapped in conjunction with deblocking technology, according to treating the piece number of decompress(ion), set up the Memory Mapping File and its of specific size, according to piece order number from small to large, decompressed data piece is put into memory-mapped space successively, need during this time the position according to data writing, the size in memory-mapped space, the threshold value remapping to remap, and adjust the threshold value remapping.Concrete implementation section is shown in detailed description.
For DNA, read compression and the decompression of order sequenced data FASTQ file, what made important progress in recent years is G_SQZ and this class block index serial algorithm of DSRC.Have not yet to see research article and the patent of the parallel algorithm relevant to this class algorithm.Along with the data of genomic sequence analysis reach even PB level scale of TB level, for ease of to the real-time analysis of lots of genes data unit sequence, must carry out real time data compression and decompression to it, this need to be by the powerful computing power of high-performance calculation platform.Therefore, study the parallelly compressed decompression method of this class block index serial algorithm of above-mentioned G_SQZ and DSRC significant.
The present invention proposes above-mentioned G_SQZ and the corresponding parallelly compressed and parallel decompression method of the serial compressed decompression algorithm of this class block index of DSRC first.Utilize the queue of circulation double buffering, the two memory-mapped of circulation and memory-mapped and in conjunction with technology such as deblocking processings, the parallelly compressed decompression of multithreading flowing water, read-write order two-dimensional arrays, realize a plurality of processes of FASTQ file and the parallelly compressed and parallel decompression processing between in-process a plurality of threads.
The invention has the advantages that:
(1) the present invention makes full use of the powerful calculating ability of multi-core CPU in each computing node and node, the restriction that can solve the resources such as the suffered processor of serial compressed gunzip, internal memory.This is parallelly compressed and the realization of parallel decompression method is flexible, can realize based on MPI and OpenMP, also can realize based on MPI and Pthread.
(2) because the present invention is applicable to the compression and decompression algorithm in any data block, therefore this parallelly compressed and parallel decompression method is not limited only to the parallelization of G_SQZ and this dual serial algorithm of DSRC, as long as serial compressed decompression algorithm has piecemeal and these two features of index, originally parallelly compressed and parallel decompression method is just applicable to the parallelization of this class block index serial algorithm.
(3) the present invention makes full use of the powerful computing power of high-performance calculation platform, can significantly promote the speed that the real-time analysis of magnanimity genome sequence is processed, and to the broader applications of gene data, provides important technical foundation.
Accompanying drawing explanation
Fig. 1 is the parallelly compressed figure of in-process multithreading flowing water in parallel compression method of the present invention;
Fig. 2 is original datacycle double buffering queue figure in the present invention;
Fig. 3 is packed data block buffer in parallel compression method of the present invention;
Fig. 4 is the in-process multithreading flowing water parallel decompression figure in parallel decompression method of the present invention;
Fig. 5 is the two memory-mapped of circulation and packed data block buffer graph of a relation in parallel decompression method of the present invention;
Fig. 6 is that in parallel decompression method of the present invention, the two memory-mapped of circulation are reading the collaborative of thread and decompression work cross-thread;
1. memory-mapped 1 in figure, 2. memory-mapped 2,3. time shaft, 4. memory-mapped region pointer, 5. piece at memory-mapped region starting point, 6. packed data block length, 7. compression data block number.
Embodiment
The invention provides the parallelly compressed decompression method that a kind of DNA reads the FASTQ file of order sequenced data, for making object of the present invention, technical scheme and effect clearer, clear and definite, below in conjunction with accompanying drawing, the present invention is described in more detail.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
Below explain that in detail the raw data in the parallel compression method of FASTQ file reads thread, its concrete implementation step is as follows:
(1) open original DNA to be compressed and read order sequenced data FASTQ compressed file.
(2) obtain the paging size of the file system of current operation machine.
(3) according to paging size set memory mapping space size.
(4) according to process task, cut apart the scope of the raw data of the required processing of current process that module distributes, (starting point need to be according to memory pages size for the starting point of set memory mapping, the border of alignment memory pages) and mapping (enum) data size, carry out memory-mapped.
(5) the memory-mapped reference position of first compressing original data piece in record the process.
(6) the raw data circulation double buffering queue that each compression work thread is searched in circulation successively, searches sky block buffer.
(7) if empty fast buffer zone exists, turn (8), otherwise turn (6).
(8) from memory-mapped region, to be recorded as granularity, circular order reads the record data of some, forms an original data block, fills this sky block buffer, and piece number adds 1, the record count in storage block.Or while arriving mapping terminal, turn (9).
(9) discharge this buffer zone to the original data block queue end of this circulation double buffering queue.
(10) the thread distribution number of this piece is set in read-write order two-dimensional array.
(11) if the data end of current arrival course allocation task turns (15), otherwise, turn (12).
(12) according to the current memory-mapped end position reading, calculate the length of the original data block having read, next piece to be read reference position in memory-mapped region is set.
(13) if (current memory-mapped reads the previous length that has just read complete piece that length that positional distance this time shines upon end position is less than 1.5 times) & & (not arriving end of file position), turn (14), otherwise (8).
(14) discharge last memory-mapped, according to current file, read the position starting point of piece (next one continue) and calculate starting point and the mapping size of memory-mapped next time, re-start memory-mapped.Turn (8).
(15) releasing memory mapping, arranges and reads thread end mark.
(16) raw data reads thread and finishes.
Raw data block buffer in the raw data circulation double buffering queue of using in above-mentioned steps as shown in Figure 3.This buffer zone is used a structure to realize, comprise three fields---record count in blocks of data pointer, piece number and piece, one of blocks of data pointed records array, each array element points to one and records object, this records object and comprises FASTQ tetra-partial informations (title part, DNA sequence part, "+" part, Quality Score part) total data, be respectively: title part---title data pointer, title data length, title headspace length, wherein title data pointer points to title data buffer; DNA sequence part---sequence data pointer, sequence data length, sequence headspace length, wherein sequence data pointer points to sequence data buffer; "+" part---plus data pointer, plus data length, plus headspace length, wherein plus data pointer points to plus data buffer; Quality Score part---Quality data pointer, Quality data length, Quality headspace length, wherein Quality data pointer points to Quality data buffer; DNA reads order sequenced data cutoff information---sequence cutoff information vector and quality cutoff information vector
Below explain in detail the compression work thread in the parallel compression method of FASTQ file, its concrete implementation step is as follows:
(1) compression work thread preliminary preparation, the foundation that comprises object in thread and initial work.
(2) from the original data block queue of raw data circulation double buffering queue, obtain queue heads.
(3) obtained queue heads if it is empty, turns (2), otherwise turns (4).
(4) compress the original data block in this queue heads buffer zone, deposit data after compression in extra buffer in thread, and the size of the blocks of data after recording compressed.
(5) discharge this buffer zone in the empty block buffer queue in this circulation double buffering queue.
(6) from the empty block buffer queue of packed data circulation double buffering queue, obtain queue heads.
(7) obtained queue heads if it is empty, turns (6), otherwise turns (8).
(8) blocks of data after the compression of buffer memory in the extra buffer in thread is deposited in the empty block buffer in this queue heads, and recording compressed size of data and piece number.
(9) discharge this buffer zone to the afterbody of the compression data block queue in this circulation double buffering queue.
(10) compression that this piece is set in the read-write order two-dimensional array sign that is disposed.
(11) if raw data reads that thread finishes and all be all disposed, turn (13), otherwise turn (2).
(12) if packed data writes thread to be finished, turn (13), otherwise turn (2).
(13) this compression work thread end mark is set.
(14) this compression work thread finishes.
It should be noted that each the original data block buffer data in above-mentioned steps (4) can according to demand, carry out packed data with the algorithm of these type of block index such as DSRC algorithm, G_SQZ algorithm.
Packed data block buffer in the packed data circulation double buffering queue of using in above-mentioned steps is used a structure to realize, comprise four fields---record count, wherein data buffer of compression data block pointed in packed data block pointer, packed data block length, compression data block number and piece.
Below explain that in detail the packed data in the parallel compression method of FASTQ file writes thread, its concrete implementation step is as follows:
(1) packed data writes thread preliminary work, is included in the record count configuration information that the head write-in block packet of compressed file contains.
(2) piece block_no=0 is set.
(3) search the compression of read-write order two-dimensional array block_no piece and process sign.
(4) if the compression of block_no piece is complete, turns (5), otherwise turn (3).
(5) from the compression data block queue of packed data circulation double buffering queue, obtain queue heads.
(6) if queue heads is empty, turns (5), otherwise turn (7).
(7) compression data block of this queue heads is write in final compressed file.
(8) discharge this buffer zone to the empty block buffer queue afterbody in this circulation double buffering queue.
(9) block_no adds 1
(10) if raw data reads that thread finishes and all write final compressed file, turn (11), otherwise turn (3).
(11) at compressed file end, write the compression tail of file information such as the location index of each piece compression position flow in compressed file, total piece number, end-of-file reference position.
(12) packed data is set and writes thread end mark.
(13) packed data writes thread and finishes.
Below explain that in detail the packed data in the parallel decompression method of FASTQ file reads thread, its concrete implementation step is as follows:
(1) open the compressed file to be decompressed of course allocation, obtain filec descriptor 1, i.e. fd1.
(2) again open the compressed file to be decompressed of course allocation, obtain filec descriptor 2, i.e. fd2.
(3) obtain the paging size of the file system of current operation machine.
(4) according to paging size set memory mapping space size.
(5) according to the location index information of each piece compression position flow of compressed file afterbody, reference position and the end position of to be decompressed all of the process that obtains in compression position flow.
(6) according to above-mentioned reference position, (starting point need to be according to memory pages size for the starting point of set memory mapping, the border of alignment memory pages), mapping (enum) data size, memory-mapped end point, fd1 is shone upon to the memory address lpBuf1 that memory-mapped obtains memory-mapped region 1.Current mapping starting point is that relative whole compressed file calculates with end point.
(7) current memory-mapped region current_buffer_symbol being set is 1.
(8) initialization memory-mapped region conversion times variable reverse_num is 0; The starting block variable of the current memory-mapped of initialization region current_lpbuffer and termination piece variable are 0, and the starting block variable of the previous memory-mapped of initialization region last_lpbuffer and termination piece variable are 0.
(9) piece number=0 of piece current to be decompressed is set.
(10) retrieval compression position flow location index information, obtains current reference position and the end position of decompression block in compressed file for the treatment of.
(11) if the current reference position for the treatment of decompression block and end position, in the scope of current mapping area, turn (20); Otherwise, turn (12).
(12) memory-mapped region conversion times variable reverse_num+1.
(13) starting block in current memory-mapped region number value assignment is given the starting block variable in previous memory-mapped region, current block number value assignment is given the starting block variable in current memory-mapped region, and (current block number value-1) assignment is given the termination piece variable in previous memory-mapped region.
(14) according to the current reference position of decompression block in compressed file for the treatment of, the starting point of set memory mapping (starting point need to be according to memory pages size, the border of alignment memory pages), mapping (enum) data size, memory-mapped end point.Current mapping starting point is that relative whole compressed file calculates with end point.
(15) change current memory-mapped region number, i.e. two mapping rotations: if current_buffer_symbol is 1, change to 2; If current_buffer_symbol is 2, change to 1.
(16) if reverse_num>=2 turns (17), otherwise turn (19)
(17) according to the starting block of previous memory-mapped region record number with stop piece number, the continuous decompression end mark of relevant block in cyclic query read-write order two-dimensional array, until in scope all decompressed processing finish.Turn (18).
(18) if current_buffer_symbol=1, releasing memory mapping 1; Otherwise releasing memory mapping 2.
(19) if current_buffer_symbol=1 again carries out memory-mapped 1 to fd1 and obtains internal memory mapping address lpbuf1, no person again carries out memory-mapped 2 to fd2 and obtains internal memory mapping address lpbuf2.The memory-mapped parameter re-starting is above all the parameters that calculate in (14).
(20) the packed data circulation double buffering queue that each decompression work thread is searched in circulation successively, searches sky block buffer.
(21) if empty block buffer exists, turn (22), otherwise turn (20).
(22) according to the mapping starting point in current_buffer_symbol, current memory-mapped region, current reference position and the end position for the treatment of decompression block, four fields in empty block buffer structure in the current packed data circulation double buffering queue obtaining are set: piece, at memory-mapped region starting point, memory-mapped region pointer, packed data block length, compression data block number, forms packed data block buffer.
(23) discharge this buffer zone to the compression data block queue end of this circulation double buffering queue.
(24) the thread distribution number of this piece is set in read-write order two-dimensional array.
(25) currently treat that the piece number of decompression block adds one, the piece > process for the treatment of decompression block if current is treated the largest block number of decompress(ion), turns (26), and no person turns (10).
Etc. (26) thread to be written finishes, if do not finish, waits for always; If write thread, finish, releasing memory mapping 1 and memory-mapped 2, close fd1 and fd2.
(27) packed data is set and reads thread end mark.
(28) packed data reads thread and finishes.
Packed data block buffer in the packed data circulation double buffering queue of using in above-mentioned steps is used a structure to realize, and comprises four fields---and memory-mapped region pointer (4), piece are at memory-mapped region starting point (5), packed data block length (6) and compression data block number (7).The time repeatedly copying for saving space and data, directly use the memory-mapped region at pointed compression data block place and piece in the starting point in memory-mapped region.
Fig. 5 shows in parallel decompression method the graph of a relation of the two memory-mapped of circulation and packed data block buffer.In figure, take piece 1 as example, shown each field and two memory-mapped region---the relation of memory-mapped (1) and memory-mapped (2) of buffer zone.Can find out that decompression work thread is directly used the two memory-mapped of circulation region in the queue of packed data circulation double buffering, has reduced data copy number of times.It should be noted that, the starting point of each memory-mapped and end point are very possible not at beginning and the end position of piece, this is because starting point need to be according to the border of memory pages size alignment memory pages, the space size of memory-mapped affects mapping size except last mapping is subject to tail of file, and remaining mapping size is all fixed values.Some pieces have only comprised partial data in a memory-mapped, such piece monoblock data that need to remap in another one memory-mapped.
Below explain in detail the decompression work thread in the parallel decompression method of FASTQ file, its concrete implementation step is as follows:
(1) decompression work thread preliminary preparation, comprise thread and obtain the process incipient stage and process piece that compressed file head and end-of-file obtain and record compression blocks number, the information such as location index of each piece in compressed file that number configuration information, compressed file comprise, and the work such as the foundation of some objects and initialization in this thread.
(2) from the compression data block queue of packed data circulation double buffering queue, obtain queue heads.
(3) obtained queue heads if it is empty, turns (2), otherwise turns (4).
(4) reading the data of four fields in this queue heads buffer zone structure---piece is at memory-mapped region starting point, memory-mapped region pointer, packed data block length, compression data block number, the decompress compression position flow of this piece, deposits data after decompression in thread in specific piece interrecord structure object.
(5) discharge this buffer zone in the empty block buffer queue in this circulation double buffering queue.
(6) from the empty block buffer queue of decompressed data circulation double buffering queue, obtain queue heads.
(7) obtained queue heads if it is empty, turns (6), otherwise turns (8).
(8) in the inherent piece interrecord structure of thread object, the blocks of data after the decompression of buffer memory deposits in the empty block buffer in this queue heads, and records length and the piece number of decompressed data piece.
(9) discharge this buffer zone to the afterbody of the decompressed data piece queue in this circulation double buffering queue.
(10) the complete sign of decompression of this piece is set in read-write order two-dimensional array.
(11) retrieval read-write order two-dimensional array, if all equal decompression are complete, turns (13), otherwise turns (2).
(12) if decompressed data writes thread to be finished, turn (13), otherwise turn (2).
(13) this decompression work thread end mark is set.
(14) this decompression work thread finishes.
It should be noted that each the compression blocks buffer data in above-mentioned steps (4) can, according to compression algorithm, carry out decompressed data with the algorithm of these type of block index such as DSRC algorithm, G_SQZ algorithm.
Decompressed data block buffer in the decompressed data circulation double buffering queue of using in above-mentioned steps is used a structure to realize, comprising three fields---decompressed data block pointer, decompressed data block length are conciliate compression data block number, and wherein decompressed data block pointer points to a buffer zone.
Fig. 6 has shown two memory-mapped---the memory-mapped 1(1 of circulation in parallel decompression method) and memory-mapped 2(2) reading the collaborative of thread and decompression work cross-thread, time shaft is (3), the setting of piece is identical with Fig. 5.In figure, can find out, for the second time, use memory-mapped 1(1) while remapping, need to wait for the complete original memory-mapped 1(1 of all decompression work thread process) in all blocks of data, after waiting for that piece 0 is disposed to piece i, can discharge mapping last time, the data of piece j+1 to piece k remap.Same situation also appears at and uses for the second time memory-mapped 2(2) when memory-mapped is new again compression blocks, need to wait for that complete i+1 of all decompression thread process is to piece j+1.
Below explain that in detail the decompressed data in the parallel decompression method of FASTQ file writes thread, its concrete implementation step is as follows:
(1) decompressed data writes thread preliminary work, comprises determining of decompress files name.
(2) obtain the paging size of the file system of current operation machine.
(3), according to the piece number for the treatment of decompress(ion), the big or small qwFileSize of Memory Mapping File and its is set.According to paging size, the size in each memory-mapped region is set, and the threshold value that re-starts memory-mapped.
(4) set up decompress files, obtain filec descriptor fd, and the shared space of file is set is qwFileSize size.
(5) calculate this memory-mapped size, fd is carried out to memory-mapped, obtain the memory headroom address lpBuf of memory-mapped.
(6) piece block_no=0 is set.
(7) search the block_no piece decompression sign in read-write order two-dimensional array.
(8) complete if block_no piece decompresses, turn (9), otherwise turn (7).
(9) from the decompressed data piece queue of decompressed data circulation double buffering queue, obtain queue heads.
(10) if queue heads is empty, turns (9), otherwise turn (11).
(11), by the decompressed data piece write memory mapping area of this queue heads, the skew of the skew of memory-mapped area data and file data all correspondingly increases according to the size of data writing.
(12) discharge this buffer zone to the empty block buffer queue afterbody in this circulation double buffering queue.
(13) if the data that current memory-mapped region writes reach threshold value, discharge this memory-mapped.And according to starting point, mapping size, the skew of memory-mapped area data and the new threshold value of the skew of current file data and file size calculating memory-mapped, re-start memory-mapped.
(14) block_no adds 1.
(15) if all equal write memory mapping area turn (16), otherwise turn (7).
(16) releasing memory mapping writes decompressed data in final decompress files, close file descriptor.
(17) decompressed data is set and writes thread end mark.
(18) decompressed data writes thread and finishes.

Claims (6)

1. DNA reads an order sequenced data FASTQ file in parallel compression method, it is characterized in that comprising parallelly compressed process task partitioning portion and compression procedure processing section, specific as follows:
(1) parallelly compressed process task partitioning portion
According in FASTQ file size, parallelly compressed number of processes, FASTQ file, each reads order fragment---the data characteristics of each record, determine the starting and ending position of the pending data of each compression procedure; Each process is similar to raw data to be compressed to be assigned to equably in each process, and to realize data parallel, each process does not have each other the consumption of call duration time when processing like this, has promoted the treatment effeciency of data parallel; Each process obtains independent compressed file, and the order of packed data is consistent with process number;
(2) it is parallelly compressed that in-process multithreading flowing water is responsible in compression procedure processing section
Each compression procedure processing section comprises that a raw data reads thread, a packed data writes thread and a plurality of compression work thread; The concrete number of worker thread can be set according to the check figure of hardware CPU and process setting;
The handled data to be compressed of each process are read thread by raw data and are divided into a plurality of, the record that each piece comprises specific fixed number, and most end end block is less than described fixed number;
Each worker thread all has the queue of two circulation double bufferings, and one is the queue of raw data circulation double buffering, and another is the queue of packed data circulation double buffering; Each raw data circulation double buffering queue comprises two queues: one is the queue of sky block buffer, and one is original data block queue; Each packed data circulation double buffering queue also comprises two queues: one is the queue of sky block buffer, and another is compression data block queue;
In-process at each, the parallelly compressed pipeline processes of the data that to carry out take original data block be unit, concrete flowing water parallel processing flow process is as follows:
(1) raw data reads thread and constantly according to record data feature, resolves and to read original data block, the empty block buffer in the raw data circulation double buffering queue of each compression work thread is searched in circulation successively, after finding, original data block is put into, then discharged this block buffer to the end of the original data block queue in this circulation double buffering queue;
Raw data reads thread and has adopted memory-mapped in conjunction with deblocking technology;
(2) each compression work thread constantly the original data block queue heads from the raw data circulation double buffering queue of this thread obtain original data block, then compress processing;
(3) each compression work thread is constantly filled into the blocks of data after compression in the empty block buffer in the packed data circulation double buffering queue of this thread obtaining, and discharges this buffer zone to the afterbody of the compression data block queue of this circulation double buffering queue;
(4) packed data writes thread and constantly according to piece order number from small to large, searches successively the thread number at the compressed blocks of data place being disposed, obtain this piece packed data in the compression data block queue heads in the packed data circulation double buffering queue in this thread, write final compressed file.
2. DNA according to claim 1 reads order sequenced data FASTQ file in parallel compression method, it is characterized in that: described raw data circulation double buffering queue processing mode is as follows:
(1) raw data circulation double buffering queue initialization process: by the queue instantiation of empty block buffer, have the empty block buffer of given number, original data block queue is empty;
(2) raw data reads thread and reads an original data block;
(3) in empty block buffer queue heads, obtain an empty block buffer;
(4) with original data block, fill this sky block buffer obtaining;
(5) original data block of this filling is put into the end of original data block queue;
(6) compression work thread obtains a blocks of data in raw data block buffer and compresses processing in original data block queue heads;
(7) this raw data block buffer is emptied, and put into the queue of sky block buffer.
3. DNA according to claim 1 reads order sequenced data FASTQ file in parallel compression method, it is characterized in that: described raw data reads the memory-mapped of thread employing in conjunction with deblocking technology, be used for improving the reading speed of large file, in conjunction with deblocking, it is mainly according to the space size of memory pages size and mapping, calculate the data of each piece in the position in memory-mapped space, and when carry out the release in memory-mapped space and remap.
4. DNA reads an order sequenced data FASTQ file in parallel decompression method, it is characterized in that comprising following part:
(1) according to process number, determine that parallel decompression process needs compressed file to be processed
FASTQ file to be compressed obtains the compressed file of respective number according to the number of the parallelly compressed process arranging; In parallel decompression, the number of parallel decompression process is set according to the number of compressed file, the order of the decompress files that each decompression process obtains is determined by the order of compressed file; Each decompression process does not have each other the consumption of call duration time when processing, and has promoted the treatment effeciency of data parallel;
(2) read compressed file afterbody, obtain piece setting, piece index and piece and count information
In each process, initially from the afterbody of compressed file, obtain setting, the location index of each piece compressed file, the information of number of piece of the record count that piece comprises;
(3) the in-process multithreading flowing water parallel decompression of carrying out of parallel decompression
Each parallel decompression process comprises that a packed data reads thread, a decompressed data writes thread and a plurality of decompression work thread;
Each decompression work thread has the queue of two circulation double bufferings, and one is the queue of packed data circulation double buffering, and one is the queue of decompressed data circulation double buffering; Each packed data circulation double buffering queue comprises two queues: one is the queue of sky block buffer, and another is compression data block queue; Each decompressed data circulation double buffering queue also comprises two queues: one is the queue of sky block buffer, and another is the queue of decompressed data piece;
In-process at each, the parallel decompression pipeline processes of the data that to carry out take compression blocks be unit, concrete parallel pipelining process treatment scheme is as follows:
(1) packed data reads the location index information of the compression blocks that thread obtains according to compressed file afterbody, according to piece order number from small to large, constantly read the compression blocks of known compression sizes, the empty block buffer of the packed data circulation double buffering queue heads of each decompression work thread is searched in circulation successively, after finding, compression blocks data are put into, and discharged this buffer zone to the end of the compression data block queue in this circulation double buffering queue;
Packed data reads thread and adopts the two memory-mapped of circulation in conjunction with deblocking technology;
(2) each decompression work thread constantly the compression data block queue heads from the packed data circulation double buffering queue of this thread obtain compression data block, then carry out decompression;
(3) each decompression work thread is constantly filled into the blocks of data after decompressing in the empty block buffer in the decompressed data circulation double buffering queue of this thread obtaining, and discharges this buffer zone to the decompressed data piece queue afterbody of this circulation double buffering queue;
(4) decompressed data writes thread and constantly according to piece order number from small to large, searches successively the thread number at the compressed blocks of data place being disposed, this piece decompressed data of obtaining the decompressed data piece queue heads in the decompressed data circulation double buffering queue in this thread, writes final compressed file.
5. DNA according to claim 4 reads order sequenced data FASTQ file in parallel decompression method, it is characterized in that: the packed data block buffer in the queue of described packed data circulation double buffering is used a structure to realize, and comprises four fields---memory-mapped region pointer, piece are at memory-mapped region starting point, packed data block length and compression data block number; The time repeatedly copying for saving space and data, directly use the memory-mapped region at pointed compression data block place and piece in the starting point in memory-mapped region.
6. DNA according to claim 4 reads order sequenced data FASTQ file in parallel decompression method, it is characterized in that: described packed data reads the two memory-mapped technology of circulation of thread employing in conjunction with deblocking technology, the reading speed that is used for improving large data files, is implemented as follows:
Wherein gordian technique is the two memory-mapped technology of circulation, make decompression work thread read packed data decompress and packed data read thread memory-mapped executed in parallel, there are two memory-mapped---memory-mapped 1 and memory-mapped 2, circulate successively and put into this two memory-mapped according to the order of compression blocks; According to the compression blocks Data Position index information of compressed file afterbody, and the size in two memory-mapped spaces, according to piece order number from small to large, calculate successively memory-mapped buffer zone and the positional information in memory-mapped buffer zone at each compression blocks data place; Decompression work thread is directly used the two memory-mapped of this circulation region in the queue of packed data circulation double buffering, to reduce data copy number of times; For the memory-mapped of using, the front mapping (enum) data that need to wait for this memory-mapped just can re-start memory-mapped after being finished using by all decompression work threads.
CN201310551802.7A 2013-11-07 2013-11-07 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method Expired - Fee Related CN103559020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310551802.7A CN103559020B (en) 2013-11-07 2013-11-07 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310551802.7A CN103559020B (en) 2013-11-07 2013-11-07 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method

Publications (2)

Publication Number Publication Date
CN103559020A true CN103559020A (en) 2014-02-05
CN103559020B CN103559020B (en) 2016-07-06

Family

ID=50013277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310551802.7A Expired - Fee Related CN103559020B (en) 2013-11-07 2013-11-07 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method

Country Status (1)

Country Link
CN (1) CN103559020B (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984528A (en) * 2014-05-15 2014-08-13 中国人民解放军国防科学技术大学 Multithread concurrent data compression method based on FT processor platform
CN103997514A (en) * 2014-04-23 2014-08-20 汉柏科技有限公司 File parallel transmission method and system
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN107565975A (en) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 The method of FASTQ formatted file Lossless Compressions
WO2018039983A1 (en) * 2016-08-31 2018-03-08 华为技术有限公司 Biological sequence data processing method and device
CN108363719A (en) * 2018-01-02 2018-08-03 中科边缘智慧信息科技(苏州)有限公司 The transparent compressing method that can configure in distributed file system
CN108629157A (en) * 2017-03-22 2018-10-09 深圳华大基因科技服务有限公司 One kind being used for nucleic acid sequencing data compression and encrypted method
CN109062502A (en) * 2018-07-10 2018-12-21 郑州云海信息技术有限公司 A kind of data compression method, device, equipment and computer readable storage medium
CN109490895A (en) * 2018-10-25 2019-03-19 中国人民解放军海军工程大学 A kind of interference synthetic aperture signal processing system based on blade server
CN109547355A (en) * 2018-10-17 2019-03-29 中国电子科技集团公司第四十研究所 A kind of storing and resolving device and method based on ten thousand mbit ethernet mouth receivers
CN110247666A (en) * 2019-05-22 2019-09-17 深圳大学 A kind of system and method for hardware concurrent compression
CN110299187A (en) * 2019-07-04 2019-10-01 南京邮电大学 A kind of parallelization gene data compression method based on Hadoop
CN110572422A (en) * 2018-06-06 2019-12-13 北京京东尚科信息技术有限公司 Data downloading method and device
US10554220B1 (en) 2019-01-30 2020-02-04 International Business Machines Corporation Managing compression and storage of genomic data
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium
CN111294057A (en) * 2018-12-07 2020-06-16 上海寒武纪信息科技有限公司 Data compression method, encoding circuit and arithmetic device
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111370070A (en) * 2020-02-27 2020-07-03 中国科学院计算技术研究所 Compression processing method for big data gene sequencing file
CN111767255A (en) * 2020-05-22 2020-10-13 北京和瑞精准医学检验实验室有限公司 Optimization method for separating sample read data from fastq file
CN111767256A (en) * 2020-05-22 2020-10-13 北京和瑞精准医学检验实验室有限公司 Method for separating sample read data from fastq file
CN110349635B (en) * 2019-06-11 2021-06-11 华南理工大学 Parallel compression method for gene sequencing data quality fraction
CN113568736A (en) * 2021-06-24 2021-10-29 阿里巴巴新加坡控股有限公司 Data processing method and device
CN113590051A (en) * 2021-09-29 2021-11-02 阿里云计算有限公司 Data storage and reading method and device, electronic equipment and medium
CN113672876A (en) * 2021-10-21 2021-11-19 南京拓界信息技术有限公司 OTG-based method and device for quickly obtaining evidence of mobile phone
CN114489518A (en) * 2022-03-28 2022-05-13 山东大学 Sequencing data quality control method and system
US12093803B2 (en) 2020-07-01 2024-09-17 International Business Machines Corporation Downsampling genomic sequence data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049680A (en) * 2012-12-29 2013-04-17 深圳先进技术研究院 gene sequencing data reading method and system
CN103077006A (en) * 2012-12-27 2013-05-01 浙江工业大学 Multithreading-based parallel executing method for long transaction
US8495662B2 (en) * 2008-08-11 2013-07-23 Hewlett-Packard Development Company, L.P. System and method for improving run-time performance of applications with multithreaded and single threaded routines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495662B2 (en) * 2008-08-11 2013-07-23 Hewlett-Packard Development Company, L.P. System and method for improving run-time performance of applications with multithreaded and single threaded routines
CN103077006A (en) * 2012-12-27 2013-05-01 浙江工业大学 Multithreading-based parallel executing method for long transaction
CN103049680A (en) * 2012-12-29 2013-04-17 深圳先进技术研究院 gene sequencing data reading method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI ZHONG等: "Parallel protein secondary structure prediction schemes using Pthread and OpenMP over hyper-threading technology", 《THE JOURNAL OF SUPERCOMPUTING》 *
詹科等: "基于MPI和CUDA的蛋白质定量软件的设计和分析", 《计算机科学》 *

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103997514A (en) * 2014-04-23 2014-08-20 汉柏科技有限公司 File parallel transmission method and system
CN103984528A (en) * 2014-05-15 2014-08-13 中国人民解放军国防科学技术大学 Multithread concurrent data compression method based on FT processor platform
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
WO2015180203A1 (en) * 2014-05-30 2015-12-03 周家锐 High-throughput dna sequencing quality score lossless compression system and compression method
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN105760706B (en) * 2014-12-15 2018-05-29 深圳华大基因研究院 A kind of compression method of two generations sequencing data
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
WO2018039983A1 (en) * 2016-08-31 2018-03-08 华为技术有限公司 Biological sequence data processing method and device
US11360940B2 (en) 2016-08-31 2022-06-14 Huawei Technologies Co., Ltd. Method and apparatus for biological sequence processing fastq files comprising lossless compression and decompression
CN108629157A (en) * 2017-03-22 2018-10-09 深圳华大基因科技服务有限公司 One kind being used for nucleic acid sequencing data compression and encrypted method
CN108629157B (en) * 2017-03-22 2021-08-31 深圳华大基因科技服务有限公司 Method for compressing and encrypting nucleic acid sequencing data
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
CN107565975A (en) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 The method of FASTQ formatted file Lossless Compressions
CN108363719A (en) * 2018-01-02 2018-08-03 中科边缘智慧信息科技(苏州)有限公司 The transparent compressing method that can configure in distributed file system
CN108363719B (en) * 2018-01-02 2022-10-21 中科边缘智慧信息科技(苏州)有限公司 Configurable transparent compression method in distributed file system
CN110572422A (en) * 2018-06-06 2019-12-13 北京京东尚科信息技术有限公司 Data downloading method and device
CN109062502A (en) * 2018-07-10 2018-12-21 郑州云海信息技术有限公司 A kind of data compression method, device, equipment and computer readable storage medium
CN109547355A (en) * 2018-10-17 2019-03-29 中国电子科技集团公司第四十研究所 A kind of storing and resolving device and method based on ten thousand mbit ethernet mouth receivers
CN109490895A (en) * 2018-10-25 2019-03-19 中国人民解放军海军工程大学 A kind of interference synthetic aperture signal processing system based on blade server
CN109490895B (en) * 2018-10-25 2020-12-29 中国人民解放军海军工程大学 Interferometric synthetic aperture sonar signal processing system based on blade server
CN111294057A (en) * 2018-12-07 2020-06-16 上海寒武纪信息科技有限公司 Data compression method, encoding circuit and arithmetic device
US10778246B2 (en) 2019-01-30 2020-09-15 International Business Machines Corporation Managing compression and storage of genomic data
US10554220B1 (en) 2019-01-30 2020-02-04 International Business Machines Corporation Managing compression and storage of genomic data
CN110247666B (en) * 2019-05-22 2023-08-18 深圳大学 System and method for hardware parallel compression
CN110247666A (en) * 2019-05-22 2019-09-17 深圳大学 A kind of system and method for hardware concurrent compression
CN110349635B (en) * 2019-06-11 2021-06-11 华南理工大学 Parallel compression method for gene sequencing data quality fraction
CN110299187A (en) * 2019-07-04 2019-10-01 南京邮电大学 A kind of parallelization gene data compression method based on Hadoop
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium
CN111061434B (en) * 2019-12-17 2021-10-01 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111370070B (en) * 2020-02-27 2023-10-27 中国科学院计算技术研究所 Compression processing method for big data gene sequencing file
CN111326216B (en) * 2020-02-27 2023-07-21 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111370070A (en) * 2020-02-27 2020-07-03 中国科学院计算技术研究所 Compression processing method for big data gene sequencing file
CN111767255B (en) * 2020-05-22 2023-10-13 北京和瑞精湛医学检验实验室有限公司 Optimization method for separating sample read data from fastq file
CN111767256B (en) * 2020-05-22 2023-10-20 北京和瑞精湛医学检验实验室有限公司 Method for separating sample read data from fastq file
CN111767256A (en) * 2020-05-22 2020-10-13 北京和瑞精准医学检验实验室有限公司 Method for separating sample read data from fastq file
CN111767255A (en) * 2020-05-22 2020-10-13 北京和瑞精准医学检验实验室有限公司 Optimization method for separating sample read data from fastq file
US12093803B2 (en) 2020-07-01 2024-09-17 International Business Machines Corporation Downsampling genomic sequence data
CN113568736A (en) * 2021-06-24 2021-10-29 阿里巴巴新加坡控股有限公司 Data processing method and device
CN113590051A (en) * 2021-09-29 2021-11-02 阿里云计算有限公司 Data storage and reading method and device, electronic equipment and medium
CN113590051B (en) * 2021-09-29 2022-03-18 阿里云计算有限公司 Data storage and reading method and device, electronic equipment and medium
CN113672876A (en) * 2021-10-21 2021-11-19 南京拓界信息技术有限公司 OTG-based method and device for quickly obtaining evidence of mobile phone
CN114489518A (en) * 2022-03-28 2022-05-13 山东大学 Sequencing data quality control method and system

Also Published As

Publication number Publication date
CN103559020B (en) 2016-07-06

Similar Documents

Publication Publication Date Title
CN103559020B (en) A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method
US8463820B2 (en) System and method for memory bandwidth friendly sorting on multi-core architectures
KR101559450B1 (en) Methods and apparatus for storage and translation of entropy encoded software embedded within a memory hierarchy
CN101717817B (en) Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar
WO2022082879A1 (en) Gene sequencing data processing method and gene sequencing data processing device
You et al. Spatial join query processing in cloud: Analyzing design choices and performance comparisons
Huang et al. LW-FQZip 2: a parallelized reference-based compression of FASTQ files
Chacón et al. Boosting the FM-index on the GPU: Effective techniques to mitigate random memory access
TW202230138A (en) Accelerator, method of dictionary decoding and article comprising non-transitory storage medium
US9626428B2 (en) Apparatus and method for hash table access
Ling et al. Design and implementation of a CUDA-compatible GPU-based core for gapped BLAST algorithm
Li et al. Swpepnovo: An efficient de novo peptide sequencing tool for large-scale ms/ms spectra analysis
WO2014132608A1 (en) Parallel processing device, parallel processing method, and parallel processing program storage medium
CN103995827A (en) High-performance ordering method for MapReduce calculation frame
US20240005133A1 (en) Hardware acceleration framework for graph neural network quantization
Gong et al. ETTE: Efficient tensor-train-based computing engine for deep neural networks
CN110767265A (en) Parallel acceleration method for big data genome comparison file sequencing
CN103543989A (en) Adaptive parallel processing method aiming at variable length characteristic extraction for big data
Wang et al. FD-CNN: A Frequency-Domain FPGA Acceleration Scheme for CNN-Based Image-Processing Applications
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
Chacón et al. FM-index on GPU: A cooperative scheme to reduce memory footprint
Wang et al. Optimizing GPU-Based Graph Sampling and Random Walk for Efficiency and Scalability
Quirino et al. Efficient filter-based algorithms for exact set similarity join on GPUs
KR20200118170A (en) System and method for low latency hardware memory management
WO2015143708A1 (en) Method and apparatus for constructing suffix array

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160706

Termination date: 20161107