Nothing Special   »   [go: up one dir, main page]

CN115691682B - Gene depth information data compression method, device, electronic equipment and storage medium - Google Patents

Gene depth information data compression method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115691682B
CN115691682B CN202211318287.3A CN202211318287A CN115691682B CN 115691682 B CN115691682 B CN 115691682B CN 202211318287 A CN202211318287 A CN 202211318287A CN 115691682 B CN115691682 B CN 115691682B
Authority
CN
China
Prior art keywords
depth information
file
gene
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211318287.3A
Other languages
Chinese (zh)
Other versions
CN115691682A (en
Inventor
周煌凯
高川
艾鹏
罗玥
孙鹏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Gene Denovo Biotechnology Co ltd
Original Assignee
Guangzhou Gene Denovo Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Gene Denovo Biotechnology Co ltd filed Critical Guangzhou Gene Denovo Biotechnology Co ltd
Priority to CN202211318287.3A priority Critical patent/CN115691682B/en
Publication of CN115691682A publication Critical patent/CN115691682A/en
Application granted granted Critical
Publication of CN115691682B publication Critical patent/CN115691682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a method, a device, electronic equipment and a storage medium for compressing gene depth information data, which relate to the technical field of biological information, and the method comprises the following steps: receiving sequencing files of each sample and converting the sequencing files into bam files; converting the bam file of each sample into a depth information file using SAMtools; compressing the depth information file to obtain a compressed file, which specifically comprises: filtering the depth information through the position information of the genes to obtain a first depth information processing file; merging the continuous sites with the same depth information; integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample; and merging the second depth information processing files of all the samples into one file to obtain a compressed file. The embodiment of the invention can greatly reduce the storage consumption under the condition of ensuring the integrity of the basic information and the depth information of the genes.

Description

Gene depth information data compression method, device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of biological information, in particular to a method and a device for compressing gene depth information data, electronic equipment and a storage medium.
Background
Currently, the most widely used depth information compression file format is the bw (bigwig) format, which has the following disadvantages:
1. the bw format does not contain information of genes, and needs additional software to process in order to acquire all depth information of a certain gene; and also, it is inconvenient to perform correlation analysis with other data such as expression level tables.
2. The bw format of the different samples cannot be combined.
Disclosure of Invention
In order to overcome the defects of the prior art, the embodiment of the invention aims to provide a method, a device, an electronic device and a storage medium for compressing gene depth information data, which can greatly reduce the storage consumption under the condition of ensuring the integrity of basic information and depth information of genes.
To solve the above problems, a first aspect of an embodiment of the present invention discloses a method for compressing gene depth information data, which includes the steps of:
receiving a sequencing file of each sample and converting the sequencing file into a bam file;
Converting the bam file of each sample into a depth information file using SAMtools;
compressing the depth information file to obtain a compressed file;
The compressing the depth information file to obtain a compressed file includes:
filtering the depth information through the position information of the genes to obtain a first depth information processing file;
Merging continuous sites with the same depth information in the first depth information processing file;
Integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample;
and merging the second depth information processing files of all the samples into one file to obtain a compressed file.
Alternatively, in a first aspect of the embodiment of the present invention, filtering depth information by using location information of a gene to obtain a first depth information processing file, including:
determining target depth information of the gene region;
And screening the positions of depth information files corresponding to the samples according to the target depth information, obtaining the positions of the target depth information, and recording the positions as a first depth information processing file.
Alternatively, in the first aspect of the embodiment of the present invention, merging consecutive sites with identical depth information in the first depth information processing file includes:
depth information and the number of continuous bits in the first depth information processing file corresponding to each gene region are determined.
Alternatively, in the first aspect of the embodiment of the present invention, integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample, including:
Recording depth information of each target section of a target gene, wherein the depth information of the target section comprises depth information of the target section and the continuous bit number thereof, and the depth information of the target section is expressed in a mode of a multiplied by b, wherein a represents the depth information of the target section, b represents the continuous bit number of the depth information of the target section, x is a connector, and other symbols can be used for guaranteeing the connector, such as &, #, @, and the like;
Displaying all target interval depth information of target genes in the same row, wherein different target interval depth information of the target genes are separated by separation marks, so as to obtain target gene depth information corresponding to the target genes;
and obtaining target gene depth information of all genes of each sample, and generating a second depth information processing file of each sample.
Alternatively, in the first aspect of the embodiment of the present invention, the merging the second depth information processing files of all samples into one file to obtain a compressed file includes:
And combining the second depth information processing files of all the samples, wherein the principle of combining is that the target gene depth information of the same gene is displayed in the same row, and the target gene depth information of the same gene of different samples is displayed in different columns, so as to obtain the compressed file.
Alternatively, in the first aspect of the embodiment of the present invention, the compressed file includes a base column and an additional column, and the first behavior table information of the compressed file;
Wherein the basic columns comprise a chromosome column, a gene start position column, a gene end position column, a gene name column and a positive and negative chain mark column;
The list head information of the additional columns is a sample name, and the display content of each additional column is target gene depth information of the corresponding sample and the corresponding gene name.
Alternatively, in the first aspect of the embodiment of the present invention, after receiving a sequencing file of each sample and converting the sequencing file into a bam file, the method further includes:
First sorting the contents of the bam file according to the chromosome;
Performing second sorting on the contents of the first sorted bam files according to the initial positions of the genes;
or/and the combination of the two,
After the compressed file is obtained, the method further comprises:
Recompression of the compressed file through bgzip, or/and fast reading of the compressed file using index technique through tabix.
The second aspect of the embodiment of the invention discloses a gene depth information data compression device, which comprises:
a first conversion unit for receiving a sequencing file of each sample and converting the sequencing file into a bam file;
a second conversion unit for converting the bam file of each sample into a depth information file using SAMtools;
The compression unit is used for compressing the depth information file to obtain a compressed file;
The compression unit includes:
The filtering subunit is used for filtering the depth information according to the position information of the genes to obtain a first depth information processing file;
the depth merging subunit is used for merging continuous sites with the same depth information in the first depth information processing file;
A section merging subunit, configured to integrate the section depth information of each gene into one row, to obtain a second depth information processing file of each sample;
And the sample merging subunit is used for merging the second depth information processing files of all the samples into one file to obtain a compressed file.
A third aspect of an embodiment of the present invention discloses an electronic device, including: a memory storing executable program code; a processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform a method for compressing data of depth information of a gene disclosed in the first aspect of the embodiment of the present invention.
A fourth aspect of the embodiments of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute a method of compressing gene depth information data disclosed in the first aspect of the embodiments of the present invention.
A fifth aspect of the embodiments of the present invention discloses a computer program product which, when run on a computer, causes the computer to perform a method of data compression of genetic depth information disclosed in the first aspect of the embodiments of the present invention.
A sixth aspect of the embodiment of the present invention discloses an application publishing platform, which is configured to publish a computer program product, where the computer program product when run on the computer causes the computer to execute a method for compressing gene depth information data disclosed in the first aspect of the embodiment of the present invention.
Compared with the prior art, the embodiment of the invention has the beneficial effects that:
The embodiment of the invention converts sample depth information data of a bam file into a compressed file by combining transcriptome and other expression related histology technical characteristics, and the specific operation process of the scheme is as follows: firstly, converting a bam file into a depth information file of each site through SAMtools, filtering the depth information through the position information of genes, combining the site information with the same continuous depth, and finally integrating the depth information of each section of each gene into a row, wherein each sample is a compressed file of the depth information, and the storage consumption is greatly reduced under the condition that the basic information and the depth information of the genes are ensured to be complete.
Drawings
FIG. 1 is a schematic flow chart of a method for compressing gene depth information data according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of merging depth information files according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a structure of a genetic depth information data compression device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
This detailed description is merely illustrative of the embodiments of the invention and is not intended to limit the embodiments of the invention, since modifications of the embodiments can be made by those skilled in the art without creative contribution as required after reading the specification, but are protected by the patent laws within the scope of the claims of the embodiments of the invention.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the embodiments of the present invention.
The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In embodiments of the invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
According to the embodiment of the invention, the sample depth information data of the bam file is converted into the compressed file by combining the histology technical characteristics related to expression of transcriptomes and the like, so that the storage consumption is greatly reduced under the condition that the basic information and the depth information of genes are ensured to be complete, and the detailed description is carried out below with reference to the accompanying drawings.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of a method for compressing genetic depth information according to an embodiment of the invention. As shown in fig. 1, the gene depth information data compression method comprises the following steps:
S110, receiving a sequencing file of each sample and converting the sequencing file into a bam file.
The sequencing file of the present invention preferably samples a file in the Format of GTF (GENE TRANSFER Format is mainly used to annotate genome) or GFF (General feature Format is mainly used to annotate genes), and in other embodiments, other types of files, such as fastq, fasta files, and a converted to a sam (The Sequencing Alignment/Map Format) file, may be used.
Taking a sequencing file in a GTF format as an example, the GTF file is firstly converted into a bam (binary format of a sam file) file. The bam file contains information such as chromosome number, start position, end position, positive and negative chain marks, depth information and the like of the genome.
And then sequencing the contents in the bam file, firstly sequencing the contents through chromosomes, and then sequencing the contents according to the starting positions. The purpose of the ordering is to facilitate later compression. Of course, in other embodiments, the sorting operation may not be performed.
S120, converting the bam file of each sample into a depth information file using SAMtools.
The whole genome alignment result file (in bam format) after each sample was ordered was converted to depth information using SAMtools depth program. The samtools defaults to output depth information files for 3 columns, the first column is the chromosome ID, the second column is the location information, and the third column is the depth information. Each row of the output depth information file represents a gene region (the region between a certain start position and a certain end position in the chromosome is marked as a gene region, and a gene region is marked as the same gene, i.e. each gene comprises a gene region, it is understood that the number of rows of the depth information file obtained by removing the table information row from each sample is the same as the number of the start positions thereof), and a large amount of redundant information exists in the depth information file, so that the memory occupied finally is large.
Therefore, by the compression of step S130, the storage consumption is greatly reduced while ensuring that the basic information of the gene and the depth information are complete.
And S130, compressing the depth information file to obtain a compressed file.
In a preferred embodiment of the present invention, redundancy removal is achieved by location filtering, depth information merging, depth information interval merging, and sample merging, resulting in a compressed file, similar to the bed format file, denoted genedepth file.
Referring to fig. 2, the method specifically includes the following steps:
S131, filtering the depth information through the position information of the genes to obtain a first depth information processing file.
The depth information file output by samtools is the depth information of the entire genome, and the larger the genome, the larger the result file. For the histology technology related to expression such as transcriptome, only depth information related to the gene region needs to be recorded. Most genomic regions account for only a small portion of the entire genome, so a large number of sites can be filtered out by screening the location of depth information.
Specifically, the location of the depth information may be filtered according to the target depth information of each gene region (the target depth information of each gene region is determined through statistics or experience), only the location information corresponding to the target depth information is reserved, and the filtered depth information file is recorded as the first depth information processing file.
S132, merging the continuous sites with the same depth information in the first depth information processing file.
Because of the histology techniques associated with expression such as transcriptome, reads (read length) is mainly concentrated in the exon region, the intron region contains little Reads, has little coverage even with Reads, and exhibits significant regularity. That is, a large number of consecutive positions have the same depth, and the same rule is exhibited in the exon region of the gene whose expression level is low. By this rule we will de-redundant the sites with the same consecutive depth, only one depth information and consecutive site numbers will be retained.
That is, each depth information and the number of consecutive sites are determined, and each depth information and the number of consecutive sites are counted as one section.
Illustratively, when the depth information of a certain gene region (denoted as gene a) is 0000011000551110111333 … …, assuming that the target depth information of the gene region is 0 and 1, the depth information obtained after the position filtering in step S131 is 00000110001110111 … …, and then the determined depth information includes six parts after the step S132, wherein the depth information of the first part is 0 and the number is 5; the depth information of the second part is 1, and the number is 2; the depth information of the third part is 0, and the number is 3; the depth information of the fourth part is 1, and the number is 3; the depth information of the fifth part is 0, and the number is 1; the depth information of the sixth section is 1 and the number is 3.
Wherein the order of step S131 and step S132 may be interchanged.
And S133, integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample.
Since a conventional bed file can record only one section per line, one gene needs many lines to be recorded, and a great deal of redundancy exists. To further remove this redundancy, we design a format in which each gene is displayed by a line, corresponding to the number of lines of the depth information file, ensuring that the redundancy of the recorded interval information is minimized. The depth information of all the sections is then combined into a column, and the depth information of each section is separated by a segmentation indicator, such as comma. The format of each section depth information record is expressed in a manner of a×b, where a represents depth information of a target section, b represents the number of points where the depth information of the target section is continuous, and x is a connector. For a section with a consecutive number of bits of 1, only depth information may be retained.
Here, the different genes are distinguished by the target section depth information, each gene includes one or more section depth information, the section depth information of the target gene is defined as the target section depth information, for example, the target section depth information of the gene a is respectively 0×5,1×2,0×3,1×3,0×1 and 1×3, and after separation by a separation mark (comma), the target section depth information set formed by separating the target section depth information is recorded as the target gene depth information of the gene a, so that the target gene depth information of all genes of one sample can be obtained and recorded as the second depth information processing file, as shown in table 1.
Table 1 section depth information table of sample a
In Table 1, the first behavior table information, the first column is the chromosome number, and the table information is # CHROM; the second column is the starting position, and the table information is Start; the third column is the End position and the table information is End; the fourth column is the gene name, and the table information is ID; the fifth column is positive and negative chain marks, and the table information is Str; the sixth column is interval depth information corresponding to each gene, and the table information is a sample name.
The above-described first to fifth columns called basic information columns of the second depth information processing file may be directly extracted from a sequencing program such as a GTF file or GFF file.
All samples are processed in the steps S131-S133 by the same method to obtain a second depth information processing file of each sample.
And S134, merging the second depth information processing files of all the samples into one file, namely, a compressed file.
Most expression-related histology techniques such as transcriptome are used to make many samples simultaneously. Thus, each sample has a second depth information processing file, and each second depth information processing file has the position information of the gene. This information is also redundant. Since the gene regions of different samples are completely identical for a plurality of samples, it is convenient to perform table merging, i.e., to use depth information of each sample as a column.
The conventional samtools depth format and the bed format cannot be directly combined because the interval information of different samples is not completely consistent, so that the depth information of the different samples can only be stored independently, redundancy exists in the information, and meanwhile, the reading, viewing and analysis of the data are troublesome.
In a preferred embodiment of the present invention, the second depth information processing files of all samples are combined to obtain a compressed file. The principle of merging is that the target interval depth information of the same gene is displayed on the same row, and the target interval depth information of the same gene of different samples is displayed in different columns.
For example, table 2 shows a second depth information processing file of sample B, and the second depth information processing file of sample a and the second depth information processing file of sample B are combined to obtain a compressed file, as shown in table 3.
Table 2 section depth information table of sample B
As can be seen from tables 1 to 3, when the section depth information of a certain gene or genes of a certain sample does not exist, for example, the section depth information of a gene ID of a sample B of braa01g000060.3c does not exist, at the time of merging, in a merged file, i.e., a compressed file, this portion is displayed as blank.
TABLE 3 section depth information table of merge file
After the compressed file is obtained, the compressed file can be recompressed through bgzip, and the compressed file can be read more quickly through the index technology of tabix. The information of the compressed file can be conveniently checked through the-H/-H parameters of tabix.
Table 4 shows the results of a single sample format conversion (only conversion of expressed genes was considered) test with RNAseq sequencing samples of 6 persons (genome size: 3.09G, total gene number: 20289).
Table 4: single sample format conversion result statistics table
sample bam Depth1 Depth2 mem(Gb) runtime(h) expressed_genes
sibrg1-1 2.3G 1.87G 31M 1.506 3.52 14930(73.59%)
sibrg1-2 2.0G 1.86G 30M 1.501 3.23 14955(73.71%)
sibrg1-3 2.4G 1.84G 28M 1.436 3.03 14865(73.27%)
siscr1 2.0G 1.84G 28M 1.498 3.02 14811(73.00%)
siscr2 2.0G 1.84G 28M 1.561 2.21 14871(73.30%)
siscr3 2.2G 1.85G 29M 1.467 3.43 14936(73.62%)
In table 4, sample is sample name; bam is the size of the bam file (the file obtained through step S110 in the embodiment of the present invention); depth1 is the size of a Depth statistical file (compressed by bgzip), that is, the file size of the file obtained in step S120 in the embodiment of the present invention after being compressed by bgzip; depth2 is the size of the second Depth information processing file (compressed bgzip), that is, the obtained file size in step S133 in the embodiment of the present invention; meanwhile, merging the second Depth information processing files of all samples into a compressed file (compressed by bgzip), and recording the size of the compressed file as Depth; mem (Gb) is the consumed memory size; runtime is run time; expressed_genes are the number of expressed genes.
All Depth2 files are combined into a compressed file with the size of 176M, the time consumption is within 2 minutes, and the memory consumption is less than 1G.
As can be seen from table 4:
1. Compared with the result of the traditional samtools statistical depth, the storage occupied by depth1 and depth2 is larger than about 60:1; therefore, compared with the traditional samtools statistical depth mode, the embodiment of the invention can greatly reduce the storage consumption.
2. The sizes of the single sample bam file and the depth1 file are about 1:0.85; the sizes of the single sample bam file and the depth2 file are 100:1; after the sample information is combined, the storage size (174M) occupied by the sample information and all the sample depth2 is about 1:1, so that the storage is reduced by 99% under the condition of ensuring the integrity of the basic information and the depth information of the genes.
Moreover, since genedepth format files are one line per gene, the merging and splitting of different samples is very convenient. The method is convenient to combine with expression quantity forms, functional annotation forms and the like, and can greatly reduce the workload for downstream joint analysis.
Example two
Referring to fig. 3, fig. 3 is a schematic structural diagram of a genetic depth information data compression device according to an embodiment of the present invention. As shown in fig. 3, the gene depth information data compression apparatus may include:
A first conversion unit 210 for receiving a sequencing file of each sample and converting the sequencing file into a bam file;
A second converting unit 220 for converting the bam file of each sample into a depth information file using SAMtools;
A compression unit 230, configured to compress the depth information file to obtain a compressed file;
The compression unit includes:
The filtering subunit is used for filtering the depth information according to the position information of the genes to obtain a first depth information processing file;
the depth merging subunit is used for merging continuous sites with the same depth information in the first depth information processing file;
A section merging subunit, configured to integrate the section depth information of each gene into one row, to obtain a second depth information processing file of each sample;
And the sample merging subunit is used for merging the second depth information processing files of all the samples into one file to obtain a compressed file.
Example III
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention. As shown in fig. 4, the electronic device may include:
a memory 310 in which executable program code is stored;
A processor 320 coupled to the memory 310;
wherein the processor 320 invokes executable program code stored in the memory 310 to perform some or all of the steps in a method for compressing gene depth information data according to the first embodiment.
The embodiment of the invention discloses a computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute part or all of the steps in a gene depth information data compression method in the first embodiment.
The embodiment of the invention also discloses a computer program product, wherein the computer program product enables the computer to execute part or all of the steps in the gene depth information data compression method in the first embodiment.
The embodiment of the invention also discloses an application release platform, wherein the application release platform is used for releasing a computer program product, and the computer program product enables the computer to execute part or all of the steps in the gene depth information data compression method in the first embodiment when running on the computer.
In various embodiments of the present invention, it should be understood that the size of the sequence numbers of the processes does not mean that the execution sequence of the processes is necessarily sequential, and the execution sequence of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present invention, or a part contributing to the prior art or all or part of the technical solution, may be embodied in the form of a software product stored in a memory, comprising several requests for a computer device (which may be a personal computer, a server or a network device, etc., in particular may be a processor in a computer device) to execute some or all of the steps of the method according to the embodiments of the present invention.
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.
Those of ordinary skill in the art will appreciate that some or all of the steps of the various methods of the described embodiments may be implemented by hardware associated with a program that may be stored in a computer-readable storage medium, including Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), or any other optical disk Memory, magnetic disk Memory, tape Memory, or computer-readable medium capable of carrying or storing data.
The above describes in detail a method, apparatus, electronic device and storage medium for compressing genetic depth information disclosed in the embodiments of the present invention, and specific examples are applied to illustrate the principles and embodiments of the present invention, where the above description of the embodiments is only for helping to understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (8)

1. A method for compressing gene depth information data, comprising the steps of:
receiving a sequencing file of each sample and converting the sequencing file into a bam file;
Converting the bam file of each sample into a depth information file using SAMtools;
compressing the depth information file to obtain a compressed file;
The compressing the depth information file to obtain a compressed file includes:
filtering the depth information through the position information of the genes to obtain a first depth information processing file;
Merging continuous sites with the same depth information in the first depth information processing file;
Integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample;
combining the second depth information processing files of all samples into one file to obtain a compressed file;
merging successive sites with identical depth information in the first depth information processing file, including:
determining depth information and continuous bit number of the depth information in a first depth information processing file corresponding to each gene region;
integrating the interval depth information of each gene into one row to obtain a second depth information processing file of each sample, wherein the second depth information processing file comprises:
Recording depth information of each target interval of a target gene, wherein the depth information of the target interval comprises depth information of the target interval and the number of continuous sites thereof, and the depth information of the target interval is expressed in a mode of a multiplied by b, wherein a represents the depth information of the target interval, b represents the number of continuous sites of the depth information of the target interval, and x is a connector;
Displaying all target interval depth information of target genes in the same row, wherein different target interval depth information of the target genes are separated by separation marks, so as to obtain target gene depth information corresponding to the target genes;
and obtaining target gene depth information of all genes of each sample, and generating a second depth information processing file of each sample.
2. The method for compressing depth information data of a gene according to claim 1, wherein filtering depth information by position information of the gene to obtain a first depth information processing file comprises:
determining target depth information of the gene region;
And screening the positions of depth information files corresponding to the samples according to the target depth information, obtaining the positions of the target depth information, and recording the positions as a first depth information processing file.
3. The method of claim 1, wherein merging the second depth information processing files of all samples into one file to obtain a compressed file, comprising:
And combining the second depth information processing files of all the samples, wherein the principle of combining is that the target gene depth information of the same gene is displayed in the same row, and the target gene depth information of the same gene of different samples is displayed in different columns, so as to obtain the compressed file.
4. The method for compressing data of gene depth information according to claim 3, wherein said compressed file includes a basic column and an additional column, and the first behavior table information of the compressed file;
Wherein the basic columns comprise a chromosome column, a gene start position column, a gene end position column, a gene name column and a positive and negative chain mark column;
The list head information of the additional columns is a sample name, and the display content of each additional column is target gene depth information of the corresponding sample and the corresponding gene name.
5. The method of data compression of genetic depth information according to any one of claims 1 to 4, further comprising, after receiving a sequencing file for each sample and converting the sequencing file into a bam file:
First sorting the contents of the bam file according to the chromosome;
Performing second sorting on the contents of the first sorted bam files according to the initial positions of the genes;
or/and the combination of the two,
After the compressed file is obtained, the method further comprises:
Recompression of the compressed file through bgzip, or/and fast reading of the compressed file using index technique through tabix.
6. A genetic depth information data compression apparatus, comprising:
a first conversion unit for receiving a sequencing file of each sample and converting the sequencing file into a bam file;
a second conversion unit for converting the bam file of each sample into a depth information file using SAMtools;
The compression unit is used for compressing the depth information file to obtain a compressed file;
The compression unit includes:
The filtering subunit is used for filtering the depth information according to the position information of the genes to obtain a first depth information processing file;
the depth merging subunit is used for merging continuous sites with the same depth information in the first depth information processing file;
A section merging subunit, configured to integrate the section depth information of each gene into one row, to obtain a second depth information processing file of each sample;
a sample merging subunit, configured to merge the second depth information processing files of all samples into one file, to obtain a compressed file;
wherein the depth merge subunit comprises:
determining depth information and continuous bit number of the depth information in a first depth information processing file corresponding to each gene region;
an interval merging subunit comprising:
Recording depth information of each target interval of a target gene, wherein the depth information of the target interval comprises depth information of the target interval and the number of continuous sites thereof, and the depth information of the target interval is expressed in a mode of a multiplied by b, wherein a represents the depth information of the target interval, b represents the number of continuous sites of the depth information of the target interval, and x is a connector;
Displaying all target interval depth information of target genes in the same row, wherein different target interval depth information of the target genes are separated by separation marks, so as to obtain target gene depth information corresponding to the target genes;
and obtaining target gene depth information of all genes of each sample, and generating a second depth information processing file of each sample.
7. An electronic device, comprising: a memory storing executable program code; a processor coupled to the memory; the processor invokes the executable program code stored in the memory for performing the gene depth information data compression method of any one of claims 1-5.
8. A computer-readable storage medium, characterized in that it stores a computer program, wherein the computer program causes a computer to execute the gene depth information data compression method according to any one of claims 1 to 5.
CN202211318287.3A 2022-10-26 2022-10-26 Gene depth information data compression method, device, electronic equipment and storage medium Active CN115691682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211318287.3A CN115691682B (en) 2022-10-26 2022-10-26 Gene depth information data compression method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211318287.3A CN115691682B (en) 2022-10-26 2022-10-26 Gene depth information data compression method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115691682A CN115691682A (en) 2023-02-03
CN115691682B true CN115691682B (en) 2024-09-10

Family

ID=85099621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211318287.3A Active CN115691682B (en) 2022-10-26 2022-10-26 Gene depth information data compression method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115691682B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109360605A (en) * 2018-09-25 2019-02-19 安吉康尔(深圳)科技有限公司 Gene order-checking data archiving method, server and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2955147A1 (en) * 2014-07-25 2016-01-28 Ontario Institute For Cancer Research System and method for process control of gene sequencing
EP3901833A1 (en) * 2018-01-15 2021-10-27 Illumina, Inc. Deep learning-based variant classifier
CN109637581B (en) * 2018-12-10 2022-05-17 江苏医联生物科技有限公司 DNA next generation sequencing full-flow quality analysis method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109360605A (en) * 2018-09-25 2019-02-19 安吉康尔(深圳)科技有限公司 Gene order-checking data archiving method, server and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
下一代测序数据格式的研究展望;鲍婧;电脑知识与技术;20111231;第7卷(第36期);正文第9316-9317、9337页 *

Also Published As

Publication number Publication date
CN115691682A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
Bellwood et al. Are ‘cultures’ inherited? Multidisciplinary perspectives on the origins and migrations of Austronesian-speaking peoples prior to 1000 BC
López et al. Human dispersal out of Africa: a lasting debate
Allhoff et al. Differential peak calling of ChIP-seq signals with replicates with THOR
CN101738188B (en) Device and method for recommending scenery spot
CN110689930B (en) Method and device for detecting TMB
Cumbie et al. NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites
TW201020518A (en) POI recommending apparatus and methods, and storage media
CN103984879B (en) A kind of method and system for determining testing gene group Zonal expression level
Delhomme et al. Guidelines for RNA-Seq data analysis
CN113362889A (en) Genome structure variation annotation method
JP2020506684A (en) Determination of tumorigenic splice variants
US9886561B2 (en) Efficient encoding and storage and retrieval of genomic data
Sung Algorithms for next-generation sequencing
WO2013097048A1 (en) Method and device for labelling single nucleotide polymorphism sites in genome
CN115691682B (en) Gene depth information data compression method, device, electronic equipment and storage medium
Song et al. Scaphopoda is the sister taxon to Bivalvia: Evidence of ancient incomplete lineage sorting
CN110969000A (en) Data merging processing method and device
JP6533415B2 (en) Apparatus, method and system for constructing a phylogenetic tree
CN115312129A (en) Gene data compression method and device in high-throughput sequencing background and related equipment
CN114520931B (en) Video generation method, device, electronic equipment and readable storage medium
Nikelski et al. High heterogeneity in genomic differentiation between phenotypically divergent songbirds: a test of mitonuclear co-introgression
Ma et al. The analysis of ChIP-Seq data
CN108846039B (en) Data flow direction determining method and device
CN109949868B (en) Gene grade ordering method and device based on tolerance analysis
Curnoe et al. Rare late Pleistocene-early Holocene human mandibles from the Niah caves (Sarawak, Borneo)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant