Nothing Special   »   [go: up one dir, main page]

CN104428423A - Method and system for determining integration manner of foreign gene in human genome - Google Patents

Method and system for determining integration manner of foreign gene in human genome Download PDF

Info

Publication number
CN104428423A
CN104428423A CN201280074522.5A CN201280074522A CN104428423A CN 104428423 A CN104428423 A CN 104428423A CN 201280074522 A CN201280074522 A CN 201280074522A CN 104428423 A CN104428423 A CN 104428423A
Authority
CN
China
Prior art keywords
sequencing
assembling
result
human genome
impurities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201280074522.5A
Other languages
Chinese (zh)
Inventor
曾玺
李伟阳
陈盛培
蒋慧
汪建
王俊
杨焕明
张秀清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN104428423A publication Critical patent/CN104428423A/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

Provided are a method, a system, and a readable medium for determining an integration manner of a foreign gene in a human genome. The method for determining an integration manner of a foreign gene in a human genome comprises: capturing, by using a capture probe, an integrated DNA fragment possibly containing a foreign gene fragment from a human genome nucleic acid sample; performing sequencing for the captured DNA fragment to obtain a sequencing result; performing first purification on the sequencing result; performing first comparison on the sequencing result obtained through the first purification and a known human genome sequence and a foreign gene sequence to obtain sequencing data possibly containing a foreign gene integration fragment; assembling the sequencing data possibly containing a foreign gene integration fragment to obtain an assembling result; performing second purification on the assembling result; and performing second comparison on the assembling result obtained through the second purification, and determining an integration manner of a foreign gene in a human genome based on a second comparison result.

Description

Method and system for determining integration manner of foreign gene in human genome
Determine the method and system of foreign gene Integration Mode in human genome
Priority information
Without technical field
The invention belongs to biological technical field, specifically, the present invention relates to a kind of bioinformatic analysis method for detecting pathogen genome Integration Mode in human genome, more particularly it relates to determine the foreign gene method of Integration Mode, system and computer-readable medium in human genome.Background technology
HBV is known, the DNA of oneself can be incorporated on the genome of people by HIV, HPV viruse, HBV infection restrovirus causes the inflammation of infection site by replicating, and then triggers cell carcinogenesis.High-risk HPV 16 in HPV viruse, causes inflammation after HPV patient's cervical infection, and promotes its cervical cell misgrowth, so as to produce canceration, the HPV16 integration imagination is obvious in cancerous issue.
The method for studying gene integration for many years, still stops detections of the PCR to cervical carcinoma and precancerous lesion HPV16 viral integrase states.The methods such as wherein alu were once widely used, with the development of high throughput sequencing technologies, provided the foundation using the analysis pathogen integration position that is improved to of high-flux sequence and information analysis method, conventional bioinformatic analysis is mainly also limited to pair end sequence alignments now, and general insertion position is determined by PE reads comparison position, and accurate position can not be determined.Thus, the method that correlative study is carried out at present still has much room for improvement.The content of the invention
It is contemplated that at least solving one of technical problem present in prior art.The present invention is directed to propose a kind of method for the accurate pathogen genome fragment integration site that can effectively find out in the range of full-length genome.
In one aspect of the invention, the present invention proposes a kind of method for determining foreign gene Integration Mode in human genome.Embodiments in accordance with the present invention, this method includes:The DNA fragmentation that may be integrated containing exogenous genetic fragment is captured from human genome sample of nucleic acid using capture probe;It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas;First removal of impurities is carried out to the sequencing result, to obtain the sequencing result by the first removal of impurities;The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain;The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities;The assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence to compare, and based on second comparison result, determines Integration Mode of the foreign gene in human genome.Integration Mode of the foreign gene such as pathogen genome in human genome can be effectively determined using this method.
In the second aspect of the present invention, the invention also provides a kind of determination foreign gene Integration Mode in human genome System.Embodiments in accordance with the present invention, the system includes:Acquisition equipment, the acquisition equipment is suitable to capture the DNA fragmentation that may be integrated containing exogenous genetic fragment from human genome sample of nucleic acid using capture probe;Sequencing device, the sequencing device is connected with the acquisition equipment, and suitable for being sequenced for captured DNA fragmentation, to obtain the sequencing result being made up of multiple sequencing datas;First knot screen, first knot screen is connected with the sequencing device, and suitable for carrying out the first removal of impurities to the sequencing result, to obtain the sequencing result by the first removal of impurities;First comparison device, first comparison device is connected with first knot screen, and compared suitable for the sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence, may the sequencing data containing exogenous origin gene integrator fragment to obtain;Assembling device, the assembling device is connected with first comparison device, and suitable for the sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;Second knot screen, second knot screen is connected with the assembling device, and suitable for carrying out the second removal of impurities to the assembling result, to obtain the assembling result by the second removal of impurities;Second comparison device, second comparison device is connected with second knot screen, and is compared suitable for the assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence;And analytical equipment, the analytical equipment is suitable for based on second comparison result, determining Integration Mode of the foreign gene in human genome.Using system according to embodiments of the present invention, it can effectively implement method described above, thus, it is possible to effectively determine foreign gene such as Integration Mode of the pathogen genome in human genome.
In still another aspect of the invention, the present invention proposes a kind of computer-readable medium.Embodiments in accordance with the present invention, be stored with instruction on the computer-readable medium, and the instruction is suitable to be executed by processor to determine foreign gene Integration Mode in human genome through the following steps:First removal of impurities is carried out to sequencing result, to obtain the sequencing result by the first removal of impurities;The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain;The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities;And compare the assembling result Jing Guo the second removal of impurities with known human genomic sequence and exogenous gene sequence progress second, and based on second comparison result, determine Integration Mode of the foreign gene in human genome, wherein, the sequencing result is by following acquisition:The DNA fragments that may be integrated containing exogenous genetic fragment are captured from human genome sample of nucleic acid using capture probe;It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas.Integration Mode of the foreign gene such as pathogen genome in human genome can be effectively determined using the computer-readable medium.
The additional aspect and advantage of the present invention will be set forth in part in the description, and partly will become apparent from the description below, or be recognized by the practice of the present invention.Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will be apparent and be readily appreciated that from description of the accompanying drawings below to embodiment is combined, wherein:
Fig. 1 is the flow of determination foreign gene method of Integration Mode in human genome according to an embodiment of the invention Schematic diagram;
Fig. 2 is the schematic flow sheet of determination foreign gene method of Integration Mode in human genome according to another embodiment of the invention;And
Fig. 3 is the structural representation of determination foreign gene system of Integration Mode in human genome according to an embodiment of the invention.Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein same or similar label represents same or similar element or the element with same or like function from beginning to end.The embodiments described below with reference to the accompanying drawings are exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
Term " first ", " second " are only used for describing purpose, and it is not intended that indicating or implying relative importance or the implicit quantity for indicating indicated technical characteristic.Thus, " first " is defined, one or more this feature can be expressed or be implicitly included to the feature of " second ".In the description of the invention, " multiple " are meant that two or more, unless otherwise specifically defined.As used herein, term " above " includes this number with " following ", and such as " more than 80% " refers to>80%, " less than 2% " refers to< 2%.PER is assembled:As described herein, PER refers to two-way(Pair end) sequencing data assembling.I.e. according to the overlapping relation between sequence, pair end are sequenced to obtained each pair PE sequencing datas and assembled.Displacement:As described herein, one section of pathogen genome DNA fragmentation is inserted into human genome, while the phenomenon for lacking the human genome DNA of this insertion position, is called displacement.PCR is repeated:Repeat amplification protcol during PCR.Joint( Adaptor ):Sequence measuring joints, occur adaptor in the sequence data of some lower machines. BWA:Burrows-Wheeler Aligner abbreviation, is a kind of sequence alignment program. Soap:Short Oligonucleotide Analysis Package abbreviation, is a kind of comparison software.
In the present invention, unless otherwise clearly defined and limited, the term such as term " connected ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected, or be integrally connected;Can be mechanical connection or electrical connection;Can be joined directly together, can also be indirectly connected to by intermediary, can be the connection of two element internals.For the ordinary skill in the art, the concrete meaning of above-mentioned term in the present invention can be understood as the case may be.Determine foreign gene Ren Lei Ji Yin ^^ modes in Group method
According to the first aspect of the invention, the present invention proposes a kind of method for determining foreign gene Integration Mode in human genome.It is just blunt according to embodiments of the invention, reference picture 1, this method includes:
Capture step S100:The DNA fragmentation that may be integrated containing exogenous genetic fragment is captured from human genome sample of nucleic acid using capture probe.Embodiments in accordance with the present invention, it is possible to use the type for the foreign gene that the inventive method is analyzed is not particularly restricted.As long as it can be integrated with human genome, and can be obtained or known its gene order.Embodiments in accordance with the present invention, the foreign gene that can be studied is pathogen genome.In addition, according to the instantiation of the present invention, the pathogen is HBV.Thus, it is possible to effectively analysis pathogen such as HBV and human genome integration. Sequencing steps S200:It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas.Embodiments in accordance with the present invention, are not particularly restricted to the mode that the DNA fragmentation by capture is sequenced.Embodiments in accordance with the present invention, sequencing is carried out by second generation microarray dataset.Embodiments in accordance with the present invention, can use and genome sequencing library is sequenced selected from Hiseq2000, SOLiD, 454 and at least one of single-molecule sequencing device.The characteristics of thereby, it is possible to using the high flux of these sequencing devices, deep sequencing, further improve the efficiency for determining foreign gene Integration Mode in human genome.Certainly, it will be appreciated to those of skill in the art that genome sequencing, such as third generation sequencing technologies, and the more advanced sequencing technologies that may be developed can also be carried out using other sequence measurements and device later.Embodiments in accordance with the present invention, are not particularly limited by the length of the sequencing data obtained by genome sequencing.Embodiments in accordance with the present invention, it is lOObp that length, which is preferably sequenced, thus, it is possible to further improve analytical effect.
First removal step S300:First removal of impurities is carried out to the sequencing result, to obtain the sequencing result by the first removal of impurities.Just blunt to carry out the type of the first removal of impurities according to embodiments of the invention, be not particularly restricted, such as the first removal of impurities, which may further include, removes at least one that PCR is repeated, removed the sequencing data of low quality sequencing data and removal containing joint.Thus, it is possible to further improve analysis efficiency.
First compares step S400:The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain.Embodiments in accordance with the present invention, can carry out first comparison using SOAP.Thus, it is possible to further improve analysis efficiency.
Number of assembling steps S500:The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data.Embodiments in accordance with the present invention, the assembling is by being carried out based on the overlapping relation between sequencing data.
Second removal step S600:Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities.Embodiments in accordance with the present invention, second removal of impurities further comprises removing the assembling data repeated.
Second compares step S700:The assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence to compare.Embodiments in accordance with the present invention, second comparison is to utilize BWA to carry out.
Analytical procedure S800:Based on second comparison result, Integration Mode of the foreign gene in human genome is determined.Embodiments in accordance with the present invention, based on second comparison result, determine that Integration Mode of the foreign gene in human genome further comprises:Selection can compare in the assembling data of known human genomic sequence and exogenous gene sequence, the assembling data and include human genome breakpoint information and foreign gene breakpoint information simultaneously.Embodiments in accordance with the present invention, can also be based further on the human genome breakpoint information and foreign gene breakpoint information, judge whether replacement mutation;Or based on the human genome breakpoint information and foreign gene breakpoint information, at least one of external source gene insertion length and type in human genome is determined, for example determine at least a portion foreign gene intubating length and type in human genome.
Below with reference to Fig. 2, by taking HBV as an example, to determination foreign gene according to embodiments of the present invention, the system of Integration Mode is explained in detail in human genome.As shown in Fig. 2 it specifically includes following steps:
1. the acquisition and sequencing of pathogen genome nucleic acids integration sequence
The method for obtaining pathogen nucleic acid integration sequence includes but is not limited to following methods:The DNA fragmentation that will likely be integrated using capture probe technology containing pathogen genome fragment is captured from sample, and then obtained sequence is sequenced.
2. removing PCR to repeat, remove low quality sequencing data and remove the sequencing data containing joint Remove the strategy that PCR is repeated:When two sequences are just the same, then repetitive sequence is regarded as.When there is a sequencing data to duplicate in a pair of PE sequencing datas, then remove this pair of sequencing datas.
Remove the strategy of low quality sequencing data:When base number of the sequencing quality value less than or equal to 5 accounts for this sequencing data total bases purpose more than 50% in a sequencing data, then it is assumed that this sequencing data is low quality sequencing data.When it is low quality to have a sequencing data in a pair of PE sequencing datas, then remove this pair of sequencing datas.
Remove the strategy of the sequencing data containing joint:When containing one section of joint sequence in a sequencing data, then it is assumed that this sequencing data is sequencing data containing joint.When it is sequencing data containing joint to have a sequencing data in a pair of PE sequencing datas, remove this pair of sequencing datas.
3. Soap is compared, the sequencing data needed is chosen, comparison rate is calculated
Treated sequencing data is compared on human genome hgl9 and pathogen genome fragment gene group respectively.Because pathogen genome typically has multiple hypotypes, the reference gene group of pathogen genome will typically choose suitable hypotype as desired here.After the completion of comparison, by the paired relation between analysis twice comparison result, the sequencing data of fragment may be integrated containing pathogen genome by choosing.And respectively calculate raw sequencing data in useful sequence ratio, and in useful sequence human genome comparison rate and pathogen genome fragment gene group comparison rate.
4. PER is assembled
The possibility that 3rd step is obtained contains the sequencing data progress PER assemblings that pathogen genome fragment integrates fragment.PER refers to two-way(Pair end) sequencing data assembling.I.e. according to the overlapping relation between sequence, pair end are sequenced to obtained each pair PE sequencing datas and assembled.
5. the sequence of repetition is removed again
After being assembled by PER, the set of sequence after an assembling is obtained.Deduplication operates are carried out to this arrangement set again.Here strategy is the deduplication strategy using SE sequencing datas, i.e.,:When the situation that a sequencing data is duplicated, then remove this sequencing data.
6. BWA is compared again, breakpoint information is extracted
By the deduplication step of the 5th step, an arrangement set is obtained.Then this arrangement set is compared in human genome hgl9 and pathogen genome again respectively using BWA softwares.By analyzing the destination file compared twice, select while can be than upper human genome hgl9 and the sequence of pathogen genome.These sequences are containing breakpoint information.These sequences and human genome hgl9 and the comparison situation of pathogen genome are analyzed respectively, are obtained pathogen genome and are integrated distribution situation of the fragment on human genome, and the distribution situation in pathogen genome.
Here distribution situation includes but is not limited to compare position, the sequencing data number supported the sequencing data number of a certain insertion breakpoint left end point, support a certain insertion breakpoint right endpoint, total supports sequencing data number, the support sequencing data number of left end point after data volume is normalized, left end point total No. ID for supporting sequencing data number and supporting all support sequencing datas of a certain insertion breakpoint after supporting sequencing data number, normalization.Here normalization strategy is normalized according to effective sequencing data number.
" sequencing data for supporting a certain insertion breakpoint left end point " and " sequencing data for supporting a certain insertion breakpoint right endpoint " said herein are all only for the insertion breakpoint in human genome, for pathogen genome, then breakpoint location is not inserted into. 7. check for displacement classification
By analyzing the inner link between human genome breakpoint and pathogen genome breakpoint, displacement type variation is checked for.
8. calculate the length and type of pathogen genome Insert Fragment
By analyzing the inner link between human genome breakpoint and pathogen genome breakpoint, the breakpoint information of pathogen genome Insert Fragment can be calculated by finding, and calculate pathogen genome Insert Fragment length and type at this part of breakpoint.
9. calculate capture actual efficiency
The comprehensive object information compared again and initial sequence information calculate the actual efficiency of probe capture.
The method of determination foreign gene Integration Mode in human genome according to embodiments of the present invention can find accurate pathogen genome fragment insertion position in the range of mankind's full-length genome.The method of determination foreign gene Integration Mode in human genome according to embodiments of the present invention can provide the possible displacement type in part, and fraction of pathogens body genomic insert type.The method of determination foreign gene Integration Mode in human genome according to embodiments of the present invention is quick, easy to use.By taking 5G initial data amount as an example, completion can be analyzed in two days.
The present inventor is by in-depth study extensively, construct first a kind of for detecting pathogen genome the bioinformatics detection method of Integration Mode and its application in sample to be tested, specifically, the present inventor to the sequence that sequence capturing technology is obtained from by being compared, screening, assemble, and compare again, establish the bioinformatics testing process of complete set.Using the testing process, detect about pathogen genome in the signal of human genome Integration Mode, the present invention is completed on this basis.
Embodiments in accordance with the present invention, manageable sample type is not particularly restricted, as long as containing sample of nucleic acid, the type of nucleic acid is not particularly restricted, and can be DNA(), DNA it can also be ribonucleic acid(), RNA preferably DNA.It will be understood by those skilled in the art that for RNA, the cDNA with corresponding sequence can be converted into by conventional meanses, subsequent detection and analysis are carried out.Embodiments in accordance with the present invention, the source of sample is not particularly restricted.According to the example of the present invention, so as to extract the DNA sequence dna of pathogen genome Insert Fragment therefrom, and then pathogen genome fragment insertion situation can be detected and analyzed using cancerous tissue sample as test sample.Embodiments in accordance with the present invention, the example for the sample that can be used includes but is not limited to patient's blood plasma, cancerous tissue cell, cancer beside organism's cell.
In one embodiment in the present invention, it is necessary to first carry out DNA library preparation before application probe carries out sequence capturing, the preparation method in sample library is well known to those skilled in the art." DNA library preparation " word refers to enter the purpose fragment of genome Break Row, and obtaining one group has a certain size DNA fragmentation mixture.Embodiments in accordance with the present invention, from the method and apparatus of sample capture special sequence, are also not particularly limited, and can be carried out using the probe of commercialization.In the present invention, sequencing sequence refers to the sequence fragment of sequenator output, i.e. sequencing data(reads ).In one embodiment of the invention, DNA fragmentation used is sequenced was captured by particular probe.Probe must be pure, and not influenceed by other different sequencing nucleic acids.Typical probe is the DNA sequence dna of clone or expands the DNA obtained, artificial synthesized oligonucleotides or the RNA obtained after in-vitro transcription cloned dna sequence by PCR, can also be used as probe.Probe length can be from 20-500mer, preferably 50-300mer, more preferably 250mer.Probe is designed and synthetic method is people in the art Member is known, can use artificial chemical synthesis synthesising probing needle or use commercially available probe.
In the present invention, obtaining sequencing sequence from sample can be carried out using the method for sequencing, and the sequencing can be carried out by any sequence measurement, including but not limited to dideoxy chain termination;It is preferred that high-throughout sequence measurement, including but not limited to second generation sequencing technologies either single-molecule sequencing technology.Heretofore described second generation microarray dataset(Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet. 2010 Jan;ll(l):31-46) include but is not limited to Illumina (such as GA series, HiSeq series), (such as GS is serial by Life Technologies (such as SOLID series, semiconductor sequencing series) and Roche)The microarray dataset provided Deng company.
It is (two-way that heretofore described sequencing type includes but is not limited to Pair-end)Sequencing, sequencing length includes but is not limited to 100bp.In one embodiment of the invention, described microarray dataset is Illumina/Solexa, and sequencing type is sequenced for Pair-end, obtains the DNA sequence dna molecule of the 100bp sizes with two-way position relationship.
In the present invention, the main library length of described sequencing sequence should be less than 200bp.So can just there are enough assembling success rates, to ensure there are enough data volumes to carry out follow-up research.In some embodiments of the present invention, plasma sample sequencing yield is about 5G, and the sequencing data amount of histocyte sample is about 1G.Data volume is bigger, and the Insert Fragment information that can be detected is more comprehensive.
Using the method for the present invention, the data produced by new-generation sequencing technology can determine that pathogen genome inserts signal, signal here includes but is not limited to position, gene type, number, length.
In the present invention, the human genomic sequence that described progress SOAP2 comparisons and BWA are compared is (the hgl9 of version 37 in ncbi database;NCBI Build 37) human genome reference sequences.
In some embodiments of the present invention, 23 genome sequences in 8 hypotypes of the pathogen genome reference sequences selected from the pathogen that described progress SOAP2 comparisons and BWA are compared.
In the present invention, described comparison includes comparing before PER assemblings and compared with after PER assemblings, and comparison is the comparison for allowing 5 bases of mispairing before wherein PER assemblings.PER assembling presequences are compared can be by any alignment programs, such as short oligonucleotide analysis bag obtained by those skilled in the art(Short Oligonucleotide Analysis Package, SOAP) and BWA comparisons(Burrows-Wheeler Aligner 0.5.8c (rl536)) carry out, by sequencing sequence and reference gene group sequence alignment, according to sequencing data and pathogen genome and the comparison situation of human genome, sequencing data is classified.The default parameters that carrying out sequence alignment can use program to provide is carried out, or parameter is selected as needed by those skilled in the art.In one embodiment of the invention, the comparison software used is SOAPaligner/soap2.Another comparison is comparison after PER assemblings, and sequence alignment can pass through any alignment programs that can be set as allowing sufficient length gap after PER assemblings.The default parameters that carrying out sequence alignment can use program to provide is carried out, or parameter is selected as needed by those skilled in the art.The B WAS W parameters that such as obtained by those skilled in the art BWA is compared in (Burrows-Wheeler Aligner) are carried out, and determine to compare position by comparing.
In some embodiments of the present invention, for step 3, extraction may the condition containing the sequencing data of pathogen genome Insert Fragment information(Following condition meets one)It is:
1) in a pair of PE sequencing datas, in the case where only allowing 5 mispairing, one can be than upper pathogen genome than upper human genome, another;
2)-and in PE sequencing datas, in the case where only allowing 5 mispairing, one can be than upper human genome, another Can not be than upper any reference sequences;
3)-and in PE sequencing datas, in the case where only allowing 5 mispairing, one can not be than upper any reference sequences than upper pathogen genome, another;
4) in a pair of PE sequencing datas, in the case where only allowing 5 mispairing, two all can not be than upper any reference sequences.In some embodiments of the present invention, step is compared for the Soap in Fig. 2, the sequencing data needed is chosen, comparison rate is calculated:The step can export a file, and this document is including but not limited to following items:Yield, comparison rate, pollution rate, effective sequencing data ratio.Effective sequencing data refers to that original lower machine sequencing data removes remaining sequencing data after contaminated sequencing data.Referred to herein as contaminated sequencing data refer to PCR repeat sequencing data, sequencing data containing joint, low quality sequencing data.
In some embodiments of the present invention, step is compared for the Soap in Fig. 2, the sequencing data needed is chosen, comparison rate is calculated, the Insert Fragment sd values that wherein Soap is compared are set as 30.So doing can ensure to have sequencing data utilization rate as big as possible.With reference to table 1 below, if the number for falling the sequencing data in some grid is VnumIf human genome comparison rate is that alnRate sets pathogen genome comparison rate as alnRatevirusIf, VallThe calculation formula of=Vi+Vz+Vs+Vs+Vs+V+Vg+Vg so comparison rates is alnRatevims= ( V1+V2+V4+V5+V3+V6 ) I Vall , alnRate ( V1+V2+V4+V5+V7+V8 ) I Van
In some embodiments of the present invention, step is compared for the soap in Fig. 2, the sequencing data needed is chosen, comparison rate is calculated:Extract may be containing the sequencing data of pathogen genome Insert Fragment information condition (following condition meets one)It is that, in the case where only allowing 5 mispairing, bar can be than upper pathogen genome than upper human genome, another 1) in a pair of PE sequencing datas;2)-and in PE sequencing datas, in the case where only allowing 5 mispairing, bar can not be than upper any reference sequences than upper human genome, another;3)-and in PE sequencing datas, in the case where only allowing 5 mispairing, one can not be than upper any reference sequences than upper pathogen genome, another;4) in a pair of PE sequencing datas, in the case where only allowing 5 mispairing, two all can not be than upper any reference sequences.
In some embodiments of the present invention, " the calculating useful sequence in raw sequencing data " compared for SOAP in step refers to, the sequence of the PE sequencing datas repeated containing PCR, the PE sequencing datas of the sequencing data containing low quality and the PE sequencing datas containing adaptor is removed in raw sequencing data.Such as 5689 sequence to be chosen in some embodiments of the present invention of table 1 below.Row in table 1 represents situation when sequence and human genome are compared, PE represent a pair of PE sequencing datas can in the Insert Fragment length range internal ratio of setting reference sequences;SE represented in a pair of PE sequencing datas, and only one can be than upper reference sequences, or two can be than upper reference sequences, but compare position not in the Insert Fragment length range of setting;Unmap represents that a pair of PE sequencing datas completely can not be than upper reference sequences.
The sequencing data of table 1. classification nine grids
In the present invention, the step of sequence for removing repetition in Fig. 2 again:Although PCR present in PE sequencing datas has been repeated in first time removal of impurities is step 2 operation of filtering, current filtering is halfway.Because the lap of some sequences might have mispairing, but still can assemble, so the duplicate situation of sequence after assembling may be caused.
In some embodiments of the present invention, step is compared again for BWA, extracts breakpoint information:Parameter used when comparing again BWA 0.5.8c (rl536) is the parameter BWASW for being applicable long sequence alignment and supporting high serious forgiveness to compare.It compares position using heuristic Smith- Waterman-like algorithm search high score.Parameter used uses the default parameter value of the version software completely during comparison, and details can consult http://bio-bwa.sourceforge.net/bwa.shtml o it using heuristic Smith- Waterman-like algorithm search high score compare position.
BWA in the present invention compares step again, extracts breakpoint information:, it is necessary to handle comparison result to extract breakpoint information after BWA compares completion again.At this moment the selection for every sequence alignment result needs to meet following condition:
1) while human genome and pathogen genome can be compared.
No matter 2) compared with what type of reference gene group, the sequence length that can be compared with reference sequences have to be larger than or equal to 30bp
No matter 3) compared with what type of reference gene group, it may have to be larger than or be equal to for the sequence length of Insert Fragment part
5bp
In some embodiments of the present invention, step is compared again for BWA, extracts breakpoint information:What BWA comparison was taken is the elementary tactics of butt and truncation.Such as, that is to say, that when the first half of a sequencing data is less than reference sequences, comparing software can directly cut out the part being less than in the sequencing data, then proceed to compare.
In some embodiments of the present invention, step is compared again for BWA, extracts breakpoint information:
When extracting breakpoint information from comparison result, several situations are can be potentially encountered, lower mask body lists this several situation and provides the processing method of the present invention.Situation given below is all that sequence is likely to occur when being compared with human genome.Researcher should be understood that the comparison position compared given by software BWA is by the left position than sequence.Assuming that the comparison position for comparing software is p
1) upstream portion of bar sequence is pathogen genome sequence, and downstream part is human DNA sequence, and during in the absence of other types base, at this moment pathogen genome sequence insertion position, which should take, compares the comparison position p that software is provided
2) upstream portion of-bar sequence is human DNA sequence, and downstream part is pathogen genome sequence, and during in the absence of other types base, at this moment pathogen genome sequence fragment insertion position, which should take, compares the comparison position that software is provided.Assuming that the length of upstream human DNA sequence part is X, at this moment pathogen genome sequence fragment insertion position should be taken as p+x 3) upstream and downstream of a sequence is all human genome sequence, center section is pathogen genome Insert Fragment, it is assumed that the length of upstream human DNA sequence part is that the sequence length of visitor DNA parts under X is χτ, at this moment pathogen genome insertion position should be taken as p+x _ t
4) upstream and downstream of a sequence is all pathogen genome sequence, center section is human genome sequence, this sequence inserts the signal of integration points with two pathogen genome fragments, it is possible to two insertion positions are extracted from the comparison result of this sequence.Assuming that the length of upstream human DNA sequence part is y _ t, the sequence length of lower visitor DNA parts is yT, the human genome sequence length of center section is y, at this moment pathogen genome fragment insertion position should be taken as p and p+y In one embodiment of the present of invention, step is compared again for BWA, extracts breakpoint information:The output file of the step gives left support sequencing data number, right support sequencing data number, total sequencing data number supported after sequencing data number, and corresponding normalization.Below to these projects --- explain.Left support sequencing data number, that is, support the sequencing data number of a certain insertion breakpoint left end point.Specifically, it is exactly the sequence number in pathogen genome Insert Fragment upstream for the reference sequences of comparison.Right support sequencing data number, that is, support the sequencing data number of a certain insertion breakpoint right endpoint.Specifically, it is exactly the sequence number in pathogen genome Insert Fragment downstream for the reference sequences of comparison.It is total to support sequencing data number to be equal to left support sequence number and right support sequence number sum.Left support sequence number after normalization.If the front left of normalization supports sequencing data number to be VL, useful sequencing data logarithm is b M couple, then the left support sequence number after normalization is Vi7b.Right support sequence number after normalization.If the front right of normalization supports sequencing data number to be a, useful sequencing data logarithm is VrM pairs, then the left support sequence number after normalization is VJb.Total support sequence number after normalization is the left support sequence number after normalization and the right support sequence number sum after normalization.
In some embodiments of the present invention, step is compared again for BWA, extracts breakpoint information:
Need to carry out deduplication series of operations again to the sequencing data in obtained output file, because the deduplication operates of step 2 and step 5 all do not account for the situation that anti-phase complementary series is also likely to be repetitive sequence.So needing to carry out a left back examination to the sequence in obtained result, remove the sequence of redundancy, draw optimal result.
In some embodiments of the present invention, for checking for the step of replacing classification:Specific method is that, when two breakpoint informations on human genome show as only left support sequencing data or only right support sequencing data, at this moment the sequence between two breakpoint locations is exactly the human genome sequence replaced.Comparison position of the left support sequencing data with right support sequencing data in pathogen genome is found respectively, and the sequence that two found are compared between position is exactly the pathogen genome sequence replaced.In the present invention, by according to the random combine of qualified breakpoint on human genome, all displacement situations are exported.
In some embodiments of the present invention, for calculating the length of pathogen genome Insert Fragment and the step of type:Specific method is that the left end for finding a pathogen genome Insert Fragment supports sequence and right-hand member to support sequence, and this two ends is then found respectively and supports that the sequence between comparison position of the sequence in pathogen genome, two positions is exactly Insert Fragment.
The step of actual efficiency being captured for calculating of the invention:Specific way is calculated during participation BWA compares again, can than upper human genome hgl9, again can than upper pathogen genome sequence number, be designated as A.Useful PE sequencing data logarithms are designated as B in raw sequencing data.Then the calculation formula of effective capture rate is A/B.
In the present invention, the breakpoint information that can finally find includes pathogen genome fragment the breakpoint information of breakpoint information and pathogen genome fragment in pathogen genome on human genome.
In the present invention, involved analysis of biological information flow goes for the signal detection in the case of all pathogen hereditary material DNAs or RNA insertion human genomes in theory.The system for determining foreign gene Integration Mode in the mankind are because of group
According to another aspect of the invention, the present invention proposes a kind of system for determining foreign gene Integration Mode in the mankind are because of group.With reference to Fig. 3, the system includes acquisition equipment 100, sequencing device 200, the first knot screen 300, first and compares dress Put 400, assembling device 500, the second knot screen 600, the second comparison device 700 and analytical equipment 800.Wherein, with reference to Fig. 3, these devices are sequentially connected in technological process.
Embodiments in accordance with the present invention, acquisition equipment 100 captures the DNA fragmentation that may be integrated containing exogenous genetic fragment using capture probe from human genome sample of nucleic acid.Embodiments in accordance with the present invention, it is possible to use the type kneecap for the foreign gene that the inventive method is analyzed is particularly limited.As long as it can be integrated with human genome, and can be obtained or known its gene order.Embodiments in accordance with the present invention, the foreign gene that can be studied is pathogen genome.Another sunset is foretold, according to the instantiation of the present invention, and the pathogen is HBV.Thus, it is possible to effectively analysis pathogen such as HBV and human genome integration.
Embodiments in accordance with the present invention, sequencing device 200 is sequenced for captured DNA fragmentation, to obtain the sequencing result being made up of multiple sequencing datas.Embodiments in accordance with the present invention, are not particularly restricted to the mode that the DNA fragmentation by capture is sequenced.Embodiments in accordance with the present invention, sequencing is carried out by second generation microarray dataset.Embodiments in accordance with the present invention, can use and the genome sequencing library is sequenced selected from Hiseq2000, SOLiD, 454 and at least one of single-molecule sequencing device.The characteristics of thereby, it is possible to using the high flux of these sequencing devices, deep sequencing, further increase the efficiency for determining unicellular chromosomal aneuploidy.Certainly, it will be appreciated to those of skill in the art that genome sequencing, such as third generation sequencing technologies, and the more advanced sequencing technologies that may be developed can also be carried out using other sequence measurements and device later.Embodiments in accordance with the present invention, are not particularly limited by the length of the sequencing data obtained by genome sequencing.Embodiments in accordance with the present invention, it is lOObp that length, which is preferably sequenced, thus, it is possible to further improve analytical effect.
Embodiments in accordance with the present invention, the first 300 pairs of knot screen sequencing result carries out the first removal of impurities, to obtain the sequencing result by the first removal of impurities.Embodiments in accordance with the present invention, carry out the type of the first removal of impurities, are not particularly restricted, and such as the first removal of impurities, which may further include, removes at least one that PCR is repeated, removed the sequencing data of low quality sequencing data and removal containing joint.Thus, it is possible to further improve analysis efficiency.
The sequencing result Jing Guo the first removal of impurities is carried out first with known human genomic sequence and exogenous gene sequence and compared by embodiments in accordance with the present invention, the first comparison device 400, may the sequencing data containing exogenous origin gene integrator fragment to obtain.Embodiments in accordance with the present invention, can carry out first comparison using SOAP.Thus, it is possible to further improve analysis efficiency.
Embodiments in accordance with the present invention, assembling device 500 is assembled the sequencing data that resulting possibility contains exogenous origin gene integrator fragment, and to obtain assembling result, the assembling result is made up of multiple assembling data.Embodiments in accordance with the present invention, the assembling is by being carried out based on the overlapping relation between sequencing data.
Embodiments in accordance with the present invention, the second 600 pairs of knot screen assembling result carries out the second removal of impurities, to obtain the assembling result by the second removal of impurities.Embodiments in accordance with the present invention, second removal of impurities further comprises removing the assembling data repeated.
The assembling result Jing Guo the second removal of impurities is carried out second with known human genomic sequence and exogenous gene sequence and compared by embodiments in accordance with the present invention, the second comparison device 700.Embodiments in accordance with the present invention, second comparison is to utilize BWA to carry out.
Analytical equipment 800 is based on second comparison result, determines Integration Mode of the foreign gene in human genome. Embodiments in accordance with the present invention, based on second comparison result, determine that Integration Mode of the foreign gene in human genome further comprises selection while human genome breakpoint information and foreign gene breakpoint information are included in the assembling data of human genomic sequence and exogenous gene sequence known on can comparing, the assembling data.Embodiments in accordance with the present invention, can also be based further on the human genome breakpoint information and foreign gene breakpoint information, judge whether replacement mutation;Or based on the human genome breakpoint information and foreign gene breakpoint information, determine at least one of external source gene insertion length and type in human genome, for example, determine the intubating length and type of at least a portion foreign gene in human genome.
It should be noted that, skilled artisans appreciate that, the system for being also suitable for determining foreign gene Integration Mode in human genome in determination foreign gene feature and advantage of the method for Integration Mode in human genome described above, for convenience of description, is no longer described in detail.Computer-readable Jie's shield
In still another aspect of the invention, the present invention proposes a kind of computer-readable medium.Embodiments in accordance with the present invention, be stored with instruction on the computer-readable medium, and the instruction is suitable to be executed by processor to determine foreign gene Integration Mode in human genome through the following steps:First removal of impurities is carried out to sequencing result, to obtain the sequencing result by the first removal of impurities;The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain;The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities;And compare the assembling result Jing Guo the second removal of impurities with known human genomic sequence and exogenous gene sequence progress second, and based on second comparison result, determine Integration Mode of the foreign gene in human genome, wherein, sequencing result is by following acquisition:The DNA fragmentation that may be integrated containing exogenous genetic fragment is captured from human genome sample of nucleic acid using capture probe;It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas.Integration Mode of the foreign gene such as pathogen genome in human genome can be effectively determined using the computer-readable medium.
For the purpose of this specification, " computer-readable medium " can any can be included, store, communicating, propagating or transmission procedure for instruction execution system, device or equipment or combines these instruction execution systems, device or equipment and the device used.The more specifically example of computer-readable medium(Non-exhaustive list)Including following:Electrical connection section with one or more wirings(Electronic installation), portable computer diskette box(Magnetic device), random access memory(), RAM read-only storage (ROM), erasable edit read-only storage(EPROM or flash memory), fiber device, and portable optic disk read-only storage(CDROM ).In addition, computer-readable medium, which can even is that, to print the paper or other suitable media of described program thereon, because can be for example by carrying out optical scanner to paper or other media, then enter edlin, interpret or handled electronically to obtain described program with other suitable methods if necessary, be then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.In the above-described embodiment, multiple steps or method can be performed in memory and by suitable instruction execution system with storage software or firmware is realized.If for example, being realized with hardware, with another embodiment, can be realized with any one of following technology well known in the art or their combination:With the logic gates for realizing logic function to data-signal Discrete logic, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array(), PGA field programmable gate array(FPGA) etc..
Those skilled in the art are appreciated that, realize that all or part of step that above-described embodiment method is carried can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable recording medium, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each of the invention embodiment can be integrated in a processing module or unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated module can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.If the integrated module is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..
It should be noted that it will be appreciated by those skilled in the art that it is described above determination foreign gene feature and advantage of the method for Integration Mode in human genome be also suitable for the computer-readable medium, for convenience of description, be no longer described in detail.The solution of the present invention is explained below in conjunction with embodiment.It will be understood to those of skill in the art that the following examples are merely to illustrate the present invention, and it should not be taken as limiting the scope of the invention.Unreceipted particular technique or condition in embodiment, according to the technology or condition described by document in the art(Write such as with reference to J. Pehanorm Brookers, what Huang Peitang etc. was translated《Molecular Cloning:A Laboratory guide》, the third edition, Science Press)Or carried out according to product description.Agents useful for same or the unreceipted production firm person of instrument, are that be able to can for example be purchased from Illumina companies by the conventional products of acquisition purchased in market.Embodiment 1
It is prepared by sample library
1. samples sources
The source of sample is the liver cancer tissue of a patient, and this patient's liver cancer tissue has genome sequencing information and the breakpoint information found by full-length genome data.
2. previous experiments
The real danger of early stage partly comprises the following steps:
(1) DNA is extracted.
(2) sample library is prepared
According to the standard library preparation flow specification of Illumina companies(Paired-End Sample Preparation Guide) library is built, genomic DNA, end-filling reparation are interrupted using Covaris s2, end adds A, adds joint, enters performing PCR to the fragment for adding joint, obtains sample library.
(3) HBV capture probes are prepared
B types and c types design probe for HBV.The 60bp during length of each probe, design principle is that the length of overlapped part between two neighboring probe is 55bp.Specifically, the step of preparing HB V probes is as follows:Primer, PCR reactions, PCR purifying and electrophoresis detection are designed, PCR primer fragmentation, fragmentation products electrophoresis detection, probe is preserved. (4) HBV capture probes are hybridized with sample library, are sequenced
Using Nimblegen hybridization platform, HB V capture probes are hybridized with sample library, eluted after hybridization, PCR is expanded, machine sequencing in PCR primer.Then, it is sequenced using the microarray datasets of Hiseq 2000, wherein c-Bot and Hiseq 2000 (PE sequencing) specification that upper machine sequencing is announced according to IUumina/Solexa officials are operated.Main sequencing library length 170bp, sequencing sequence length lOObp, sequencing yield lG bp.
3. bioinformatic analysis
(1) remove PCR to repeat, remove low quality sequencing data and remove the sequencing data containing joint
Take after lower machine data, removing PCR to these data repeats, remove low quality sequencing data and remove the sequencing data containing joint.
Remove the strategy that PCR is repeated:When two sequences are just the same, then repetitive sequence is regarded as.When there is a sequencing data to duplicate in a pair of PE sequencing datas, then remove this pair of sequencing datas.
Remove the strategy of low quality sequencing data:When base number of the sequencing quality value less than or equal to 5 accounts for this sequencing data total bases purpose more than 50% in a sequencing data, then it is assumed that this sequencing data is low quality sequencing data when a pair
When to have a sequencing data in PE sequencing datas be low quality, then remove this pair of sequencing datas.
Remove the strategy of the sequencing data containing joint:When containing one section of joint sequence in a sequencing data, then it is assumed that this sequencing data is the sequencing data containing joint.When it is the sequencing data containing joint to have a sequencing data in a pair of PE sequencing datas, remove this pair of sequencing datas.
4. Soap is compared, the sequencing data needed is chosen, comparison rate is calculated
Treated sequencing data is compared on human genome hgl9 and HBV gene group respectively.Because HBV viruses have multiple hypotypes, HBV genomes here include 23 hypotypes(AB014381.1, AB032431.1, AB033554.1, AB036910.1, AB064310.1, AF090842.1, AF100309.1, AF160501.1, AF223965.1, AF405706.1, AY090454.1, AY090457.1, AY090460.1, AY123041.1, D00329.1, M32138.1, X02763.1, X04615.1, X51970.1, X65259.1, X69798.1, X75657.1, X85254.1) genome sequence.After the completion of comparison, by the paired relation between analysis twice comparison result, selection may the sequencing data containing viral integrase fragment.And the ratio of useful sequence in raw sequencing data, and human genome comparison rate and the comparison rate of HBV genomes in useful sequence are calculated respectively.The parameter that Soap is compared: -m l38 -x l98 -p 8 -140 -v 5 -r l.Table 2 below is quality of data report, and table 3 is the classification results carried out by preliminary soap comparison results to sequencing data, and table 4 is comparison rate statistics.
The quality of data of table 2. is reported
G/C content 48.78;49.08
Sequencing data ratio 0.07% containing joint
Low quality sequencing data ratio 6.32%
PCR repetitive rates 5.14%
The sequencing data nine grids classification results of 88.47% table of valid data ratio 3.
The comparison rate of table 4. is counted
5. PER is assembled
The sequencing data that the possibility that 3rd step is obtained contains viral integrase fragment carries out PER assemblings.PER refers to two-way(Pair end) sequencing data assembling.I.e. according to the overlapping relation between sequence, pair end are sequenced to obtained each pair PE sequencing datas and assembled.It is 94.33% to assemble success rate
6. the sequence of repetition is removed again
After being assembled by PER, the set of sequence after an assembling is obtained.Deduplication operates are carried out to this arrangement set again.Here strategy is the deduplication strategy using SE sequencing datas, i.e.,:When the situation that a sequencing data is duplicated, then remove this sequencing data.As a result eliminate 1.823% repetitive sequence, the available sequences after remaining 850696 assemblings.
7. BWA is compared again, breakpoint information is extracted
By the deduplication step of the 5th step, an arrangement set is obtained.Then this arrangement set is compared again respectively using BWA softwares again in human genome hgl9 and HBV viral genome.By analyzing the destination file compared twice, select while can be than the upper virus genomic sequences of human genome hgl9 and HBV.These sequences are containing breakpoint information.These sequences are analyzed respectively and human genome hgl9 and HBV are virus genomic compares situation, obtain integration of the HBV viruses on human genome, and the distribution situation in HBV viral genomes.33 breakpoints are eventually found, wherein the number of breakpoints for crossing threshold value is 8.Deduplication operates are being carried out to obtained result, final result is drawn.Such as table 5 below, it is shown that the HBV viruses that the present invention is found are inserted into human genome hgl9 most significant viral insertion position. The HBV of table 5. virus insertion breakpoint informations
8. check for displacement classification
By analyzing the inner link between human genome breakpoint and HBV viral genome breakpoints, displacement type variation is checked for.Specifically method is, when two breakpoint informations on human genome are shown as within 500bp, and when two breakpoints all only have left end support sequencing data or only right-hand member support sequencing data, the situation of displacement has at this moment been likely occurred between two breakpoints.The corresponding HBV viral genomes comparison information of two breakpoints is found, it is determined that the type replaced and position.Table 6 below shows 7 displacement situations that the present invention is found.
The displacement type of table 6.
9. calculate the length and type of HBV Insert Fragments
By analyzing the inner link between human genome breakpoint and HBV viral genome breakpoints, the breakpoint information of the viral Insert Fragments of HBV can be calculated by finding, and the viral Insert Fragment length of HBV and type calculated at this part of breakpoint is another1J.Specific method is that the left end for finding a viral Insert Fragment supports sequence and right-hand member to support sequence, and the sequence then found respectively between comparison position of this two terminal sequence in HBV viral genomes, two positions is exactly Insert Fragment.Here an insertion breakpoint is only found to reach and can find out the requirement of Insert Fragment type.Table 7 below is the type that the present invention finds the Insert Fragment found. The HBV of table 7. virus Insert Fragment types
10. calculate capture effective efficiency
The comprehensive object information compared again and initial sequence information calculate the actual efficiency of probe capture.Specific method is calculated during participation BWA compares again, can than upper human genome hgl9, again can sequence number more virus genomic than upper HBV, be designated as A.Useful PE sequencing data logarithms are designated as B in raw sequencing data.Then, the calculation formula of effective capture rate is A/B.Here counted effective capture rate is 0.0001059.Industrial applicibility
The determination foreign gene of the present invention method of Integration Mode, system and computer-readable medium in human genome, can be efficiently used for determining foreign gene such as Integration Mode of the pathogen genome in human genome.In the description of this specification, the description of reference term " one embodiment ", " some embodiments ", " example ", " specific example " or " some examples " etc. means to combine specific features, structure, material or the feature that the embodiment or example describe and is contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.Moreover, specific features, structure, material or the feature of description can in an appropriate manner be combined in any one or more embodiments or example.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:A variety of change, modification, replacement and modification can be carried out to these embodiments, the scope of the present invention is limited by claim and its equivalent in the case where not departing from the principle and objective of the present invention.

Claims (1)

  1. Claims
    1st, a kind of method for determining foreign gene Integration Mode in human genome, it is characterised in that including:The DNA fragmentation that may be integrated containing exogenous genetic fragment is captured from human genome sample of nucleic acid using capture probe;It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas;First removal of impurities is carried out to the sequencing result, to obtain the sequencing result by the first removal of impurities;
    The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain;
    The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;
    Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities;And
    The assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence to compare, and based on second comparison result, determines Integration Mode of the foreign gene in human genome.
    2nd, according to the method described in claim 1, it is characterised in that the foreign gene is pathogen genome.
    3rd, method according to claim 2, it is characterised in that the pathogen is HBV.
    4th, according to the method described in claim 1, it is characterised in that the sequencing is carried out by second generation microarray dataset.
    5th, according to the method described in claim 1, it is characterised in that first removal of impurities further comprises that removing PCR repeats, removes low quality sequencing data and remove at least one of the sequencing data containing joint.
    6th, according to the method described in claim 1, it is characterised in that first comparison is to utilize SOAP to carry out.7th, according to the method described in claim 1, it is characterised in that the assembling is by being carried out based on the overlapping relation between sequencing data.
    8th, according to the method described in claim 1, it is characterised in that second removal of impurities further comprises removing the assembling data repeated.
    9th, according to the method described in claim 1, it is characterised in that second comparison is to utilize BWA to carry out.10th, the method according to claim 9, it is characterised in that based on second comparison result, determines that Integration Mode of the foreign gene in human genome further comprises:
    Selection can compare in the assembling data of known human genomic sequence and exogenous gene sequence, the assembling data and include human genome breakpoint information and foreign gene breakpoint information simultaneously.
    11st, method according to claim 10, it is characterised in that further comprise:
    Based on the human genome breakpoint information and foreign gene breakpoint information, replacement mutation is judged whether;Or based on the human genome breakpoint information and foreign gene breakpoint information, determine at least one of external source gene insertion length and type in human genome.
    12nd, a kind of system for determining foreign gene Integration Mode in human genome, it is characterised in that including:Acquisition equipment, the acquisition equipment is suitable to capture the DNA fragmentation that may be integrated containing exogenous genetic fragment from human genome sample of nucleic acid using capture probe;
    Sequencing device, the sequencing device is connected with the acquisition equipment, and suitable for being carried out for captured DNA fragmentation Sequencing, to obtain the sequencing result being made up of multiple sequencing datas;
    First knot screen, first knot screen is connected with the sequencing device, and suitable for carrying out the first removal of impurities to the sequencing result, to obtain the sequencing result by the first removal of impurities;
    First comparison device, first comparison device is connected with first knot screen, and compared suitable for the sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence, may the sequencing data containing exogenous origin gene integrator fragment to obtain;
    Assembling device, the assembling device is connected with first comparison device, and suitable for the sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;
    Second knot screen, second knot screen is connected with the assembling device, and suitable for carrying out the second removal of impurities to the assembling result, to obtain the assembling result by the second removal of impurities;
    Second comparison device, second comparison device is connected with second knot screen, and is compared suitable for the assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence;And
    Analytical equipment, the analytical equipment is suitable to be based on second comparison result, determines Integration Mode of the foreign gene in human genome.
    13rd, system according to claim 12, it is characterised in that the sequencing device is second generation microarray dataset.
    14th, system according to claim 12, it is characterised in that first knot screen further comprises at least one following:
    It is adapted for removing the unit that PCR is repeated;
    It is adapted for removing the unit of low quality sequencing data;And
    It is adapted for removing the unit of the sequencing data containing joint.
    15th, system according to claim 12, it is characterised in that first comparison device is suitable to be compared using SOAP.
    16th, system according to claim 12, it is characterised in that the assembling device is suitable to by being assembled based on the overlapping relation between sequencing data.
    17th, system according to claim 12, it is characterised in that second knot screen further comprises the unit for being suitable to remove the assembling data repeated.
    18th, system according to claim 12, it is characterised in that second comparison device is suitable to be compared using BWA.
    19th, system according to claim 18, it is characterized in that, the analytical equipment is suitably selected for including human genome breakpoint information and foreign gene breakpoint information while can compare in the assembling data of known human genomic sequence and exogenous gene sequence, the assembling data.
    20th, system according to claim 19, it is characterised in that the analytical equipment is suitable to:
    Based on the human genome breakpoint information and foreign gene breakpoint information, replacement mutation is judged whether;Or based on the human genome breakpoint information and foreign gene breakpoint information, determine at least one of external source gene insertion length and type in human genome. 21st, a kind of computer-readable medium, it is characterised in that be stored with instruction on the computer-readable medium, the instruction is suitable to be executed by processor to determine foreign gene Integration Mode in human genome through the following steps:
    First removal of impurities is carried out to sequencing result, to obtain the sequencing result by the first removal of impurities;
    The sequencing result Jing Guo the first removal of impurities is carried out into first with known human genomic sequence and exogenous gene sequence to compare, may the sequencing data containing exogenous origin gene integrator fragment to obtain;
    The sequencing data that resulting possibility contains exogenous origin gene integrator fragment is assembled, to obtain assembling result, the assembling result is made up of multiple assembling data;
    Second removal of impurities is carried out to the assembling result, to obtain the assembling result by the second removal of impurities;And
    The assembling result Jing Guo the second removal of impurities is carried out into second with known human genomic sequence and exogenous gene sequence to compare, and based on second comparison result, determines Integration Mode of the foreign gene in human genome,
    Wherein, the sequencing result is obtained through the following steps:
    The DNA fragmentation that may be integrated containing exogenous genetic fragment is captured from human genome sample of nucleic acid using capture probe;It is sequenced for the DNA fragmentation captured, to obtain the sequencing result being made up of multiple sequencing datas.
    22nd, computer-readable medium according to claim 21, it is characterised in that the foreign gene is pathogen genome.
    23rd, computer-readable medium according to claim 22, it is characterised in that the pathogen is HBV.
    24th, computer-readable medium according to claim 21, it is characterised in that the sequencing is carried out by second generation microarray dataset.
    25th, computer-readable medium according to claim 21, it is characterised in that first removal of impurities further comprises that removing PCR repeats, removes low quality sequencing data and remove at least one of the sequencing data containing joint.
    26th, computer-readable medium according to claim 21, it is characterised in that first comparison is to utilize SOAP to carry out.
    27th, computer-readable medium according to claim 21, it is characterised in that the assembling is by being carried out based on the overlapping relation between sequencing data.
    28th, computer-readable medium according to claim 21, it is characterised in that second removal of impurities further comprises removing the assembling data repeated.
    29th, computer-readable medium according to claim 21, it is characterised in that second comparison is to utilize BWA to carry out.
    30th, computer-readable medium according to claim 29, it is characterised in that based on second comparison result, determines that Integration Mode of the foreign gene in human genome further comprises:
    Selection can compare in the assembling data of known human genomic sequence and exogenous gene sequence, the assembling data and include human genome breakpoint information and foreign gene breakpoint information simultaneously.
    31st, computer-readable medium according to claim 30, it is characterised in that further comprise:
    Based on the human genome breakpoint information and foreign gene breakpoint information, replacement mutation is judged whether;Or based on the human genome breakpoint information and foreign gene breakpoint information, determine at least one of external source gene insertion length and type in human genome.
CN201280074522.5A 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome Pending CN104428423A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/078311 WO2014005329A1 (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome

Publications (1)

Publication Number Publication Date
CN104428423A true CN104428423A (en) 2015-03-18

Family

ID=49881273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201280074522.5A Pending CN104428423A (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome

Country Status (2)

Country Link
CN (1) CN104428423A (en)
WO (1) WO2014005329A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110612351A (en) * 2017-03-20 2019-12-24 Illumina公司 Methods and compositions for preparing nucleic acid libraries
CN111584003A (en) * 2020-04-10 2020-08-25 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration
CN112639987A (en) * 2018-06-29 2021-04-09 格瑞尔公司 Nucleic acid rearrangement and integration analysis

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199772B (en) * 2019-12-27 2023-05-23 上海派森诺生物科技股份有限公司 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing
CN113957130B (en) * 2021-09-27 2023-12-22 江汉大学 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DURAN USTEK等: "A genome-wide analysis of lentivector integration sites using targeted sequence capture and next generation sequencing technology", 《INFECTION, GENETICS AND EVOLUTION》 *
DURAN USTEK等: "A genome-wide analysis of lentivector integration sites using targeted sequence capture and next generation sequencing technology", 《INFECTION, GENETICS AND EVOLUTION》, vol. 12, 14 May 2012 (2012-05-14), pages 1349 - 1354 *
MATTHEW RUFFALO等: "Comparative analysis of algorithms for next-generation sequencing read alignment", 《BIOINFORMATICS》 *
ZHAOSHI JIANG等: "The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients", 《GENOME RESEARCH》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110612351A (en) * 2017-03-20 2019-12-24 Illumina公司 Methods and compositions for preparing nucleic acid libraries
CN110612351B (en) * 2017-03-20 2023-08-11 Illumina公司 Methods and compositions for preparing nucleic acid libraries
CN112639987A (en) * 2018-06-29 2021-04-09 格瑞尔公司 Nucleic acid rearrangement and integration analysis
CN111584003A (en) * 2020-04-10 2020-08-25 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration
CN111584003B (en) * 2020-04-10 2022-05-10 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration

Also Published As

Publication number Publication date
WO2014005329A1 (en) 2014-01-09

Similar Documents

Publication Publication Date Title
AU2020202153B2 (en) Single-molecule sequencing of plasma DNA
CN103874767B (en) Presumptive area in sample of nucleic acid is carried out the method and system of gene type
EP4092680A1 (en) Detecting repeat expansions with short read sequencing data
KR20140140122A (en) Method and system for detecting copy number variation
CN103173441A (en) Amplification method, primer, sequencing method and mutation detection method of mitochondria whole genome DNA (Deoxyribonucleic Acid)
CN103080336A (en) Kits, devices and methods for detecting chromosome copy number of embryo or tumor
CN105331606A (en) Nucleic acid molecule quantification method applied to high-throughput sequencing
US11862299B2 (en) Algorithms for sequence determinations
CN104428423A (en) Method and system for determining integration manner of foreign gene in human genome
Ma et al. The analysis of ChIP-Seq data
CN104145028A (en) Method and device for detecting microdeletion in chromosome sts area
JP7535998B2 (en) Detection of genetic variants based on merged and unmerged reads
CN104694654B (en) A kind of kit for detecting fetal chromosomal Number Variation
CN105950707A (en) Method and system for determining nucleic acid sequence
CN112687341B (en) Method for identifying chromosome structure variation by taking breakpoint as center
CN115620809B (en) Nanopore sequencing data analysis method and device, storage medium and application
WO2019009431A1 (en) Method for highly accurately distinguishing spontaneous mutations occurring in tumor cells
CN117965748A (en) Identification method for screening synegg twins based on SNV and INDEL
CN114134214A (en) Double-platform combined peripheral blood cfDNA base mutation and methylation detection method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150318

RJ01 Rejection of invention patent application after publication