CN104293940B

CN104293940B - Build the method and its application of sequencing library

Info

Publication number: CN104293940B
Application number: CN201410521540.4A
Authority: CN
Inventors: 管彦芳; 钱朝阳; 吕小星; 常连鹏; 易鑫; 朱红梅; 杨玲; 吴仁花
Original assignee: TIANJIN BGI TECHNOLOGY Co Ltd; BGI Shenzhen Co Ltd
Current assignee: TIANJIN BGI TECHNOLOGY Co Ltd; BGI Shenzhen Co Ltd
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2017-07-28
Anticipated expiration: 2034-09-30
Also published as: CN104293940A

Abstract

The method and its application for building sequencing library are disclosed, this method includes：(a) jointing is distinguished at the two ends of double chain DNA fragment, to obtain connection product；(b) connection product is cracked into Single-stranded DNA fragments；(c) Single-stranded DNA fragments are screened using probe；(d) chain extension reaction is carried out using Single-stranded DNA fragments described in the first primer pair, to obtain chain extension product；(e) the chain extension product is expanded, to obtain amplified production, the amplified production constitutes the sequencing library.Also disclose sequence measurement, the method for determining nucleotide sequence, the device for building sequencing library, sequencing equipment and the system for determining nucleotide sequence.

Description

Build the method and its application of sequencing library

Technical field

The present invention relates to biomedical sector.Specifically, the present invention relates to method, the sequencing side for building sequencing library Method, the method for determining nucleotide sequence, the device for building sequencing library, sequencing equipment and the system for determining nucleotide sequence.

Background technology

High-flux sequence is increasingly concerned, but high-flux sequence still needs to be changed for the detection of low frequency mutation at present Enter.

The content of the invention

It is contemplated that at least solving one of technical problem present in prior art.Therefore, according to the implementation of the present invention Example, the present invention proposes the method for building sequencing library and detects the means of low frequency mutation.

In the first aspect of the present invention, the present invention proposes a kind of method for building sequencing library.According to the reality of the present invention Example is applied, this method includes：(a) jointing is distinguished at the two ends of double chain DNA fragment, to obtain connection product, wherein, it is described Joint includes the first chain and the second chain, and first chain and the second chain part are matched and first chain includes the first label sequence Row, to limit double stranded region and two single-stranded afterbodys on the joint, are included in the sequence of one of described two single-stranded afterbodys First label；(b) connection product is cracked into Single-stranded DNA fragments；(c) Single-stranded DNA fragments are carried out using probe Screening, wherein, the probe specificity recognizes presumptive area, wherein, the presumptive area includes one of following：(1) shown in table 1 At least one of gene；(2) the CDS regions of (1)；And the upstream and downstream of (3) (2) at least 10bp region；(d) draw using first Thing carries out chain extension reaction to the Single-stranded DNA fragments, to obtain chain extension product, wherein, first primer includes the Two sequence labels, and first primer is suitable to the first chain with the joint into duplex structure, simply described first marks There is mispairing between label sequence and second sequence label；(e) the chain extension product is expanded, to be expanded Product, the amplified production constitutes the sequencing library, wherein, the amplification using suitable for expanding the first label sequence simultaneously The primer of row and second sequence label..

Thus, using the method for structure sequencing library according to embodiments of the present invention, sequencing library can be effectively built, Meanwhile, in constructed sequencing library, for every of identical double chain DNA fragment (also referred herein as " source sequence ") Chain, obtains the amplified production with the first sequence label and the second sequence label respectively, thus, in point of follow-up sequencing result In analysis, mutual correction can be carried out according to the sequencing result of two kinds of labels, improve the reliability of analysis result.

Embodiments in accordance with the present invention, the double chain DNA fragment is obtained through the following steps：Sample of nucleic acid is carried out End is repaired, to obtain the sample of nucleic acid by reparation；And base A is added in 5 ' ends of the sample of nucleic acid, so as to Obtain two ends has cohesive end base A sample of nucleic acid respectively, and the two ends have cohesive end base A nucleic acid sample respectively This composition double chain DNA fragment.Thus, it is possible in subsequent operation, easily be added at the two ends of the double chain DNA fragment Joint.So as to improve the efficiency for building sequencing library.

Embodiments in accordance with the present invention, the sample of nucleic acid is at least a portion or free nucleic acid of human gene group DNA.Root According to embodiments of the invention, people's free nucleic acid is extracted from the peripheral blood of patient.Embodiments in accordance with the present invention, it is described Patient suffers from colorectal cancer.Thus, using the method for the embodiment of the present invention, effectively the gene of human patient can be dashed forward Change is effectively analyzed, and then can be effective for the early diagnosis of colorectal cancer, personalized medicine and postoperative monitoring etc..

Embodiments in accordance with the present invention, at least a portion of the human gene group DNA is by being carried out to human gene group DNA Interrupt and obtain at random.Thus, it is possible in subsequent operation, easily add joint at the two ends of the double chain DNA fragment. So as to improve the efficiency for building sequencing library.

Embodiments in accordance with the present invention, the joint has 3 ' base T cohesive ends.Thus, it is possible in subsequent operation, Easily joint is added at the two ends of the double chain DNA fragment.So as to improve the efficiency for building sequencing library.

Embodiments in accordance with the present invention, the Single-stranded DNA fragments are by the way that connection product progress denaturation treatment is obtained .Thus, it is possible to fast and effectively obtain Single-stranded DNA fragments.According to some embodiments of the present invention, the denaturation treatment can Think thermal denaturation processing or alkaline denaturation processing.

Embodiments in accordance with the present invention, the probe is provided in the form of chip.Thus, it is possible to improve probe screening Efficiency.

Embodiments in accordance with the present invention, when there is UDG enzymes/FPG enzymes, carry out the chain extension reaction.Thus, it is possible to have Effect ground is repaired to the DNA that there is damage during chain extension, reduces the generation of false positive, is improved and is built sequencing library Quality.

Separately length is for embodiments in accordance with the present invention, first sequence label and second sequence label 4~10nt.The length of embodiments in accordance with the present invention, first sequence label and second sequence label is 8nt.Root According to embodiments of the invention, there is at least 2nt mispairing between first sequence label and second sequence label.Invention People utilizes the first sequence label and the second mark it has surprisingly been found that using being arranged such, can effectively improve in subsequent analysis The efficiency that label sequence is corrected.

Embodiments in accordance with the present invention, the first chain of the joint has SEQ ID NO：Sequence shown in 1, the joint The second chain there is SEQ ID NO：Sequence shown in 2, first label has SEQ ID NO：Shown in any one of 3-6 Sequence, second label has SEQ ID NO：Sequence shown at least one of 7-10, first primer has SEQ ID NO：Sequence shown in 11, the primer tool for being suitable to expand first sequence label and second sequence label simultaneously There are SEQ ID NO：12 and SEQ ID NO：Sequence shown in 13.

Wherein, " XXXXXXXX " is represented in the first sequence label, the first primer in sequence in the sequence of the first chain of joint " XXXXXXXX " represent the second sequence label.

Embodiments in accordance with the present invention, label includes but is not limited to 4 couple described above, can be related to as needed multipair Detected while label is for Multi-example.

In the second aspect of the present invention, the present invention proposes a kind of sequence measurement, and this method includes：According to foregoing Method builds sequencing library；The sequencing library is sequenced.

Embodiments in accordance with the present invention, carry out the sequencing on Hiseq2000 or Hiseq2500.Thus, it is possible to effectively Improve the efficiency of sequencing in ground.In addition, it is previously with regard to build the feature and advantage described by the method for sequencing library, it is equally applicable to be somebody's turn to do Sequence measurement, will not be repeated here.

In the third aspect of the present invention, the present invention proposes a kind of method for determining nucleotide sequence, and this method includes：For Sample of nucleic acid, is sequenced according to the foregoing method of claim, to obtain the sequencing being made up of multiple sequencing datas As a result；Based on the sequencing result, at least one sequencing data subset is built, wherein, it is all in each sequencing data subset Sequencing data corresponds to identical source sequence on sample of nucleic acid；For each sequencing data subset, determine respectively and described the The corresponding sequencing data of one sequence label is normal chain sequencing data, and sequencing data corresponding with second sequence label is minus strand Sequencing data；For each sequencing data subset, the normal chain sequencing data and the minus strand sequencing data are based respectively on, it is right Sequencing data is corrected, to determine corrected sequencing data；And based on the corrected sequencing data, really The sequence of the fixed sample of nucleic acid.Thus, it is possible to be effectively corrected based on normal chain sequencing data and minus strand sequencing data, carry The reliability of high analyte result.

Embodiments in accordance with the present invention, the sequencing is double end sequencings, and the sequencing result is by multipair paired sequencing Data are constituted.

Embodiments in accordance with the present invention, based on the sequencing result, it is under to build at least one sequencing data subset What row step was carried out：For every a pair of the multipair paired sequencing data, it is determined that sequencing data index in pairs, described paired Sequencing data index be made up of the initial N number of base of each of paired sequencing data, wherein, N be 10~20 between it is whole Number；Indexed based on the paired sequencing data, build at least one preliminary sequencing data subset, wherein, the preliminary sequencing number The paired sequencing data index of identical is respectively provided with according to each sequencing data in subset；And based on the preliminary sequencing data Hamming distance in subset between sequencing data, is finely divided at least one described preliminary sequencing data subset, to obtain Multiple sequencing data subsets.

Embodiments in accordance with the present invention, N is 12.

Embodiments in accordance with the present invention, in each of the multiple sequencing data subset, any two pairs sequencings in pairs The Hamming distance of data is no more than 20.

Embodiments in accordance with the present invention, in each of the multiple sequencing data subset, normal chain sequencing data and negative Chain sequencing data is respectively at least two.

Embodiments in accordance with the present invention, based on the normal chain sequencing data and the minus strand sequencing data, it is determined that by school Positive sequencing data is carried out based on following principle：Each base in corrected sequencing data is obtained at least simultaneously 50% normal chain sequencing data and at least support of 50% minus strand sequencing data.

Each base in embodiments in accordance with the present invention, corrected sequencing data is obtaining at least 80% just simultaneously Chain sequencing data and at least support of 80% minus strand sequencing data.

Embodiments in accordance with the present invention, further comprise：The corrected sequencing data is compared to reference sequences On, and delete the sequencing data that comparison quality is less than 30.

Embodiments in accordance with the present invention, further comprise：Based on the sequence of the sample of nucleic acid, carry out SNV analyses or Indel is analyzed.

In the fourth aspect of the present invention, the present invention proposes a kind of device for building sequencing library.According to the reality of the present invention Example is applied, the device includes：Connection unit, for distinguishing jointing at the two ends of double chain DNA fragment, to obtain connection production Thing, wherein, the joint includes the first chain and the second chain, the first chain and the second chain part matching and the first chain bag Containing the first sequence label, to limit one of double stranded region and two single-stranded afterbodys, described two single-stranded afterbodys on the joint Sequence in include the first label；Unit is cracked, for the connection product to be cracked into Single-stranded DNA fragments；Screening unit, is used In before the chain extension is carried out, the Single-stranded DNA fragments are screened using probe, wherein, the probe specificity Presumptive area is recognized, wherein, the presumptive area includes one of following：(1) at least one of gene shown in table 1；(2) (1) CDS regions；And the upstream and downstream of (3) (2) at least 10bp region；Chain extension unit, for utilizing list described in the first primer pair Chain DNA fragment carries out chain extension reaction, to obtain chain extension product, wherein, first primer includes the second sequence label, And first primer is suitable to the first chain of the joint into duplex structure, simply first sequence label with it is described There is mispairing between second sequence label；Amplification unit, for being expanded to the chain extension product, to obtain amplification production Thing, the amplified production constitutes the sequencing library, wherein, the amplification using suitable for expanding first sequence label simultaneously With the primer of second sequence label.

Embodiments in accordance with the present invention, said apparatus can effectively implement the side of structure sequencing library described above Method, can effectively build sequencing library, meanwhile, in constructed sequencing library, for identical double chain DNA fragment (at this Every chain, obtains the amplification with the first sequence label and the second sequence label in text also referred to as " source sequence ") respectively Product, thus, in the analysis of follow-up sequencing result, can carry out mutual correction according to the sequencing result of two kinds of labels, improve The reliability of analysis result.

Embodiments in accordance with the present invention, further comprise：Unit is repaired in end, for sample of nucleic acid progress end to be repaiied It is multiple, to obtain the sample of nucleic acid by reparation；And end modified unit, in the addition of 5 ' ends of the sample of nucleic acid Base A, has cohesive end base A sample of nucleic acid, the two ends have cohesive end alkali respectively respectively to obtain two ends Base A sample of nucleic acid constitutes the double chain DNA fragment.

Embodiments in accordance with the present invention, the probe is provided in the form of chip.

Separately length is for embodiments in accordance with the present invention, first sequence label and second sequence label 4~10nt.

The length of embodiments in accordance with the present invention, first sequence label and second sequence label is 8nt.

, there is at least 2nt between first sequence label and second sequence label in embodiments in accordance with the present invention Mispairing.

It will be appreciated to those of skill in the art that above for the feature and excellent built described by the method for sequencing library Point, is equally applicable to the device of the structure sequencing library, will not be repeated here.

In the fifth aspect of the present invention, the present invention proposes a kind of sequencing equipment.Embodiments in accordance with the present invention, the sequencing Equipment includes：According to the device of foregoing structure sequencing library；Sequencing device, for being surveyed to the sequencing library Sequence.

Thus, it is possible to effectively improve the efficiency of sequencing.In addition, being previously with regard to build the method and apparatus institute of sequencing library The feature and advantage of description, the equally applicable sequencing equipment, will not be repeated here.

Embodiments in accordance with the present invention, the sequencing device is Hiseq2000 or Hiseq2500.

In the sixth aspect of the present invention, the present invention proposes a kind of system for determining nucleotide sequence.According to the reality of the present invention Example is applied, the system includes：Foregoing sequencing equipment, for being sequenced for sample of nucleic acid, is surveyed to obtain by multiple Ordinal number according to composition sequencing result；Sequencing data subset builds equipment, for based on the sequencing result, building at least one survey Sequence data subset, wherein, all sequencing datas in each sequencing data subset correspond to identical source sequence on sample of nucleic acid； Sequencing data sorting device, for for each sequencing data subset, determining respectively corresponding with first sequence label Sequencing data is normal chain sequencing data, and sequencing data corresponding with second sequence label is minus strand sequencing data；Number is sequenced According to calibration equipment, for for each sequencing data subset, being based respectively on the normal chain sequencing data and minus strand sequencing Data, are corrected to sequencing data, to determine corrected sequencing data；And sequence determination device, for based on The corrected sequencing data, determines the sequence of the sample of nucleic acid.Thus, determination according to embodiments of the present invention is utilized The system of nucleotide sequence, can effectively implement the method for nucleotide sequence determined above.Surveyed so as to effectively be based on normal chain Ordinal number evidence and minus strand sequencing data are corrected, and improve the reliability of analysis result.

Embodiments in accordance with the present invention, sequencing data subset, which builds equipment, to be included：Sequencing data index determines equipment, is used for For every a pair of the multipair paired sequencing data, it is determined that sequencing data index in pairs, the paired sequencing data index It is made up of the initial N number of base of each of paired sequencing data, wherein, N is the integer between 10~20；Preliminary screening is filled Put, for being indexed based on the paired sequencing data, build at least one preliminary sequencing data subset, wherein, the just pacing Each sequencing data in sequence data subset is respectively provided with the paired sequencing data index of identical；And postsearch screening device, use Hamming distance in based on the preliminary sequencing data subset between sequencing data, at least one described preliminary sequencing data Subset is finely divided, to obtain multiple sequencing data subsets.

Embodiments in accordance with the present invention, N is 12.

Embodiments in accordance with the present invention, further comprise sequence analysis device, and the sequence analysis device is used to be based on institute The sequence of sample of nucleic acid is stated, SNV analyses or Indel analyses is carried out.

It will be appreciated by persons skilled in the art that being previously with regard to determine advantage and the spy described by the method for nucleotide sequence The system for levying the equally applicable determination nucleotide sequence, will not be repeated here.

The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become from description of the accompanying drawings below to embodiment is combined Substantially and be readily appreciated that, wherein：

Fig. 1 shows the flow chart for the method for building sequencing library according to an embodiment of the invention；

Fig. 2 shows according to one embodiment of present invention, the analysis result of same index reads clusters；And

Fig. 3 shows according to one embodiment of present invention, spectrum of mutation analysis result.

Embodiment

Below by specific embodiment, the present invention will be described, it is necessary to which explanation is that these embodiments are only to be Illustration purpose, and can not be construed to limitation of the present invention in any way.

Conventional method

Unless stated otherwise, in the following embodiments, carried out according to following conventional method：

First, probe is designed

According to human genome HG19, transfer the exon sequence of related gene, it is contemplated that the size of capture region and into This, final chip has pertained only to the CDS regions of said gene, and to extending 20bp before and after CDS regions.On chip covered with Abundant capture probe, probe overlay area can be enriched with target DNA fragments, same up to 98% from complicated genome Open on chip with high specific and high coverage rate capture genome area.

2nd, sequencing library and sequencing are built

Reference picture 1, builds the step of library and sequencing as follows：

1. extracting patient's 5ml peripheral bloods, centrifugal separation plasma and leucocyte, plasma sample and leucocyte sample are carried respectively Take DNA, detection of the control for somatic mutation will be used as after the DNA that leucocyte is extracted.

2. the free Circulating DNA extracted in blood plasma is average in 170BP, directly 3 are carried out according to conventional banking process afterwards Walk enzymatic reaction：End is repaired, plus " A " and the sequence measuring joints of connection specially treated (carry 8BP label, ordered on the joint Entitled index1, it not only has the function of distinguishing different samples, the mark of normal chain after being also used for).

3. obtain connection product, carry out Colorectalpan chip hybridization captures, its elute single-stranded template product it Afterwards by the primer amplification marked with index2 of 1 wheel, 1 circulation so that anti-chain is labeled.Added simultaneously during PCR UDG/FPG enzymes are incubated, and to eliminate the DNA damage carried in template strand, reduce the generation of false positive.

4. the product that the double index marks of positive anti-chain are completed, takes turns PCR enrichments by after purification, carrying out second, completes library Prepare.

5. sequence measurement uses Hiseq 2000 or Hiseq2500, the difference measured according to sequencing and sample number, can be flexible Select suitable microarray dataset.

Specific steps include：

1.cfDNA extraction

The blood plasma about 2-3ml that 5ml peripheral bloods are isolated is taken, according to QIAamp Circulating Nucleic Acid Kit extracts reagent specifications, carry out blood plasma cfDNA extraction.Qubit (Invitrogen, the Quant-iT^TM dsDNA HS Assay Kit) quantitative extracted DNA, total amount is about 5~50ng.

2. the preparation in sample library：

The cfDNA extracted in blood plasma, builds storehouse specification according to KAPA LTP Library Preparation Kit afterwards, Carry out 3 step enzymatic reactions.

1) end is repaired

Afterwards, the μ L of Agencourt AMPure XP reagent 120 are added, magnetic beads for purifying, the last μ L of back dissolving 42 is carried out ddH₂O, band magnetic bead carries out next step reaction.

2) A is added

The μ L of PEG/NaCl SPRI solution 90 are added afterwards, are sufficiently mixed, and carry out magnetic beads for purifying, last back dissolving (35- joints) μL ddH₂O, band magnetic bead carries out next step reaction.

3) joint is connected

50 μ L of PEG/NaCl SPRI solution are separately added into afterwards 2 times, carry out 2 magnetic beads for purifying, the last μ L of back dissolving 25 ddH₂O。

3 chip hybridizations are captured

The early screening chip Colorectalpan for colorectal cancer designed in the present invention using inventor, with reference to chip The specification that manufacturer provides carries out hybrid capture.Finally elute the μ L ddH of back dissolving 21₂O band hybridization elution magnetic beads.

4. the positive anti-chain marks of couple index and enrichment：

2 are carried out altogether and takes turns PCR, and PCR 1 carries out anti-chain mark and template DNA injury repair, and PCR2 carries out amplification enrichment, complete Prepared into library.

1)PCR1

PCR1 programs：

Hybridization elution magnetic bead is first removed, the μ L of Agencourt AMPure XP reagent 40 is then added, carries out magnetic bead Purifying, the last μ L ddH of back dissolving 20₂O, band magnetic bead carries out next step reaction.

2)PCR2

PCR2 programs：

Previous step magnetic bead is first removed, the μ L of Agencourt AMPure XP reagent 50 are then rejoined, magnetic is carried out Pearl purifies, the last μ L ddH of back dissolving 25₂O, carries out QC and upper machine.

3rd, sequencing result is analyzed

1, by paired reads (paired sequencing data) reads1 preceding 12bp bases and reads2 preceding 12bp alkali Base (i.e. sequence of breakpoints) connects into a 24bp short sequence, and using this 24bp as paired reads index, and root Normal chain and anti-chain are marked according to its index.

2, external sort, the purpose being brought together with the copy reached same DNA profiling are carried out to index.

3, central cluster is carried out to the reads for possessing same index gathered together, according to the Hamming distance between its sequence From each big cluster for having same index to be gathered into the Chinese of any two couples of paired reads in several tuftlets, each tuftlet Prescribed distance is no more than 10, to reach the purpose for distinguishing the reads for possessing same index but from different DNA profilings.

4, the copy cluster of the same DNA profiling to being obtained in step 3 is screened, if the reads numbers of normal chain and anti-chain More than 2 pairs are all reached, then carries out subsequent analysis.

5, error correction is carried out to the cluster for meeting 4 conditionals, and a pair of error-free new reads are produced, for each of DNA profiling Individual sequencing base, if certain concordance rate of base type in the reads of normal chain reaches 80%, and it is consistent in anti-chain reads Rate also reaches 80%, then this base for remembering new reads is this base type, is otherwise designated as N, has so just obtained representing original The new reads of DNA profiling sequence.

6, new reads is compared on genome again with bwa mem algorithms, screens out and compares the reads that quality is less than 30.

7, SNV analyses:

1) counted according to the reads obtained in 6, the base type distribution in each site in capture region is obtained, with master It had both been mutating alkali yl type to flow the inconsistent base type of base type (ratio is more than 15% base type).Count target area covering big Small, average sequencing depth, positive anti-chain interworking rate, low frequency mutation rate etc..

2) SNP is annotated using CCDS, human genome database (NCBI36.3), dbSNP (v130) information, really Determine mutational site generation gene, coordinate, mRNA sites, amino acid change, (the missense mutation/nonsense mutation/variable of SNP functions Shearing site), SIFT prediction SNP influence protein function predictions etc.；

3) according to the comparison of Patient Sample A and control sample information, Call Somatic Mutation.Simultaneously candidate's The SNP occurred in dbSNP, HAPMAP, 1000 human genomes, other extron sequencing projects is got rid of in SNV, using as The related candidate SNV of last disease.

8, INDEL analyses：

1) counted according to the reads containing indel in the reads obtained in 6, obtain all indel and selection There is the indel of 2 and above reads supports as reliable mutation indel,

2) Indel is annotated using CCDS, human genome database (NCBI36.3), dbSNP (v130) information, Determine gene, coordinate, mRNA sites, the change of Coding region sequence, the influence to amino acid, InDel that mutational site occurs Function (amino acid insertion/amino acid deletions/frameshift mutation)；

3) according to the comparison of Patient Sample A and control sample information, Call Somatic Mutation.Simultaneously candidate's The Indel occurred in dbSNP and other extron sequencing projects is got rid of in Indel, to be used as last disease correlation Candidate Indel.

Embodiment 1：Colorectal cancer is early sieved

First, chip is designed

1) design of colorectal cancer early screening chip：

Based on TCGA, ICGC, database and the pertinent literature reference such as COSMIC design pin Colon and rectum using iterative algorithm The genetic chip Colorectalpan that cancer is early sieved.Colorectalpan chips include：The related driving gene of colorectal cancer, Important gene in high frequency mutant gene, and the signal paths of cancer 12, altogether 60 genes, 123KB.

Chip the design process is divided into 4 steps：

1st, statistics cosmic databases in about colorectal cancer driver gene each exon 1 variation sample number, Variation sample, hottest point variation where sample number, PI values (to assess level of patient's reply frequency on each extron, Accumulative number of patients/extron length of mutation is carried on the every extrons of PI=), and arranged according to PI values descending.Use afterwards Iterative algorithm：Sample using first exon 1 variation counts other all interval and sample datas as sample database The number of storehouse difference sample, the most sample interval of different number of samples is classified as into second, and to screen chip interval, now with The two interval variation samples screened screen the 3rd interval, until sample in the same way as sample database Database includes all samples, to count exon 1 collection, and for not screening any all areas of interval gene Between, then all it is added on chip interval.

2. based on TCGA, the database such as ICGC is interval and including being more than or equal to 5 samples to remove driver gene Focus variation interval (SNV>=5) interval for candidate, repeat the iterative calculation of previous step.

3. based on TCGA, the database such as ICGC, remove be screened it is interval in respectively with：PI>=30, SNV>=3 With：PI>=20, SNV>=3 be that candidate is interval, and screening causes single sample database sample number to reduce most intervals and be used as first Individual chip is interval, repeats above procedure and is iterated calculating.

4. add the intervals such as fusion.

List of genes details are shown in Table 1.

Table 1

KRAS	SRC	TLR3	EP300	TMPRSS13	EPHA5
						BRAF	PTEN	MC4R	CYLD	PHF2	EPHA3
APC	AXIN1	MLH1	FBN2	OPRD1	PTPRD
						TP53	FLG	AKT1	NF1	LILRB5	NTRK3
PIK3CA	LIG1	CASD1	ASXL1	COL18A1	NTRK1
						CTNNB1	MAP2K1	PTCH1	SMAD4	LARP4B	ALK
NRAS	PIK3R1	ADAMTS18	IRF5	DMKN	ROS1
						EGFR	ERBB2	MSH2	DOCK3	ROBO2	RET
FBXW7	STK11	BAP1	MYOM1	KCNN3	PDGFRA
						ARID1A	IL7R	CTNNA1	NEFH	INHBA	FGFR1

2nd, sequencing analysis

Using the present invention, 1 intestinal polyp patient is surveyed according to colorectal cancer early screening is carried out the step of above method, as a result It is as follows：

Sequencing data statistical result see the table below：

Annotation：Positive anti-chain interworking rate：Based on cluster total 3 more than reads cluster/3 more than reads that just anti-chain is having Ratio, to assess positive anti-chain interworking situation in data available；Valid data utilization rate：Based on the reads at least meeting 2+/2- clusters Number and the ratio of total sequencing reads numbers after error correction；Average sequencing depth：After valid data error correction, to target area The average coverage condition of base.

The analysis of cluster：

Fig. 2 is shown in the analysis of same index reads clusters, wherein, abscissa represents duplication (dup) number of cluster, indulges Coordinate represents the total reads numbers for the cluster for meeting a certain dup numbers.Fig. 2 result is shown：The dup clusters overwhelming majority is left 6 The right side, most of cluster interior energy meets 2 just+2 anti-conditions, and final data data effective rate of utilization is 5.12%, average sequencing depth For：1033X

It is mutated analysis of spectrum：

Spectrum of mutation analysis result is shown in Fig. 3, wherein, complementary mutation type is theoretical for the molecule (DNA) from double-strand The frequency of mutation is essentially identical, and abscissa represents the type of base mutation；Ordinate represents the number of mutation.Fig. 3 result is shown： The distribution of mutating alkali yl type is in a basic balance, and its frequency of mutation (Mutations per nucleotide) is：2.2×10^-6。

Variation detection list details (are counted) based on exon areas and nonsynonymous mutation：

Gene	Base mutation	Amino acid mutation	Mutation type	The frequency of mutation
					SMAD4	c.2119G>A	p.Y301F	Missense mutation	2.8%
ARID1A	c.817C>T	p.A1872T	Missense mutation	2.34%
					APC	c.217A>C	p.A426T	Missense mutation	1.80%

Interpretation of result：Relational database and the documents and materials such as foundation TCGA, COSMIC, ClinVar, HMGD, in patient SMAD4 p.Y301F, APC p.A426T driving mutation are detected in blood plasma and imply that patient has higher risk of cancer Rate, it is proposed that patient to relevant healthcare institution is more fully detected intervening measure related to taking.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can in an appropriate manner be combined in any one or more embodiments or example.In addition, it is necessary to explanation, ability Field technique personnel are it is understood that order the step of included in scheme proposed by the invention, and those skilled in the art can be with It is adjusted, this is also included within the scope of the present invention.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is limited by claim and its equivalent.

Claims

1. a kind of method for building sequencing library, it is characterised in that including：

(a) jointing is distinguished at the two ends of double chain DNA fragment, to obtain connection product, wherein, the joint includes first Chain and the second chain, first chain and the second chain part are matched and first chain includes the first sequence label, so as to described Limit double stranded region and two single-stranded afterbodys on joint, the first label is included in the sequence of one of described two single-stranded afterbodys；

(b) connection product is cracked into Single-stranded DNA fragments；

(c) Single-stranded DNA fragments are screened using probe, wherein, the probe specificity recognizes presumptive area, its In, the presumptive area includes one of following：

(1)TLR3、TMPRSS13、MC4R、PHF2、OPRD1、FLG、LILRB5、LIG1、CASD1、COL18A1、LARP4B、 At least one of ADAMTS18, IRF5, DMKN, DOCK3, MYOM1, KCNN3 and NEFH gene；

(2) the CDS regions of (1)；And

(3) upstream and downstream of (2) at least 10bp region；

(d) chain extension reaction is carried out using Single-stranded DNA fragments described in the first primer pair, to obtain chain extension product, wherein, institute Stating the first primer includes the second sequence label, and first primer is suitable to the first chain link in pairs with the joint , simply there is mispairing between first sequence label and second sequence label in structure；

(e) the chain extension product is expanded, to obtain amplified production, the amplified production constitutes the sequencing text Storehouse, wherein, the amplification is described using the primer for being suitable to expand first sequence label and second sequence label simultaneously Primer is the second primer and three-primer.

2. according to the method described in claim 1, it is characterised in that the double chain DNA fragment is obtained through the following steps：

Sample of nucleic acid is subjected to end reparation, to obtain the sample of nucleic acid by reparation；And

Base A is added in 5 ' ends of the sample of nucleic acid, there is cohesive end base A nucleic acid sample respectively to obtain two ends This, the sample of nucleic acid with cohesive end base A constitutes the double chain DNA fragment respectively at the two ends.

3. method according to claim 2, it is characterised in that the sample of nucleic acid is at least one of human gene group DNA Divide or free nucleic acid.

4. method according to claim 3, it is characterised in that the free nucleic acid is extracted from the peripheral blood of patient.

5. method according to claim 4, it is characterised in that the patient suffers from colorectal cancer.

6. method according to claim 3, it is characterised in that at least a portion of the human gene group DNA is by right Human gene group DNA is interrupted and obtained at random.

7. according to the method described in claim 1, it is characterised in that the joint has 3 ' base T cohesive ends.

8. according to the method described in claim 1, it is characterised in that the Single-stranded DNA fragments are by by the connection product Carry out denaturation treatment acquisition.

9. according to the method described in claim 1, it is characterised in that the probe is provided in the form of chip.

10. according to the method described in claim 1, it is characterised in that when there is UDG enzymes/FPG enzymes, carry out the chain extension Reaction.

11. according to the method described in claim 1, it is characterised in that first sequence label and second sequence label Separately length is 4~10nt.

12. according to the method described in claim 1, it is characterised in that first sequence label and second sequence label Length be 8nt.

13. according to the method described in claim 1, it is characterised in that first sequence label and second sequence label Between exist at least 2nt mispairing.

14. according to the method described in claim 1, it is characterised in that the nucleotides sequence of the first chain of the joint is classified as SEQ ID NO：Sequence shown in 1, the nucleotides sequence of the second chain of the joint is classified as SEQ ID NO：Sequence shown in 2, described The nucleotides sequence of one label is classified as SEQ ID NO：Sequence shown at least one of 3-6, the nucleotides sequence of second label It is classified as SEQ ID NO：Sequence shown at least one of 7-10, the nucleotides sequence of first primer is classified as SEQ ID NO：11 Shown sequence, the nucleotides sequence of second primer is classified as SEQ ID NO：Sequence shown in 12, the core of the three-primer Nucleotide sequence is SEQ ID NO：Sequence shown in 13.

15. a kind of sequence measurement, methods described is used for non-diagnostic purpose, it is characterised in that including：

Method according to any one of claim 1~14 builds sequencing library；

The sequencing library is sequenced.

16. method according to claim 15, it is characterised in that the survey is carried out on Hiseq2000 or Hiseq2500 Sequence.

17. a kind of method for determining nucleotide sequence, methods described is used for non-diagnostic purpose, it is characterised in that including：

For sample of nucleic acid, the method according to claim 15 or 16 is sequenced, to obtain by multiple sequencing datas The sequencing result of composition；

Based on the sequencing result, at least one sequencing data subset is built, wherein, all surveys in each sequencing data subset Ordinal number is according to identical source sequence on corresponding sample of nucleic acid；

For each sequencing data subset, determine that sequencing data corresponding with first sequence label is sequenced for normal chain respectively Data, sequencing data corresponding with second sequence label is minus strand sequencing data；

For each sequencing data subset, the normal chain sequencing data and the minus strand sequencing data are based respectively on, to sequencing Data are corrected, to determine corrected sequencing data；And

Based on the corrected sequencing data, the sequence of the sample of nucleic acid is determined.

18. method according to claim 17, it is characterised in that the sequencing is double end sequencings, the sequencing result It is made up of multipair paired sequencing data.

19. method according to claim 17, it is characterised in that based on the sequencing result, builds at least one sequencing Data subset is carried out through the following steps：

For every a pair of the multipair paired sequencing data, it is determined that sequencing data index, the paired sequencing data in pairs Index is made up of the initial N number of base of each of paired sequencing data, wherein, N is the integer between 10~20；

Indexed based on the paired sequencing data, build at least one preliminary sequencing data subset, wherein, the preliminary sequencing number The paired sequencing data index of identical is respectively provided with according to each sequencing data in subset；And

Based on the Hamming distance between sequencing data in the preliminary sequencing data subset, at least one described preliminary sequencing number It is finely divided according to subset, to obtain multiple sequencing data subsets.

20. method according to claim 19, it is characterised in that N is 12.

21. method according to claim 19, it is characterised in that in each of the multiple sequencing data subset, The Hamming distance of any two pairs paired sequencing datas is no more than 20.

22. method according to claim 19, it is characterised in that in each of the multiple sequencing data subset, Normal chain sequencing data and minus strand sequencing data are respectively at least two.

23. method according to claim 22, it is characterised in that be sequenced based on the normal chain sequencing data and the minus strand Data, determining corrected sequencing data is carried out based on following principle：

Each base in corrected sequencing data obtains at least 50% normal chain sequencing data and at least 50% negative simultaneously The support of chain sequencing data.

24. method according to claim 23, it is characterised in that each base in corrected sequencing data is same When obtain at least 80% normal chain sequencing data and at least support of 80% minus strand sequencing data.

25. method according to claim 23, it is characterised in that further comprise：

The corrected sequencing data is compared to reference sequences, and deletes the sequencing data that comparison quality is less than 30.

26. method according to claim 17, it is characterised in that the sequence based on the sample of nucleic acid, carries out SNV analyses Or Indel analyses.

27. a kind of device for building sequencing library, it is characterised in that including：

Connection unit, for distinguishing jointing at the two ends of double chain DNA fragment, to obtain connection product, wherein, it is described to connect Head includes the first chain and the second chain, and first chain and the second chain part are matched and first chain includes the first label sequence Row, to limit double stranded region and two single-stranded afterbodys on the joint, are included in the sequence of one of described two single-stranded afterbodys First label；

Unit is cracked, for the connection product to be cracked into Single-stranded DNA fragments；

Screening unit, for before chain extension is carried out, being screened using probe to the Single-stranded DNA fragments, wherein, it is described Probe specificity recognizes presumptive area, wherein, the presumptive area includes one of following：

(2) the CDS regions of (1)；And

(3) upstream and downstream of (2) at least 10bp region；

Chain extension unit, for carrying out chain extension reaction using Single-stranded DNA fragments described in the first primer pair, to obtain chain extension Product, wherein, first primer includes the second sequence label, and first primer is suitable to the first chain with the joint Duplex structure is formed, simply there is mispairing between first sequence label and second sequence label；

Amplification unit, for being expanded to the chain extension product, to obtain amplified production, the amplified production constitutes institute Sequencing library is stated, wherein, the amplification uses the second primer and three-primer, and second primer recognizes the of the joint Two chains, the three-primer is arranged to be suitable to while expanding first sequence label and second sequence label.

28. device according to claim 27, it is characterised in that further comprise：

Unit is repaired in end, for sample of nucleic acid to be carried out into end reparation, to obtain the sample of nucleic acid by reparation；And

End modified unit, for adding base A in 5 ' ends of the sample of nucleic acid, has viscosity respectively to obtain two ends Terminal bases A sample of nucleic acid, the sample of nucleic acid with cohesive end base A constitutes the double-stranded DNA piece respectively at the two ends Section.

29. device according to claim 27, it is characterised in that the probe is provided in the form of chip.

30. device according to claim 27, it is characterised in that when there is UDG enzymes/FPG enzymes, carries out the chain extension Reaction.

31. device according to claim 27, it is characterised in that first sequence label and second sequence label Separately length is 4~10nt.

32. device according to claim 27, it is characterised in that first sequence label and second sequence label Length be 8nt.

33. device according to claim 27, it is characterised in that first sequence label and second sequence label Between exist at least 2nt mispairing.

34. device according to claim 27, it is characterised in that the nucleotides sequence of the first chain of the joint is classified as SEQ ID NO：Sequence shown in 1, the nucleotides sequence of the second chain of the joint is classified as SEQ ID NO：Sequence shown in 2, described The nucleotides sequence of one label is classified as SEQ ID NO：Sequence shown at least one of 3-6, the nucleotides sequence of second label It is classified as SEQ ID NO：Sequence shown at least one of 7-10, the nucleotides sequence of first primer is classified as SEQ ID NO：11 Shown sequence, the nucleotides sequence of second primer is classified as SEQ ID NO：Sequence shown in 12, the core of the three-primer Nucleotide sequence is SEQ ID NO：Sequence shown in 13.

35. a kind of sequencing equipment, it is characterised in that including：

The device of structure sequencing library according to any one of claim 27~34；

Sequencing device, for the sequencing library to be sequenced.

36. sequencing equipment according to claim 35, it is characterised in that the sequencing device be Hiseq2000 or Hiseq2500。

37. a kind of system for determining nucleotide sequence, it is characterised in that including：

Sequencing equipment described in claim 35 or 36, for being sequenced for sample of nucleic acid, to obtain by multiple sequencings The sequencing result that data are constituted；

Sequencing data subset builds equipment, for based on the sequencing result, building at least one sequencing data subset, wherein, All sequencing datas in each sequencing data subset correspond to identical source sequence on sample of nucleic acid；

Sequencing data sorting device, for for each sequencing data subset, determining and first sequence label pair respectively The sequencing data answered is normal chain sequencing data, and sequencing data corresponding with second sequence label is minus strand sequencing data；

Sequencing data calibration equipment, for for each sequencing data subset, being based respectively on the normal chain sequencing data and institute Minus strand sequencing data is stated, sequencing data is corrected, to determine corrected sequencing data；And

Sequence determination device, for based on the corrected sequencing data, determining the sequence of the sample of nucleic acid.

38. the system according to claim 37, it is characterised in that the sequencing is double end sequencings, the sequencing result It is made up of multipair paired sequencing data.

39. the system according to claim 37, it is characterised in that sequencing data subset, which builds equipment, to be included：

Sequencing data index determines equipment, for every a pair for the multipair paired sequencing data, it is determined that sequencing in pairs Data directory, the paired sequencing data index is made up of the initial N number of base of each of paired sequencing data, wherein, N For the integer between 10~20；

Preliminary screening device, for being indexed based on the paired sequencing data, builds at least one preliminary sequencing data subset, its In, each sequencing data in the preliminary sequencing data subset is respectively provided with the paired sequencing data index of identical；And

Postsearch screening device, for based on the Hamming distance between sequencing data in the preliminary sequencing data subset, to described At least one preliminary sequencing data subset is finely divided, to obtain multiple sequencing data subsets.

40. the system according to claim 39, it is characterised in that N is 12.

41. the system according to claim 39, it is characterised in that in each of the multiple sequencing data subset, The Hamming distance of any two pairs paired sequencing datas is no more than 20.

42. the system according to claim 39, it is characterised in that in each of the multiple sequencing data subset, Normal chain sequencing data and minus strand sequencing data are respectively at least two.

43. system according to claim 42, it is characterised in that be sequenced based on the normal chain sequencing data and the minus strand Data, determining corrected sequencing data is carried out based on following principle：

44. system according to claim 43, it is characterised in that each base in corrected sequencing data is same When obtain at least 80% normal chain sequencing data and at least support of 80% minus strand sequencing data.

45. system according to claim 43, it is characterised in that further comprise：

46. the system according to claim 37, it is characterised in that further comprise sequence analysis device, the sequence point Analysis apparatus is used for the sequence based on the sample of nucleic acid, carries out SNV analyses or Indel analyses.