Label, Tag primer, kit and application thereof
Technical field
The present invention relates to blended data treatment technology, can will particularly mix sequencing data and correspond to one group of source sample
Label and its correlated series.
Background technology
The goldstandard for being to determine Genotyping is sequenced in Sanger, and flight time mass spectrum detection can realize fixed point detection gene
Parting, such as a product that Shenzhen Hua Da gene is released carry out mass spectrum for 20 sites of four deaf common mutations genes
Detection, this 20 sites occupy main function, also genome sequencing, three kinds of sides in the pathogenic factor of China deafness crowd
Method all has a respective limitation, such as Sanger and mass spectrography flux is low, cost is high, and genome sequencing then can not be effective
Utilize whole sequencing datas.
Congenital deafness is a kind of common disease, the incidence of disease in Chinese neonates be higher than 1 ‰, wherein 60% above is
Caused by inherent cause.Therefore, except the medical diagnostic method of routine, it is by determining the Genotyping of related gene, judging
No producer mutation, diagnosis neonate can be aided in whether with deafness.
The result of deaf gene mutation progress Molecule Epidemiology Investigation in population of China is directed to according to studies in China personnel,
GJB2, GJB3, SLC26A4 and 12sRNA mutation are most commonly seen, and the mutant proportion in crowd is up to 40%, in this four bases
Because upper mutational site be cause hereditary hearing impairment occur common mutations.
The content of the invention
An aspect of of the present present invention provides a group of labels, and it includes SEQ ID NO:At least 2 of sequence shown in 27~124
It is individual.
Another aspect provides a group of labels primer, and it includes 1 pair of Tag primer, the knot of the Tag primer
Structure formula isWherein, X and X ' is selected from the same sequence in the label of one aspect of the present invention offer or different sequences
Row, Y are selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID
NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:21, SEQ ID
NO:23 and SEQ ID NO:Any sequence shown in 25, Y ' is corresponding with Y, and described correspondence refers to:When Y is SEQ ID NO:1
When, Y ' is SEQ ID NO:2, when Y is SEQ ID NO:When 3, Y ' is SEQ ID NO:4, when Y is SEQ ID NO:When 5, Y ' is
SEQ ID NO:6, when Y is SEQ ID NO:When 7, Y ' is SEQ ID NO:8, when Y is SEQ ID NO:When 9, Y ' is SEQ ID
NO:10, when Y is SEQ ID NO:When 11, Y ' is SEQ ID NO:12, when Y is SEQ ID NO:When 13, Y ' is SEQ ID NO:
14, when Y is SEQ ID NO:When 15, Y ' is SEQ ID NO:16, when Y is SEQ ID NO:When 17, Y ' is SEQ ID NO:18,
When Y is SEQ ID NO:When 19, Y ' is SEQ ID NO:20, when Y is SEQ ID NO:When 21, Y ' is SEQ ID NO:22, work as Y
For SEQ ID NO:When 23, Y ' is SEQ ID NO:24, when Y is SEQ ID NO:When 25, Y ' is SEQ ID NO:26.
Another aspect of the invention provides a kind of kit, and it includes a group of labels of one aspect of the present invention offer, and/or
The Tag primer provided including another aspect of the present invention.
Another aspect of the present invention provides aforementioned agents box and mixes sample of nucleic acid in mark, and/or multiple sample of nucleic acid mix
Close sequencing, and/or it is determined that mixing sequencing data in data corresponding to sample source, and/or detection deaf-related gene dash forward
Purposes in change.
Label marker samples or the kit using the present invention, Neng Gouqu are introduced using Tag primer provided by the invention
Divide the sample more than or equal to number of tags, for example utilize a pair of upstream and downstream primer sequences all primer pair marks one with same label
Individual sample, one specific label sequence of nucleic acid amplification product band of the sample is set to be utilized a pair for another example to distinguish other samples
The primer pair of upstream and downstream primer sequence band difference label carrys out one sample of mark, makes two marks of nucleic acid amplification product band of the sample
Sign sequence, if such two labels of institute's band any one it is different from other samples can just distinguish other sample nucleic acids,
Utilize multipair one sample of primer mark for another example again, the upstream and downstream sequence of each pair of primer is all with same label or different marks
Label, with the same or different labels between each pair of primer, make the mark of the sample nucleic acid amplified production band one or more particular arrangement
The difference of label, any of which label or label position is enough to distinguish different samples.This group of label or mark provided by the invention
Primer is signed, is inventor by test of many times, creative work, design considers each sequence of examination base composition itself, various alkali
Between base ratio and relationship between sequences such as label and label, between label and primer, between primer and primer, label connection
Relation after to primer between different Tag primers etc., the whole or any portion for making these sequences can be same
Using playing a role in reaction system.Primer in Tag primer provided by the invention, can be captured using all 13 pairs of primers,
Common deaf-related gene in mixing amplification population of China, detect the catastrophe of deaf-related gene, adjuvant clinical detection
Diagnose deaf disease.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention is from combining in description of the accompanying drawings below to embodiment by change
Obtain substantially and be readily appreciated that, wherein:
Fig. 1 is the library construction schematic diagram in the specific embodiment of the present invention.
Embodiment
According to one embodiment of the present invention, there is provided a group of labels, it includes SEQ ID NO:Sequence shown in 27~124
At least two of row.
According to the specific embodiment of the present invention, there is provided label include SEQ ID NO:Sequence shown in 27~124
At least five;According to the specific embodiment of the present invention, described label includes SEQ ID NO:Sequence shown in 27~124
At least ten of row;According to the specific embodiment of the present invention, label used includes SEQ ID NO:Shown in 27~124
At least 20 of sequence;According to the specific embodiment of the present invention, used label includes SEQ ID NO:27~124
At least 30 of shown sequence;According to the specific embodiment of the present invention, used label includes SEQ ID NO:27
At least 40 of sequence shown in~124;According to the specific embodiment of the present invention, used label includes SEQ ID
NO:At least 50 of sequence shown in 27~124;According to the specific embodiment of the present invention, used label includes SEQ
ID NO:At least 60 of sequence shown in 27~124;According to the specific embodiment of the present invention, used label includes
SEQ ID NO:At least 70 of sequence shown in 27~124;According to the specific embodiment of the present invention, used label
Including SEQ ID NO:At least 80 of sequence shown in 27~124;It is used according to the specific embodiment of the present invention
Label includes SEQ ID NO:At least 90 of sequence shown in 27~124;According to the specific embodiment of the present invention, made
Label includes SEQ ID NO:All 98 of sequence shown in 27~124.Sequence SEQ ID NO:27~124 are shown in Table 1.
Table 1
This group of label of one embodiment of the present invention, be consider sequence length, base composition, base positions ratio,
With a large amount of sequences of relational design of other label bases, test of many times screening obtains, the part of label of the invention or complete
Portion can be placed in same reaction system and mutual not interference effect, and do not disturb other reactions in conventional system
Thing or reaction, for example each reaction system and reaction in library construction are not influenceed, fixed sequence program on sequence testing chip etc..
According to another embodiment of the invention, there is provided a group of labels primer, it includes 1 pair of Tag primer, label
The structural formula of primer isWherein, X and X ' is selected from the label of one aspect of the present invention offer, and Y is selected from SEQ ID NO:
1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9, SEQ ID NO:11, SEQ ID NO:13,
SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:21, SEQ ID NO:23 and SEQ ID
NO:Any sequence shown in 25, Y ' is corresponding with Y, and the correspondence refers to:
When Y is SEQ ID NO:When 1, Y ' is SEQ ID NO:2,
When Y is SEQ ID NO:When 3, Y ' is SEQ ID NO:4,
When Y is SEQ ID NO:When 5, Y ' is SEQ ID NO:6,
When Y is SEQ ID NO:When 7, Y ' is SEQ ID NO:8,
When Y is SEQ ID NO:When 9, Y ' is SEQ ID NO:10,
When Y is SEQ ID NO:When 11, Y ' is SEQ ID NO:12,
When Y is SEQ ID NO:When 13, Y ' is SEQ ID NO:14,
When Y is SEQ ID NO:When 15, Y ' is SEQ ID NO:16,
When Y is SEQ ID NO:When 17, Y ' is SEQ ID NO:18,
When Y is SEQ ID NO:When 19, Y ' is SEQ ID NO:20,
When Y is SEQ ID NO:When 21, Y ' is SEQ ID NO:22,
When Y is SEQ ID NO:When 23, Y ' is SEQ ID NO:24,
When Y is SEQ ID NO:When 25, Y ' is SEQ ID NO:26.
According to the specific embodiment of the present invention, there is provided a group of labels primer include 2 pairs of Tag primers, this 2 couple
The Y of Tag primer is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9,
SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:
21, SEQ ID NO:23 and SEQ ID NO:Wantonly 2 sequences shown in 25.
According to the specific embodiment of the present invention, there is provided a group of labels primer include 5 pairs of Tag primers, this 5 couple
The Y of Tag primer is selected from SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9,
SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:
21, SEQ ID NO:23 and SEQ ID NO:Wantonly 5 sequences shown in 25.
According to the specific embodiment of the present invention, there is provided a group of labels primer include 10 pairs of Tag primers, this 10
SEQ ID NO are selected to the Y of Tag primer:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:9,
SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:
21, SEQ ID NO:23 and SEQ ID NO:Wantonly 10 sequences shown in 25.
According to the specific embodiment of the present invention, there is provided a group of labels primer include 13 pairs of Tag primers, this 13
Y to Tag primer is respectively SEQ ID NO:1, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7, SEQ ID NO:
9, SEQ ID NO:11, SEQ ID NO:13, SEQ ID NO:15, SEQ ID NO:17, SEQ ID NO:19, SEQ ID NO:
21, SEQ ID NO:23 and SEQ ID NO:Sequence shown in 25.Sequence SEQ ID NO:1-26 is as shown in table 2.
Table 2
Primer numbers |
Primer sequence |
F1 |
TCTTTTCCAGAGCAAACCGC(SEQ ID NO:1) |
F2 |
ACGTGCATGGCCACTAGGAG(SEQ ID NO:3) |
F3 |
TGCAGCTGATCTTCGTGTCC(SEQ ID NO:5) |
F4 |
ATGGTGAGTACGATGCAGAC(SEQ ID NO:7) |
F5 |
GCCTTTGGTGTGCTAAAGAC(SEQ ID NO:9) |
F6 |
GGGTTCCAGGAAATTACTTTG(SEQ ID NO:11) |
F7 |
AAATGATCGGTTTAGACAC(SEQ ID NO:13) |
F8 |
AGGATCGTTGTCATCCAGTC(SEQ ID NO:15) |
F9 |
TAGGGCCTATTCCTGATTGG(SEQ ID NO:17) |
F10 |
CCAAAGCTCCAAATGTATA(SEQ ID NO:19) |
F11 |
AGAAAAGCTGGAGCAATGCG(SEQ ID NO:21) |
F12 |
ACACACAATAGCTAAGACCC(SEQ ID NO:23) |
F13 |
GAGTGCTTAGTTGAACAGGG(SEQ ID NO:25) |
R1 |
GGGTGTTGCAGACAAAGTCG(SEQ ID NO:2) |
R2 |
TTGTGGCTGCAAAGGAGGTG(SEQ ID NO:4) |
R3 |
ACCACAGGGAGCCTTCGATG(SEQ ID NO:6) |
R4 |
CAAGCTCATCATTGAGTTCC(SEQ ID NO:8) |
R5 |
GGAGAAGTGTTAAACTCCTG(SEQ ID NO:10) |
R6 |
ACAGCTAGAGTCCTGATTGC(SEQ ID NO:12) |
R7 |
TTTCCAGGTTGGCTCCATAT(SEQ ID NO:14) |
R8 |
AAGGCTGTTGTTCCTACCTG(SEQ ID NO:16) |
R9 |
CCAGTCCTATTTTCTATGGC(SEQ ID NO:18) |
R10 |
GTGGATTGGAACTCTGAGC(SEQ ID NO:20) |
R11 |
GATACATCTGTAGAAAGGTTG(SEQ ID NO:22) |
R12 |
GATTACAGAACAGGCTCCTC(SEQ ID NO:24) |
R13 |
AAGCTACACTCTGGTTCGTC(SEQ ID NO:26) |
Using a group of labels primer in this embodiment of the invention, the corresponding nucleic area in each sample can be expanded
Domain, and make the label for having ordinal relation in the amplified production of each sample with corresponding one or more, it is built such that sample
With the corresponding relation of label, according to the corresponding relation, can there are the sample nucleic acid sequence data that method makes mixing, typically number
Huge blended data, it is corresponding to correct sample source, analyze each sample information nucleic acid.
According to the another embodiment of the present invention, there is provided a kind of kit, it includes one aspect of the present invention offer
Label.According to the specific embodiment of the present invention, it also includes the Tag primer on the other hand provided.
Sample of nucleic acid is mixed in mark embodiment there is provided aforementioned agents box according to another of the invention, and/or it is more
Individual sample of nucleic acid mixing sequencing, and/or it is determined that mixing the purposes in sample source corresponding to the data in sequencing data.
Using sequence provided by the invention, kit or method, differentiation can be marked to be more than or equal to or much larger than label
Several multiple samples, the data of multiple samples are enable to mix processing and finally can utilize the mark corresponding relation area
Divide blended data, sort out it to each sample.
The mixing sample library construction of embodiment one, sequencing
96 people's blood samples are obtained from Tianjin healthcare hospital for women & children, and reagent or instrument are conventional commercial, such as commercially available spontaneous
Order technology company (life technologies).
The nucleic acid of 96 samples is respectively placed in each hole of 96 orifice plates, in 96 orifice plates all samples are entered with performing PCR mark
Reaction:After the DNA in extracting blood sample, the DNA profiling of each sample is separately added into different holes, and each
13 pairs of Tag primers are added in hole, Tag primer is to hold previously prepared, record sample by the way that sequence label is connected into primer 5 '
Originally with the corresponding relation of the PCR labels (barcode or index) introduced, it is subsequently placed in PCR instrument and enters performing PCR amplification, obtain target
Configure to sequence amplification product, the component and amount of the multi-PRC reaction system or ratio and can refer to known multi-PRC reaction body
System, such as with reference to Hayden MJ, Nguyen TM, Waterman A, Chalmers KJ.2008.Multiplex-ready
PCR:a new method for multiplexed SSR and SNP genotyping.BMC Genomics.;doi:
10.1186/1471-2164-9-80 multi-PRC reaction system configurations in are carried out, and synthesize the label and primer in Tag primer
It is respectively selected from Tables 1 and 2.Then according to mark mixing amplified production and purifying, then according to used by microarray dataset library
Structure explanation carries out mixing library construction, is here the library construction handbook according to Pronton, including end is repaired and joint connects
Connect, so, a library includes 96 samples, and library fragments size is detected followed by Agilent Bioanalyzer 2100
And concentration, qualified upper machine library is obtained, operates the computer flow then according to Ion Proton semiconductor microchip microarray datasets
Library is sequenced.
Because amplification region is small in the embodiment, the data volume of a sample is far below a passage in Proton sequenators
(lane) data volume, Illumina multiple (Multiplex) sequencing technologies (Meyer M, Kircher have been referred to
M.2010.Illumina Sequencing Library Preparation for Highly Multiplexed Target
Capture and Sequencing.Cold Spring Harb.Protoc.;doi:10.1101/pdb.prot5448), will
It is transplanted to Proton platforms, for example is marked in library construction using connecting label joint or expanding the new label of introduction
Remember multiple mixing libraries, machine sequencing on multiple mixing libraries can be mixed.Fig. 1 illustrates that the sequence label amplification using the present invention
Each sample is marked, each sample nucleic acid is mixed and builds storehouse, obtain one or more libraries, if building multiple mixing sample libraries,
New label is introduced in amplification after joint connection or joint connection, the label here introduced can utilize open label
Group, multiple mixing libraries are distinguished except the sequence label of marker samples also selected from table 1.
The classification of the blended data of embodiment two
After obtaining lower machine data, too short reads is screened out according to read length, for example filter out less than 50bp's
Reads, then the corresponding relation according to label and sample reads is sorted out.
Special reference sequence is built, the reference sequences of this side are made up of, reference targets target sequence (target amplification region)
The position for the reference gene group that the interception of sequence area determines to compare according to the primer in embodiment one determines to intercept.It is right
Blended data is sorted out, and can utilize traditional method, and interception compares the soft montage reads to reference sequences at least one end
The 7bp, i.e. label at the 5 ' ends of (soft clip reads) length, by this 7bp of interception and the sequence label pair of marker samples
Should, the just reads being returned to the sample in correspondence.
Soft clip reads refer to reads and compared when returning to genome, and a reads is cut into two sections, matches difference
Region, such reads is called soft clip reads, is typically due to the missing that genome occurs a certain section, or transcription
The montage of group, in sequencing procedure, this kind of reads is across deletion segment or splice site, in this side soft clip reads
Refer to an only part and compare the reads for arriving special reference sequence, that section not compared is referred to as soft clip.
Inventor has found the amplification region (capture directly with design from the lower machine data of the non-chip microarray dataset of the semiconductor
Region) reference sequences of Sequence composition are compared, comparison positions of the observation reads on this reference sequences, when in comparison
Original position when being located at the 8th base of the reads, illustrate that the read at its 5 ' end there is a complete length to be exactly
7bp barcode;Similarly, when the final position distance read3 ' of the comparison distances held are 7bp, illustrate the read at 3 ' ends
With a complete barcode.By statistics, the reads for having 64% has complete barcode at 5 ' ends and 3 ' ends,
The reads for only having complete b arcode at 5 ' ends accounts for 14%, and the reads for only having complete b arcode at 3 ' ends accounts for 12%.Base
In the discovery, inventor proposes the classifying method of another set of blended data, and accuracy rate corresponding to classification can be made to improve at least
12%, the classification corresponds to reads to the method for sample source, and the comparison software first with life technologies
Tmap is set to compare read (reads) with default parameters and arrived on the reference sequences formed with target sequence (amplification region), according to
Soft clip reads information in comparison result, for every soft clip reads, when its 5 ' end or 3 ' ends have one
When the soft clip that length is 7bp occur, assert that this read there are complete barcode sequences at 5 ' ends or 3 ' ends.If read
All there is complete barcode sequences at 5 ' ends and 3 ' ends, relatively both it is whether identical, it is different then cast out the reads;If
Reads only has barcode sequences at 5 ' ends or 3 ' ends, then the sequence is exactly the barcode sequences of the reads.Based on before
The barcode and the corresponding relation of sample recorded, sample corresponding to the barcode sequences can be learnt, by the sequence of the reads
Part corresponding to barcode is clipped in row and mass value, remaining part is included into the data of the sample.In this step, number
90% can be up to according to utilization rate.
The variation detection of the sample nucleic acid of embodiment three
After correctly sorting out to mixing sequencing data, individually enter row variation detection and analysis for each sample, what is used is soft
Part is the comparison software tmap that lifetechnologies is provided and variation inspection software tvc.These softwares can be
Run in the random server of proton sequenators, its default parameters is set both for human genome, can not have to adjust
It is whole.With reference to tmap and Torrent Variant Caller (tvc) explanation document, correctly return using amplification region information and
The reads information of class, it can just complete to compare and make a variation the analysis detected.Wherein, the central principle of detection of making a variation is to utilize pattra leaves
This infers (Shoemaker JS, Painter IS, Weir BS.1999.Bayesian statistics in genetics:A
guide for the uninitiated.Trends Genet15:354–358).On the algorithm combination comparison result and read
Base quality value information, the Genotyping of the point is can be derived that, so as to judge whether the gene is mutated.
Expanded simultaneously by using 13 pairs of Tag primers, can realize for deaf related GJB2, GJB3,
Tetra- genes of SLC26A4 and 12sRNA enter row variation detection.The test of 300 samples altogether including above-mentioned 96 samples, with
The uniformity of Sanger sequencing results has reached 100%, and the testing result of the part sample of this method is as shown in table 3, in table 3 because
Behaviour is diploid, so testing result is represented with two letters, if SNP, is just directly represented with base, such as sample
Point mutation " the 9G of this " 14HL078963 ">A " testing result is " G.G ", illustrates that a pair of grade bit bases in the site are not sent out
Raw mutation, i.e., SNP does not occur for the site, and for the detection of insertion and deletion (indel) mutation type, unmutated, use is represented with R
V represents mutation, and Sanger sequencings detect the mutation of this part sample, as a result with table 3.Meanwhile generally 2 the time required to this method
~3 days, compare and shortened 2 days with mass spectrography, and relative to genome sequencing, the sequencing cost of this method be only its 1%,
Also, once can to detect 500 parts of samples simultaneously even more more for upper machine for this method, greatly improve detection flux.
Table 3