A kind of copy number mutation detection method based on new-generation sequencing
Technical field
The invention belongs to DNA molecular and carry out the high throughput sequencing technologies field of sequencing, particularly relate to a kind of copy number mutation detection method based on new-generation sequencing.
Background technology
Copy number variation (copynumbervariation, CNV) is the important phenomenon in cancer gene group.Its main manifestations is amplification and the disappearance two states of copy number, and generation, development with cancerous cell have close ties.Detecting the concurrent CNV of same area in multiple cancer sample the impact that confluence analysis CNV is on full-length genome expression, identify that those are affected the cancer gene of expression by CNV, this has great importance for the generation and transfer studying cancer.Although the CNV detection method based on single sample is more and more ripe, but these methods still can not meet multiple sample in detection sensitivity and degree of accuracy etc. there is the detection in CNV region jointly, therefore, CNV carrying out analyzing of system and provides important channel for the pathogenesis studying cancer from molecular level, its bottom, most crucial problem are how to detect CNV relevant to tumor-related gene in multiple cancer sample.
New-generation sequencing (NextGenerationSequencing, NGS) technology is once to obtain the high throughput sequencing technologies of up to a million the even short sequence information of millions of, has high speed, high-resolution, low cost, repeatable advantages of higher.Therefore, study detection CNV based on NGS data and substantially increase speed and accuracy, also reduce cost simultaneously.
Numerous researchs show, CNV functional mode is often implied in the consistent variation region of cancer gene group sample, and in NGS comparison to the proportional relation of the sequential digit values in each region of genome and the copy numerical value in this region, so set up the computational methods based on theory of statistics, detection CNV concurrent (Common) significance level in multiple cancer samples, for identifying CNV functional mode and finding that potential cancer gene provides direct, feasible technological means, and then provide important information for the biological physician prediction to cancer and diagnosis.Therefore, setting up rationally and effectively, statistical inspection model is most important.
The intensive in high flux full-length genome CNV site and the complexity of structure thereof, bring great challenge to the detection of the foundation of statistical inspection model and CNV significance, be mainly reflected in following two aspect.First, the difficult point of problem itself: a) number of loci more than up to 180 ten thousand and sample number is often less, define the data general layout of a kind of high latitude small sample;B) systematic error that order-checking platform and order-checking level difference are brought, and the sample of difference order-checking level is normalized;C) the reads signal (readdepth, RD) that gene loci is corresponding is vulnerable to the effect of noise such as order-checking mistake, comparison mistake;D) there is stronger relatedness between CNV site, and dependent so that there is reciprocal effect between detecting factor;E) amplification of detection copy number or miss status to consider the feature of two aspects, i.e. relatedness between site correspondence reads number and site, this requires the mechanism of a rational balance the two feature.Second, solve the theory of problem and the challenge of method: a) data scale is big, the effectively control to calculating Time & Space Complexity is a challenge;B) how to take into full account the relatedness between CNV site, reduce the conservative that CNV significance level is estimated, be a difficulties;C) how to set up null hypothesis distribution consistent with statistic, strengthen the statistical significance that significance level is estimated, be an emphasis and the problem not yet broken through at present.
Analyzing technically, consider from sample size, current existing copy mutation detection method is broadly divided into the CNV detection method below based on single sample analysis and the method based on multisample.Mainly have technically: the copy number detection method of the detection method based on fluorescence sites hybridization technique, the Comparative genomic hybridization based on microarray and gene new-generation sequencing technology.First two method resolution is very low and is difficult to detect short CNV, and the method based on NGS more highlights because it has high-throughout advantage.CNV detection method based on NGS is broadly divided into based on PEM (pair-endmapping) signature with based on two kinds of technology paths of DOC (depthofcoverage).Although the method based on PEM is capable of detecting when the CNV of small fragment but is difficult to the insertion (copy number amplification) of detection large fragment and the CNV (such as SDs) of complex region.The CNV of large fragment can be detected based on the method for DOC.Therefore there is also the method combined both some, such as CNVer, improve the breakpoint accuracy rate in CNV region by integrating DOC and PEM signature.The method being currently based on DOC is more exposed to favor.
DOC detection model based on segmentation relates generally to different dividing methods, such as CBS, LASSO etc..The testing result that different dividing methods produces also is not quite similar.As ReadDepth adopts CBS partitioning algorithm can identify the border that copy number makes a variation more accurately, when detecting low coverage data, still there is higher sensitivity and specificity.The constraint of the uncontrolled sample of FREEC method, adopts LASSO to return accurate CNV border, but ignores local reads number variation, easily cause error detection;Be likely to simultaneously be subject to sub-clone affect G/C content standardization so that affect CNV detection.Segseq method and rSW-seq method, owing to directly making comparisons with control sample, can quickly detect and accurately identify CNV region, but it does not account for the local feature feature of multiple sample, causes that resultant error is very big.Due to sequencing technologies and genomic local feature feature, partitioning algorithm can make the false positive of result higher.SeqCNA does not require to control sample yet, adopts LOESS or polymorphic matching to be applicable to the CNV of detection local small fragment, but is not suitable for detection cancer sample data.
Based on the assumption that the DOC statistical significance model of inspection is mainly concerned with two key elements, i.e. statistic of test and zero cloth, the quality of they designs directly influences the effectiveness of significance level estimation and the qualification performance of CNV functional mode.The EWT method RD fitted Gaussian probability Distribution Model to continuous fragment (window), adopt monolateral Z-test inspection CNV, the copy number variable region of large fragment can be detected, but EWT does not account for the relatedness between site, it is impossible to accurately detect the position of insertion (CNV) and the CNV of small fragment is insensitive.CNV-seq method RD ratio (with sample for reference) the matching Poisson distribution model to non-overlapping segment (window), the significance calculating Z-score is simultaneously introduced partitioning algorithm to detect CNV, improve the sensitivity that low coverage data is detected, but easily improve false positive.CNA-seg, based on the HMM method of segseq and JointSLM, is simultaneously introduced card side χ2Statistic detection CNV.
The detection method being currently based on the common CNV of multisample of DOC is still not as ripe, and detection method mainly has CMDS method [17], cn.MOPS method, JointSLM method and the detection method etc. based on punishment sparse regression model.Wherein the Single locus of multiple samples is built correlation diagonal matrix and calculates its significance to detect CNV by CMDS method, and accuracy rate is higher compared with detecting single sample, improves the cost performance of time and space complexity simultaneously.Cn.MOPS method reduces the influence of noise of technology and biomutation, it is adaptable to detect the CNV that multiple sample same area variation amplitude is inconsistent, and the CNV that amplitude is consistent is insensitive.JointSLM method is the EWT extension detected at multisample, is simultaneously introduced hidden Markov model (HMM) and detects CNV, but when there is common CNV in part sample, it is felt simply helpless.Detection method based on penalty coefficient regression model is one the penalized regression model of RDsignal matching to multiple samples; commonCNV (cCNV) border detection will be converted into change point (changepoint) test problems and utilize significance test method to detect, thus improve accuracy rate and reducing false discovery rate.But but its accuracy rate can decline during ancestors' difference of multiple sample data.
By to existing these based on DOC model [3,7,9-27] com-parison and analysis it can be seen that major part method can produce a significantly high false discovery rate, especially when without reference to sample, feature is especially prominent.The existing significance model based on NGS, is all with CNV structure fragment for detection primitive when designing statistic, and employs the information of relatedness between the frequency of CNV and amplitude and CNV site when quantitative statistics amount.For the structure of zero cloth, most methods are all realized by random permutation strategy.
Analyze from the biological characteristic of CNV data, between CNV site independently, namely contiguous CNV site is an organic whole, then be difficult to the objective significance level estimating CNV with Single locus for detection primitive, easily ignores again the relatedness in inside configuration site with structure fragment for detection primitive;Secondly, consider the reads number of CNV and the relatedness in site despite multiple method when counting statistics amount, but the two feature is not reasonably weighed by they, it is easy to flase drop CNV.
Existing CNV significance level detection method is primarily present following deficiency:
(1) statistic being primitive with single CNV site, it is easy to cause the conservative that significance level is estimated;Though remain the inherent structure characteristic of copy number to a certain extent with CNV structure fragment for constant dollar amount, but ignore the dependency between internal site, it is difficult to the significance level of objective estimation statistic CNV.
(2) there is no the frequency of reasonable tradeoff CNV and the relatedness of variant sites so that the biological performance that CNV associates with cancer is difficult to position;
(3) based on the method for single pattern detection when detecting the cCNV of multiple samples, systematic error or platform errors problem are serious.
(4) there is no the automatic Synthesis multiple samples from difference order-checking platform or order-checking level so that there is bigger limitation when detecting multiple samples concurrent CNV functional mode;
(5) for the sample data of low-coverage level, insensitive, Detection results is not good.
Summary of the invention
It is an object of the invention to provide a kind of copy number mutation detection method based on new-generation sequencing, it is intended to the data for different coverage take different normalized measures, make data more operability, reduce systematic error;Integrate multiple sample, it is proposed to a set of with CNV structural units be primitive significance level etection theory and method;With supervised learning mechanism for guiding, set up and the consistent zero cloth of statistic, to improve the accuracy that significance level is estimated.
The present invention is achieved in that a kind of copy number mutation detection method based on new-generation sequencing, a kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:
The pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low;By standardization G/C content, adjust the reads number that data sample site is corresponding;The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level;For the data sample that overburden depth is low, directly data normalization is become same level;For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature;
The structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix;Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site;Calculate the dependency between site;
The calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation in each site in each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively;
The enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;To each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
Estimation based on CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
The Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate);The border Detection capability of copy number variation;The computation complexity of parser.
Further, reads < Q30 in the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low is filtered out described in.
Further, multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.
Further, described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S);For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.
Further, S in the calculating of described statistictestTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
Further, the described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.
Further, if the described zero cloth based on CNV length designs with p value in the estimation of significance level less than 0.05 threshold value set, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.
Further, in the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.
The invention solves the problem that prior art is easily trapped into conservative when copy number makes a variation significance estimation;Automatic Synthesis of the present invention detects the region that multiple samples occur copy number to make a variation in same area jointly, avoid the detection error that prior art only detects the copy number variable region of single sample or paired sample, from patient groups, study the relation of copy number variation and cancer;The invention solves the copy number variation detection error problem owing to order-checking platform and order-checking level difference cause, make result more accurate;The present invention is directed to new-generation sequencing data form to utilize from multimodal frequency histogram feature normalization data, accurately to divide normal region and copy number variable region;Prior art is only at copy number variant sites reads number, and consider during statistic design that between variation reads number and adjacent variables site, relatedness exists discordance, the present invention is directed to this problem, consider the comprehensive function of relatedness between variation reads number and variant sites, set up new model, solve problem of inconsistency, with the significance level of objective estimation copy number variation.
When detecting multisample cCNV, the present invention integrates multiple sample, decreases and detects produced systematic error or order-checking platform errors based on single sample testing method successively, substantially increases detection effect.
When early stage normalization (standardization) processes data, the present invention is directed to different order-checking horizontal datas and adopt different processing methods, with prior art low covering horizontal data detection insensitive compared with, no matter present invention order-checking covering level height all has higher sensitivity, this lays a good foundation for the follow-up degree of accuracy improving detection copy number variation.
The copy number variation of detection multisample common region, except to consider that the region that multiple sample generation copy number makes a variation presents identical amplification or deleted signal, the detection that copy number is made a variation by the correlation between adjacent sites also has important biological meaning.Therefore, be conducive to estimating more objectively the significance level of the copy number variation of common region based on the statistic of the feature of structure these two aspects and statistical inspection model;And prior art often only emphasizes the amplitude of copy number variable region, and ignore the dependency between site;For this, the present invention considers both features, set up statistical inspection model, and by supervised learning strategy balance the two feature with reasonably counting statistics amount, this not only makes hypothesis testing model and statistic have concordance, and can strengthen statistics and the biological double meaning that significance level is estimated.
Present invention data for difference covering level when data process take different standardization processing methods, especially to high overburden depth data, first define copy number amplification and miss status according to its data frequency rectangular histogram feature, isolate only normal (0) amplification (1) data set and normal (0) disappearance (-1) data set;The present invention is with Single locus for detection primitive when designing statistic, and combines the information of relatedness between the reads number of CNV Single locus and site when quantitative statistics amount, it is possible to fundamentally improve the accuracy that significance level is estimated;The present invention integrates multiple sample, weighed by the feature of dependency two aspect between the supervised learning method reads number (amplitude) to full-length genome site and site, rationally to quantify statistic, and construct and the consistent hypothesis testing model of statistic, thus improve the statistical significance that significance level is estimated.
Given emulation data: comprise 5 samples of 18 concurrent copy numbers variation (cCNV), the present invention is capable of detecting when 17 cCNV regions, and prior art such as FREEC is only capable of detecting 15 cCNV regions by single pattern detection global alignment.Great many of experiments shows simultaneously: compared with FREEC, and the present invention reduces variable region order on border when detecting more accurate.
Accompanying drawing explanation
Fig. 1 is the copy number mutation detection method flow chart based on new-generation sequencing that the embodiment of the present invention provides.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.
Present invention data for difference covering level when data process take different standardization processing methods, especially to high overburden depth data, first define copy number amplification and miss status according to its data frequency rectangular histogram feature, isolate only normal (0) amplification (1) data set and normal (0) disappearance (-1) data set;The present invention is with Single locus for detection primitive when designing statistic, and combines the information of relatedness between the reads number of CNV Single locus and site when quantitative statistics amount, it is possible to fundamentally improve the accuracy that significance level is estimated;The present invention integrates multiple sample, weighed by the feature of dependency two aspect between the supervised learning method reads number (amplitude) to full-length genome site and site, rationally to quantify statistic, and construct and the consistent hypothesis testing model of statistic, thus improve the statistical significance that significance level is estimated.
Below in conjunction with accompanying drawing, the application principle of the present invention is further described.
A kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:
S101: the pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low;By standardization G/C content, adjust the reads number that data sample site is corresponding;The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level;For the data sample that overburden depth is low, directly data normalization is become same level;For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature;
S102: the structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix;Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site;Calculate the dependency between site
S103: the calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation of each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively;
S104: the enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;To each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
P-value represents the p-value value that each site of sample is corresponding, and K is the number of times T of random permutation is statistic during zero cloth,For the statistic of i & lt, ifMore than T, then counting adds one, finally namely obtains p value.(wherein p-value,T is vector)
S105: based on the estimation of CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
S106: the Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate);The border Detection capability of copy number variation;The computation complexity of parser.
Reads < Q30 in the reads that in the described Batch effect filtering out CNV signal and comparison process, comparison quality is relatively very low.
Multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.
Described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S);For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.
S in the calculating of described statistictestTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
The described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.
If described 0.05 threshold value based on p value in the estimation of CNV significance level less than setting, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.
In the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.
Below in conjunction with application principle, the invention will be further described.
On the basis that copy number biological nature and theory of statistics are fully studied, set up statistical inspection model, design CNV significance level detection algorithm, utilize a large amount of emulation data testing algorithm repeatedly, its performance is analyzed and evaluation from multi-angle.
(1) pretreatment of copy number variation data
Sample data that copy number is made a variation carries out suitable pretreatment has important meaning to copy number variation significance detection.A) for the quality problems in the Batch effect of CNV signal and comparison process, the relatively very low reads of comparison quality (< Q30) is filtered out.B) due to new-generation sequencing technology data measured, its order-checking coverage is by the impact of G/C content, thus affecting copy number variation detection.It would therefore be desirable to by standardization G/C content, adjust the reads number that data sample site is corresponding.C) owing to the order-checking level of multiple samples would be likely to occur height difference, it is impossible to be made directly follow-up normalized set, it is necessary to normalized becomes the data of corresponding same order-checking level just to have meaning.For the data sample that overburden depth is low, directly data normalization can be become same level;For the data sample that overburden depth is high, can first define copy number amplification and miss status according to its data frequency rectangular histogram feature.
(2) structure of sliding window
Multiple samples after integrated standardization process, can obtain a higher dimensional matrix (the number of sites N of number of samples s* sample).Owing to copy number variation presents with one section of region, the relatedness between generally contiguous copy number variant sites is stronger, may be up to 0.985, and between distant site, relatedness compares overly soft pulse to ignoring.In order to more accurately calculate the dependency between site, intend structure sliding window and calculate the frequency in site from original position and utilize the Pearson formula dependency to calculate in each window between site simultaneously, sliding window gradually, until throughout each site.Wherein result is affected not quite by choosing of the size of sliding window, and we take 10 temporarily here, and rear extended meeting observes it by experiment to impact effect.
(3) calculating of statistic
For each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status.Owing to the data of new-generation sequencing are subject to the impact of order-checking overburden depth, for low cover degree and high coverage sample counting statistics amount respectively, greatly strengthen the suitability of the present invention.For low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S).Sample for high overburden depth, we utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively, be conducive to the significance level of detection copy number variation better.Here difficult point is how reasonable tradeoff frequency and correlation coefficient, and for this, we utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount.
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively.Due to StestTraining set does not clearly provide, therefore, intends giving relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
(4) structure of the enforcement of Replacement Strategy and zero cloth
Multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, constructs zero cloth T.Then to sample data, (the every a line in data matrix represents a sample, every string represents a site on full-length genome) implement random permutation, detailed process is as follows: a) for each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;For each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
(5) estimation with significance level is designed based on the zero cloth of CNV length
The CNV region occurred is evaluated, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function by the p value that the sample all sites obtained is corresponding.Furthermore, it is contemplated that the amplification of CNV and miss status have different biological functions and performance, we, for each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
(6) Performance Evaluation of algorithm
The present invention intends from three below aspect, the performance of algorithm being evaluated: a) can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);B) whether evaluation algorithms can accurately estimate p value (TypeIErrorRate), and namely whether the statistical model of algorithm has stronger statistical significance;C) the border Detection capability of copy number variation;D) computation complexity of parser.
Intend with the normal cell copy number of 1000Affymetrix full-length genome SNP6.0 chip detection for background, consider NGS technology and data characteristics, based on theory of probability and nonstationary model, build markov CNV emulation mode, the large-scale CNV data based on NGS of simulation, test the method performance of the present invention.Partial simulation experiment draws, this algorithm, under keeping higher TPR situation, has higher border Detection capability.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.