CN105760712A

CN105760712A - Copy number variation detection method based on next generation sequencing

Info

Publication number: CN105760712A
Application number: CN201610114354.8A
Authority: CN
Inventors: 李垚垚; 袁细国; 张军英; 杨利英; 白俊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2016-03-01
Filing date: 2016-03-01
Publication date: 2016-07-13
Anticipated expiration: 2036-03-01
Also published as: CN105760712B

Abstract

The invention discloses a copy number variation detection method based on next-generation sequencing. The method includes preprocessing of copy number variation data, construction of a sliding window, calculation of statistics, implementation of a replacement strategy, construction of a zero distribution, and implementation of an algorithm. Performance evaluation, the performance evaluation of the algorithm is to judge whether the algorithm can obtain a high correct positive rate under the condition that the false positive rate is controllable, and evaluate whether the algorithm can accurately estimate the p value and the boundary detection ability of copy number variation; Analyze the computational complexity of the algorithm. The invention solves the problem of copy number variation detection error caused by different sequencing platforms and sequencing levels, and makes the results more accurate; utilizes the data normalized from the characteristics of the multi-peak frequency histogram to accurately divide the normal region and the copy number variation region; Invent the comprehensive effect of the correlation between the number of variant reads and the variant site, establish a new model, solve the problem of inconsistency, and objectively estimate the significance level of the copy number variation.

Description

A kind of copy number mutation detection method based on new-generation sequencing

Technical field

The invention belongs to DNA molecular and carry out the high throughput sequencing technologies field of sequencing, particularly relate to a kind of copy number mutation detection method based on new-generation sequencing.

Background technology

Copy number variation (copynumbervariation, CNV) is the important phenomenon in cancer gene group.Its main manifestations is amplification and the disappearance two states of copy number, and generation, development with cancerous cell have close ties.Detecting the concurrent CNV of same area in multiple cancer sample the impact that confluence analysis CNV is on full-length genome expression, identify that those are affected the cancer gene of expression by CNV, this has great importance for the generation and transfer studying cancer.Although the CNV detection method based on single sample is more and more ripe, but these methods still can not meet multiple sample in detection sensitivity and degree of accuracy etc. there is the detection in CNV region jointly, therefore, CNV carrying out analyzing of system and provides important channel for the pathogenesis studying cancer from molecular level, its bottom, most crucial problem are how to detect CNV relevant to tumor-related gene in multiple cancer sample.

New-generation sequencing (NextGenerationSequencing, NGS) technology is once to obtain the high throughput sequencing technologies of up to a million the even short sequence information of millions of, has high speed, high-resolution, low cost, repeatable advantages of higher.Therefore, study detection CNV based on NGS data and substantially increase speed and accuracy, also reduce cost simultaneously.

Numerous researchs show, CNV functional mode is often implied in the consistent variation region of cancer gene group sample, and in NGS comparison to the proportional relation of the sequential digit values in each region of genome and the copy numerical value in this region, so set up the computational methods based on theory of statistics, detection CNV concurrent (Common) significance level in multiple cancer samples, for identifying CNV functional mode and finding that potential cancer gene provides direct, feasible technological means, and then provide important information for the biological physician prediction to cancer and diagnosis.Therefore, setting up rationally and effectively, statistical inspection model is most important.

The intensive in high flux full-length genome CNV site and the complexity of structure thereof, bring great challenge to the detection of the foundation of statistical inspection model and CNV significance, be mainly reflected in following two aspect.First, the difficult point of problem itself: a) number of loci more than up to 180 ten thousand and sample number is often less, define the data general layout of a kind of high latitude small sample；B) systematic error that order-checking platform and order-checking level difference are brought, and the sample of difference order-checking level is normalized；C) the reads signal (readdepth, RD) that gene loci is corresponding is vulnerable to the effect of noise such as order-checking mistake, comparison mistake；D) there is stronger relatedness between CNV site, and dependent so that there is reciprocal effect between detecting factor；E) amplification of detection copy number or miss status to consider the feature of two aspects, i.e. relatedness between site correspondence reads number and site, this requires the mechanism of a rational balance the two feature.Second, solve the theory of problem and the challenge of method: a) data scale is big, the effectively control to calculating Time & Space Complexity is a challenge；B) how to take into full account the relatedness between CNV site, reduce the conservative that CNV significance level is estimated, be a difficulties；C) how to set up null hypothesis distribution consistent with statistic, strengthen the statistical significance that significance level is estimated, be an emphasis and the problem not yet broken through at present.

Analyzing technically, consider from sample size, current existing copy mutation detection method is broadly divided into the CNV detection method below based on single sample analysis and the method based on multisample.Mainly have technically: the copy number detection method of the detection method based on fluorescence sites hybridization technique, the Comparative genomic hybridization based on microarray and gene new-generation sequencing technology.First two method resolution is very low and is difficult to detect short CNV, and the method based on NGS more highlights because it has high-throughout advantage.CNV detection method based on NGS is broadly divided into based on PEM (pair-endmapping) signature with based on two kinds of technology paths of DOC (depthofcoverage).Although the method based on PEM is capable of detecting when the CNV of small fragment but is difficult to the insertion (copy number amplification) of detection large fragment and the CNV (such as SDs) of complex region.The CNV of large fragment can be detected based on the method for DOC.Therefore there is also the method combined both some, such as CNVer, improve the breakpoint accuracy rate in CNV region by integrating DOC and PEM signature.The method being currently based on DOC is more exposed to favor.

DOC detection model based on segmentation relates generally to different dividing methods, such as CBS, LASSO etc..The testing result that different dividing methods produces also is not quite similar.As ReadDepth adopts CBS partitioning algorithm can identify the border that copy number makes a variation more accurately, when detecting low coverage data, still there is higher sensitivity and specificity.The constraint of the uncontrolled sample of FREEC method, adopts LASSO to return accurate CNV border, but ignores local reads number variation, easily cause error detection；Be likely to simultaneously be subject to sub-clone affect G/C content standardization so that affect CNV detection.Segseq method and rSW-seq method, owing to directly making comparisons with control sample, can quickly detect and accurately identify CNV region, but it does not account for the local feature feature of multiple sample, causes that resultant error is very big.Due to sequencing technologies and genomic local feature feature, partitioning algorithm can make the false positive of result higher.SeqCNA does not require to control sample yet, adopts LOESS or polymorphic matching to be applicable to the CNV of detection local small fragment, but is not suitable for detection cancer sample data.

Based on the assumption that the DOC statistical significance model of inspection is mainly concerned with two key elements, i.e. statistic of test and zero cloth, the quality of they designs directly influences the effectiveness of significance level estimation and the qualification performance of CNV functional mode.The EWT method RD fitted Gaussian probability Distribution Model to continuous fragment (window), adopt monolateral Z-test inspection CNV, the copy number variable region of large fragment can be detected, but EWT does not account for the relatedness between site, it is impossible to accurately detect the position of insertion (CNV) and the CNV of small fragment is insensitive.CNV-seq method RD ratio (with sample for reference) the matching Poisson distribution model to non-overlapping segment (window), the significance calculating Z-score is simultaneously introduced partitioning algorithm to detect CNV, improve the sensitivity that low coverage data is detected, but easily improve false positive.CNA-seg, based on the HMM method of segseq and JointSLM, is simultaneously introduced card side χ²Statistic detection CNV.

The detection method being currently based on the common CNV of multisample of DOC is still not as ripe, and detection method mainly has CMDS method [17], cn.MOPS method, JointSLM method and the detection method etc. based on punishment sparse regression model.Wherein the Single locus of multiple samples is built correlation diagonal matrix and calculates its significance to detect CNV by CMDS method, and accuracy rate is higher compared with detecting single sample, improves the cost performance of time and space complexity simultaneously.Cn.MOPS method reduces the influence of noise of technology and biomutation, it is adaptable to detect the CNV that multiple sample same area variation amplitude is inconsistent, and the CNV that amplitude is consistent is insensitive.JointSLM method is the EWT extension detected at multisample, is simultaneously introduced hidden Markov model (HMM) and detects CNV, but when there is common CNV in part sample, it is felt simply helpless.Detection method based on penalty coefficient regression model is one the penalized regression model of RDsignal matching to multiple samples; commonCNV (cCNV) border detection will be converted into change point (changepoint) test problems and utilize significance test method to detect, thus improve accuracy rate and reducing false discovery rate.But but its accuracy rate can decline during ancestors' difference of multiple sample data.

By to existing these based on DOC model [3,7,9-27] com-parison and analysis it can be seen that major part method can produce a significantly high false discovery rate, especially when without reference to sample, feature is especially prominent.The existing significance model based on NGS, is all with CNV structure fragment for detection primitive when designing statistic, and employs the information of relatedness between the frequency of CNV and amplitude and CNV site when quantitative statistics amount.For the structure of zero cloth, most methods are all realized by random permutation strategy.

Analyze from the biological characteristic of CNV data, between CNV site independently, namely contiguous CNV site is an organic whole, then be difficult to the objective significance level estimating CNV with Single locus for detection primitive, easily ignores again the relatedness in inside configuration site with structure fragment for detection primitive；Secondly, consider the reads number of CNV and the relatedness in site despite multiple method when counting statistics amount, but the two feature is not reasonably weighed by they, it is easy to flase drop CNV.

Existing CNV significance level detection method is primarily present following deficiency:

(1) statistic being primitive with single CNV site, it is easy to cause the conservative that significance level is estimated；Though remain the inherent structure characteristic of copy number to a certain extent with CNV structure fragment for constant dollar amount, but ignore the dependency between internal site, it is difficult to the significance level of objective estimation statistic CNV.

(2) there is no the frequency of reasonable tradeoff CNV and the relatedness of variant sites so that the biological performance that CNV associates with cancer is difficult to position；

(3) based on the method for single pattern detection when detecting the cCNV of multiple samples, systematic error or platform errors problem are serious.

(4) there is no the automatic Synthesis multiple samples from difference order-checking platform or order-checking level so that there is bigger limitation when detecting multiple samples concurrent CNV functional mode；

(5) for the sample data of low-coverage level, insensitive, Detection results is not good.

Summary of the invention

It is an object of the invention to provide a kind of copy number mutation detection method based on new-generation sequencing, it is intended to the data for different coverage take different normalized measures, make data more operability, reduce systematic error；Integrate multiple sample, it is proposed to a set of with CNV structural units be primitive significance level etection theory and method；With supervised learning mechanism for guiding, set up and the consistent zero cloth of statistic, to improve the accuracy that significance level is estimated.

The present invention is achieved in that a kind of copy number mutation detection method based on new-generation sequencing, a kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:

The pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low；By standardization G/C content, adjust the reads number that data sample site is corresponding；The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level；For the data sample that overburden depth is low, directly data normalization is become same level；For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature；

The structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix；Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site；Calculate the dependency between site；

The calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation in each site in each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w₁And w₂, with counting statistics amount,

S_test=w₁*f+w₂*a

Wherein, f, a, S_testRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively；

The enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set；To each displacement sample set, calculate the statistic that tandem copies number variation occurs；Finally calculate the significance level of detection statistic:

p - v a l u e = \frac{Σ_{i = 1}^{K} I (T_{i}^{*} &GreaterEqual; T)}{K};

Estimation based on CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.

The Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR)；Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate)；The border Detection capability of copy number variation；The computation complexity of parser.

Further, reads < Q30 in the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low is filtered out described in.

Further, multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.

Further, described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S)；For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.

Further, S in the calculating of described statistic_testTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.

Further, the described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.

Further, if the described zero cloth based on CNV length designs with p value in the estimation of significance level less than 0.05 threshold value set, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.

Further, in the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.

The invention solves the problem that prior art is easily trapped into conservative when copy number makes a variation significance estimation；Automatic Synthesis of the present invention detects the region that multiple samples occur copy number to make a variation in same area jointly, avoid the detection error that prior art only detects the copy number variable region of single sample or paired sample, from patient groups, study the relation of copy number variation and cancer；The invention solves the copy number variation detection error problem owing to order-checking platform and order-checking level difference cause, make result more accurate；The present invention is directed to new-generation sequencing data form to utilize from multimodal frequency histogram feature normalization data, accurately to divide normal region and copy number variable region；Prior art is only at copy number variant sites reads number, and consider during statistic design that between variation reads number and adjacent variables site, relatedness exists discordance, the present invention is directed to this problem, consider the comprehensive function of relatedness between variation reads number and variant sites, set up new model, solve problem of inconsistency, with the significance level of objective estimation copy number variation.

When detecting multisample cCNV, the present invention integrates multiple sample, decreases and detects produced systematic error or order-checking platform errors based on single sample testing method successively, substantially increases detection effect.

When early stage normalization (standardization) processes data, the present invention is directed to different order-checking horizontal datas and adopt different processing methods, with prior art low covering horizontal data detection insensitive compared with, no matter present invention order-checking covering level height all has higher sensitivity, this lays a good foundation for the follow-up degree of accuracy improving detection copy number variation.

The copy number variation of detection multisample common region, except to consider that the region that multiple sample generation copy number makes a variation presents identical amplification or deleted signal, the detection that copy number is made a variation by the correlation between adjacent sites also has important biological meaning.Therefore, be conducive to estimating more objectively the significance level of the copy number variation of common region based on the statistic of the feature of structure these two aspects and statistical inspection model；And prior art often only emphasizes the amplitude of copy number variable region, and ignore the dependency between site；For this, the present invention considers both features, set up statistical inspection model, and by supervised learning strategy balance the two feature with reasonably counting statistics amount, this not only makes hypothesis testing model and statistic have concordance, and can strengthen statistics and the biological double meaning that significance level is estimated.

Present invention data for difference covering level when data process take different standardization processing methods, especially to high overburden depth data, first define copy number amplification and miss status according to its data frequency rectangular histogram feature, isolate only normal (0) amplification (1) data set and normal (0) disappearance (-1) data set；The present invention is with Single locus for detection primitive when designing statistic, and combines the information of relatedness between the reads number of CNV Single locus and site when quantitative statistics amount, it is possible to fundamentally improve the accuracy that significance level is estimated；The present invention integrates multiple sample, weighed by the feature of dependency two aspect between the supervised learning method reads number (amplitude) to full-length genome site and site, rationally to quantify statistic, and construct and the consistent hypothesis testing model of statistic, thus improve the statistical significance that significance level is estimated.

Given emulation data: comprise 5 samples of 18 concurrent copy numbers variation (cCNV), the present invention is capable of detecting when 17 cCNV regions, and prior art such as FREEC is only capable of detecting 15 cCNV regions by single pattern detection global alignment.Great many of experiments shows simultaneously: compared with FREEC, and the present invention reduces variable region order on border when detecting more accurate.

Accompanying drawing explanation

Fig. 1 is the copy number mutation detection method flow chart based on new-generation sequencing that the embodiment of the present invention provides.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

Below in conjunction with accompanying drawing, the application principle of the present invention is further described.

A kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:

S101: the pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low；By standardization G/C content, adjust the reads number that data sample site is corresponding；The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level；For the data sample that overburden depth is low, directly data normalization is become same level；For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature；

S102: the structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix；Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site；Calculate the dependency between site

S103: the calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation of each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w₁And w₂, with counting statistics amount,

S_test=w₁*f+w₂*a

S104: the enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set；To each displacement sample set, calculate the statistic that tandem copies number variation occurs；Finally calculate the significance level of detection statistic:

p - v a l u e = \frac{Σ_{i = 1}^{K} I (T_{i}^{*} &GreaterEqual; T)}{K}

P-value represents the p-value value that each site of sample is corresponding, and K is the number of times T of random permutation is statistic during zero cloth,For the statistic of i & lt, ifMore than T, then counting adds one, finally namely obtains p value.(wherein p-value,T is vector)

S105: based on the estimation of CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.

S106: the Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR)；Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate)；The border Detection capability of copy number variation；The computation complexity of parser.

Reads < Q30 in the reads that in the described Batch effect filtering out CNV signal and comparison process, comparison quality is relatively very low.

Multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.

Described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S)；For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.

S in the calculating of described statistic_testTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.

The described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.

If described 0.05 threshold value based on p value in the estimation of CNV significance level less than setting, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.

In the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.

Below in conjunction with application principle, the invention will be further described.

On the basis that copy number biological nature and theory of statistics are fully studied, set up statistical inspection model, design CNV significance level detection algorithm, utilize a large amount of emulation data testing algorithm repeatedly, its performance is analyzed and evaluation from multi-angle.

(1) pretreatment of copy number variation data

Sample data that copy number is made a variation carries out suitable pretreatment has important meaning to copy number variation significance detection.A) for the quality problems in the Batch effect of CNV signal and comparison process, the relatively very low reads of comparison quality (< Q30) is filtered out.B) due to new-generation sequencing technology data measured, its order-checking coverage is by the impact of G/C content, thus affecting copy number variation detection.It would therefore be desirable to by standardization G/C content, adjust the reads number that data sample site is corresponding.C) owing to the order-checking level of multiple samples would be likely to occur height difference, it is impossible to be made directly follow-up normalized set, it is necessary to normalized becomes the data of corresponding same order-checking level just to have meaning.For the data sample that overburden depth is low, directly data normalization can be become same level；For the data sample that overburden depth is high, can first define copy number amplification and miss status according to its data frequency rectangular histogram feature.

(2) structure of sliding window

Multiple samples after integrated standardization process, can obtain a higher dimensional matrix (the number of sites N of number of samples s* sample).Owing to copy number variation presents with one section of region, the relatedness between generally contiguous copy number variant sites is stronger, may be up to 0.985, and between distant site, relatedness compares overly soft pulse to ignoring.In order to more accurately calculate the dependency between site, intend structure sliding window and calculate the frequency in site from original position and utilize the Pearson formula dependency to calculate in each window between site simultaneously, sliding window gradually, until throughout each site.Wherein result is affected not quite by choosing of the size of sliding window, and we take 10 temporarily here, and rear extended meeting observes it by experiment to impact effect.

(3) calculating of statistic

For each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status.Owing to the data of new-generation sequencing are subject to the impact of order-checking overburden depth, for low cover degree and high coverage sample counting statistics amount respectively, greatly strengthen the suitability of the present invention.For low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S).Sample for high overburden depth, we utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively, be conducive to the significance level of detection copy number variation better.Here difficult point is how reasonable tradeoff frequency and correlation coefficient, and for this, we utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w₁And w₂, with counting statistics amount.

S_test=w₁*f+w₂*a

Wherein, f, a, S_testRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively.Due to S_testTraining set does not clearly provide, therefore, intends giving relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.

(4) structure of the enforcement of Replacement Strategy and zero cloth

Multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, constructs zero cloth T.Then to sample data, (the every a line in data matrix represents a sample, every string represents a site on full-length genome) implement random permutation, detailed process is as follows: a) for each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set；For each displacement sample set, calculate the statistic that tandem copies number variation occurs；Finally calculate the significance level of detection statistic:

p - v a l u e = \frac{Σ_{i = 1}^{K} I (T_{i}^{*} &GreaterEqual; T)}{K}

(5) estimation with significance level is designed based on the zero cloth of CNV length

The CNV region occurred is evaluated, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function by the p value that the sample all sites obtained is corresponding.Furthermore, it is contemplated that the amplification of CNV and miss status have different biological functions and performance, we, for each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.

(6) Performance Evaluation of algorithm

The present invention intends from three below aspect, the performance of algorithm being evaluated: a) can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR)；B) whether evaluation algorithms can accurately estimate p value (TypeIErrorRate), and namely whether the statistical model of algorithm has stronger statistical significance；C) the border Detection capability of copy number variation；D) computation complexity of parser.

Intend with the normal cell copy number of 1000Affymetrix full-length genome SNP6.0 chip detection for background, consider NGS technology and data characteristics, based on theory of probability and nonstationary model, build markov CNV emulation mode, the large-scale CNV data based on NGS of simulation, test the method performance of the present invention.Partial simulation experiment draws, this algorithm, under keeping higher TPR situation, has higher border Detection capability.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims

1. A copy number variation detection method based on next-generation sequencing, characterized in that, the copy number variation detection method based on next-generation sequencing comprises the following steps:

Preprocessing of copy number variation data: filter out the Batch effect of CNV signals and relatively low-quality reads during the comparison process; adjust the number of reads corresponding to the data sample sites by standardizing the GC content; sequence multiple samples Level normalization is processed into data corresponding to the same sequencing level; for data samples with low coverage depth, the data is directly normalized to the same level; for data samples with high coverage depth, the copy number is first defined according to the characteristics of its data frequency histogram Amplification and deletion status;

Sliding window construction: Synthesize multiple samples after normalization processing to obtain a high-dimensional matrix; construct a sliding window to calculate the frequency of sites from the initial position, and use the Pearson formula to calculate the correlation between sites in each window, gradually Sliding window until across each locus; computing correlations between loci;

Calculation of statistics: Calculate the statistics of each sliding window to reflect the amplification or deletion status of the copy number variation, use the known copy number variation function model to construct the training set, learn the weight of frequency and correlation coefficient, w ₁ and w ₂ , to compute the statistic,

S _test ＝w ₁ *f+w ₂ *a

Among them, f, a, and S _test respectively refer to the frequency, correlation, and statistical value of the functional pattern of copy number variation in the training set;

The implementation of the replacement strategy and the construction of the zero distribution: calculate the detection statistics corresponding to each site on the whole genome for multiple samples after normalization, construct the zero distribution T, and then perform random replacement on the sample data, and randomly replace each sample Its position in the whole genome is replaced until s samples are replaced to form a full replacement sample set; for each replacement sample set, the statistics of random copy number variation are calculated; finally, the significance level of the detection statistics is calculated :

p p - - v v a a l l u u e e = = \frac{{Σ Σ}_{i i = = 11}^{K K} I I (({T T}_{i i}^{* *} &GreaterEqual; &Greater Equal; T T))}{K K};;

p-value represents the p-value value corresponding to each point of the sample, K is the statistic when the number of random permutations T is zero distribution, is the i-th statistic, if If it is greater than T, add one to the count, and finally get the p value. (where p-value, T is a vector)

Estimation based on the significance level of CNV: The region where the CNV occurs is evaluated by the p value corresponding to all the sites of the obtained sample. If the p value is less than a set threshold (such as 0.05), we believe that the CNV has biological significance or cancer function . For each CNV structural unit, the null distributions of the amplification and deletion states are respectively established to detect the significance levels of the amplification and deletion states respectively;

Performance evaluation of the algorithm: judging whether the algorithm can obtain a high correct positive rate under the condition that the false positive rate is controllable; evaluating whether the algorithm can estimate the p value more accurately; the boundary detection ability of the copy number variation; analyzing the performance of the algorithm Computational complexity.

2. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein the Batch effect of the CNV signal is filtered out and reads<Q30 in reads with relatively low alignment quality in the alignment process .

3. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, characterized in that, the multiple samples after the comprehensive normalization process are obtained to obtain a high-dimensional matrix in which the high-dimensional matrix is the number of samples s*sample The number of loci N, the correlation between the copy number variation presented in a region and adjacent copy number variation loci is relatively strong, up to 0.985, and the correlation between loci with far distances is relatively weak.

4. the copy number variation detection method based on next generation sequencing as claimed in claim 1, is characterized in that, described for each sliding window, calculate its statistic to reflect the amplification or deletion state of copy number variation, for low Coverage samples, directly calculate the frequency of reads corresponding to each site and the correlation coefficient between the site and other sites in the window, and integrate its frequency and correlation coefficient to quantify its statistics (S); for high coverage depth For the sample, the frequency histogram is used to subtly and accurately distinguish the two states of copy number amplification and deletion, which have different biological function performance, and the statistics (S) of these two states are calculated respectively.

5. the copy number variation detection method based on next generation sequencing as claimed in claim 1, is characterized in that, in the calculation of described statistic, S _test intends to pass known copy number variation functional pattern in the public database and Relationships of gene expression levels assign relative values to them.

6. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein the detection statistics corresponding to each site on the whole genome are calculated for multiple samples after normalization, and the zero distribution T is constructed. , and then perform random permutation on the sample data. Each row in the data matrix represents a sample, and each column represents a locus on the whole genome.

7. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein if the p-value is less than the set threshold of 0.05 in the zero distribution design and significance level estimation based on CNV length, The CNV has biological significance or cancer function, and the amplification and deletion states of the CNV have different biological functions and manifestations.

8. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein in the performance evaluation of the algorithm, whether the evaluation algorithm can estimate the p value more accurately, that is, whether the statistical model of the algorithm has a relatively strong statistical significance.