Nothing Special   »   [go: up one dir, main page]

CN105760712A - Copy number variation detection method based on next generation sequencing - Google Patents

Copy number variation detection method based on next generation sequencing Download PDF

Info

Publication number
CN105760712A
CN105760712A CN201610114354.8A CN201610114354A CN105760712A CN 105760712 A CN105760712 A CN 105760712A CN 201610114354 A CN201610114354 A CN 201610114354A CN 105760712 A CN105760712 A CN 105760712A
Authority
CN
China
Prior art keywords
copy number
number variation
cnv
sample
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610114354.8A
Other languages
Chinese (zh)
Other versions
CN105760712B (en
Inventor
李垚垚
袁细国
张军英
杨利英
白俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201610114354.8A priority Critical patent/CN105760712B/en
Publication of CN105760712A publication Critical patent/CN105760712A/en
Application granted granted Critical
Publication of CN105760712B publication Critical patent/CN105760712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

本发明公开了一种基于新一代测序的拷贝数变异检测方法,该方法包括拷贝数变异数据的预处理、滑动窗口的构造、统计量的计算、置换策略的实施与零分布的构造、算法的性能评估,算法的性能评估采用判断算法能否在错误肯定率可控的情况下,获得较高的正确肯定率,评价算法是否能够较准确地估计p值,拷贝数变异的边界检出能力;分析算法的计算复杂度。本发明解决了由于测序平台及测序水平不同引起的拷贝数变异检测误差问题,令结果更准确;利用从多峰频率直方图特点归一化数据,以准确划分正常区域和拷贝数变异区域;本发明变异reads数与变异位点间关联性的综合作用,建立新模型,解决不一致性问题,客观估计拷贝数变异的显著性水平。

The invention discloses a copy number variation detection method based on next-generation sequencing. The method includes preprocessing of copy number variation data, construction of a sliding window, calculation of statistics, implementation of a replacement strategy, construction of a zero distribution, and implementation of an algorithm. Performance evaluation, the performance evaluation of the algorithm is to judge whether the algorithm can obtain a high correct positive rate under the condition that the false positive rate is controllable, and evaluate whether the algorithm can accurately estimate the p value and the boundary detection ability of copy number variation; Analyze the computational complexity of the algorithm. The invention solves the problem of copy number variation detection error caused by different sequencing platforms and sequencing levels, and makes the results more accurate; utilizes the data normalized from the characteristics of the multi-peak frequency histogram to accurately divide the normal region and the copy number variation region; Invent the comprehensive effect of the correlation between the number of variant reads and the variant site, establish a new model, solve the problem of inconsistency, and objectively estimate the significance level of the copy number variation.

Description

A kind of copy number mutation detection method based on new-generation sequencing
Technical field
The invention belongs to DNA molecular and carry out the high throughput sequencing technologies field of sequencing, particularly relate to a kind of copy number mutation detection method based on new-generation sequencing.
Background technology
Copy number variation (copynumbervariation, CNV) is the important phenomenon in cancer gene group.Its main manifestations is amplification and the disappearance two states of copy number, and generation, development with cancerous cell have close ties.Detecting the concurrent CNV of same area in multiple cancer sample the impact that confluence analysis CNV is on full-length genome expression, identify that those are affected the cancer gene of expression by CNV, this has great importance for the generation and transfer studying cancer.Although the CNV detection method based on single sample is more and more ripe, but these methods still can not meet multiple sample in detection sensitivity and degree of accuracy etc. there is the detection in CNV region jointly, therefore, CNV carrying out analyzing of system and provides important channel for the pathogenesis studying cancer from molecular level, its bottom, most crucial problem are how to detect CNV relevant to tumor-related gene in multiple cancer sample.
New-generation sequencing (NextGenerationSequencing, NGS) technology is once to obtain the high throughput sequencing technologies of up to a million the even short sequence information of millions of, has high speed, high-resolution, low cost, repeatable advantages of higher.Therefore, study detection CNV based on NGS data and substantially increase speed and accuracy, also reduce cost simultaneously.
Numerous researchs show, CNV functional mode is often implied in the consistent variation region of cancer gene group sample, and in NGS comparison to the proportional relation of the sequential digit values in each region of genome and the copy numerical value in this region, so set up the computational methods based on theory of statistics, detection CNV concurrent (Common) significance level in multiple cancer samples, for identifying CNV functional mode and finding that potential cancer gene provides direct, feasible technological means, and then provide important information for the biological physician prediction to cancer and diagnosis.Therefore, setting up rationally and effectively, statistical inspection model is most important.
The intensive in high flux full-length genome CNV site and the complexity of structure thereof, bring great challenge to the detection of the foundation of statistical inspection model and CNV significance, be mainly reflected in following two aspect.First, the difficult point of problem itself: a) number of loci more than up to 180 ten thousand and sample number is often less, define the data general layout of a kind of high latitude small sample;B) systematic error that order-checking platform and order-checking level difference are brought, and the sample of difference order-checking level is normalized;C) the reads signal (readdepth, RD) that gene loci is corresponding is vulnerable to the effect of noise such as order-checking mistake, comparison mistake;D) there is stronger relatedness between CNV site, and dependent so that there is reciprocal effect between detecting factor;E) amplification of detection copy number or miss status to consider the feature of two aspects, i.e. relatedness between site correspondence reads number and site, this requires the mechanism of a rational balance the two feature.Second, solve the theory of problem and the challenge of method: a) data scale is big, the effectively control to calculating Time & Space Complexity is a challenge;B) how to take into full account the relatedness between CNV site, reduce the conservative that CNV significance level is estimated, be a difficulties;C) how to set up null hypothesis distribution consistent with statistic, strengthen the statistical significance that significance level is estimated, be an emphasis and the problem not yet broken through at present.
Analyzing technically, consider from sample size, current existing copy mutation detection method is broadly divided into the CNV detection method below based on single sample analysis and the method based on multisample.Mainly have technically: the copy number detection method of the detection method based on fluorescence sites hybridization technique, the Comparative genomic hybridization based on microarray and gene new-generation sequencing technology.First two method resolution is very low and is difficult to detect short CNV, and the method based on NGS more highlights because it has high-throughout advantage.CNV detection method based on NGS is broadly divided into based on PEM (pair-endmapping) signature with based on two kinds of technology paths of DOC (depthofcoverage).Although the method based on PEM is capable of detecting when the CNV of small fragment but is difficult to the insertion (copy number amplification) of detection large fragment and the CNV (such as SDs) of complex region.The CNV of large fragment can be detected based on the method for DOC.Therefore there is also the method combined both some, such as CNVer, improve the breakpoint accuracy rate in CNV region by integrating DOC and PEM signature.The method being currently based on DOC is more exposed to favor.
DOC detection model based on segmentation relates generally to different dividing methods, such as CBS, LASSO etc..The testing result that different dividing methods produces also is not quite similar.As ReadDepth adopts CBS partitioning algorithm can identify the border that copy number makes a variation more accurately, when detecting low coverage data, still there is higher sensitivity and specificity.The constraint of the uncontrolled sample of FREEC method, adopts LASSO to return accurate CNV border, but ignores local reads number variation, easily cause error detection;Be likely to simultaneously be subject to sub-clone affect G/C content standardization so that affect CNV detection.Segseq method and rSW-seq method, owing to directly making comparisons with control sample, can quickly detect and accurately identify CNV region, but it does not account for the local feature feature of multiple sample, causes that resultant error is very big.Due to sequencing technologies and genomic local feature feature, partitioning algorithm can make the false positive of result higher.SeqCNA does not require to control sample yet, adopts LOESS or polymorphic matching to be applicable to the CNV of detection local small fragment, but is not suitable for detection cancer sample data.
Based on the assumption that the DOC statistical significance model of inspection is mainly concerned with two key elements, i.e. statistic of test and zero cloth, the quality of they designs directly influences the effectiveness of significance level estimation and the qualification performance of CNV functional mode.The EWT method RD fitted Gaussian probability Distribution Model to continuous fragment (window), adopt monolateral Z-test inspection CNV, the copy number variable region of large fragment can be detected, but EWT does not account for the relatedness between site, it is impossible to accurately detect the position of insertion (CNV) and the CNV of small fragment is insensitive.CNV-seq method RD ratio (with sample for reference) the matching Poisson distribution model to non-overlapping segment (window), the significance calculating Z-score is simultaneously introduced partitioning algorithm to detect CNV, improve the sensitivity that low coverage data is detected, but easily improve false positive.CNA-seg, based on the HMM method of segseq and JointSLM, is simultaneously introduced card side χ2Statistic detection CNV.
The detection method being currently based on the common CNV of multisample of DOC is still not as ripe, and detection method mainly has CMDS method [17], cn.MOPS method, JointSLM method and the detection method etc. based on punishment sparse regression model.Wherein the Single locus of multiple samples is built correlation diagonal matrix and calculates its significance to detect CNV by CMDS method, and accuracy rate is higher compared with detecting single sample, improves the cost performance of time and space complexity simultaneously.Cn.MOPS method reduces the influence of noise of technology and biomutation, it is adaptable to detect the CNV that multiple sample same area variation amplitude is inconsistent, and the CNV that amplitude is consistent is insensitive.JointSLM method is the EWT extension detected at multisample, is simultaneously introduced hidden Markov model (HMM) and detects CNV, but when there is common CNV in part sample, it is felt simply helpless.Detection method based on penalty coefficient regression model is one the penalized regression model of RDsignal matching to multiple samples; commonCNV (cCNV) border detection will be converted into change point (changepoint) test problems and utilize significance test method to detect, thus improve accuracy rate and reducing false discovery rate.But but its accuracy rate can decline during ancestors' difference of multiple sample data.
By to existing these based on DOC model [3,7,9-27] com-parison and analysis it can be seen that major part method can produce a significantly high false discovery rate, especially when without reference to sample, feature is especially prominent.The existing significance model based on NGS, is all with CNV structure fragment for detection primitive when designing statistic, and employs the information of relatedness between the frequency of CNV and amplitude and CNV site when quantitative statistics amount.For the structure of zero cloth, most methods are all realized by random permutation strategy.
Analyze from the biological characteristic of CNV data, between CNV site independently, namely contiguous CNV site is an organic whole, then be difficult to the objective significance level estimating CNV with Single locus for detection primitive, easily ignores again the relatedness in inside configuration site with structure fragment for detection primitive;Secondly, consider the reads number of CNV and the relatedness in site despite multiple method when counting statistics amount, but the two feature is not reasonably weighed by they, it is easy to flase drop CNV.
Existing CNV significance level detection method is primarily present following deficiency:
(1) statistic being primitive with single CNV site, it is easy to cause the conservative that significance level is estimated;Though remain the inherent structure characteristic of copy number to a certain extent with CNV structure fragment for constant dollar amount, but ignore the dependency between internal site, it is difficult to the significance level of objective estimation statistic CNV.
(2) there is no the frequency of reasonable tradeoff CNV and the relatedness of variant sites so that the biological performance that CNV associates with cancer is difficult to position;
(3) based on the method for single pattern detection when detecting the cCNV of multiple samples, systematic error or platform errors problem are serious.
(4) there is no the automatic Synthesis multiple samples from difference order-checking platform or order-checking level so that there is bigger limitation when detecting multiple samples concurrent CNV functional mode;
(5) for the sample data of low-coverage level, insensitive, Detection results is not good.
Summary of the invention
It is an object of the invention to provide a kind of copy number mutation detection method based on new-generation sequencing, it is intended to the data for different coverage take different normalized measures, make data more operability, reduce systematic error;Integrate multiple sample, it is proposed to a set of with CNV structural units be primitive significance level etection theory and method;With supervised learning mechanism for guiding, set up and the consistent zero cloth of statistic, to improve the accuracy that significance level is estimated.
The present invention is achieved in that a kind of copy number mutation detection method based on new-generation sequencing, a kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:
The pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low;By standardization G/C content, adjust the reads number that data sample site is corresponding;The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level;For the data sample that overburden depth is low, directly data normalization is become same level;For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature;
The structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix;Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site;Calculate the dependency between site;
The calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation in each site in each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively;
The enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;To each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
p - v a l u e = Σ i = 1 K I ( T i * ≥ T ) K ;
Estimation based on CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
The Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate);The border Detection capability of copy number variation;The computation complexity of parser.
Further, reads < Q30 in the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low is filtered out described in.
Further, multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.
Further, described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S);For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.
Further, S in the calculating of described statistictestTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
Further, the described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.
Further, if the described zero cloth based on CNV length designs with p value in the estimation of significance level less than 0.05 threshold value set, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.
Further, in the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.
The invention solves the problem that prior art is easily trapped into conservative when copy number makes a variation significance estimation;Automatic Synthesis of the present invention detects the region that multiple samples occur copy number to make a variation in same area jointly, avoid the detection error that prior art only detects the copy number variable region of single sample or paired sample, from patient groups, study the relation of copy number variation and cancer;The invention solves the copy number variation detection error problem owing to order-checking platform and order-checking level difference cause, make result more accurate;The present invention is directed to new-generation sequencing data form to utilize from multimodal frequency histogram feature normalization data, accurately to divide normal region and copy number variable region;Prior art is only at copy number variant sites reads number, and consider during statistic design that between variation reads number and adjacent variables site, relatedness exists discordance, the present invention is directed to this problem, consider the comprehensive function of relatedness between variation reads number and variant sites, set up new model, solve problem of inconsistency, with the significance level of objective estimation copy number variation.
When detecting multisample cCNV, the present invention integrates multiple sample, decreases and detects produced systematic error or order-checking platform errors based on single sample testing method successively, substantially increases detection effect.
When early stage normalization (standardization) processes data, the present invention is directed to different order-checking horizontal datas and adopt different processing methods, with prior art low covering horizontal data detection insensitive compared with, no matter present invention order-checking covering level height all has higher sensitivity, this lays a good foundation for the follow-up degree of accuracy improving detection copy number variation.
The copy number variation of detection multisample common region, except to consider that the region that multiple sample generation copy number makes a variation presents identical amplification or deleted signal, the detection that copy number is made a variation by the correlation between adjacent sites also has important biological meaning.Therefore, be conducive to estimating more objectively the significance level of the copy number variation of common region based on the statistic of the feature of structure these two aspects and statistical inspection model;And prior art often only emphasizes the amplitude of copy number variable region, and ignore the dependency between site;For this, the present invention considers both features, set up statistical inspection model, and by supervised learning strategy balance the two feature with reasonably counting statistics amount, this not only makes hypothesis testing model and statistic have concordance, and can strengthen statistics and the biological double meaning that significance level is estimated.
Present invention data for difference covering level when data process take different standardization processing methods, especially to high overburden depth data, first define copy number amplification and miss status according to its data frequency rectangular histogram feature, isolate only normal (0) amplification (1) data set and normal (0) disappearance (-1) data set;The present invention is with Single locus for detection primitive when designing statistic, and combines the information of relatedness between the reads number of CNV Single locus and site when quantitative statistics amount, it is possible to fundamentally improve the accuracy that significance level is estimated;The present invention integrates multiple sample, weighed by the feature of dependency two aspect between the supervised learning method reads number (amplitude) to full-length genome site and site, rationally to quantify statistic, and construct and the consistent hypothesis testing model of statistic, thus improve the statistical significance that significance level is estimated.
Given emulation data: comprise 5 samples of 18 concurrent copy numbers variation (cCNV), the present invention is capable of detecting when 17 cCNV regions, and prior art such as FREEC is only capable of detecting 15 cCNV regions by single pattern detection global alignment.Great many of experiments shows simultaneously: compared with FREEC, and the present invention reduces variable region order on border when detecting more accurate.
Accompanying drawing explanation
Fig. 1 is the copy number mutation detection method flow chart based on new-generation sequencing that the embodiment of the present invention provides.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.
Present invention data for difference covering level when data process take different standardization processing methods, especially to high overburden depth data, first define copy number amplification and miss status according to its data frequency rectangular histogram feature, isolate only normal (0) amplification (1) data set and normal (0) disappearance (-1) data set;The present invention is with Single locus for detection primitive when designing statistic, and combines the information of relatedness between the reads number of CNV Single locus and site when quantitative statistics amount, it is possible to fundamentally improve the accuracy that significance level is estimated;The present invention integrates multiple sample, weighed by the feature of dependency two aspect between the supervised learning method reads number (amplitude) to full-length genome site and site, rationally to quantify statistic, and construct and the consistent hypothesis testing model of statistic, thus improve the statistical significance that significance level is estimated.
Below in conjunction with accompanying drawing, the application principle of the present invention is further described.
A kind of copy number mutation detection method based on new-generation sequencing, should comprise the following steps based on the copy number mutation detection method of new-generation sequencing:
S101: the pretreatment of copy number variation data: filter out the reads that in the Batch effect of CNV signal and comparison process, comparison quality is relatively very low;By standardization G/C content, adjust the reads number that data sample site is corresponding;The order-checking level normalization of multiple samples is processed into the data of corresponding same order-checking level;For the data sample that overburden depth is low, directly data normalization is become same level;For the data sample that overburden depth is high, first define copy number amplification and miss status according to its data frequency rectangular histogram feature;
S102: the structure of sliding window: the multiple samples after integrated standardization process, obtains a higher dimensional matrix;Intend structure sliding window to calculate the frequency in site from original position and utilize Pearson formula to calculate in each window the dependency between site simultaneously, sliding window gradually, until throughout each site;Calculate the dependency between site
S103: the calculating of statistic: calculate amplification or the miss status of the statistic reflection copy number variation of each sliding window, utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively;
S104: the enforcement of Replacement Strategy and the structure of zero cloth: the multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, structure zero cloth T, then sample data is implemented random permutation, to each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;To each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
p - v a l u e = &Sigma; i = 1 K I ( T i * &GreaterEqual; T ) K
P-value represents the p-value value that each site of sample is corresponding, and K is the number of times T of random permutation is statistic during zero cloth,For the statistic of i & lt, ifMore than T, then counting adds one, finally namely obtains p value.(wherein p-value,T is vector)
S105: based on the estimation of CNV significance level: evaluated the CNV region occurred by the p value that the sample all sites obtained is corresponding, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function.To each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
S106: the Performance Evaluation of algorithm: can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);Whether evaluation algorithms can accurately estimate p value (TypeIErrorRate);The border Detection capability of copy number variation;The computation complexity of parser.
Reads < Q30 in the reads that in the described Batch effect filtering out CNV signal and comparison process, comparison quality is relatively very low.
Multiple samples after described integrated standardization process, obtaining higher dimensional matrix in a higher dimensional matrix is the number of sites N of number of samples s* sample, relatedness between the described contiguous copy number variant sites of copy number variation presented with one section of region is stronger, up to 0.985, between distant site, relatedness is more weak.
Described for each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status, for low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S);For the sample of high overburden depth, utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively.
S in the calculating of described statistictestTraining set is intended give relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
The described detection statistic that each site on multiple samples calculating full-length genome after standardization is corresponding, structure zero cloth T, then sample data is implemented sample data in random permutation is that the every a line in data matrix represents a sample, and every string represents a site on full-length genome.
If described 0.05 threshold value based on p value in the estimation of CNV significance level less than setting, this CNV has biological meaning or cancer function, and the amplification of described CNV and miss status have different biological functions and performance.
In the Performance Evaluation of described algorithm, whether evaluation algorithms can accurately estimate p value, and namely whether the statistical model of algorithm has stronger statistical significance.
Below in conjunction with application principle, the invention will be further described.
On the basis that copy number biological nature and theory of statistics are fully studied, set up statistical inspection model, design CNV significance level detection algorithm, utilize a large amount of emulation data testing algorithm repeatedly, its performance is analyzed and evaluation from multi-angle.
(1) pretreatment of copy number variation data
Sample data that copy number is made a variation carries out suitable pretreatment has important meaning to copy number variation significance detection.A) for the quality problems in the Batch effect of CNV signal and comparison process, the relatively very low reads of comparison quality (< Q30) is filtered out.B) due to new-generation sequencing technology data measured, its order-checking coverage is by the impact of G/C content, thus affecting copy number variation detection.It would therefore be desirable to by standardization G/C content, adjust the reads number that data sample site is corresponding.C) owing to the order-checking level of multiple samples would be likely to occur height difference, it is impossible to be made directly follow-up normalized set, it is necessary to normalized becomes the data of corresponding same order-checking level just to have meaning.For the data sample that overburden depth is low, directly data normalization can be become same level;For the data sample that overburden depth is high, can first define copy number amplification and miss status according to its data frequency rectangular histogram feature.
(2) structure of sliding window
Multiple samples after integrated standardization process, can obtain a higher dimensional matrix (the number of sites N of number of samples s* sample).Owing to copy number variation presents with one section of region, the relatedness between generally contiguous copy number variant sites is stronger, may be up to 0.985, and between distant site, relatedness compares overly soft pulse to ignoring.In order to more accurately calculate the dependency between site, intend structure sliding window and calculate the frequency in site from original position and utilize the Pearson formula dependency to calculate in each window between site simultaneously, sliding window gradually, until throughout each site.Wherein result is affected not quite by choosing of the size of sliding window, and we take 10 temporarily here, and rear extended meeting observes it by experiment to impact effect.
(3) calculating of statistic
For each sliding window, calculate its statistic with the amplification reflecting copy number and making a variation or miss status.Owing to the data of new-generation sequencing are subject to the impact of order-checking overburden depth, for low cover degree and high coverage sample counting statistics amount respectively, greatly strengthen the suitability of the present invention.For low cover degree sample, directly calculating the correlation coefficient between other sites in reads number frequency corresponding to each site and this site and window, comprehensively its frequency and correlation coefficient quantify its statistic (S).Sample for high overburden depth, we utilize the ingenious state area accurately having different biological functions to show the amplification of copy number and disappearance both of frequency histogram separately, calculate the statistic (S) of both states respectively, be conducive to the significance level of detection copy number variation better.Here difficult point is how reasonable tradeoff frequency and correlation coefficient, and for this, we utilize known copy number mutation schema construction training set, the weight of study frequency and correlation coefficient, w1And w2, with counting statistics amount.
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the frequency of copy number mutation pattern, dependency, and the value of statistic in training set respectively.Due to StestTraining set does not clearly provide, therefore, intends giving relative value by the relation of copy number mutation pattern known in public database Yu gene expression dose to it.
(4) structure of the enforcement of Replacement Strategy and zero cloth
Multiple samples after standardization are calculated the detection statistic that on full-length genome, each site is corresponding, constructs zero cloth T.Then to sample data, (the every a line in data matrix represents a sample, every string represents a site on full-length genome) implement random permutation, detailed process is as follows: a) for each sample, its position occurred in full-length genome of random permutation, until s sample standard deviation is replaced, constitute a total replacement sample set;For each displacement sample set, calculate the statistic that tandem copies number variation occurs;Finally calculate the significance level of detection statistic:
p - v a l u e = &Sigma; i = 1 K I ( T i * &GreaterEqual; T ) K
(5) estimation with significance level is designed based on the zero cloth of CNV length
The CNV region occurred is evaluated, if p value is less than the threshold value (such as 0.05) of certain setting, then it is considered that this CNV has biological meaning or cancer function by the p value that the sample all sites obtained is corresponding.Furthermore, it is contemplated that the amplification of CNV and miss status have different biological functions and performance, we, for each CNV construction unit, set up the zero cloth of amplification and miss status respectively, to detect the significance level of amplification and miss status respectively.
(6) Performance Evaluation of algorithm
The present invention intends from three below aspect, the performance of algorithm being evaluated: a) can evaluation algorithm when false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);B) whether evaluation algorithms can accurately estimate p value (TypeIErrorRate), and namely whether the statistical model of algorithm has stronger statistical significance;C) the border Detection capability of copy number variation;D) computation complexity of parser.
Intend with the normal cell copy number of 1000Affymetrix full-length genome SNP6.0 chip detection for background, consider NGS technology and data characteristics, based on theory of probability and nonstationary model, build markov CNV emulation mode, the large-scale CNV data based on NGS of simulation, test the method performance of the present invention.Partial simulation experiment draws, this algorithm, under keeping higher TPR situation, has higher border Detection capability.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims (8)

1.一种基于新一代测序的拷贝数变异检测方法,其特征在于,该基于新一代测序的拷贝数变异检测方法包括以下步骤:1. A copy number variation detection method based on next-generation sequencing, characterized in that, the copy number variation detection method based on next-generation sequencing comprises the following steps: 拷贝数变异数据的预处理:过滤掉CNV信号的Batch效应及比对过程中比对质量相对很低的reads;通过标准化GC含量,调整数据样本位点对应的reads数;对多个样本的测序水平归一化处理成对应同一测序水平的数据;对于覆盖深度低的数据样本,直接将数据归一化成同一水平;对于覆盖深度高的数据样本,根据其数据频率直方图特点先定义出拷贝数扩增与缺失状态;Preprocessing of copy number variation data: filter out the Batch effect of CNV signals and relatively low-quality reads during the comparison process; adjust the number of reads corresponding to the data sample sites by standardizing the GC content; sequence multiple samples Level normalization is processed into data corresponding to the same sequencing level; for data samples with low coverage depth, the data is directly normalized to the same level; for data samples with high coverage depth, the copy number is first defined according to the characteristics of its data frequency histogram Amplification and deletion status; 滑动窗口的构造:综合标准化处理后的多个样本,得一个高维矩阵;拟构造滑动窗口从起始位置计算位点的频数同时利用Pearson公式计算每个窗口内位点间的相关性,逐渐滑动窗口,直至遍及每个位点;计算位点间的相关性;Sliding window construction: Synthesize multiple samples after normalization processing to obtain a high-dimensional matrix; construct a sliding window to calculate the frequency of sites from the initial position, and use the Pearson formula to calculate the correlation between sites in each window, gradually Sliding window until across each locus; computing correlations between loci; 统计量的计算:计算每个滑动窗口的统计量反映拷贝数变异的扩增或缺失状态,利用已知的拷贝数变异功能模式构造训练集,学习频数和相关系数的权重,w1和w2,以计算统计量,Calculation of statistics: Calculate the statistics of each sliding window to reflect the amplification or deletion status of the copy number variation, use the known copy number variation function model to construct the training set, learn the weight of frequency and correlation coefficient, w 1 and w 2 , to compute the statistic, Stest=w1*f+w2*aS test =w 1 *f+w 2 *a 其中,f,a,Stest分别指训练集中拷贝数变异功能模式的频数,相关性,及统计量的值;Among them, f, a, and S test respectively refer to the frequency, correlation, and statistical value of the functional pattern of copy number variation in the training set; 置换策略的实施与零分布的构造:对标准化后的多个样本计算全基因组上各个位点对应的检测统计量,构造零分布T,然后对样本数据实施随机置换,对每一样本,随机置换其在全基因组中出现的位置,直至s个样本均被置换,构成一个全置换样本集;对每个置换样本集,计算随机拷贝数变异发生的统计量;最后计算检测统计量的显著性水平:The implementation of the replacement strategy and the construction of the zero distribution: calculate the detection statistics corresponding to each site on the whole genome for multiple samples after normalization, construct the zero distribution T, and then perform random replacement on the sample data, and randomly replace each sample Its position in the whole genome is replaced until s samples are replaced to form a full replacement sample set; for each replacement sample set, the statistics of random copy number variation are calculated; finally, the significance level of the detection statistics is calculated : pp -- vv aa ll uu ee == &Sigma;&Sigma; ii == 11 KK II (( TT ii ** &GreaterEqual;&Greater Equal; TT )) KK ;; p-value表示样本各位点对应的p-value值,K为随机置换的次数T为零分布时的统计量,为第i次的统计量,若大于T,则计数加一,最后即得p值。(其中p-value,T均为向量)p-value represents the p-value value corresponding to each point of the sample, K is the statistic when the number of random permutations T is zero distribution, is the i-th statistic, if If it is greater than T, add one to the count, and finally get the p value. (where p-value, T is a vector) 基于CNV显著性水平的估计:由得到的样本所有位点对应的p值评价CNV发生的区域,若p值小于某设定的阈值(如0.05),则我们认为该CNV具有生物意义或癌症功能。对每个CNV结构单元,分别建立扩增和缺失状态的零分布,以分别检测扩增和缺失状态的显著性水平;Estimation based on the significance level of CNV: The region where the CNV occurs is evaluated by the p value corresponding to all the sites of the obtained sample. If the p value is less than a set threshold (such as 0.05), we believe that the CNV has biological significance or cancer function . For each CNV structural unit, the null distributions of the amplification and deletion states are respectively established to detect the significance levels of the amplification and deletion states respectively; 算法的性能评估:判断算法能否在错误肯定率可控的情况下,获得较高的正确肯定率;评价算法是否能够较准确地估计p值;拷贝数变异的边界检出能力;分析算法的计算复杂度。Performance evaluation of the algorithm: judging whether the algorithm can obtain a high correct positive rate under the condition that the false positive rate is controllable; evaluating whether the algorithm can estimate the p value more accurately; the boundary detection ability of the copy number variation; analyzing the performance of the algorithm Computational complexity. 2.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述过滤掉CNV信号的Batch效应及比对过程中比对质量相对很低的reads中reads<Q30。2. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein the Batch effect of the CNV signal is filtered out and reads<Q30 in reads with relatively low alignment quality in the alignment process . 3.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述综合标准化处理后的多个样本,得到一个高维矩阵中高维矩阵为样本个数s*样本的位点数N,所述以一段区域呈现的拷贝数变异邻近拷贝数变异位点间的关联性比较强,高达0.985,距离较远的位点间关联性比较弱。3. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, characterized in that, the multiple samples after the comprehensive normalization process are obtained to obtain a high-dimensional matrix in which the high-dimensional matrix is the number of samples s*sample The number of loci N, the correlation between the copy number variation presented in a region and adjacent copy number variation loci is relatively strong, up to 0.985, and the correlation between loci with far distances is relatively weak. 4.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述针对每个滑动窗口,计算其统计量以反映拷贝数变异的扩增或缺失状态,对于低覆盖度样本,直接计算每个位点对应的reads数频数和该位点与窗内其他位点间的相关系数,综合它的频数和相关系数来量化其统计量(S);对于高覆盖深度的样本,利用频率直方图巧妙精确地将拷贝数的扩增和缺失这两种有不同的生物功能表现的状态区分开,分别计算这两种状态的统计量(S)。4. the copy number variation detection method based on next generation sequencing as claimed in claim 1, is characterized in that, described for each sliding window, calculate its statistic to reflect the amplification or deletion state of copy number variation, for low Coverage samples, directly calculate the frequency of reads corresponding to each site and the correlation coefficient between the site and other sites in the window, and integrate its frequency and correlation coefficient to quantify its statistics (S); for high coverage depth For the sample, the frequency histogram is used to subtly and accurately distinguish the two states of copy number amplification and deletion, which have different biological function performance, and the statistics (S) of these two states are calculated respectively. 5.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述统计量的计算中Stest在训练集中拟通过公共数据库中已知的拷贝数变异功能模式与基因表达水平的关系对其赋予相对值。5. the copy number variation detection method based on next generation sequencing as claimed in claim 1, is characterized in that, in the calculation of described statistic, S test intends to pass known copy number variation functional pattern in the public database and Relationships of gene expression levels assign relative values to them. 6.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述对标准化后的多个样本计算全基因组上各个位点对应的检测统计量,构造零分布T,然后对样本数据实施随机置换中样本数据为数据矩阵中的每一行代表一个样本,每一列代表全基因组上的一个位点。6. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein the detection statistics corresponding to each site on the whole genome are calculated for multiple samples after normalization, and the zero distribution T is constructed. , and then perform random permutation on the sample data. Each row in the data matrix represents a sample, and each column represents a locus on the whole genome. 7.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述基于CNV长度的零分布设计与显著性水平的估计中若p值小于设定的0.05阈值,该CNV具有生物意义或癌症功能,所述CNV的扩增和缺失状态具有不同的生物功能及表现。7. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein if the p-value is less than the set threshold of 0.05 in the zero distribution design and significance level estimation based on CNV length, The CNV has biological significance or cancer function, and the amplification and deletion states of the CNV have different biological functions and manifestations. 8.如权利要求1所述的基于新一代测序的拷贝数变异检测方法,其特征在于,所述算法的性能评估中评价算法是否能够较准确地估计p值,即算法的统计模型是否具有较强的统计意义。8. The copy number variation detection method based on next-generation sequencing as claimed in claim 1, wherein in the performance evaluation of the algorithm, whether the evaluation algorithm can estimate the p value more accurately, that is, whether the statistical model of the algorithm has a relatively strong statistical significance.
CN201610114354.8A 2016-03-01 2016-03-01 A kind of copy number mutation detection method based on new-generation sequencing Active CN105760712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610114354.8A CN105760712B (en) 2016-03-01 2016-03-01 A kind of copy number mutation detection method based on new-generation sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610114354.8A CN105760712B (en) 2016-03-01 2016-03-01 A kind of copy number mutation detection method based on new-generation sequencing

Publications (2)

Publication Number Publication Date
CN105760712A true CN105760712A (en) 2016-07-13
CN105760712B CN105760712B (en) 2019-03-26

Family

ID=56331603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610114354.8A Active CN105760712B (en) 2016-03-01 2016-03-01 A kind of copy number mutation detection method based on new-generation sequencing

Country Status (1)

Country Link
CN (1) CN105760712B (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372459A (en) * 2016-08-30 2017-02-01 天津诺禾致源生物信息科技有限公司 Method and device for detecting copy number variation based on amplicon next generation sequencing
CN106650312A (en) * 2016-12-29 2017-05-10 安诺优达基因科技(北京)有限公司 Device for detecting DNA copy number variation of circulating tumor
CN106682450A (en) * 2016-11-24 2017-05-17 西安电子科技大学 New generation sequencing copy number variation simulation method based on state transition model
CN106682455A (en) * 2016-11-24 2017-05-17 西安电子科技大学 Statistical testing method of copy number consistency variation region in multiple samples
CN106676178A (en) * 2017-01-19 2017-05-17 北京吉因加科技有限公司 System and method for tumor heterogeneity assessment
CN106778072A (en) * 2016-12-30 2017-05-31 西安交通大学 For the flow bearing calibration of second generation Oncogenome high-flux sequence data
CN106845154A (en) * 2016-12-29 2017-06-13 安诺优达基因科技(北京)有限公司 A kind of device for the copy number variation detection of FFPE samples
CN107229839A (en) * 2017-05-25 2017-10-03 西安电子科技大学 A kind of Indel detection methods based on new-generation sequencing data
CN108073790A (en) * 2016-11-10 2018-05-25 安诺优达基因科技(北京)有限公司 A kind of chromosomal variation detection device
CN108197428A (en) * 2017-12-25 2018-06-22 西安交通大学 A kind of next-generation sequencing technologies copy number mutation detection method of parallel Dynamic Programming
CN108256292A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of copy number variation detection device
CN108563923A (en) * 2017-12-05 2018-09-21 华南理工大学 A kind of genetic mutation data distribution formula storage method and framework
WO2018214010A1 (en) * 2017-05-23 2018-11-29 深圳华大基因研究院 Method, device, and storage medium for detecting mutation on the basis of sequencing data
CN109658983A (en) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 A kind of method and apparatus identifying and eliminate false positive in variance detection
CN109887546A (en) * 2019-01-15 2019-06-14 明码(上海)生物科技有限公司 A single-gene or multi-gene copy number detection system and method based on next-generation sequencing technology
CN110024035A (en) * 2016-09-22 2019-07-16 Illumina公司 The variation detection of body cell copy number
WO2019157791A1 (en) * 2018-02-14 2019-08-22 南京世和基因生物技术有限公司 Detection method and device of copy number variations, and computer readable medium
CN110310704A (en) * 2019-05-08 2019-10-08 西安电子科技大学 A copy number variation detection method based on local outlier factors
CN111429966A (en) * 2020-04-23 2020-07-17 长沙金域医学检验实验室有限公司 Chromosome copy number variation discrimination method and device based on robust linear regression
CN111508559A (en) * 2020-04-21 2020-08-07 北京橡鑫生物科技有限公司 Method and device for detecting target area CNV
CN111627498A (en) * 2020-05-21 2020-09-04 北京吉因加医学检验实验室有限公司 Method and device for correcting GC bias of sequencing data
CN111863124A (en) * 2020-06-06 2020-10-30 聊城大学 A copy number variation detection method, system, storage medium, and computer equipment
CN112365927A (en) * 2017-12-28 2021-02-12 安诺优达基因科技(北京)有限公司 CNV detection device
CN112885406A (en) * 2020-04-16 2021-06-01 深圳裕策生物科技有限公司 Method and system for detecting HLA heterozygosity loss
CN113270141A (en) * 2021-06-10 2021-08-17 哈尔滨因极科技有限公司 Genome copy number variation detection integration algorithm
CN113284558A (en) * 2021-07-02 2021-08-20 赛福解码(北京)基因科技有限公司 Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data
CN114496300A (en) * 2021-12-20 2022-05-13 北京优迅医学检验实验室有限公司 Method and device for clinical annotation of copy number variation pathogenicity
CN114758720A (en) * 2022-06-14 2022-07-15 北京贝瑞和康生物技术有限公司 Methods, apparatus, and media for detecting copy number variation
CN115064210A (en) * 2022-07-27 2022-09-16 北京大学第三医院(北京大学第三临床医学院) Method for identifying chromosome cross-exchange positions in diploid embryonic cells and application
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN118016150A (en) * 2023-11-30 2024-05-10 东莞博奥木华基因科技有限公司 Model construction for detecting copy number variation of genetic sequence and application thereof
US12154664B2 (en) 2017-11-16 2024-11-26 Illumina, Inc. Systems and methods for determining microsatellite instability

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050064476A1 (en) * 2002-11-11 2005-03-24 Affymetrix, Inc. Methods for identifying DNA copy number changes
CN103778350A (en) * 2014-01-09 2014-05-07 西安电子科技大学 Somatic copy number alteration obviousness detection method based on two-dimension statistic model
CN104221022A (en) * 2012-04-05 2014-12-17 深圳华大基因医学有限公司 Method and system for detecting copy number variation
CN104603284A (en) * 2012-09-12 2015-05-06 深圳华大基因研究院 Method for detecting copy number variations by genome sequencing fragments
CN104694384A (en) * 2015-03-20 2015-06-10 上海美吉生物医药科技有限公司 Mitochondrial DNA copy index variability detecting device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050064476A1 (en) * 2002-11-11 2005-03-24 Affymetrix, Inc. Methods for identifying DNA copy number changes
CN104221022A (en) * 2012-04-05 2014-12-17 深圳华大基因医学有限公司 Method and system for detecting copy number variation
CN104603284A (en) * 2012-09-12 2015-05-06 深圳华大基因研究院 Method for detecting copy number variations by genome sequencing fragments
CN103778350A (en) * 2014-01-09 2014-05-07 西安电子科技大学 Somatic copy number alteration obviousness detection method based on two-dimension statistic model
CN104694384A (en) * 2015-03-20 2015-06-10 上海美吉生物医药科技有限公司 Mitochondrial DNA copy index variability detecting device

Cited By (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372459A (en) * 2016-08-30 2017-02-01 天津诺禾致源生物信息科技有限公司 Method and device for detecting copy number variation based on amplicon next generation sequencing
CN106372459B (en) * 2016-08-30 2019-03-15 天津诺禾致源生物信息科技有限公司 A kind of method and device based on amplification second filial sequencing copy number variation detection
CN110024035A (en) * 2016-09-22 2019-07-16 Illumina公司 The variation detection of body cell copy number
CN110024035B (en) * 2016-09-22 2023-11-14 Illumina公司 Somatic cell copy number variation detection
CN108073790B (en) * 2016-11-10 2022-03-01 安诺优达基因科技(北京)有限公司 Chromosome variation detection device
CN108073790A (en) * 2016-11-10 2018-05-25 安诺优达基因科技(北京)有限公司 A kind of chromosomal variation detection device
CN106682450A (en) * 2016-11-24 2017-05-17 西安电子科技大学 New generation sequencing copy number variation simulation method based on state transition model
CN106682455A (en) * 2016-11-24 2017-05-17 西安电子科技大学 Statistical testing method of copy number consistency variation region in multiple samples
CN106682450B (en) * 2016-11-24 2019-05-07 西安电子科技大学 A state transition model-based simulation method for copy number variation in next-generation sequencing
CN106682455B (en) * 2016-11-24 2019-03-26 西安电子科技大学 A kind of Statistical Identifying Method of multisample copy number consistency variable region
CN108256292A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of copy number variation detection device
CN106845154B (en) * 2016-12-29 2022-04-08 浙江安诺优达生物科技有限公司 A device for FFPE sample copy number variation detects
CN106650312B (en) * 2016-12-29 2022-05-17 浙江安诺优达生物科技有限公司 Device for detecting copy number variation of circulating tumor DNA
CN106845154A (en) * 2016-12-29 2017-06-13 安诺优达基因科技(北京)有限公司 A kind of device for the copy number variation detection of FFPE samples
CN108256292B (en) * 2016-12-29 2021-11-02 浙江安诺优达生物科技有限公司 Copy number variation detection device
CN106650312A (en) * 2016-12-29 2017-05-10 安诺优达基因科技(北京)有限公司 Device for detecting DNA copy number variation of circulating tumor
CN106778072A (en) * 2016-12-30 2017-05-31 西安交通大学 For the flow bearing calibration of second generation Oncogenome high-flux sequence data
CN106778072B (en) * 2016-12-30 2019-05-21 西安交通大学 For the process bearing calibration of second generation Oncogenome high-flux sequence data
CN106676178B (en) * 2017-01-19 2020-03-24 北京吉因加科技有限公司 Method and system for evaluating tumor heterogeneity
CN106676178A (en) * 2017-01-19 2017-05-17 北京吉因加科技有限公司 System and method for tumor heterogeneity assessment
WO2018214010A1 (en) * 2017-05-23 2018-11-29 深圳华大基因研究院 Method, device, and storage medium for detecting mutation on the basis of sequencing data
CN107229839A (en) * 2017-05-25 2017-10-03 西安电子科技大学 A kind of Indel detection methods based on new-generation sequencing data
US12154664B2 (en) 2017-11-16 2024-11-26 Illumina, Inc. Systems and methods for determining microsatellite instability
CN108563923A (en) * 2017-12-05 2018-09-21 华南理工大学 A kind of genetic mutation data distribution formula storage method and framework
CN108563923B (en) * 2017-12-05 2020-08-18 华南理工大学 Distributed storage method and system for genetic variation data
CN108197428B (en) * 2017-12-25 2020-06-19 西安交通大学 Copy number variation detection method for next generation sequencing technology based on parallel dynamic programming
CN108197428A (en) * 2017-12-25 2018-06-22 西安交通大学 A kind of next-generation sequencing technologies copy number mutation detection method of parallel Dynamic Programming
CN112365927B (en) * 2017-12-28 2023-08-25 安诺优达基因科技(北京)有限公司 CNV detection device
CN112365927A (en) * 2017-12-28 2021-02-12 安诺优达基因科技(北京)有限公司 CNV detection device
WO2019157791A1 (en) * 2018-02-14 2019-08-22 南京世和基因生物技术有限公司 Detection method and device of copy number variations, and computer readable medium
CN109658983A (en) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 A kind of method and apparatus identifying and eliminate false positive in variance detection
CN109887546A (en) * 2019-01-15 2019-06-14 明码(上海)生物科技有限公司 A single-gene or multi-gene copy number detection system and method based on next-generation sequencing technology
CN110310704A (en) * 2019-05-08 2019-10-08 西安电子科技大学 A copy number variation detection method based on local outlier factors
CN112885406A (en) * 2020-04-16 2021-06-01 深圳裕策生物科技有限公司 Method and system for detecting HLA heterozygosity loss
CN111508559A (en) * 2020-04-21 2020-08-07 北京橡鑫生物科技有限公司 Method and device for detecting target area CNV
CN111429966A (en) * 2020-04-23 2020-07-17 长沙金域医学检验实验室有限公司 Chromosome copy number variation discrimination method and device based on robust linear regression
CN111627498A (en) * 2020-05-21 2020-09-04 北京吉因加医学检验实验室有限公司 Method and device for correcting GC bias of sequencing data
CN111627498B (en) * 2020-05-21 2022-10-04 北京吉因加医学检验实验室有限公司 Method and device for correcting GC bias of sequencing data
CN111863124A (en) * 2020-06-06 2020-10-30 聊城大学 A copy number variation detection method, system, storage medium, and computer equipment
CN111863124B (en) * 2020-06-06 2024-01-30 聊城大学 Copy number variation detection method, system, storage medium and computer equipment
CN113270141A (en) * 2021-06-10 2021-08-17 哈尔滨因极科技有限公司 Genome copy number variation detection integration algorithm
CN113270141B (en) * 2021-06-10 2023-02-21 哈尔滨因极科技有限公司 Genome copy number variation detection integration algorithm
CN113284558A (en) * 2021-07-02 2021-08-20 赛福解码(北京)基因科技有限公司 Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data
CN113284558B (en) * 2021-07-02 2024-03-12 赛福解码(北京)基因科技有限公司 Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data
CN114496300A (en) * 2021-12-20 2022-05-13 北京优迅医学检验实验室有限公司 Method and device for clinical annotation of copy number variation pathogenicity
CN114758720A (en) * 2022-06-14 2022-07-15 北京贝瑞和康生物技术有限公司 Methods, apparatus, and media for detecting copy number variation
CN115064210A (en) * 2022-07-27 2022-09-16 北京大学第三医院(北京大学第三临床医学院) Method for identifying chromosome cross-exchange positions in diploid embryonic cells and application
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN117409856B (en) * 2023-10-25 2024-03-29 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN118016150A (en) * 2023-11-30 2024-05-10 东莞博奥木华基因科技有限公司 Model construction for detecting copy number variation of genetic sequence and application thereof

Also Published As

Publication number Publication date
CN105760712B (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN105760712A (en) Copy number variation detection method based on next generation sequencing
Li et al. FDR-control in multiscale change-point segmentation
Wang et al. A novel approach combined transfer learning and deep learning to predict TMB from histology image
CN109887546B (en) Single-gene or multi-gene copy number detection system and method based on next-generation sequencing
CN110517790A (en) Compound hepatotoxicity wind agitation method for early prediction based on deep learning and gene expression data
CN117594243A (en) Ovarian cancer prognosis prediction method based on cross-modal view association discovery network
CN101996284A (en) Screening method of characteristic gene of certain disease
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN114184599A (en) Single-cell Raman spectrum acquisition number estimation method, data processing method and device
CN114036531A (en) Multi-scale code measurement-based software security vulnerability detection method
CN116864011A (en) Colorectal cancer molecular marker identification method and system based on multiple sets of chemical data
CN115620812A (en) Resampling-based feature selection method and device, electronic equipment and storage medium
CN103778350B (en) Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method
CN115277151A (en) Network intrusion detection method based on whale lifting algorithm
CN102586418A (en) Pathway-based specific combined medicine target detection method
CN117992913A (en) Multimode data classification method based on bimodal attention fusion network
CN118035118A (en) A method for defect localization in deep learning programs based on mutation
Schroeder et al. Enricherator: A Bayesian method for inferring regularized genome-wide enrichments from sequencing count data
CN115684363A (en) Evaluation Method of Concrete Performance Degradation Based on Acoustic Emission Signal Processing
CN101565747B (en) Method for extracting characteristic expression patterns of multiple gene sets
CN103218543B (en) A kind of method and system distinguishing protein coding gene and Noncoding gene
CN114062305A (en) Method and system for single-grain variety identification based on near-infrared spectroscopy and 1D-In-Resnet network
CN111598184A (en) DenseNet-based image noise identification method and device
Bidaut et al. WaveRead: automatic measurement of relative gene expression levels from microarrays using wavelet analysis
CN115274124B (en) Dynamic optimization method of tumor early screening targeting Panel and classification model based on data driving

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant