Mono-parent diploid detection method based on NGS-trio and application
Technical Field
The invention relates to the technical field of bioinformatics analysis, in particular to an NGS-trio-based uniparental diploid detection method and application thereof.
Background
Genomic imprinting (also called genetic imprinting) is a genetic process for marking the information of the parental origin of a gene or Genomic domain by biochemical means. Such genes are called imprinted genes, and whether or not they are expressed depends on the source of the chromosome from which they are derived (paternal or maternal), and whether or not the gene is silenced on the chromosome from which it is derived (the silencing mechanism is primarily methylation). Some imprinted genes are expressed only from maternal chromosomes, and some are expressed only from father chromosomes.
In a normal diploid, a pair of homologous chromosomes is respectively derived from a male parent and a female parent, and a UniParental diploid (UPD for short) refers to a pair of homologous chromosomes (or partial segments of chromosomes) derived from the same parent, and if the segments contain imprinted genes, gene expression disorder can be caused. The current method for diagnosing UPD is to determine whether the methylation level is consistent between the same segments of a pair of homologous chromosomes.
In most cases, UPD is a gamete with abnormal copy number of chromosome because two homologous chromosomes are not separated during meiosis, compared with a gamete with one copy in normal gamete, the gamete with abnormal copy number is 2 or 0 copies, and then a zygote (trisomy or monosome) with abnormal copy number is generated. Finally, through trisomy rescue, as shown in FIG. 1, one chromosome is randomly lost; or by monomer rescue, as shown in figure 2, replicating a single chromosome to become euploid. The probability of a triploid rescue with 1/3 resulted in UPD, whereas a monomer rescue must result in UPD.
For UPD generated by monomer rescue, since it generates homozygous for the entire chromosome, it can be inferred by indirect detection of LOH (loss of heterozygosity); while for UPD generated by trisomy rescue, local LOH is occasionally generated due to recombination during meiosis, but local LOH is caused more frequently (such as marriage close to the relative), and UPD cannot be determined 100%.
Moreover, the methylation detection method for detecting UPD in the conventional technology can only process small local chromosome segments, and different experiments need to be designed according to different regions, so that the efficiency is low, the speed is low, and the method is not suitable for screening in the whole genome range;
the method adopting the SNP chip has the defect of higher cost, and the target probe of the method is a polymorphic site, so that other pathogenic micro-mutations (point mutation, micro-insertion deletion and the like) cannot be simultaneously detected;
whole exon sequencing is the most common method for detecting gene defect diseases at present, can detect pathogenic point mutation, micro-insertion deletion, copy number variation and the like, and is the first choice of most of patients. However, UPD can only be inferred indirectly from LOH based on sequencing data of a single sample, as disclosed in CN 110211630A.
Disclosure of Invention
In view of the above, it is necessary to provide an NGS-trio-based method for detecting an unipolar diploid, which can directly infer the genetic origin of chromosomes of a proband, thereby, directly determine whether or not UPD (instead of indirectly inferring UPD by LOH), and improve the positive rate of diagnosis without increasing any cost.
An NGS-trio-based uniparental diploid detection method comprises the following steps:
data acquisition: acquiring NGS sequencing data of the same group of trio samples;
and (3) screening mutation sites: respectively selecting mutation sites which meet preset conditions in each sample, defining the mutation sites as qualified mutation sites of the sample, and positioning the mutation sites which are screened and removed as unqualified mutation sites of the sample;
merging the site data: taking a union set of unqualified mutation sites of all samples in the same group of trio samples, obtaining and concentrating chromosome coordinates of each unqualified mutation site, and removing mutation sites with the same coordinates as the unqualified mutation sites from qualified sites of each sample; according to the remaining qualified mutation sites in the group of samples, mutually complementing the genotypes at the positions without mutation into homozygous sites consistent with the reference sequence;
and (3) genetic pattern classification: the classification of the genetic pattern was performed for the trio combinations for each mutation site, dividing the mutation site into: sites conforming to the inheritance of parents, sites conforming to the inheritance of only a single parent and sites not conforming to the genetic rule;
and (3) paternity judgment: if the locus which does not accord with the genetic rule is smaller than the preset value, performing subsequent analysis, and if the locus which does not accord with the genetic rule is larger than or equal to the preset value, judging that the sample is unqualified;
judging the uniparental fragment: if the coverage range of the continuous locus only conforming to the single parent father source inheritance exceeds a preset value, judging the continuous locus as a fragment of the single parent father source; if the coverage range of the continuous locus only conforming to the inheritance of the single parent source exceeds a preset value, judging the continuous locus as a fragment of the single parent source;
judging UPD: analyzing the coverage depth of the sequencing data which is judged to be the single parent fragment, and judging that the fragment is missing if the section is suggested to be single copy; otherwise, judging the section as a UPD section;
pathogenic UPD screening: and checking whether the UPD section covers the imprinted gene or the corresponding strip, if not, judging the UPD section to be benign, and if so, indicating the risk of the pathogenic UPD.
With the reduction of sequencing cost, more and more full exon sequencing detection schemes select samples for simultaneously detecting probands and parents thereof, and based on the trio family data, the method can directly infer the chromosome genetic source of probands, thereby directly judging whether UPD exists or not and improving the diagnosis positive rate on the premise of not increasing any cost.
It will be appreciated that the NGS sequencing data described above may be either whole exon sequencing data or whole genome sequencing data.
In one embodiment, in the step of screening for a mutation site, the mutation site is selected as follows:
1) screening high-quality mutation sites in NGS sequencing data;
2) removing the mutation site located on the Y chromosome;
3) screening point mutation sites in the gene;
4) eliminating suspected false positive sites according to Hardy-Weinberg balance;
5) removing sites with mutation frequency higher than 70% for heterozygous sites and removing sites with mutation frequency lower than 85% for homozygous sites;
6) typing the mutation at each position, and removing the loci with more than 2 typing numbers;
7) the rest sites are mutation sites meeting the preset conditions.
In mutation analysis, since humans are diploid, one position has a maximum of 2 genotypes, more than two are typically sequencing errors, for example: the chr1:69849G > A, the Het is divided into chr1:69849[ A/G ], the chr1:69849G > A, and the Hom is divided into chr1:69849[ A/A ]. For example, if there are both chr1:69849G > A, Het and chr1:69849G > T, Het, the typing is chr1:69849[ A/G/T ], i.e., more than 2 types of typing, and this site needs to be removed.
It will be appreciated that the predetermined qualified mutation sites need to be qualified for all screening conditions simultaneously and not for all removal conditions.
It can be understood that according to Hardy-Weinberg's law of equilibrium, the genotype frequency and gene frequency at a locus in a population will remain unchanged and be in genetic equilibrium under the condition that a population is infinite and has random mating, no mutation, no selection and no genetic drift. Thus, false positive sites can be excluded by chi-square test. For example, the frequency of a locus AA-AB-BB is regular, for example, 1 million persons in a local population pool, the allele frequency of A is 0.4, B is 0.6, the theoretical value of the number of persons with the genotype AA is 1600, BB is 3600, and AB is 4800, and the actual number of persons and the theoretical number of persons in the population pool are used for chi-square test to exclude the locus where the actual number of persons deviates too much from the theoretical number of persons (i.e., the high-probability false positive locus).
A large number of sites with poor quality are doped in the sequencing result of the conventional NGS, so that the subsequent UPD judgment process of the method is greatly interfered, and the detection effect is poor if all the sites are used. Therefore, the mutation sites are selected by the method, so that the accuracy of the analysis result can be improved.
In one embodiment, in the mutation site screening step:
the high-quality mutation sites are mutation sites meeting the following standards: the GATK-VQSR quality control PASS, the total coverage is >20X, and the mutation frequency is > 25%.
In one embodiment, in the data acquiring step, the same set of trio samples includes a paternal sample, a maternal sample and a proband sample;
in the site data merging step, mutation site data with consistent coordinates are arranged according to the sequence of proband-father-mother.
The method for detecting the disease of the invention must include samples of probands and parents, and is not necessary.
In one embodiment, the genetic pattern classification step classifies the sites that correspond to parental inheritance as:
type 1: sites that only fit into parental inheritance;
type 0: the locus conforms to both parental inheritance and monophyletic inheritance;
sites that fit only uniparental inheritance were divided into:
type 3F: the resulting sites can only be rescued by the parent monomer;
type 2F: the generated sites can be rescued by father source monomers and also can be rescued by father source triplets;
3M type: sites that can only be rescued by maternal monomers;
2M type: (ii) a site that is rescued by either the maternal monomer or the maternal trisomy;
the sites that do not comply with the genetic rule are divided into:
-type 1: either parent does not comply with the genetic rule;
-type 2: both parents do not comply with the genetic rules.
It is understood that the above-mentioned parental inheritance compatible loci refer to loci from which both alleles of the proband can find their origin in parents, including loci compatible with parental inheritance only (i.e., type 1, such as Aa-Aa), and loci compatible with both parental inheritance and monadic inheritance (i.e., type 0).
In one embodiment, in the step of determining the uniparental fragment, if more than 8 continuous 2F or 3F sites are reached, the coverage range is more than 1Mbp, i.e., the fragment is determined as the uniparental source fragment; if more than 8 continuous 2M or 3M type sites are reached, the coverage range is more than 1Mbp, and the fragments are judged to be of uniparental origin.
It is understood that the above continuous sites are not divided by the type 1 site in the middle, such as more than 8 continuous sites of the type 2F or 3F, not divided by the type 1 site in the middle, or more than 8 continuous sites of the type 2M or 3M, not divided by the type 1 site in the middle.
In one embodiment, in the step of determining the UPD, the data determined to be the segments of the single parent are compared with the result of copy number analysis of sequencing of the whole exon, and if the copy number analysis indicates that the segment is single copy, the segment is determined to be missing; otherwise, judging the UPD.
The invention also discloses application of the NGS-trio-based monadic diploid detection method in research, development or preparation of a device for screening pathogenic UPD.
The invention also discloses a screening device of the monadic diploid based on NGS-trio, which comprises the following steps: the device comprises a data acquisition module, a data analysis module and a UPD judgment module;
the data acquisition module is used for acquiring NGS sequencing data of the same group of trio samples;
the data analysis module is used for analyzing the sequencing data, and dividing mutation sites into: sites conforming to the inheritance of parents, sites conforming to the inheritance of only a single parent and sites not conforming to the genetic rule;
the UPD judgment module is used for carrying out UPD judgment on the mutation sites according to a preset rule to obtain a judgment result;
the data analysis module performs analysis according to the following steps:
and (3) screening mutation sites: respectively selecting mutation sites which meet preset conditions in each sample, defining the mutation sites as qualified mutation sites of the sample, and positioning the mutation sites which are screened and removed as unqualified mutation sites of the sample;
merging the site data: taking a union set of unqualified mutation sites of all samples in the same group of trio samples, obtaining and concentrating chromosome coordinates of each unqualified mutation site, and removing mutation sites with the same coordinates as the unqualified mutation sites from qualified sites of each sample; according to the remaining qualified mutation sites in the group of samples, mutually complementing the genotypes at the positions without mutation into homozygous sites consistent with the reference sequence;
and (3) genetic pattern classification: the classification of the genetic pattern was performed for the trio combinations for each mutation site, dividing the mutation site into: sites conforming to the inheritance of parents, sites conforming to the inheritance of only a single parent and sites not conforming to the genetic rule;
the UPD judging module analyzes according to the following steps:
and (3) paternity judgment: if the locus which does not accord with the genetic rule is smaller than the preset value, performing subsequent analysis, and if the locus which does not accord with the genetic rule is larger than or equal to the preset value, judging that the sample is unqualified;
judging the uniparental fragment: if the coverage range of the continuous locus only conforming to the single parent father source inheritance exceeds a preset value, judging the continuous locus as a fragment of the single parent father source; if the coverage range of the continuous locus only conforming to the inheritance of the single parent source exceeds a preset value, judging the continuous locus as a fragment of the single parent source;
judging UPD: analyzing the coverage depth of the sequencing data which is judged to be the single parent fragment, and judging that the fragment is missing if the section is suggested to be single copy; otherwise, judging the section as a UPD section;
pathogenic UPD screening: and checking whether the UPD section covers the imprinted gene or the corresponding strip, if not, judging the UPD section to be benign, and if so, indicating the risk of the pathogenic UPD.
In one embodiment, in the step of screening for a mutation site, the mutation site is selected as follows:
1) screening high-quality mutation sites in NGS sequencing data;
2) removing the mutation site located on the Y chromosome;
3) screening point mutation sites in the gene;
4) eliminating suspected false positive sites according to Hardy-Weinberg balance;
5) removing sites with mutation frequency higher than 70% for heterozygous sites and removing sites with mutation frequency lower than 85% for homozygous sites;
6) typing the mutation at each position, and removing the loci with more than 2 typing numbers;
7) the rest sites are mutation sites meeting the preset conditions.
In one embodiment, in the mutation site screening step:
the high-quality mutation sites are mutation sites meeting the following standards: the GATK-VQSR quality control PASS, the total coverage is >20X, and the mutation frequency is > 25%.
In one embodiment, in the data acquisition module, the same set of trio samples includes a paternal sample, a maternal sample and a proband sample;
in the site data merging step, mutation site data with consistent coordinates are arranged according to the sequence of proband-father-mother.
In one embodiment, the genetic pattern classification step classifies the sites that correspond to parental inheritance as:
type 1: sites that only fit into parental inheritance;
type 0: the locus conforms to both parental inheritance and monophyletic inheritance;
sites that fit only uniparental inheritance were divided into:
type 3F: the resulting sites can only be rescued by the parent monomer;
type 2F: the generated sites can be rescued by father source monomers and also can be rescued by father source triplets;
3M type: sites that can only be rescued by maternal monomers;
2M type: (ii) a site that is rescued by either the maternal monomer or the maternal trisomy;
the sites that do not comply with the genetic rule are divided into:
-type 1: either parent does not comply with the genetic rule;
-type 2: both parents do not comply with the genetic rules.
It is understood that the above-mentioned parental inheritance compatible loci refer to loci from which both alleles of the proband can find their origin in parents, including loci compatible with parental inheritance only (i.e., type 1, such as Aa-Aa), and loci compatible with both parental inheritance and monadic inheritance (i.e., type 0).
In one embodiment, in the step of determining the uniparental fragment, if more than 8 continuous 2F or 3F sites are reached, the coverage range is more than 1Mbp, i.e., the fragment is determined as the uniparental source fragment; if more than 8 continuous 2M or 3M type sites are reached, the coverage range is more than 1Mbp, and the fragments are judged to be of uniparental origin.
In one embodiment, in the step of determining the UPD, the data determined to be the segments of the single parent are compared with the result of copy number analysis of sequencing of the whole exon, and if the copy number analysis indicates that the segment is single copy, the segment is determined to be missing; otherwise, judging the UPD.
The invention also discloses a storage medium which comprises a stored program, and the program realizes the functions of the modules.
The invention also discloses a processor, which is used for running a program, and the program realizes the functions of the modules.
Compared with the prior art, the invention has the following beneficial effects:
according to the NGS-trio-based uniparental diploid detection method, the occurrence of UPD and the occurrence of UPD in a high-risk imprinting area can be judged while the conventional pathogenic mutation is checked on the basis of the trio data of whole exome/whole genome sequencing, and no additional experiment or labor cost is needed.
In addition, the method can also be used for assisting in judging the heterozygous deletion of the large fragment, and the density resolution ratio of the mutation site can reach 1Mbp, so that the method has excellent detection performance.
Drawings
FIG. 1 is a schematic diagram of a three-body rescue in the background art;
FIG. 2 is a schematic diagram of monomer rescue in the background art;
FIG. 3 is a flow chart of the method for detecting the monadic diploid based on NGS-trio in example 1;
FIG. 4 is a schematic view of a screening apparatus module in example 2;
FIG. 5 is a schematic view of a normal sample in example 3;
FIG. 6 is a schematic representation of the analysis of the trio sample set NP21S0557-NP21S0558-NP21S0549 in example 4;
FIG. 7 is an enlarged view of a portion of the frame of FIG. 4;
FIG. 8 is a schematic diagram of the analysis of the trio sample set NP19E0911-NP19E0910-NP19E0912 in example 4;
FIG. 9 is an enlarged view of a portion of the frame of FIG. 6;
FIG. 10 is a schematic representation of the analysis of the trio sample set NP20E957-NP20E956-NP20E958 of example 4;
FIG. 11 is an enlarged view of a portion of the frame of FIG. 8;
FIG. 12 is a schematic representation of the analysis of the trio sample set NP21F6166- -NP21F6167- -NP21F6168 in example 5;
FIG. 13 is an enlarged view of a portion of the frame of FIG. 10;
FIG. 14 is a diagram of analysis of the trio sample set NP19F0315- -NP19F0313- -NP19F0314 in example 5;
FIG. 15 is an enlarged view of a portion of the frame of FIG. 12;
FIG. 16 is a schematic representation of the analysis of the trio sample set NP21F3536- -NP21F3567- -NP21F3537 in example 5;
FIG. 17 is an enlarged view of a portion of the frame of FIG. 14;
FIG. 18 is a schematic diagram of the analysis of the trio sample set NP19E1380- -NP19E1381- -NP19E1382 in example 6;
FIG. 19 is an enlarged view of a portion of the frame of FIG. 16;
FIG. 20 is a schematic diagram showing the analysis of the trio sample set NP19E0056- -NP9E0057- -NP9E0055 in example 6;
FIG. 21 is an enlarged view of a portion of the frame of FIG. 18;
wherein: in FIGS. 5, 6, 6, 8, 10, 12, 14, 16, 18, 20, the abscissa is the number of each chromosome, the lower half of the figure is the proportion of consecutive homozygous fragments to the entire chromosome length, and the upper half is the distribution of mutated sites on each chromosome;
in the enlarged schematic diagrams of fig. 7, 9, 11, 13, 15, 17, 19, 21, the schematic diagrams of the different types of loci on each chromosome are, in order from left to right: the cross-shaped unInherit _2 refers to a-type 2 locus, the round dot unInherit _1 refers to a-type 1 locus, the diamond-shaped Norm refers to a normal locus, the solid line exome _ bed refers to the whole exon sequencing coverage, the imprint location refers to an imprinting section, the imprint gene refers to an imprinting gene range, the inverted triangle Mather refers to the uniparental maternal genetic locus (3M and 2M), and the regular triangle farmer refers to the uniparental paternal genetic locus (3F and 2F).
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
An NGS-trio-based uniparental diploid detection method is shown in a flow chart of figure 1 and comprises the following steps:
firstly, data acquisition.
NGS sequencing data were obtained for the same set of trio samples. It is understood that the NGS sequencing data may be whole exome sequencing data or whole genome sequencing data.
For the sample, the proband sample, the paternal sample and the maternal sample need to be complete.
And secondly, screening mutation sites.
For a group of trio samples, respectively selecting mutation sites meeting preset conditions in each sample, defining the mutation sites as qualified mutation sites of the samples, positioning the mutation sites to be screened and removed as unqualified mutation sites of the samples, and specifically screening according to the following method:
1. screening high-quality mutation sites (GATK-VQSR quality control PASS, total coverage >20X, mutation frequency > 25%) in whole exome sequencing;
2. removing the mutation site located on the Y chromosome;
3. screening point mutation sites in the gene;
4. excluding possible false positive sites in the local population frequency bin according to Hardy-Weinberg equilibrium;
5. removing sites with mutation frequency higher than 70% from heterozygous sites, and removing sites with mutation frequency lower than 85% from homozygous sites;
6. mutations at each position are typed to remove more than 2 (human diploid, up to 2 genotypes at a position, more than two sequencing errors in general), for example chr1:69849G > A, Het typing chr1:69849[ A/G ], chr1:69849G > A, and Hom typing chr1:69849[ A/A ]. For example, if there are both chr1:69849G > A, Het and chr1:69849G > T, Het, the typing is chr1:69849[ A/G/T ], i.e., more than 2 types of typing, and this site needs to be removed.
7. And respectively summarizing and recording the screened qualified sites and the screened unqualified sites.
Qualified sites need to be simultaneously "eligible for all screening conditions" and "ineligible for all removal conditions".
And thirdly, merging the site data.
1. Taking a union set of unqualified mutation sites of three samples (prob, father and mother samples) in the same group of trio samples, obtaining and concentrating chromosome coordinates of each unqualified mutation site, and removing mutation sites with the coordinates consistent with the unqualified mutation sites from qualified sites of each sample; that is, as long as one spot has a quality failure in one sample, it is rejected in the other two samples.
2. According to the remaining qualified mutation sites in the group of samples, mutually complementing the genotypes at the positions without mutation into homozygous sites consistent with the reference sequence; for example, pro-chr 1:69849[ A/G ], father chr1:69849[ A/A ], no mutation at this position of the mother, and since the reference sequence at this position is G, the mother type is chr1:69849[ G/G ].
Through the processing, about 5 ten thousand qualified mutant site trio combinations can be generally obtained from the sequencing data of the whole exon. And the sequence is ordered according to the following mode, and the trio combination sequence of the mutation sites is as follows: proband-father-mother, such as Aa-AA-Aa, namely proband is Aa, father is AA and mother is Aa.
And fourthly, classifying the genetic patterns.
The classification of the genetic pattern was performed for the trio combinations for each mutation site, dividing the mutation site into: sites conforming to the inheritance of parents, sites conforming to the inheritance of only single parents and sites not conforming to the genetic rule. The method specifically comprises the following steps:
1. sites that correspond to parental inheritance: that is, two alleles of proband can find a source in parents, wherein the Aa-AA-Aa type is necessarily inherited from parents, such loci are marked as type 1 (loci which only accord with the inheritance from parents), other loci such as Aa-Aa, AA-AA-Aa and the like also accord with the inheritance from parents but also accord with the inheritance from single parents, and such loci can not be used as the basis of any judgment and are marked as type 0 (loci which accord with both the inheritance from parents and the inheritance from single parents).
2. Sites that fit only uniparental inheritance: namely, two alleles of the proband can only be inherited from one side of the parents, taking the inheritance from the father as an example, two cases of AA-AA-AA and AA-AA types exist, wherein the AA-AA can only be generated by the monomer rescue, the mark is 3F type, and the AA-AA-AA can be generated by the monomer rescue or the trisomy rescue, and the mark is 2F type; similarly, if the corresponding type is inherited from mother, it is labeled as 3M and 2M.
3. The rest sites which do not accord with the genetic rule: if it is a plurality of sporadic sites, it may be caused by genetic mutation, sequencing error, etc., and if it is extensive, it takes into account the possibility that parents are not parental. There are two cases: AA-AA-AA type, both parents are not family, and the mark is-2 type; Aa-Aa type, parental side is not parental, and labeled-1 type.
Fifthly, judging the relationship.
And if the locus which does not accord with the genetic rule is less than the preset value, performing subsequent analysis, and if the locus which does not accord with the genetic rule is more than or equal to the preset value, judging that the sample is unqualified.
Normally, due to gene mutation and sequencing errors, there may be a few sporadic-1 and-2 sites, typically no more than 100, while in the case of non-parental even if only one party is non-parental there are thousands of-1 sites.
In summary, the sites that exceed 800-1 and-2 are determined to be non-parentage, that is, in this embodiment, the predetermined value (threshold) of the sites that do not conform to the genetic rule is set to 800.
If the paternity is judged to be non-paternity, subsequent analysis cannot be performed. If the relationship judgment sample meets the requirement, the subsequent procedure is entered.
And sixthly, judging the uniparental segment.
If the coverage range of the continuous locus only conforming to the single parent father source inheritance exceeds a preset value, judging the continuous locus as a fragment of the single parent father source; if the coverage range of the continuous locus which only accords with the inheritance of the single parent source exceeds a preset value, the fragment is judged to be the fragment of the single parent source.
Specifically, in this embodiment, the single parent source/parent source segment is determined according to the following method: the fragments reach more than 8 continuous 2F or 3F type sites (the middle is not divided by the 1 type site), the coverage range exceeds 1Mbp, and the fragments are judged to be fragments from a single parent source; similarly, the sequence reaches more than 8 continuous 2M or 3M type sites (the middle is not divided by the 1 type site), the coverage range exceeds 1Mbp, and the fragment is judged to be the fragment from the single parent source.
And seventhly, judging the UPD.
Analyzing the coverage depth of the sequencing data which is judged to be the single parent fragment, and judging that the fragment is missing if the section is suggested to be single copy; otherwise, the UPD section is determined. The method specifically comprises the following steps:
combining the analysis result of sequencing Copy Number Variation (CNV) of the whole exon, namely comparing the sequencing data coverage depth of the single parent source/parent source segment with other samples in the same batch, and judging that the segment is missing if the CNV analysis indicates that the segment is single copy; otherwise, judging the test result as UPD; in particular, deletions of a large segment are generally lethal, and if the segment is more than half of the entire chromosome, or even the entire chromosome, deletion of a segment can be substantially excluded if the sample is derived from a non-embryonic source.
And eighthly, screening the pathogenic UPD.
And checking whether the UPD section covers the imprinted gene or the corresponding strip, if not, judging the UPD section to be benign, and if so, indicating the risk of the pathogenic UPD.
Example 2
A screening device for NGS-trio based uniparental diploids, as shown in fig. 4, comprising: the device comprises a data acquisition module, a data analysis module and a UPD judgment module.
The data acquisition module is used for acquiring NGS sequencing data of the same group of trio samples.
The data analysis module is used for analyzing the sequencing data, and dividing mutation sites into: sites conforming to the inheritance of parents, sites conforming to the inheritance of only a single parent and sites not conforming to the genetic rule; the data analysis module performs analysis according to steps two through four of example 1.
The UPD judgment module is used for carrying out UPD judgment on the mutation sites according to a preset rule to obtain a judgment result; the UPD judging module judges according to the fifth step to the eighth step in the embodiment 1.
Example 3
An NGS-trio based monadic diploid screening is carried out on a certain group of (NP19E1936-NP19E1937-NP19F0086) clinical samples by adopting the screening device of example 2.
As shown in FIG. 3, the sample has almost only the Norm (normal) site, and other types of sites sporadically appear as sequencing errors or new mutations during the genetic process, and the result is indicated as a normal sample.
Example 4
An NGS-trio based screening of monadic diploids, exemplified by 3 sets of clinical specimens, using the screening apparatus of example 2.
1. Group of trio samples: NP21S0557-NP21S0558-NP21S 0549.
The results are shown in fig. 4-5, the sample has sites which accord with the parental inheritance, sites which only accord with the single parental inheritance and sites which do not accord with the inheritance rule, and the sites are uniformly distributed, and meanwhile, 11443 sites and more than 800 sites are provided for the-1 site and the-2 site, the result is judged to be unqualified, parents are not in person or the sample is wrong, and the subsequent judgment cannot be carried out.
2. Group of trio samples: NP19E0911-NP19E0910-NP19E 0912.
The results are shown in fig. 6-7, the sample has sites conforming to the inheritance of both parents, sites conforming to the inheritance of only the parent source of the single parent and sites not conforming to the inheritance rule, and are uniformly distributed, single parent source type sites (sites almost without 2F or 3F) are lacked, meanwhile, the number of the-1 and-2 sites is 5878, and exceeds 800, the result is judged to be unqualified, and the parents are not in person or the sample has errors, so that the subsequent judgment cannot be carried out.
3. Group of trio samples: NP20E957-NP20E956-NP20E 958.
The results are shown in fig. 8-9, the sample has sites conforming to the parental inheritance, sites conforming to the parental inheritance of the single parent only and sites not conforming to the inheritance rule, and are uniformly distributed, single parent-type sites (sites with almost no 2M or 3M) are lacked, and the-1 and-2 sites are 6044 sites, more than 800 sites, which are judged to have unqualified results, and the mother is not in person or the sample has errors, so that the subsequent judgment cannot be carried out.
After the samples are analyzed, the samples do not meet the requirement of the trio sample, the parent line samples and/or the maternal line samples are deleted, and the subsequent analysis cannot be continued.
Example 5
An NGS-trio based screening of monadic diploids, exemplified by 3 sets of clinical specimens, using the screening apparatus of example 2.
1. Group of trio samples: NP21F6166- -NP21F6167- -NP21F 6168.
As shown in FIGS. 10-11, only sites corresponding to the inheritance of the parent source of the single parent are on chr15 in the sample, sites corresponding to the inheritance of the double parent are almost uniformly distributed on the rest autosomes, sites corresponding to the inheritance of the parent source of the single parent and sites not corresponding to the genetic rule are absent (sites of 2F, 3F, -1 and-2 are almost absent), the coverage range is about 72Mbp due to continuous 180 sites of 2M or 3M on chr15, and the CNV result is not abnormal, and the result is judged to be the UPD of the parent source of chr15, and the UPD segment covers a plurality of genetic imprinting regions and indicates the UPD with high risk pathogenicity.
2. Group of trio samples: NP19F0315- -NP19F0313- -NP19F 0314.
As shown in FIGS. 12-13, only sites corresponding to the inheritance of the parent source of the single parent are on chr6 in the sample, sites corresponding to the inheritance of the parent source of the single parent and sites not corresponding to the inheritance rule (almost no sites of 2M, 3M, -1, -2) are on the rest autosomes and are uniformly distributed, the coverage range is about 169Mbp due to 813 continuous sites of 2F or 3F on chr6, and the CNV result is not abnormal, and is judged as the parent source UPD of chr6, and the UPD segment covers a plurality of genetic imprinting regions, thereby indicating the high-risk pathogenic UPD.
3. Group of trio samples: NP21F3536- -NP21F3567- -NP21F 3537.
As shown in FIGS. 14-15, only sites corresponding to the inheritance of the parent source of the single parent are on chr20 in the sample, sites corresponding to the inheritance of the double parent are almost uniformly distributed on the rest autosomes, sites corresponding to the inheritance of the parent source of the single parent and sites not corresponding to the genetic rule are absent (sites with few 2F, 3F, -1 and-2), the coverage range is about 63Mbp due to continuous 197 sites of 2M or 3M on chr20, and the CNV result is not abnormal, and the result is judged to be the UPD of the parent source of chr20, and the UPD segment covers a plurality of genetic imprinting regions and indicates the UPD with high risk and pathogenicity.
The samples were analyzed to be at risk for pathogenic UPD.
Example 6
An NGS-trio based screening of monadic diploids, exemplified by 2 sets of clinical specimens, using the screening apparatus of example 2.
1. Group of trio samples: NP19E1380- -NP19E1381- -NP19E 1382.
The result is shown in fig. 16-17, a small segment of sites on chr15 in the sample only accord with the single parent source inheritance in a local range, the rest of chr15 and the rest of autosomes almost accord with the double parent inheritance and are evenly distributed, sites which are lack of the single parent source inheritance and sites which do not accord with the inheritance rule (the sites which almost do not have 2M, 3M, -1, -2) are lack, the coverage range is about 4Mbp due to 16 continuous sites of 2F or 3F on chr15, and the CNV result indicates that the heterozygous deletion of about 4Mbp exists in the same range of chr15, and the chr15 local maternal deletion is judged, namely, the local parent source fragment only has one copy (the clinical influence is similar to that of the parent source UPD), and the segment covers a plurality of gene imprinting areas, thereby indicating the high-risk pathogenic maternal heterozygous deletion.
2. Group of trio samples: NP19E0056- -NP9E0057- -NP9E 0055.
The results are shown in fig. 18-19, there is a small segment of sites on chr8 in the sample which only conform to the inheritance of a single parent in a local range (wherein, a single parent site may be sequencing error or other reasons and does not affect the overall analysis), the rest of chr8 and the rest of autosomes almost conform to the inheritance of a double parent and are evenly distributed, sites which lack the inheritance of a single parent and sites which do not conform to the inheritance rule (sites which hardly have 2M, 3M, -1, -2) are lacking, because 69 continuous sites of 2M or 3M on chr8 cover about 11Mbp, and the CNV results indicate that there is a heterozygous deletion of about 11Mbp in the same range of chr8, which is judged as a local parent deletion of chr8, namely, there is only one copy of a maternal fragment in a local region (the clinical effect is similar to that of the maternal UPD), because the segment covers a plurality of genetic imprinting regions, suggesting a heterozygous deletion for the high risk pathogenic parent.
The samples were analyzed to be high-risk pathogenic heterozygous deletions with similar clinical impact to UPD as opposed to the source of the deletion (e.g., heterozygous deletions of parent origin have similar clinical impact to UPD of parent origin).
Example 7
UPD was screened in 792 examples of all exon trio sequencing in this detection center using the screening device of example 2, and the results are shown in the following table.
Table 1.792 examples screening UPD results in Whole exon trio sequencing
Note: the "detection of a single parent origin" means that UPD (group 14) or heterozygous deletion (group 32) is detected;
the above "PWS-AS" refers to the pathogenic situation that is caused by the chr15-UPD, where the parent source UPD would cause PWS, the parent source UPD would cause AS,
the chr15-UPD is a common pathogenic condition, corresponding methylation detection methods are available in the market at present, wherein a mother source UPD can cause PWS, a father source UPD can cause AS, 7 cases of chr15-UPD screened by the embodiment are verified by methylation detection, and the results are all matched, so that the method has high detection result accuracy.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.