CN105069322B

CN105069322B - Disease-susceptible humans risk profile device

Info

Publication number: CN105069322B
Application number: CN201510442836.1A
Authority: CN
Inventors: 曹鑫恺; 王立山; 臧卫东; 宋伟
Original assignee: Shanghai Eryun Information Technology Co Ltd
Current assignee: Shanghai Eryun Information Technology Co Ltd
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2018-01-12
Anticipated expiration: 2035-07-24
Also published as: CN105069322A

Abstract

The present invention relates to bioinformatics, there is provided a kind of disease-susceptible humans Risk Forecast Method and device.The disease-susceptible humans Risk Forecast Method of the present invention includes：There is provided comprising the incidence of disease data of disease, SNP site genotype frequency data, for each disease associated SNP positions risk equipotential homozygous genotype and heterozygous genotypes OR Value Datas database；Receive the information of test individual；Calculate the disease syndrome susceptible risk array for obtaining disease interested to test individual；The individual disease syndrome susceptible risk dynamic changing curve of the range of age is specified in generation.The present invention considers that the factor of two aspects of individual inheritance and environment calculates individual disease susceptibility risk simultaneously, result of calculation more meets objective reality, the disease susceptibility risk change of age curve for individual obtained, so that individual can not only learn more accurate disease-susceptible humans risk immediately, while also can continue to understand the trend with various diseases neurological susceptibility changes after age growth.

Description

Disease susceptibility risk prediction device

Technical Field

The present invention relates to bioinformatics, and more particularly, to a method and apparatus for predicting risk of susceptibility to disease.

Background

Some chromosome loci related to the occurrence of diseases corresponding to genetic constitution which is unfavorable for health are called disease susceptibility loci, and the disease susceptibility means the predispositivity of a certain disease or a certain kind of disease determined by heredity, and people with disease susceptibility must have specific genetic characteristics, namely, susceptibility genotypes with a certain disease. So far, in the medical statistics research process of recent ten years, a large number of chromosome sites are found to be closely related to the susceptibility of tumors, cardiovascular and cerebrovascular diseases, chronic diseases and the like and are repeatedly verified by mutually independent research. For a particular disease, the susceptibility of each individual to a particular disease can be understood by counting and calculating the chromosomal loci contained on each individual's chromosomes that are associated with that disease and the risk exposure levels of the loci.

However, modern medical research has shown that the majority of diseases are caused by the combined action of environmental factors and personal genetic constitution. The currently known individual disease susceptibility risk calculation method is only based on the single nucleotide polymorphism change of individual chromosome sites, the importance of genetic information on disease occurrence is over-emphasized, the objective fact that the disease occurrence is caused by the combined action of environmental factors and individual genetic constitutions is ignored, and the referential property of prediction data is low.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention provides a method and device for predicting risk of susceptibility to diseases.

The invention simultaneously considers the influence of personal genetic factors and environmental factors on the occurrence of diseases. Unlike the stable invariance of genetic characteristics of genes, firstly, environmental factors are accumulated in the human body continuously along with the increase of the age of people and exert influence on the health of individuals, and the environmental factors are factors with cumulative effect, and the age of the individuals is a measure for well identifying the high and low environmental cumulative effect. Secondly, people of different nationalities, different regions and with different ethnic backgrounds are necessarily affected by the inherent dietary customs, social customs and other daily living habits of the ethnic society, and finally, the environmental factors of different groups are different. Therefore, when calculating the disease susceptibility risk, the invention also distinguishes the environmental factors suffered by different regional populations aiming at regions with different living habits and social cultures. Third, human as an amphiphile, male and female, due to their natural differences in constitution, have a strong relationship between their individual preferences for their own environment and sex. Therefore, even for individuals of different sexes in the same regional population, the influence of environmental factors on the individuals will be slightly different due to the influence of the respective sexes, which is one of the factors to be considered when accurately calculating the disease susceptibility.

In addition, different from the known individual disease susceptibility calculation method, the invention creatively provides a susceptibility dynamic change curve which takes age as an independent variable and a disease susceptibility risk value as a dependent variable in consideration of the dynamic variability characteristic of the cumulative effect of environmental factors on individual disease susceptibility. Under the condition that the region and the living habit of the individual are not changed, the curve can effectively explain the change trend of the individual disease susceptibility risk value along with the age increase, thereby more effectively helping the individual to carry out health management and achieving the effect of preventing diseases.

In summary, based on the individual genetic characteristics, the age, the area and the sex of the individual are further taken as three factors influencing the environmental factors of the individual, and the age change curve of the disease susceptibility risk of each individual is drawn by integrating the four individual information, so that the individual can not only obtain more accurate instant disease susceptibility risk, but also continuously know the trend of the disease susceptibility changes along with the age increase.

The invention firstly provides a disease susceptibility risk prediction method, which comprises the following steps:

step S101, providing a database S containing disease incidence data, SNP locus genotype frequency data, risk allele homozygous genotype and heterozygous genotype OR value data of each disease-related SNP locus; wherein, the incidence data of the same disease is distinguished according to different combinations of regions, sexes and age sections, the OR value data of each genotype of SNP sites related to the same disease is distinguished according to different combinations of regions, sex compositions and age distribution sections, and the genotype frequency data of the SNP sites are distinguished according to different regions.

And step S102, receiving region information, sex information and interested actual measurement genotype information of the SNP locus of the individual to be detected.

Step S103, aiming at the interested diseases of the individual to be detected, extracting the following data from the database S according to the regional information, the sex information and the actually measured genotype information of the SNP locus of the individual to be detected: incidence data of each interested disease of corresponding region and corresponding gender under each age distribution region, genotype frequency data of SNP loci corresponding to each interested disease of corresponding region and OR value data of SNP loci corresponding to the interested disease of corresponding region and corresponding gender under each age distribution region; and calculating to obtain a disease comprehensive susceptibility risk array of each interested disease of the individual to be detected according to the data, wherein the disease comprehensive susceptibility risk array of the individual to be detected comprises: the individual disease comprehensive susceptibility risk value of each age section corresponding to the genotype composition, the region where the individual disease comprehensive susceptibility risk value is located and the sex composition of the individual disease comprehensive susceptibility risk value is the same as the genotype composition of the individual to be detected;

and step S104, fitting a disease comprehensive susceptibility risk calculation function corresponding to the discrete array by using LOESS regression according to the individual disease comprehensive susceptibility risk array of each disease of interest, and generating an individual disease comprehensive susceptibility risk dynamic change curve in a specified age range based on the function.

The invention also provides a disease susceptibility risk prediction device, comprising:

a database unit for providing a database S comprising data on the incidence of a disease, frequency data of SNP locus genotypes, at-risk allele homozygous genotypes and OR value data of heterozygous genotypes for each disease-related SNP locus; wherein, the incidence data of the same disease is distinguished according to different combinations of regions, sexes and age sections, the OR value data of each genotype of SNP sites related to the same disease is distinguished according to different combinations of regions, sex compositions and age distribution sections, and the genotype frequency data of the SNP sites are distinguished according to different regions.

And the to-be-detected individual information unit is used for providing the regional information, the sex information and the interested actual measurement genotype information of the SNP locus of the to-be-detected individual.

The disease comprehensive susceptibility risk array calculation unit: the system is connected with the database unit and the information unit of the individual to be detected and is used for extracting the following data from the database S according to the region information, the sex information and the actual measurement genotype information of the SNP locus of the individual to be detected aiming at the interested disease of the individual to be detected: incidence data of each interested disease of corresponding region and corresponding gender under each age distribution region, genotype frequency data of SNP loci corresponding to each interested disease of corresponding region and OR value data of SNP loci corresponding to the interested disease of corresponding region and corresponding gender under each age distribution region; and calculating to obtain a disease comprehensive susceptibility risk array of each interested disease of the individual to be detected according to the data, wherein the disease comprehensive susceptibility risk array of the individual to be detected comprises: the individual disease comprehensive susceptibility risk value of each age section corresponding to the genotype composition, the region where the individual disease comprehensive susceptibility risk value is located and the sex composition of the individual disease comprehensive susceptibility risk value is the same as the genotype composition of the individual to be detected;

and the disease comprehensive susceptibility risk dynamic change curve unit is connected with the disease comprehensive susceptibility risk array calculation unit and is used for fitting a disease comprehensive susceptibility risk calculation function corresponding to the discrete array by using LOESS regression according to the individual disease comprehensive susceptibility risk array of each interested disease, and generating an individual disease comprehensive susceptibility risk dynamic change curve in a specified age range based on the function.

The invention has the advantages that:

1) the invention simultaneously considers the factors of the individual heredity and the environment to calculate the individual disease susceptibility risk, and the calculation result is more in line with the objective reality. The environmental factors include the region of the individual, the sex of the individual, and the cumulative effect of the environmental factors reflected based on the age of the individual.

2) The invention finally draws the age change curve aiming at individual disease susceptibility risk, so that the individual can not only learn more accurate instant disease susceptibility risk, but also continuously learn the trend of various disease susceptibility changes along with the age increase, and is beneficial to playing a long-term effective health prompting role.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

FIG. 2 is an example of a dynamic change curve of an individual disease complex susceptibility risk obtained by a processing method according to an embodiment of the invention. In the figure, the horizontal axis represents age of an individual, and the vertical axis represents comprehensive susceptibility risk of diseases corresponding to ages of individuals. The upper curve represents the change of susceptibility risk of the individual to the disease at different ages, and the lower color curve represents the average susceptibility risk of the individual to the disease in the population. In practice, the curves representing individuals and the curves representing average human levels may be labeled with different colors.

FIG. 3A is a schematic diagram of an apparatus according to an embodiment of the invention

FIG. 3B is a schematic diagram of an apparatus according to a preferred embodiment of the present invention

FIG. 4A is a schematic view of an apparatus according to another preferred embodiment of the present invention

FIG. 4B is a schematic diagram of a calibration module of the apparatus according to another preferred embodiment of the present invention

FIG. 5 is a schematic view of an apparatus according to still another preferred embodiment of the present invention

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Furthermore, it is to be understood that one or more method steps mentioned in the present invention does not exclude that other method steps may also be present before or after the combined steps or that other method steps may also be inserted between these explicitly mentioned steps, unless otherwise indicated; it is also to be understood that a combined connection between one or more devices/apparatus as referred to in the present application does not exclude that further devices/apparatus may be present before or after the combined device/apparatus or that further devices/apparatus may be interposed between two devices/apparatus explicitly referred to, unless otherwise indicated. Moreover, unless otherwise indicated, the numbering of the various method steps is merely a convenient tool for identifying the various method steps, and is not intended to limit the order in which the method steps are arranged or the scope of the invention in which the invention may be practiced, and changes or modifications in the relative relationship may be made without substantially changing the technical content.

Based on the individual genetic characteristics, the age, the area and the sex of the individual are further taken as three factors influencing the environmental factors of the individual, and the age change curve of the disease susceptibility risk of each individual is drawn by integrating the four individual information, so that the individual can obtain more accurate instant disease susceptibility risk and can continuously know the trend of the susceptibility change of various diseases accompanying the age increase.

The invention provides a disease susceptibility risk prediction method, in one embodiment, as shown in fig. 1, the disease susceptibility risk prediction method comprises the following steps:

step S101, providing a database S containing disease incidence data, SNP locus genotype frequency data, risk allele homozygous genotype and heterozygous genotype OR value data of each disease-related SNP locus; wherein, the incidence data of the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, the OR value data of each genotype of SNP sites related to the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, and the genotype frequency data of the SNP sites are distinguished according to different regions.

And S102, receiving region information, sex information and actual measurement genotype information of the SNP locus of the individual to be detected.

Step S103, aiming at the interested diseases of the individual to be detected, extracting the following data from the database S according to the regional information, the sex information and the actually measured genotype information of the SNP locus of the individual to be detected: incidence data of the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition, genotype frequency data of SNP loci corresponding to the interested diseases of the corresponding region and OR value data of the SNP loci corresponding to the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition; and calculating to obtain a disease comprehensive susceptibility risk array of each interested disease of the individual to be detected according to the data, wherein the disease comprehensive susceptibility risk array of the individual to be detected comprises: the individual disease comprehensive susceptibility risk value of each age distribution section is the same as the genotype composition of the individual to be detected, and is formed by the corresponding region and the corresponding sex;

The method calculates the comprehensive susceptibility risk (genetic factors and environmental factors) of the two-factor disease according to the database information, the information of the individual to be detected and the genotype data of the individual to be detected; and fitting the obtained comprehensive susceptibility risk of the two-factor disease into an age change curve of the comprehensive susceptibility risk of the disease.

Specifically, for step S101,

the incidence of a disease, Pr (D), refers to the incidence of a particular disease in a particular population, and in epidemiology to the rate of new occurrence of a disease in a particular population over a period of time. Disease incidence can be used to determine risk of onset. The data acquisition mode mainly depends on database information of health statistical departments of various countries.

In one embodiment, the disease incidence information is obtained by accessing an existing database and capturing disease incidence information therefrom or calculating by capturing relevant information and then entering the information into the database S. Taking the information acquisition mode of different types of tumor morbidity of Chinese people as an example, firstly, the main page of a GLOBOCAN database is accessed, the Cancer bypass sub-page is further accessed, and various types of tumor morbidity information of people in different countries and regions is acquired.

A SNP, i.e., a single nucleotide polymorphism, refers to a polymorphism in a nucleic acid sequence that occurs due to a change in a single nucleotide base. Each SNP site comprises two allelic bases of a Major Allele and a minor Allele, wherein the base with higher occurrence frequency is defined as the Major Allele (Major Allele), and the other base with lower occurrence frequency is defined as the minor Allele (Minorallele). Assuming that there is one SNP site X, it is statistically found that the site is composed of adenine (A) and cytosine (C). Wherein, the appearance frequency of A is 59%, the appearance frequency of C is 41%, then A is marked as Major Allole, and C is marked as MinoraAllole. As the human nuclear chromosome is diploid, three different genotypes can be generated together through different permutation and combination of Major Allole and Minor Allole, namely a homozygous genotype (AA) with two basic groups both Major Allole, a homozygous genotype (CC) with two basic groups both Minor Allole and a heterozygous genotype (AC) consisting of 1 Minor Allle and 1 Major Allle.

Genotype frequency (pr (gi)) refers to the frequency of occurrence of a particular genotype at a certain SNP site in a particular population. In one embodiment, the genotype frequency data is obtained from a Hapmap database and stored in database S. And (3) accessing an FTP site provided by a Hapmap database, and downloading and obtaining genotype frequency information of different crowds at the site.

The OR value is called Odds Ratio, also called Ratio and risk exposure Ratio. For disease incidence, the OR value of different genotypes is an estimate of the relative risk of the gene causing disease. Specifically, when OR value is 1, this indicates that the factor does not contribute to the occurrence of disease; when the OR value is greater than 1, the factor is a risk factor; when the OR value is less than 1, it indicates that the factor is a protective factor.

In the present invention, the formula representing the OR value is defined as follows:

equation 1:

wherein,

G_irepresenting the genotype, i is selected from 0,1,2, G₀Representing a non-at-risk allelic homozygous genotype, G₁Representing a heterozygous genotype, G₂Representing risk allelic homozygous genotypes. Non-risk alleles refer to bases that occur less frequently in the diseased population than in the control population (which refers to the non-diseased random population). Risk alleles refer to bases that appear more frequently in the disease-affected population than in the control population.

Pr(D|G_i) Represents G_iThe incidence of disease of genotype. Specifically, Pr (D | G)₀) Disease incidence, Pr (D | G), representing a non-risk allelic homozygous genotype₁) The incidence of disease, Pr (D | G), representing a heterozygous genotype₂) Representing the incidence of at-risk allelic homozygous genotypes.

OR_iRepresents G_iOR value of genotype. Specifically, OR₀An OR value representing a non-risk allelic homozygous genotype; OR (OR)₁An OR value representing a heterozygous genotype; OR (OR)₂OR values representing risk allele homozygous genotypes. According to the calculation, for any SNP, the OR thereof in the invention₀The values are all 1. Therefore, the OR of each SNP does not need to be recorded in the database₀Values, all involving OR₀The value calculation is carried out by only substituting 1.

For example, the following steps are carried out: it is assumed that a certain SNP site is reported to be closely related to lung cancer susceptibility, and the site is mainly found to have two base types of adenine (A) and thymine (C) based on a large-scale sample investigation, wherein the occurrence frequency of A in a lung cancer affected group is 70%, the occurrence frequency of C in the lung cancer affected group is 30%, and the occurrence frequency of A in a normal control group is 55%, and the occurrence frequency of C is 45%. Then for that SNP locus, CC is a non-risk homozygous allele (corresponding to G)₀) AC is heterozygous genotype (corresponding to G)₁) AA is the risk homozygous allele type (corresponding to G)₂)；Pr(D|G₀) Pr (D | G), which represents the CC-type lung cancer incidence at this SNP site₁) Represents the SNP site AType C incidence of lung cancer, Pr (D | G)₂) Representing the incidence of the AA type lung cancer at the SNP locus; OR (OR)₀The OR value, OR, of lung cancer of CC type at this SNP site₁OR value, OR, of lung cancer of type AC representing the SNP site₂Representing the OR value of the lung cancer of the AA type at the SNP locus.

In the database S of the present invention, the OR value of the risk allele homozygous genotype and the OR value of the heterozygous genotype (i.e., the OR value of each SNP site) for each disease-related SNP site are recorded₁Value and OR₂Value). The OR is₁Value and OR₂The values can be derived directly from the literature. When OR is not given directly in the literature₁Value and OR₂When other related information is recorded, for example, the incidence of each genotype, the value can be calculated by using the related information according to a formula.

The data in the database S may be continuously updated according to the increase of the disclosure of the literature-related information.

The method is divided into European and American people (European) and east Asian people (EastAsian) according to regions, and other people do not make statistics because the research data set is very rare.

And (3) distinguishing according to gender compositions: the finger area is divided into three components of single Male (Male), single Female (Female) and Male and Female Mixed type (Mixed).

Distinguishing according to age distribution section: different age zones are distinguished.

In one embodiment, the age distribution segments are divided into 10 groups, in order: 0-14, 15-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74 and more than 75. The specific segment ranges may be designed otherwise as desired.

In a preferred embodiment, to further improve the referential of the predicted risk, the OR value data is collated before entering the database S, the collation comprising:

i) whether the base recorded at each SNP site is consistent with the base at the site on the positive strand of the chromosome or not is corrected if the base is inconsistent with the base.

The purpose of this step is to verify whether or not a SNP site is represented by a base on the positive strand when it is described in the literature, and it is not necessary to correct it if it is represented by a base on the positive strand, and it is necessary to correct it to be represented by a base on the positive strand according to the principle of base complementarity if it is represented by a base on the negative strand.

ii) whether the total number of the effective samples is equal to the sum of the effective samples of each genotype group, if not, removing the data record corresponding to the sample and feeding back the data record to a database administrator for data correction.

The purpose of this step is to prevent input errors when manually entering data. When the total number of the effective samples is equal to the sum of the effective samples of each genotype group, no error is considered, and when the total number of the effective samples is not equal to the sum of the effective samples of each genotype group, the entry is considered to be in error and needs to be corrected.

And iii) standardizing fields of the input information, requiring uniform format and correcting the non-uniform format into a specified format. In one embodiment, the capital and lower case formats of the input information English letters are uniform, if a phrase appears, a connecting symbol is added by default to connect two adjacent words, so that the phrase consisting of a plurality of fields can be represented as one field, and the used connecting symbol can be various symbols such as underlines.

The purpose of this step is: and the automatic processing, information identification and information matching of subsequent programs are facilitated.

iv) determination of risk alleles, non-risk alleles.

The purpose of this step is: input errors in manual entry of data are prevented.

In one embodiment, it is determined whether the homozygous genotype OR value corresponding to the entered risk allele is greater than 1, if greater than 1, indicating correct entry, and if less than 1, requiring correction while correcting the corresponding non-risk allele. For example, if the recorded risk allele is G and the non-risk allele is a, determining whether OR value of homozygous genotype corresponding to the recorded risk allele (i.e. GG genotype) is greater than 1, if OR value of GG genotype is 3.37, indicating that the OR value is greater than 1, indicating that the recording is correct and no correction is needed; if the OR value of GG genotype is recorded as 0.79, the OR value is less than 1, which indicates recording error, and the risk allele needs to be corrected to be A, and the non-risk allele needs to be corrected to be G.

v) redundancy removal of duplicate SNP site records. When the SNP site number, the sample sex, the region to which the sample crowd belongs and the research disease appearing in the two information records are the same, the record with the most significant statistical relevance (namely the minimum P value) is reserved.

vi) information record credibility determination and information record screening. The credibility value provides the efficacy grade of related research literature of the SNP locus and disease susceptibility, the efficacy grade is divided into four grades, and the sample scale, the information completeness, the statistical test significance and the publication influence of the literature research are respectively considered. When the document meets the above one condition, the reliability is raised by one step. If the confidence level is lower than level 1, that is, at most one of the 4 conditions is satisfied, the record is rejected.

In one embodiment, the information record credibility rating method comprises the following steps:

the evaluation of the credibility of a single record comprises four aspects, namely the completeness of information mined from the literature, the sample size of literature research, the significance of statistical tests of the association of SNP sites and diseases and the influence of literature source publications.

a) Information integrity mined in literature

In one embodiment, the information mined in the document includes at least:

rsID number of SNP locus, research disease (i.e. disease name), disease description, population (divided by country), region (divided into east Asia population and European and American population), age distribution segment, sex composition (divided into single male, single female, male and female mixed type), effective sample total number, effective case total number, non-risk allele, SNP locus and disease related P value, SNP locus related gene, risk allele homozygous OR value, heterozygous genotype OR value, risk allele homozygous effective sample total number, heterozygous genotype effective sample total number, non-risk allele homozygous effective sample total number, non-risk allele effective sample total number, non-risk homozygous genotype effective sample number, document number, source publication.

As shown in Table 1, information records (corresponding to three records) of three SNP sites (rs1, rs2, rs3) are given

TABLE 1

If a single record has a specific numerical value or specific description in each column of the file, and no "None" value or other vacancy occurs, the record is considered to have information integrity.

b) Sample size for literature studies

Based on the total number of valid samples in the mined information in the literature, when the value is greater than or equal to 2000, the sample size of the record is considered to meet the requirement.

c) Statistical testing for significance

Based on the information of the SNP locus and the disease-related P value in the information file mined in the literature, when the value is less than 0.00001, the statistical test significance of the record is considered to meet the requirement.

d) Influence of literature sources publications

Based on the source publication information in the information mined from the literature, when the influence factor corresponding to the source publication recorded in the information is greater than or equal to 5, the influence of the literature source publication of the record is considered to meet the requirement.

For a single SNP locus information record, the reliability of the record is raised by one step (namely, the reliability is plus 1 step) when the condition is met. Thus, the confidence level of a single record ranges between levels 0-4.

In one embodiment, the following method is used for SNP site information record screening:

and based on the information record credibility rating result, if the single record credibility level is lower than 1 level, namely only one or none of the 4 conditions is met, rejecting the record.

And finally, recording the OR value data after screening and the age distribution of the corresponding samples, the regional distribution of the research population and the sex composition information into a database S.

With regard to the step S103, it is,

the data of the corresponding region is: data of a region to which the individual region information belongs. For example, if the individual region information is Shanghai, and the region in the database distinguishes between European and American people and east Asian people, the data of the east Asian people is extracted.

The data for the respective gender compositions refer to: the data of sex composition or male-female mixed type corresponding to the sex of the individual is preferably data of sex composition corresponding to the sex of the individual. For example, if the sex of the individual is female, data of a single female is extracted, or data of a male-female mixed type is extracted if the data is missing from the database.

The data composed of the corresponding gender of the corresponding region is as follows: and simultaneously, data meeting the conditions of corresponding regions and corresponding gender compositions, for example, the individual gender is Shanghai female, European and American populations and east Asian populations are distinguished by regions in the database, the gender compositions are distinguished as single male, single female and male and female mixed types, and the data of the single female of the east Asian population is extracted.

The following steps are adopted to calculate the comprehensive susceptibility risk of the disease:

the method comprises the following steps: according to the extracted incidence data, SNP locus genotype frequency data and OR value data, combining formulas 2 and 3, respectively calculating the disease incidence Pr (D | G) of each SNP locus of each interested disease aiming at the genotype of the single SNP locus of the corresponding region, the corresponding sex composition and each age distribution section_iRe gion, Gender, Age), i.e., the susceptibility risk value of the disease at the independent SNP site.

Equation 2

Equation 3

(i＝{1,2},Re gion＝X,Gender＝Y,Age＝Z)

In the above formula, Region represents regional conditions, Gender represents sex composition conditions, and Age represents Age distribution zone conditions. Re gion ═ X, genter ═ Y, Age ═ Z, representing Region conditions X, Gender, Y, Age, and Z. The OR values extracted from the database should have the corresponding Region, Gender, Age conditions.

Pr (D | Re gion ═ X, gentr ═ Y, Age ═ Z), represents the disease incidence when the Region, gentr, Age conditions were X, Y, Z, respectively. This value is extracted from the database and is a known condition at the time of calculation.

Pr(G_iII) is a single SNP site G under a specific Region condition_iGenotype frequency of the genotype. This value is extracted from the database and is a known condition at the time of calculation.

OR_iThe meaning is the same as formula 1. OR (OR)₁And OR₂Extracted from the database, as a known condition at the time of calculation.

Pr(D|G_iRe gion, Gender, Age) represents a single SNP site G in a specific region, a specific sex composition, a specific Age distribution region_iThe incidence of disease of genotype. G_iThe meaning is the same as formula 1. When i is 0, it is Pr (D | G)₀,Re gion,Gender,Age)。

In the formula 2, the right side i of the formula is respectively taken as 0,1 and 2 to be added into the summation formula for summation. In formula 3, i takes values of 1 and 2, respectively. Therefore, according to equations 2 and 3, for a specific Region, a specific Gender composition, and a specific Age distribution section condition, a non-linear equation set including three unknowns of three equations may be listed, and when the conditions of Region, Gender, and Age are X, Y, Z: pr (D | G)₀,Re gion,Gender,Age)、Pr(D|G₁,Re gion,Gender,Age)、 Pr(D|G₂Re gion, Gender, Age).

Step two: and (4) calculating the comprehensive disease susceptibility risk number of each interested disease of the individual to be detected by using the extracted data, the calculation result of the step one and the following formula. This step takes into account the fact that there is a relationship between single disease susceptibility and multiple SNP sites.

The method is specifically divided into the following two substeps:

the first substep: combining genotypes corresponding to single SNP sites of the individuals to be tested, using formulas 4-7 to complete OR calculation and further calculating OR_composite*。

Based on the genotype information, the region information and the sex information provided by the individual, aiming at the specific disease type, all independent SNP locus disease susceptibility risk calculation results matched with the three conditions of the individual are extracted. The final extracted result will contain all the individual SNP site disease susceptibility risk values closely related to the disease in different age groups.

Equation 4

Odds(D|G,Re gion,Gender,Age)＝Pr(D|G,Re gion,Gender,Age)/(1-Pr(D|G,Region,Gender,Age))

Equation 5

Odds(D|Re gion,Gender,Age)＝Pr(D|Re gion,Gender,Age)/(1-Pr(D|Re gion,Gender,Age))

Equation 6

OR*＝Odds(D|G,Re gion,Gender,Age)/Odds(D|Re gion,Gender,Age)

(Re gion＝X,Gender＝Y,Age＝Z)

Equation 7

In the above-mentioned formula,

pr (D | G, Re gion, Gender, Age) represents the disease incidence of the G genotype at a single SNP site in a specific region, a specific sex composition, a specific Age distribution segment. The G genotype is the genotype of the corresponding SNP site of the individual to be tested, which should be G_iOne of the genotypes.

Pr (D | Re gion, genter, Age), represents the incidence of disease in a particular region, a particular sex composition, a particular Age distribution segment. This value is extracted from the database and is a known condition at the time of calculation.

Odds (D | G, Re gion, genter, Age) represents the ratio of incidence and normality of a disease for the G genotype under conditions of a specific territory, a specific Gender composition, a specific Age distribution segment.

Odds (D | Re gion, genter, Age) represents the ratio of incidence to normal for a disease in a particular region, a particular sex composition, a particular Age distribution segment.

OR is an approximation ratioThe ratio of values is the ratio of Odds (D | G, Re gion, Gender, Age) to Odds (D | Re gion, Gender, Age). The same disease usually corresponds to a plurality of different SNP loci, one disease corresponds to m different SNP loci, m belongs to { all related SNP loci of the disease }, and then for each different SNP locus genotype of the same disease of an individual to be detected, OR values are respectively calculated and respectively marked as OR₁*、OR₂*、OR₃*、……、OR_m*。

OR_compositeThe comprehensive approximate odds ratio of the disease is represented by formula 7, which is the product of OR values of different SNP site genotypes of the same disease.

And a second substep: and (3) completing the calculation of the individual disease comprehensive susceptibility risk value by using formulas 8 and 9 and an inverse function calculation method, and taking the value as the finally calculated individual disease comprehensive susceptibility risk.

Through the steps, the individual disease comprehensive susceptibility risk array which has the same genotype composition, corresponds to the region where the individual disease is located and corresponds to the sex composition and contains different age distribution sections of each interested disease is obtained.

Equation 8

(Re gion＝X,Gender＝Y,Age＝Z)

Equation 9

Odds(D|G_1,2,3,...,mRe gion, genter, Age) is OR_compositeProduct with Odds (D | Re gion, genter, Age).

Pr(D|G_1,2,3,...,mRe gion, Gender, Age) is a comprehensive susceptibility risk value of an individual disease, which is represented in a specific region and a specific regionUnder the conditions of sex composition and specific age distribution section, the comprehensive susceptibility risk value of the individual diseases of the same disease m SNP locus genotypes of the individuals to be detected is comprehensively considered.

With regard to the step S104, it is,

the step is based on the genetic characteristic information, the region information and the sex information of the individual, and the dynamic change curve drawing of the disease comprehensive susceptibility risk of the individual in different age groups is completed so as to reflect the influence of the cumulative effect of the environmental factors on the individual disease comprehensive susceptibility.

Taking the individual disease comprehensive susceptibility Risk array obtained in step S103 as input data, taking the individual disease comprehensive susceptibility Risk values corresponding to age and age as independent variables and dependent variables, respectively, and further using an loass regression fitting to obtain an individual disease comprehensive susceptibility Risk calculation function (Risk) corresponding to the discrete array in combination with formula 10_loess)。

The age is a specific age. Such as 1,2, 3, 4, … …, 100 years old.

And the individual disease comprehensive susceptibility risk value corresponding to the age is the individual disease comprehensive susceptibility risk value of an age distribution section corresponding to a certain age. For example, the age distribution segments in database S are: 0-14, 15-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75 or more, then the comprehensive susceptibility risk values of the individual diseases of the age distribution sections of 0-14, 15-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75 or more can be obtained through calculation. In calculating the comprehensive susceptibility Risk calculation function (Risk) of individual diseases_loess) When 1,2, 3, … … and 14 years old correspond to an age distribution section of 0-14, and the comprehensive susceptibility risk value of the disease in the age distribution section of 0-14 is taken as the comprehensive susceptibility risk value of the disease of the individual corresponding to each age of 1,2, 3, … … and 14 years old; similarly, the comprehensive susceptibility risk value of the diseases in the age distribution range of 15-39 is taken as the comprehensive susceptibility risk value of the diseases of the individuals corresponding to the ages of 15, 16, 17, … … and 39 years; … …, respectively; diseases in the age distribution range of 75 years old or olderThe comprehensive susceptibility risk value of the disease is taken as the comprehensive susceptibility risk value of the disease of the individuals corresponding to the ages of 75, 76, … … and 100, so as to obtain the comprehensive susceptibility risk value of the disease of the individuals corresponding to the ages.

Equation 10Risk_loess(Age)＝LOESS_REGRESSION(Age,Risk_Age)

In the formula 10, the process is described,

age represents the Age of the patient,

Risk_Agerepresents the age-corresponding individual disease comprehensive susceptibility risk value, which is the individual disease comprehensive susceptibility risk value of the age distribution segment corresponding to the age (i.e. Pr (D | G) in the formula 9_1,2,3,...,m,Region,Gender,Age))。

Risk_loessThe calculation function for the disease comprehensive susceptibility risk needs to be solved.

Based on the function, a dynamic change curve of the individual disease comprehensive susceptibility risk in a specified age range can be generated.

Further, step S104 further includes using the disease incidence data of a disease composed of sexes corresponding to regions corresponding to the individuals to be tested in each age distribution region as an average susceptibility Risk array of the disease, using the data as input data, using the average susceptibility Risk values of the disease corresponding to ages as independent variables and dependent variables, and replacing Risk in formula 10 with the average susceptibility Risk number of the disease_AgeAnd further using LOESS regression fitting to obtain a disease average susceptibility risk calculation function corresponding to the discrete array, and generating a population average disease susceptibility risk dynamic change curve in a specified age range based on the function, wherein the population average susceptibility risk calculation function can be used as a reference.

The age-corresponding disease average susceptibility risk value is the disease incidence of the age distribution section corresponding to the age. For example, the age distribution segments extracted from the database S for a single male in the east asian population are: 0-14 parts, 15-39 parts, 40-44 parts, 45-49 parts, 50-54 parts, 55-59 parts, 60-64 parts, 65-69 parts, 70-74 parts and 75 parts, wherein the incidence data of the corresponding lung cancer are 2/10 ten thousand, 4/10 ten thousand, 5/10 ten thousand, 6/10 ten thousand, 7/10 ten thousand, 10/10 ten thousand, 22/10 ten thousand, 34/10 ten thousand, 70/10 thousand and 100/10 thousand respectively. Then, 1,2, 3, … …, and 14 years of age correspond to an age distribution range of 0-14, and the mean susceptibility risk values of lung cancer for each age of 1,2, 3, … …, and 14 years of age are all "2/10 ten thousand" (incidence data of lung cancer corresponding to the age distribution range of 0-14); similarly, the mean susceptibility risk values of lung cancer for the ages of 15, 16, 17, … … and 39 are all "4/10 ten thousand"; … … and so on; 75. the average susceptibility risk values of lung cancer of 76, … … and 100 years are 100/10 ten thousand respectively, so that the average susceptibility risk values of lung cancer of all ages are obtained.

In an embodiment of the present invention, as an exemplary example, a dynamic change curve of the comprehensive susceptibility risk of individual diseases and a dynamic change curve of the average susceptibility risk of human group diseases in the age range of 0-100 years are given for the susceptibility risk of individual lung cancer to be tested in males in shanghai region of a specific genotype, and the final output result is shown in fig. 2.

Those skilled in the art will appreciate that the computing processes described above may be implemented using computers, integrated circuit modules, programmable logic devices, other hardware, or existing software modules known in the art.

Fig. 3 is a schematic mechanism diagram of one embodiment of the disease susceptibility risk prediction apparatus of the present invention.

As shown in the drawings, the disease susceptibility risk prediction device of the present invention includes:

a database unit 100 for providing a database comprising data on incidence rate of diseases, genotype frequency data of SNP sites, risk allele homozygous genotype for each disease-related SNP site, and OR value data of heterozygous genotype; wherein, the incidence data of the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, the OR value data of each genotype of SNP sites related to the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, and the genotype frequency data of the SNP sites are distinguished according to different regions.

And the unit 200 for the information of the individual to be tested is used for providing the regional information, the sex information and the actually measured genotype information of the SNP locus of the individual to be tested.

A disease comprehensive susceptibility risk array computing unit 300, connected to the database unit 100 and the individual information unit 200 to be tested, for extracting the following data from the database unit 100 according to the region information, the gender information and the actual measurement genotype information of the SNP sites of the individual to be tested, for the disease of interest of the individual to be tested: incidence data of the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition, genotype frequency data of SNP loci corresponding to the interested diseases of the corresponding region and OR value data of the SNP loci corresponding to the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition; and calculating to obtain a disease comprehensive susceptibility risk array of each interested disease of the individual to be detected according to the data, wherein the disease comprehensive susceptibility risk array of the individual to be detected comprises: the individual disease comprehensive susceptibility risk value of each age section is the same as the genotype composition of the individual to be detected, and is formed by the corresponding region and the corresponding sex;

and a disease comprehensive susceptibility risk dynamic change curve unit 400 connected to the disease comprehensive susceptibility risk array calculation unit 300, for fitting a disease comprehensive susceptibility risk calculation function corresponding to the discrete array by using LOESS regression according to the individual disease comprehensive susceptibility risk array of each disease of interest, and generating an individual disease comprehensive susceptibility risk dynamic change curve in a specified age range based on the function.

The device of the invention can calculate the comprehensive susceptibility risk (genetic factor and environmental factor) of the two-factor disease according to the database information, the information of the person to be detected and the genotype data of the person to be detected; and fitting the obtained comprehensive susceptibility risk of the two-factor disease into an age change curve of the comprehensive susceptibility risk of the disease.

Specifically, with respect to the database unit 100,

in one embodiment, the incidence information of the disease is obtained by accessing an existing database and capturing the incidence information of the disease therefrom or by capturing relevant information to calculate and then entering the information into the database unit 100. Taking the information acquisition mode of different types of tumor incidence of Chinese population as an example, firstly, the GLOBOCAN database homepage is accessed, the Cancer by publication sub-page is further accessed, and the information of various tumor incidence of the population in different countries is acquired.

In one embodiment, the genotype frequency data is obtained from a Hapmap database and stored in the database unit 100. And (3) accessing an FTP site provided by a Hapmap database, and downloading and obtaining genotype frequency information of different crowds at the site.

In the apparatus of the present invention, the formula definition representing the OR value corresponds to the above formula 1.

In the database unit 100 of the present invention, the OR value of the risk allele homozygous genotype and the OR value of the heterozygous genotype (i.e., the OR value of each SNP site) for each disease-related SNP site are recorded₁Value and OR₂Value). The OR is₁Value and OR₂The values can be derived directly from the literature. When OR is not given directly in the literature₁Value and OR₂When other related information is recorded, for example, the incidence of each genotype, the value can be calculated by using the related information according to a formula.

The data within the database unit 100 may be continually updated as literature-related information is revealed.

The distinction according to the sex composition means that the composition is divided into three compositions of single Male (Male), single Female (Female) and Male-Female Mixed type (Mixed).

As shown in fig. 4A, in a preferred embodiment, in order to further improve the referenceable property of the predicted risk, the disease susceptibility risk prediction apparatus of the present invention further includes:

and a collating unit 500, connected to the database unit 100, for collating the candidate OR value data and supplying the collated value data to the database unit 100.

Further, as shown in fig. 4B, the collation unit 500 includes:

document mining information entry module 510: for entering OR value related information mined from the literature. In one embodiment, the OR value related information includes at least: rsID number of SNP locus, research disease (i.e. disease name), disease description, population (divided by country), region (divided into east Asia population and European and American population), age distribution segment, sex composition (divided into single male, single female, male and female mixed type), effective sample total number, effective case total number, non-risk allele, SNP locus and disease related P value, SNP locus related gene, risk allele homozygous OR value, heterozygous genotype OR value, risk allele homozygous effective sample total number, heterozygous genotype effective sample total number, non-risk allele homozygous effective sample total number, non-risk allele effective sample total number, non-risk homozygous genotype effective sample number, document number, source publication.

SNP locus proofreading module 520: and the literature mining information recording module 510 is connected to correct whether the base recorded at each SNP locus is consistent with the base at the locus on the chromosome plus strand, and if the base is inconsistent with the base, the base is corrected. The purpose of this module is to verify whether or not a SNP site is represented by a base on the positive strand when it is described in the literature, and if it is represented by a base on the positive strand, it is not necessary to correct it, and if it is represented by a base on the negative strand, it is necessary to correct it to the base on the positive strand according to the principle of base complementarity.

The sample number proofreading module 530: and the document mining information entry module 510 is connected to correct whether the total number of the valid samples is equal to the sum of the valid samples of each genotype group, and if not, the data records corresponding to the samples are removed and fed back to a database administrator for data correction. This module is provided to prevent input errors when manually entering data. When the total number of the effective samples is equal to the sum of the effective samples of each genotype group, no error is considered, and when the total number of the effective samples is not equal to the sum of the effective samples of each genotype group, the entry is considered to be in error and needs to be corrected.

The field specification module 540: and the document mining information entry module 510 is connected to standardize the fields of the entered information, and requires uniform formats and corrects the non-uniform formats into the specified formats. In one embodiment, the field specification module is used for correcting the input information and unifying the capital and lower case formats of English letters, if a phrase appears, a connecting symbol is added by default to connect two adjacent words, so that the phrase formed by a plurality of fields can be represented as one field, and the used connecting symbol can be various symbols such as underlines. The purpose of this module is: and the automatic processing, information identification and information matching of subsequent programs are facilitated.

Risk allelic, non-risk allelic collation module 550: and is connected with the document mining information entry module 510 for correcting input errors of risk alleles and non-risk alleles. In one embodiment, the at-risk allelic and non-risk allelic proofreading module 550 further comprises:

OR value determination submodule 551: is connected with the document mining information entry module 510 and is used for judging whether the entered homozygous genotype OR value corresponding to the risk allele is greater than 1;

modifier submodule 552: and the OR value judgment submodule 551 is connected to correct risk alleles and non-risk alleles when the OR value judgment submodule 551 judges that the OR value of the homozygous genotype corresponding to the entered risk alleles is less than 1. For example, if the logged risk allele is G and the non-risk allele is a, and if the OR value of the GG genotype is 3.37, the OR value judgment submodule 551 judges that the OR value is greater than 1, indicating that the logging is correct and no correction is needed; if the OR value of the GG genotype is recorded as 0.79, the OR value determination submodule 551 determines that the OR value is less than 1, indicating a recording error, and the correction module 552 corrects the risk allele to a and the non-risk allele to G.

A de-redundancy module 560 for duplicate SNP site recordings: and the document mining information entry module 510 is connected to the database, and is configured to keep only the record with the most significant statistical correlation (i.e., the smallest P value) when the SNP site number, the sample sex, the region to which the sample population belongs, and the research disease appearing in the two information records are the same.

The information record credibility determination and information record screening module 570: and is connected to the document mining information entry module 510 for determining the credibility of the information records and screening the information records.

In one embodiment, the confidence level value provides a rank of efficacy of the study literature relating the SNP site to disease susceptibility, with a total of four grades, taking into account the sample size, information completeness, statistical test significance, and publication impact of the literature study, respectively. When the document meets the above one condition, the reliability is raised by one step. If the confidence level is lower than level 1, that is, at most one of the 4 conditions is satisfied, the record is rejected.

Further, in one embodiment, as shown in FIG. 4B, the record confidence level determination and record screening module 570 comprises:

information integrity determination sub-module 571: and the document mining information entry module 510 is connected to judge the integrity of the information, and if the information is complete, the credibility of the information is increased by level 1.

In one embodiment, the information mined in the document includes at least:

The sample size determination submodule 572: and the document mining information entry module 510 is connected to judge whether the sample size meets the requirement, and if so, the information credibility is increased by level 1. In one embodiment, based on the total number of valid samples in the mined information in the document, the sample size of the record is considered to be satisfactory when the value is greater than or equal to 2000.

Statistical test significance determination sub-module 573: and the document mining information entry module 510 is connected to judge whether the SNP locus and the disease-related P value meet the requirements, and if the SNP locus and the disease-related P value meet the requirements, the information credibility is increased by 1 level. In one embodiment, based on the information of the P value of the association between the SNP sites and diseases in the information file mined in the literature, when the value is less than 0.00001, the statistical test significance of the record is considered to meet the requirement.

Literature source publications influence sub-module 574: and the document mining information entry module 510 is connected to judge whether the influence factor corresponding to the source publication of the record meets the requirement, and if the influence factor meets the requirement, the information credibility is increased by 1 level. In one embodiment, based on source publication information in the mined information in the document, the influence factor of the source publication of the record is determined to be greater than or equal to 5.

SNP site information record screening submodule 575: and the information integrity judgment sub-module 571, the sample scale judgment sub-module 572, the statistical test significance judgment sub-module 573 and the literature source publication influence sub-module 574 are connected and used for obtaining an information record credibility rating result and eliminating information records of which the result does not meet the requirement. In one embodiment, a record is rejected if the confidence level of the single record is below level 1, i.e., only one or none of the 4 conditions described above are met.

In the preferred embodiment, each entry record is collated by modules 520, 530, 540, 550, 560, 570.

Finally, the OR value data and the age distribution of the corresponding sample, the regional distribution of the research population, and the gender composition information screened by the collation unit 500 are recorded into the database unit 100.

For the disease composite susceptibility risk array calculation unit 300,

in one embodiment, the disease composite susceptibility risk array calculating unit 300 further comprises:

the data extraction module 310: connected to the database unit 100 and the individual information unit 200 to be tested, and configured to extract the following data from the database unit 100: incidence data of the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition, genotype frequency data of SNP loci corresponding to the interested diseases of the corresponding region and OR value data of the SNP loci corresponding to the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition;

independent SNP locus disease susceptibility risk value calculation module 320: connected with the data extraction module 310 and used for extracting the data according to the extracted morbidity rate, SNThe frequency data and OR value data of the P locus genotype are combined with the formulas 2 and 3 to respectively calculate the disease incidence Pr (D | G) of different people of the SNP loci of the interested diseases aiming at the corresponding regions, the corresponding sex compositions and the genotypes of the single SNP loci of the age sections_iRe gion, Gender, Age), i.e., the susceptibility risk value of the disease at the independent SNP site.

The disease comprehensive susceptibility risk array calculation module 330 for each disease of interest of the individual to be tested: and the data extraction module 310 and the independent SNP locus disease susceptibility risk value calculation module 320 are connected to calculate the comprehensive disease susceptibility risk number of each disease of interest of the individual to be tested by using the extracted data and the independent SNP locus disease susceptibility risk value.

The disease comprehensive susceptibility risk array calculation module 330 for each disease of interest of the individual to be tested is specifically divided into the following sub-modules:

OR_compositevalue calculator module 331: connected with the data extraction module 310 and the independent SNP site disease susceptibility risk value calculation module 320, and configured to combine genotypes corresponding to the individual SNP sites of the individual to be detected, complete OR calculation using the foregoing formulas 4 to 7, and further calculate OR_composite*。

The submodule extracts all independent SNP locus disease susceptibility risk calculation results matched with the three conditions of the individual aiming at a specific disease type based on genotype information, region information and sex information provided by the individual. The final extracted result will contain all the individual SNP site disease susceptibility risk values closely related to the disease in different age groups.

The individual disease comprehensive susceptibility risk calculation submodule 332: and said OR_compositeThe value-of-one operator module 331 is connected to complete the calculation of the comprehensive susceptibility risk value of the individual disease by using the above formulas 8 and 9 and using the inverse function calculation method, and takes the value as the comprehensive susceptibility risk of the individual disease obtained by the final calculation.

By the module, the comprehensive susceptibility risk array of the individual diseases, which has the same genotype composition as the individual to be detected, corresponds to the region where the individual diseases are located, corresponds to the sex composition and comprises different age distribution sections, of each disease of interest is obtained.

For the disease complex susceptibility risk dynamics curve unit 400, as shown in fig. 3A, the unit at least includes:

the individual disease composite susceptibility risk dynamic curve module 410: is connected with the disease comprehensive susceptibility risk array calculating unit 300, further is connected with the disease comprehensive susceptibility risk array calculating module 330 of each disease of interest of the individual to be tested of the disease comprehensive susceptibility risk array calculating unit 300, and further is connected with the individual disease comprehensive susceptibility risk calculating submodule 332 of the disease comprehensive susceptibility risk array calculating module 330 of each disease of interest of the individual to be tested for generating an individual disease comprehensive susceptibility risk dynamic change curve in a specified age range. In one embodiment, the module uses the disease comprehensive susceptibility Risk array obtained by the disease comprehensive susceptibility Risk array calculation unit as input data, uses the age and age corresponding individual disease comprehensive susceptibility Risk values as independent variables and dependent variables, combines with formula 10, and further uses a LOESS regression fit to obtain a disease comprehensive susceptibility Risk calculation function (Risk) corresponding to the discrete array_loess) And generating a dynamic change curve of the comprehensive susceptibility risk of the individual diseases in the specified age range based on the function.

The module finishes the drawing of dynamic change curves of the comprehensive susceptibility risks of the individual diseases in different age groups based on the genetic characteristic information, the region information and the sex information of the individual so as to reflect the influence of the cumulative effect of environmental factors on the comprehensive susceptibility of the individual diseases.

In a preferred embodiment, as shown in fig. 3B, the disease complex susceptibility risk dynamic curve unit 400 further includes:

population average disease susceptibility risk dynamic change curve module 420: connected with the disease comprehensive susceptibility risk array computing unit 300 forAnd generating a population average disease susceptibility risk dynamic change curve. In one embodiment, the module uses the incidence data of a disease composed of sexes corresponding to regions corresponding to the individuals to be tested in each age distribution region as an average susceptibility Risk array of the disease, uses the incidence data as input data, uses average susceptibility Risk values of the disease corresponding to ages as independent variables and dependent variables, and replaces Risk in formula 10 with the average susceptibility Risk number of the disease_AgeAnd further fitting a disease average susceptibility risk calculation function corresponding to the discrete array by using LOESS regression, and generating a population average disease susceptibility risk dynamic change curve in a specified age range based on the function to be used as a reference of a disease comprehensive susceptibility risk dynamic change curve.

Further, as shown in fig. 5, in a preferred embodiment, the apparatus of the present invention may further include a result output unit 600, connected to the disease comprehensive susceptibility risk dynamic change curve unit 400, for outputting a disease comprehensive susceptibility risk dynamic change curve. The result output unit may be a display, a printing device, or the like.

In an embodiment of the present invention, as an exemplary example, the device of the present invention provides a dynamic change curve of the comprehensive susceptibility risk of individual diseases and a dynamic change curve of the average susceptibility risk of group diseases in the age range of 0-100 years for the lung cancer susceptibility risk of the male to be tested in shanghai region of a specific genotype, and the final output result is shown in fig. 2.

The above examples are intended to illustrate the disclosed embodiments of the invention and are not to be construed as limiting the invention. In addition, various modifications of the methods and apparatus set forth herein, as well as variations of the invention, will be apparent to those skilled in the art upon reference to the description without departing from the scope and spirit of the invention. While the invention has been specifically described in connection with various specific preferred embodiments thereof, it should be understood that the invention should not be unduly limited to such specific embodiments. Indeed, various modifications of the above-described embodiments which are obvious to those skilled in the art to which the invention pertains are intended to be covered by the scope of the present invention.

Claims

1. A disease susceptibility risk prediction device, comprising:

database unit (100): a database for providing OR value data comprising disease incidence data, SNP locus genotype frequency data, risk allele homozygous genotype and heterozygous genotype for each disease-associated SNP locus; wherein, the incidence data of the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, the OR value data of each genotype of SNP loci related to the same disease is distinguished according to different combinations of regions, gender compositions and age distribution sections, and the genotype frequency data of the SNP loci are distinguished according to different regions;

individual information unit to be tested (200): the system is used for providing regional information, sex information and actually measured genotype information of the SNP locus of an individual to be detected;

a disease comprehensive susceptibility risk array calculation unit (300): the system is connected with the database unit (100) and the information unit (200) of the individual to be detected, and is used for extracting the following data from the database unit (100) according to the region information, the sex information and the actual measurement genotype information of the SNP locus of the individual to be detected for the disease of interest of the individual to be detected: incidence data of the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition, genotype frequency data of SNP loci corresponding to the interested diseases of the corresponding region and OR value data of the SNP loci corresponding to the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition; and calculating to obtain a disease comprehensive susceptibility risk array of each interested disease of the individual to be detected according to the data, wherein the disease comprehensive susceptibility risk array of the individual to be detected comprises: the individual disease comprehensive susceptibility risk value of each age section is the same as the genotype composition of the individual to be detected, and is formed by the corresponding region and the corresponding sex;

disease complex susceptibility risk dynamics curve unit (400): and the disease comprehensive susceptibility risk array computing unit (300) is connected and used for using LOESS regression to fit a disease comprehensive susceptibility risk computing function corresponding to the individual disease comprehensive susceptibility risk array of each disease of interest according to the individual disease comprehensive susceptibility risk array of each disease of interest, and generating an individual disease comprehensive susceptibility risk dynamic change curve in a specified age range based on the function.

2. The disease susceptibility risk prediction device of claim 1, further comprising: and the correction unit (500) is connected with the database unit (100) and is used for correcting the candidate OR value data and supplying the corrected data to the database unit (100).

3. The disease susceptibility risk prediction device of claim 2, wherein the collation unit (500) comprises:

document mining information entry module (510): the method is used for recording OR value related information mined from the literature;

SNP site proofreading module (520): the system is connected with a literature mining information recording module (510) and is used for checking whether the base recorded by each SNP locus is consistent with the base of the locus of the chromosome plus strand or not, and if the base is inconsistent, the base is corrected;

a sample number proofreading module (530): the system is connected with a document mining information entry module (510) and is used for checking whether the total number of the effective samples is equal to the sum of the effective samples of each genotype group, if not, the data records corresponding to the samples are removed and fed back to a database administrator for data correction;

field specification module (540): the document mining information input module (510) is connected with the document mining information input module and is used for standardizing fields of input information, requiring uniform formats and correcting non-uniform formats into specified formats;

a risk allelic, non-risk allelic collation module (550): the document mining information input module (510) is connected and used for correcting input errors of risk alleles and non-risk alleles;

a redundancy elimination module (560) of duplicate SNP site recordings: the system is connected with a literature mining information recording module (510) and is used for only keeping one record when the SNP site number, the sample sex, the region to which the sample crowd belongs and the research disease appearing in the two information records are the same;

an information record credibility determination and information record screening module (570): and the document mining information entry module (510) is connected for determining the credibility of the information records and screening the information records.

4. The disease susceptibility risk prediction device of claim 3, wherein the information record confidence level determination and information record screening module (570) comprises:

information integrity determination submodule (571): the document mining information input module (510) is connected and used for judging the information integrity, and if the information is complete, the information credibility is increased by level 1;

sample size determination submodule (572): the document mining information input module (510) is connected and used for judging whether the sample scale meets the requirement, and if the sample scale meets the requirement, the information credibility is increased by level 1;

statistical test significance determination submodule (573): the system is connected with a document mining information entry module (510) and used for judging whether the SNP locus and the disease correlation P value meet the requirements or not, and if the SNP locus and the disease correlation P value meet the requirements, the information credibility is increased by 1 level;

literature source publication influence sub-module (574): the system is connected with a document mining information entry module (510) and used for judging whether the influence factors corresponding to the recorded source publications meet the requirements or not, and if the influence factors meet the requirements, the information credibility is increased by 1 level;

SNP site information record screening submodule (575): and the information integrity judgment sub-module (571), the sample scale judgment sub-module (572), the statistical test significance judgment sub-module (573) and the literature source publication influence sub-module (574) are connected for obtaining an information record credibility rating result and rejecting information records with results not meeting requirements.

5. The disease susceptibility risk prediction apparatus of claim 1, wherein the disease composite susceptibility risk array calculation unit (300) comprises:

data extraction module (310): the system is connected with the database unit (100) and the individual information unit (200) to be tested and is used for extracting the following data from the database unit (100): incidence data of the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition, genotype frequency data of SNP loci corresponding to the interested diseases of the corresponding region and OR value data of the SNP loci corresponding to the interested diseases under the age distribution zones of the corresponding region and the corresponding gender composition;

an independent SNP locus disease susceptibility risk value calculation module (320): connected with the data extraction module (310) and used for respectively calculating each feeling according to the extracted morbidity data, SNP locus genotype frequency data and OR value data by combining formulas 2 and 3Disease incidence Pr (D | G) of different people with different SNP locus genotypes of corresponding regions, corresponding sex compositions and age zones_iRe gion, genter, Age), i.e., the susceptibility risk value of the independent SNP locus disease;

equation 2

<mrow> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>|</mo> <mi>Re</mi> <mi> </mi> <mi>g</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mo>=</mo> <mi>X</mi> <mo>,</mo> <mi>G</mi> <mi>e</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>r</mi> <mo>=</mo> <mi>Y</mi> <mo>,</mo> <mi>A</mi> <mi>g</mi> <mi>e</mi> <mo>=</mo> <mi>Z</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <munder> <munder> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> </mrow> </munder> <mrow> <mi>Re</mi> <mi>g</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mo>=</mo> <mi>x</mi> <mo>,</mo> </mrow> </munder> <mrow> <mi>G</mi> <mi>e</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>r</mi> <mo>=</mo> <mi>Y</mi> <mo>,</mo> </mrow> </munder> <mrow> <mi>A</mi> <mi>g</mi> <mi>e</mi> <mo>=</mo> <mi>Z</mi> </mrow> </munder> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>|</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>Re</mi> <mi>g</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mo>,</mo> <mi>G</mi> <mi>e</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>r</mi> <mo>,</mo> <mi>A</mi> <mi>g</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>*</mo> <mi>Pr</mi> <mrow> <mo>(</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>|</mo> <mi>Re</mi> <mi>g</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mo>)</mo> </mrow> </mrow>

Equation 3

i＝{1,2},Region＝X,Gender＝Y,Age＝Z

In the formula, Region represents regional conditions, Gender represents sex composition conditions, and Age represents Age distribution section conditions;

Region-X, gene-Y, Age-Z, representing Region conditions X, Gender, Y, Age, and Z;

pr (D | Region ═ X, Gender ═ Y, Age ═ Z), representing the disease incidence when the Region, Gender, Age conditions were X, Y, Z respectively;

G_irepresenting the genotype, i is selected from 0,1,2, G₀Representing a non-at-risk allelic homozygous genotype, G₁Representing a heterozygous genotype, G₂Represents an at-risk allelic homozygous genotype;

Pr(G_iregion) is a single SNP site G under the condition of specific Region_iBase of genotypeA factor frequency;

OR_irepresents G_iOR value of genotype;

Pr(D|G_iregion, Gender, Age) represents a single SNP site G under conditions of a specific Region, a specific sex composition, a specific Age distribution segment_iDisease incidence of genotype; when i is 0, it is Pr (D | G)₀,Region,Gender,Age)；

In the formula 2, taking 0,1 and 2 generations from the right side i of the formula respectively and adding the two generations into a summation formula for summation; in formula 3, i takes values of 1 and 2, respectively;

a disease comprehensive susceptibility risk array calculation module (330) of each interested disease of the individual to be tested: and the data extraction module (310) and the independent SNP locus disease susceptibility risk value calculation module (320) are connected and used for calculating the disease comprehensive susceptibility risk number of each interested disease of the individual to be detected by using the extracted data and the independent SNP locus disease susceptibility risk value.

6. The disease susceptibility risk prediction device of claim 5, wherein the disease composite susceptibility risk array calculation module (330) for each disease of interest of the subject comprises the following sub-modules:

OR_compositevalue operator module (331): connected with the data extraction module (310) and the independent SNP locus disease susceptibility risk value calculation module (320) and used for combining the genotypes corresponding to the single SNP locus of the individual to be detected, completing OR calculation by using a formula 4-7 and further calculating OR_composite*；

Equation 4

Equation 5

Equation 6

OR*＝Odds(D|G,Re gion,Gender,Age)/Odds(D|Re gion,Gender,Age)

(Re gion＝X,Gender＝Y,Age＝Z)

Equation 7

In the above-mentioned formula,

pr (D | G, Re gion, Gender, Age) represents the disease incidence of a single SNP locus G genotype under the conditions of a specific region, a specific sex composition and a specific Age distribution section;

pr (D | Re gion, genter, Age), representing disease incidence in a specific region, a specific Gender composition, a specific Age distribution segment;

odds (D | G, Re gion, genter, Age) represents the ratio of incidence to normal for a certain disease of the G genotype in a particular region, a particular sex composition, a particular Age distribution segment;

odds (D | Re gion, genter, Age) represents the ratio of incidence to normal for a disease in a particular region, a particular sex composition, a particular Age distribution segment;

OR is an approximate ratio of ratio, is the ratio of Odds (D | G, Re gion, Gender, Age) to Odds (D | Re gion, Gender, Age), sets m different SNP loci corresponding to a disease, and the m belongs to { all related SNP loci of the disease }, so for different SNP locus genotypes of the same disease of the individual to be detected, OR values need to be calculated respectively and are marked as OR₁*、OR₂*、OR₃*、……、OR_m*；

OR_compositeThe comprehensive approximate ratio of diseases is represented by the formula 7, and the calculation method is the product of OR values of the genotypes of different SNP loci of the same disease;

individualsA disease comprehensive susceptibility risk calculation submodule (332): and said OR_compositeThe value operator module (331) is connected and used for completing the calculation of the individual disease comprehensive susceptibility risk value by using formulas 8 and 9 and an inverse function calculation method, and taking the value as the finally calculated individual disease comprehensive susceptibility risk;

equation 8

(Re gion＝X,Gender＝Y,Age＝Z)

Equation 9

Odds(D|G_1,2,3,...,mRegion, gene, Age) is OR_compositeProduct with Odds (D | Re gion, genter, Age);

Pr(D|G_1,2,3,...,mre gion, genter, Age) is the comprehensive susceptibility risk value of the individual disease, which represents the comprehensive susceptibility risk value of the individual disease considering the m SNP locus genotypes of the individual to be detected under the conditions of specific region, specific Gender composition and specific Age distribution section.

7. The disease susceptibility risk prediction device of claim 1, wherein the disease complex susceptibility risk dynamics curve unit (400) comprises at least:

an individual disease composite susceptibility risk dynamic curve module (410): is connected with the disease comprehensive susceptibility risk array calculating unit 300 and is used for generating the dynamic change curve of the individual disease comprehensive susceptibility risk in the appointed age range.

8. The disease susceptibility Risk prediction apparatus of claim 7, wherein the individual disease comprehensive susceptibility Risk dynamic curve module (410) uses the disease comprehensive susceptibility Risk array obtained by the disease comprehensive susceptibility Risk array calculation unit as input data, uses the age and age corresponding individual disease comprehensive susceptibility Risk values as independent variables and independent variables, combines with formula 10, and further uses LOESS regression to fit a disease comprehensive susceptibility Risk calculation function (Risk) corresponding to the individual disease comprehensive susceptibility Risk array of each disease of interest_loess) And generating a dynamic change curve of the comprehensive susceptibility risk of the individual diseases in a specified age range based on the function

Equation 10

Risk_loess(Age)＝LOESS_REGRESSION(Age,Risk_Age)

In the formula 10, the process is described,

age represents the Age of the patient,

Risk_Agerepresenting age-corresponding individual disease composite susceptibility risk values;

Risk_loessa function is calculated for the risk of disease complex susceptibility.

9. The disease susceptibility risk prediction device of claim 7, wherein the disease complex susceptibility risk dynamics unit (400) further comprises:

a population average disease susceptibility risk dynamic change curve module (420): is connected with the disease comprehensive susceptibility risk array calculation unit (300) and is used for generating a population average disease susceptibility risk dynamic change curve.

10. The disease susceptibility Risk prediction device of claim 9, wherein the population average susceptibility Risk dynamic curve module (420) uses the incidence data of a disease composed of age distribution segments corresponding to the gender of the region corresponding to the subject as the disease average susceptibility Risk array, uses the incidence data as the input data, uses the age and age-corresponding disease average susceptibility Risk values as the independent variable and the dependent variable, and uses the disease average susceptibility Risk number to replace the Risk in the formula 10_AgeFurther using LOESS regression fitting to obtain a disease average susceptibility risk calculation function corresponding to the individual disease comprehensive susceptibility risk array of each disease of interest, and generating a population average disease susceptibility risk dynamic change curve in a specified age range based on the function to serve as a reference of the disease comprehensive susceptibility risk dynamic change curve

Equation 10

Risk_loess(Age)＝LOESS_REGRESSION(Age,Risk_Age)

In the formula 10, the process is described,

age represents the Age of the patient,