CN116863998A - Genetic algorithm-based whole genome prediction method and application thereof - Google Patents
Genetic algorithm-based whole genome prediction method and application thereof Download PDFInfo
- Publication number
- CN116863998A CN116863998A CN202310741264.1A CN202310741264A CN116863998A CN 116863998 A CN116863998 A CN 116863998A CN 202310741264 A CN202310741264 A CN 202310741264A CN 116863998 A CN116863998 A CN 116863998A
- Authority
- CN
- China
- Prior art keywords
- molecular marker
- model
- genetic algorithm
- subset
- genome
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000002068 genetic effect Effects 0.000 title claims abstract description 27
- 239000003147 molecular marker Substances 0.000 claims abstract description 36
- 238000009395 breeding Methods 0.000 claims abstract description 14
- 230000001488 breeding effect Effects 0.000 claims abstract description 14
- 230000035772 mutation Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 39
- 230000000694 effects Effects 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 19
- 230000009418 agronomic effect Effects 0.000 claims description 17
- 235000007164 Oryza sativa Nutrition 0.000 claims description 15
- 235000009566 rice Nutrition 0.000 claims description 15
- 240000008042 Zea mays Species 0.000 claims description 13
- 235000002017 Zea mays subsp mays Nutrition 0.000 claims description 13
- 238000012360 testing method Methods 0.000 claims description 13
- 241000196324 Embryophyta Species 0.000 claims description 10
- 239000003550 marker Substances 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000002790 cross-validation Methods 0.000 claims description 6
- 235000016383 Zea mays subsp huehuetenangensis Nutrition 0.000 claims description 4
- 235000009973 maize Nutrition 0.000 claims description 4
- 240000007594 Oryza sativa Species 0.000 claims 1
- 241000209094 Oryza Species 0.000 description 14
- 210000000349 chromosome Anatomy 0.000 description 10
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 9
- 235000005822 corn Nutrition 0.000 description 9
- 238000012549 training Methods 0.000 description 7
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 235000013339 cereals Nutrition 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 238000012214 genetic breeding Methods 0.000 description 1
- 238000010353 genetic engineering Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A40/00—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
- Y02A40/10—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Physiology (AREA)
- Artificial Intelligence (AREA)
- Crystallography & Structural Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Public Health (AREA)
- Analytical Chemistry (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof, wherein a genome optimal linear unbiased estimation model is adopted to predict the breeding value of an individual, and the method comprises the following steps: obtaining a molecular marker of crops to be predicted; randomly selecting a certain proportion of molecular marker subsets from the molecular marker subsets repeatedly for the initialization of a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving the molecular marker subsets with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subsets at a certain ratio to generate a new molecular marker subset; and calculating fitness functions of different molecular marker subsets again, reserving the molecular marker subset with higher fitness until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model for whole genome prediction. The method can be used for improving the accuracy of the whole genome selection of the hybrid seeds, and can provide important technical support for the accurate breeding of the hybrid seeds.
Description
Technical Field
The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof.
Background
Cultivating high-quality high-yield, green and high-efficiency crop varieties is a great importance in the current crop genetic breeding work. Traditional crop breeding relies on phenotypic selection: breeders select lines with the trait of interest from them for further identification by examining the phenotype of crop lines in the field and in the laboratory, in combination with their breeding experience. However, many traits related to yield, quality of crops belong to quantitative traits, which are controlled by a large number of micro-efficient quantitative trait loci, are susceptible to environmental influences, and are unreliable to select by phenotype alone. Molecular biology development enables molecular marker assisted selection breeding, however, molecular marker assisted selection is only applicable to traits controlled by a few major quantitative trait loci, and has no ability to select for traits such as crop yield and quality. Whole genome selection techniques construct statistical models using high density molecular markers covering the whole genome and crop phenotypes to predict the behavior of materials of known genotype but unknown phenotype. Whole genome selection incorporates the effects of all markers on the genome into the model irrespective of their level of significance and is thus particularly suited for quantitative traits such as crop yield, quality, which are controlled by a minigenome. The genome best linear unbiased estimation (Genomic Best Liner Unbiased Prediction, GBLUP) model is the most robust and general model among the whole genome selection models, however, the GBLUP model assumes that all molecular markers have the same contribution to the target trait, which is contrary to the conclusion of modern molecular genetics, limiting further improvement in the prediction accuracy of GBLUP methods.
Disclosure of Invention
The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits. The prediction power of agronomic characters such as rice and corn hybrid yield can be effectively improved through a GA-GBLUP algorithm. Therefore, the invention can be used for improving the accuracy of the whole genome selection of the hybrid seeds, has important significance in the utilization of rice and corn heterosis, and can provide important technical support for the accurate breeding of the hybrid seeds.
The aim of the invention is realized by adopting the following technical scheme:
a whole genome prediction method based on genetic algorithm adopts genetic algorithm to select optimal molecular markers, and combines genome optimal linear unbiased estimation model to predict individual breeding value, comprising the following steps:
obtaining a molecular marker of crops to be predicted;
randomly selecting a certain proportion of molecular markers to initialize a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving a molecular marker subset with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subset at a certain ratio to generate a new molecular marker subset;
and calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model.
Further, the method for constructing the genome predictive model comprises the following steps:
y is an n x 1 vector representing a quantitative trait, and the hybrid linear model containing m markers is expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) 2 ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect; solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.
Further, the step of randomly selecting includes: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 means that this flag is kept, the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA algorithm initialization.
Further, the calculation method of the suitability of the subset of different molecular markers is any one of the following methods,
red pool information criterion:
AIC=2m-2ln(L)
where m is the number of parameters being estimated and L is the likelihood of the model; AIC represents the fitness calculation result calculated by adopting the red pool information rule;
bayesian information criterion:
BIC=mln(n)-2ln(L)
where m is the number of parameters being estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated by adopting a Bayesian information criterion;
FIT function:
FIT=1-SSE/SST
where SST is the sum of the squares of the total variations of the phenotype values and SSE is the sum of the squares of the residuals; FIT represents a fitness calculation result obtained by FIT function calculation;
HAT function:
HAT=1-PRESS/SST
where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values; HAT represents the fitness calculation result calculated using HAT functions.
Further, mutating, pairing, cross-exchanging the remaining subset of molecular markers at a certain ratio, generating a new subset of molecular markers comprises:
carrying out mutation of 1 to 0 or 0 to 1 on the reserved molecular marker vector according to the probability of 0.1 of each site; every time a pair of delta vectors is randomly selected, two delta vectors after pairing are subjected to cross exchange, so that the information of a plurality of positions or large areas of the two delta vectors is recombined; pairing and cross-swapping simultaneously produces new molecular marker vectors.
Further, the predictive power of the model was evaluated by 10 fold cross validation.
The invention also provides a method for predicting the agronomic characters of the crop hybrid seeds.
Further, the crop is rice or maize.
Further, the agronomic trait is a quantitative trait controlled by a micro-efficient polygene.
Further, the agronomic traits include crop yield and quality traits.
Further, the agronomic traits include yield, tiller number of individual plants, spike weight, thousand seed weight and plant height.
The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits.
In the embodiment of the invention, the plant is specifically rice and corn which are gramineous plants. The algorithm is specifically a genetic algorithm and a genome optimal linear unbiased estimation algorithm.
The method comprises the following steps: firstly, 1% of markers are randomly selected for 100 times repeatedly in all molecular markers input by a user to obtain 100 different chromosomes for the initialization of a GA algorithm, then genome prediction models are respectively constructed by using the markers selected by the chromosomes, thereby calculating fitness functions of the different chromosomes, keeping 5 chromosomes with highest fitness, eliminating the rest of the chromosomes, and then mutating the 5 chromosomes at a certain rate (representing the process that the markers at a certain position are not selected or are not selected to be selected), pairing (pairwise pairing for generating new chromosomes), cross exchanging, finally generating 100 new chromosomes, selecting the markers by taking the chromosomes as standards again for constructing a relationship matrix, and repeating the above processes until the maximum iteration number or convergence is reached. Constructing a genome optimal linear unbiased estimation model by utilizing a molecular marker subset finally selected by an algorithm, predicting the phenotype of a training set, evaluating the predictive power of the model, and predicting the phenotype of all potential hybrids on the basis, and selecting the hybrid with better target characters from the phenotype for field identification.
The algorithm provided by the invention is named as a whole genome prediction method based on a genetic algorithm.
Advantageous effects
The invention can improve the accuracy of the whole genome selection of the hybrid by adopting a whole genome prediction method based on a genetic algorithm. Compared with the traditional GBLUP method, the GA-GBLUP method can effectively improve the prediction capability of agronomic characters such as rice and corn hybrid seed yield, grain weight, plant height and the like, has important significance in rice and corn hybrid seed breeding, and provides an effective tool for improving the variety breeding efficiency of crops.
The genetic algorithm is combined with the traditional genome optimal linear unbiased estimation method to form the GA-GBLUP algorithm, the algorithm can effectively improve the predictive power of the whole genome selection of the agronomic characters of the hybrid rice and corn, improve the accuracy of the whole genome selection of the hybrid rice, and provide accurate and reliable digital reference basis for the breeding of new varieties of crops, thereby improving the research level of breeding and the breeding efficiency.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 shows the expression of the present invention on a rice hybrid dataset.
FIG. 3 is a representation of the invention on a maize hybrid dataset.
Detailed Description
The following describes the technical scheme provided by the invention in detail by combining examples, but the invention is not limited to the following.
The rice IMF2 populations (Hua, j.p., xing, y.z., wu, w.r., xu, c.g., sun, x.l., yu, s.b., & Zhang, q.f. (2003) Single-locus heterotic effects and dominance by dominance interactions can adequately explain the genetic basis of heterosis in an elite rice hybrid processes of the National Academy of Sciences of the United States of America,100 (5), 2574-2579) and maize 305 hybrid populations (Wang, x, zhang, z., xu, y, li, p., zhang, x., & Xu, c. (2020) Using genomic data to improve the estimation of general combining ability based on sparse partial diallel cross designs in mail.the Crop Journal,8 (5), 819-829) genotype and phenotype data in the following examples were all publicly available.
Example 1
Implementation of GA-GBLUP algorithm
The GA-GBLUP algorithm adopts a genome optimal linear unbiased estimation model to predict individual breeding values. y is an n x 1 vector representing a quantitative trait, and a hybrid linear model comprising m markers can be expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) 2 ) Representing the residual error. m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect. Solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.
The GA-GBLUP algorithm mainly comprises the following steps:
1) Chromosome representation
All m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 means that this flag is preserved and the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA-GBLUP algorithm initialization.
2) Fitness calculation
Selecting delta from among all labels for each delta vector described above k The labels of =1 are retained, the retained labels are used to construct a genome predictive model, their fitness is calculated according to a fitness function, the fitness calculation method being any one of the following methods, which may be employedThe functions include:
red pond information rule (AIC)
AIC=2m-2ln(L)
Wherein: m is the number of parameters estimated, L is the likelihood value of the model; AIC represents the fitness calculation result calculated using the red-pool information criterion.
Bayesian Information Criterion (BIC)
BIC=mln(n)-2ln(L)
Wherein: m is the number of parameters estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated using bayesian information criteria.
FIT function
FIT=1-SSE/SST
Wherein: SST is the sum of squares of total variations of the phenotype values, SSE is the sum of squares of residuals; FIT represents the fitness calculation result calculated using the FIT function.
HAT function
HAT=1-PRESS/SST
Where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values. HAT represents the fitness calculation result calculated using HAT functions.
After the fitness calculation is completed, sorting different delta vectors according to the fitness, reserving the delta vector of the first 5% with the highest fitness, and eliminating the rest delta vectors; from the 10-fold cross-validation results, it can be seen that the FIT and HAT functions are more effective.
3) Genetic manipulation
For the remaining 5 delta vectors, first a 1.fwdarw.0 or 0.fwdarw.1 mutation is performed with a probability of 0.1 per site, e.g
δ (i) =[1 0 1 1 0 0 1 0 1 1]Pre-mutation
δ (j) =[1 0 1 1 0 0 1 0 0 1]Post mutation indicates that the 9 th site on the delta vector is mutated from 1 to 0, whereby the 9 th site is excluded from the model.
Then, we randomly select a pair of delta vectors every time, and the two delta vectors after pairing are exchanged in a crossing way, so that the information of several sites or large areas of the two delta vectors are recombined.
parent (i) =[1 0 1 0 0 0 1 0 1 1]
parent (j) =[1 0 1 1 0 0 1 0 0 1]
child (i) =[1 0 1 0 0 0 1 0 1 1]
child (j) =[1 0 1 1 0 0 1 0 0 1]
The pairing and cross-exchange simultaneously produces a new individual, and the above process is repeated 50 times until 100 different delta vectors are produced.
And at the moment, calculating the fitness of different delta vectors again, selecting 5 individuals with highest fitness from the mixed linear models, repeating the steps until the fitness of the models is not increased or reaches the preset iteration times, taking a molecular marker subset finally selected by an algorithm as a new molecular marker matrix to be brought into the mixed linear model, solving the mixed linear model by adopting a limiting maximum likelihood method, estimating the sizes of a fixed effect and a random effect, taking genotypes of test data into the mixed linear model on the basis, obtaining a phenotype value of the test set, further predicting phenotypes of all potential hybrids on the basis of the predictive power of the 10-fold cross verification evaluation model, and selecting hybrids with better target characters for field identification.
Example 2
Use of GA-GBLUP algorithm on rice hybrid population
1619 bin markers of 278 hybrid seeds of the rice IMF2 population are used as genotype data, and four characters of yield, tiller number of a single plant, weight of a single spike and thousand grain weight are used as phenotype data. 278 hybrid seeds are randomly divided into 10 uniform parts, 9 parts are used as training sets, 1 part is used as a test set, the training sets are combined with GA-GBLUP models with different super parameters, marks are selected from all 1619 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 2 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of the whole genome selection of rice hybrid can be effectively improved. Compared with the traditional GBLUP algorithm, the GA-GBLUP algorithm can improve the predictive power of 24.2%, 12.6%, 3.9% and 2.2% at most for the four characters of yield, tiller number of each plant, spike weight and thousand grain weight, and has great significance for the characters of low genetic transmission such as yield.
Example 3
Use of GA-GBLUP algorithm on corn hybrid population
11255 SNP markers of 305 corn hybrids are used as genotype data, and two traits of spike weight and plant height are used as phenotype data. Randomly dividing 305 hybrid seeds into 10 uniform parts, wherein 9 parts are used as training sets, 1 part is used as a test set, a GA-GBLUP model combined with different super parameters is used on the training sets, marks are selected from all 11255 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 3 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of whole genome selection of corn hybrid can be effectively improved. When used for the prediction of ear weight, GA-GBLUP can increase the predictive power by 11.2% compared to the GBLUP method.
Claims (10)
1. A whole genome prediction method based on a genetic algorithm is characterized in that an optimal molecular marker is selected by adopting the genetic algorithm, and the breeding value of an individual is predicted by combining a genome optimal linear unbiased estimation model on the basis, and the method comprises the following steps:
obtaining a molecular marker of crops to be predicted;
randomly selecting a certain proportion of molecular markers to initialize a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving a molecular marker subset with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subset at a certain ratio to generate a new molecular marker subset;
calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model; and (3) introducing the genotype of the crop to be predicted into a genome optimal linear unbiased estimation model to obtain the phenotype value of the crop to be predicted.
2. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the method of constructing a genome genetic relationship matrix prediction model comprises:
y is an n x 1 vector representing a quantitative trait, and the hybrid linear model containing m markers is expressed as:
wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) 2 ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma k Indicating the magnitude of the kth marker effect; by means of limiting polesSolving the mixed linear model by a large likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.
3. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the step of randomly selecting comprises: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] 1 δ 2 ...δ m ]Wherein delta k =0 means that this flag is excluded, δ k =1 means that this flag is kept, the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA algorithm initialization.
4. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the calculation method of the suitability of the subset of different molecular markers is any one of the following methods:
red pool information criterion:
AIC=2m-2ln(L)
where m is the number of parameters being estimated and L is the likelihood of the model; AIC represents the fitness calculation result calculated by adopting the red pool information rule;
bayesian information criterion:
BIC=mln(n)-2ln(L)
where m is the number of parameters being estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated by adopting a Bayesian information criterion;
FIT function:
FIT=1-SSE/SST
where SST is the sum of the squares of the total variations of the phenotype values and SSE is the sum of the squares of the residuals; FIT represents a fitness calculation result obtained by FIT function calculation;
HAT function:
HAT=1-PRESS/SST
where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values; HAT represents the fitness calculation result calculated using HAT functions.
5. The genetic algorithm-based whole genome prediction method according to claim 1, wherein mutating, pairing, cross-interchanging the remaining subset of molecular markers at a certain ratio, generating a new subset of molecular markers comprises:
carrying out mutation of 1 to 0 or 0 to 1 on the reserved molecular marker vector according to the probability of 0.1 of each site; every time a pair of delta vectors is randomly selected, two delta vectors after pairing are subjected to cross exchange, so that the information of a plurality of positions or large areas of the two delta vectors is recombined; pairing and cross-swapping simultaneously produces new molecular marker vectors.
6. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the predictive power of the model is evaluated by 10-fold cross-validation.
7. A method of predicting agronomic traits in a crop hybrid, comprising predicting agronomic traits in a crop hybrid using the method of claim 1.
8. The method of predicting agronomic traits in crop hybrids as claimed in claim 7, wherein the crop is rice or maize; the agronomic trait is a quantitative trait controlled by a micro-efficient polygene.
9. The method of predicting agronomic traits in crop hybrids of claim 7, wherein the agronomic traits comprise crop yield and quality traits.
10. The method of predicting agronomic traits in crop hybrids of claim 7, wherein the agronomic traits comprise yield, tiller number per plant, ear weight per spike, thousand kernel weight and plant height.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310741264.1A CN116863998B (en) | 2023-06-21 | 2023-06-21 | Genetic algorithm-based whole genome prediction method and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310741264.1A CN116863998B (en) | 2023-06-21 | 2023-06-21 | Genetic algorithm-based whole genome prediction method and application thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116863998A true CN116863998A (en) | 2023-10-10 |
CN116863998B CN116863998B (en) | 2024-04-05 |
Family
ID=88220750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310741264.1A Active CN116863998B (en) | 2023-06-21 | 2023-06-21 | Genetic algorithm-based whole genome prediction method and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116863998B (en) |
Citations (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005061731A1 (en) * | 2003-12-24 | 2005-07-07 | Nanyang Polytechnic | Method and system for unbiased genome amplification using genetic algorithms to select primers for genomic dna amplification |
WO2005078133A2 (en) * | 2004-02-09 | 2005-08-25 | Monsanto Technology Llc | Marker assisted best linear unbiased predicted (ma-blup): software adaptions for practical applications for large breeding populations in farm animal species |
WO2008025093A1 (en) * | 2006-09-01 | 2008-03-06 | Innovative Dairy Products Pty Ltd | Whole genome based genetic evaluation and selection process |
AU2007214360A1 (en) * | 2006-09-01 | 2008-03-20 | Innovative Dairy Products Pty Ltd | Whole genome based genetic evaluation and selection process |
WO2009035560A1 (en) * | 2007-09-12 | 2009-03-19 | Pfizer, Inc. | Methods of using genetic markers and related epistatic interactions |
WO2010020252A1 (en) * | 2008-08-19 | 2010-02-25 | Viking Genetics Fmba | Methods for determining a breeding value based on a plurality of genetic markers |
US20120151625A1 (en) * | 2010-11-30 | 2012-06-14 | Zhigang Guo | Methods for increasing genetic gain in a breeding population |
CN103026361A (en) * | 2010-06-03 | 2013-04-03 | 先正达参股股份有限公司 | Methods and compositions for predicting unobserved phenotypes (PUP) |
WO2013107048A1 (en) * | 2012-01-20 | 2013-07-25 | 深圳华大基因健康科技有限公司 | Method and system for determining whether copy number variation exists in sample genome, and computer readable medium |
US20140283152A1 (en) * | 2013-03-14 | 2014-09-18 | University Of Florida Research Foundation, Inc. | Method for artificial selection |
US20150181822A1 (en) * | 2013-12-31 | 2015-07-02 | Dow Agrosciences Llc | Selection based on optimal haploid value to create elite lines |
CN106255764A (en) * | 2013-12-20 | 2016-12-21 | 比勒陀利亚大学 | The Disease Resistance labelling of Semen Maydis |
CN106779076A (en) * | 2016-11-18 | 2017-05-31 | 栾图 | Breeding variety system and its algorithm based on biological information |
WO2017210102A1 (en) * | 2016-06-01 | 2017-12-07 | Institute For Systems Biology | Methods and system for generating and comparing reduced genome data sets |
CN109688805A (en) * | 2016-07-11 | 2019-04-26 | 先锋国际良种公司 | The method for generating gray leaf spot resistance maize |
CN109997192A (en) * | 2016-06-15 | 2019-07-09 | 哈佛学院董事及会员团体 | Method for rule-based genome design |
CN110273007A (en) * | 2019-06-27 | 2019-09-24 | 广西扬翔农牧有限责任公司 | SNP marker relevant to the effective sperm count of boar and its preparation method and application |
CN110476214A (en) * | 2017-03-30 | 2019-11-19 | 孟山都技术有限公司 | System and method for identifying the Aggregate effect of the genome editor of multiple genome editors and prediction identification |
CA3105404A1 (en) * | 2018-07-03 | 2020-01-09 | New West Genetics Inc. | Cannabis variety which produces greater than 50% female plants |
CN111640508A (en) * | 2020-05-28 | 2020-09-08 | 上海生物信息技术研究中心 | Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application |
CN111863137A (en) * | 2020-05-28 | 2020-10-30 | 上海朴岱生物科技合伙企业(有限合伙) | Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application |
CN112204156A (en) * | 2018-05-25 | 2021-01-08 | 先锋国际良种公司 | Systems and methods for improving breeding by modulating recombination rates |
CN112601826A (en) * | 2018-02-27 | 2021-04-02 | 康奈尔大学 | Ultrasensitive detection of circulating tumor DNA by whole genome integration |
CN112802548A (en) * | 2021-01-07 | 2021-05-14 | 深圳吉因加医学检验实验室 | Method for predicting allele-specific copy number variation of single-sample whole genome |
CN112980962A (en) * | 2019-12-12 | 2021-06-18 | 深圳华大生命科学研究院 | SNP marker related to birth weight trait of pig and application thereof |
CN113223606A (en) * | 2021-05-13 | 2021-08-06 | 浙江大学 | Genome selection method for genetic improvement of complex traits |
CN113234848A (en) * | 2021-05-26 | 2021-08-10 | 北京林业大学 | Molecular marker related to poplar stomatal morphology and photosynthetic efficiency and application thereof |
WO2021202910A1 (en) * | 2020-04-02 | 2021-10-07 | Embark Veterinary, Inc. | Methods and systems for determining pigmentation phenotypes |
CN114317779A (en) * | 2022-01-19 | 2022-04-12 | 华中农业大学 | SNP molecular marker related to pig carcass traits and application |
CN114863991A (en) * | 2022-06-21 | 2022-08-05 | 沈阳农业大学 | Method for improving whole genome prediction precision based on two-step prediction model establishment |
CN116210571A (en) * | 2023-03-06 | 2023-06-06 | 广州市林业和园林科学研究院 | Three-dimensional greening remote sensing intelligent irrigation method and system |
-
2023
- 2023-06-21 CN CN202310741264.1A patent/CN116863998B/en active Active
Patent Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005061731A1 (en) * | 2003-12-24 | 2005-07-07 | Nanyang Polytechnic | Method and system for unbiased genome amplification using genetic algorithms to select primers for genomic dna amplification |
WO2005078133A2 (en) * | 2004-02-09 | 2005-08-25 | Monsanto Technology Llc | Marker assisted best linear unbiased predicted (ma-blup): software adaptions for practical applications for large breeding populations in farm animal species |
WO2008025093A1 (en) * | 2006-09-01 | 2008-03-06 | Innovative Dairy Products Pty Ltd | Whole genome based genetic evaluation and selection process |
AU2007214360A1 (en) * | 2006-09-01 | 2008-03-20 | Innovative Dairy Products Pty Ltd | Whole genome based genetic evaluation and selection process |
WO2009035560A1 (en) * | 2007-09-12 | 2009-03-19 | Pfizer, Inc. | Methods of using genetic markers and related epistatic interactions |
WO2010020252A1 (en) * | 2008-08-19 | 2010-02-25 | Viking Genetics Fmba | Methods for determining a breeding value based on a plurality of genetic markers |
CN103026361A (en) * | 2010-06-03 | 2013-04-03 | 先正达参股股份有限公司 | Methods and compositions for predicting unobserved phenotypes (PUP) |
US20120151625A1 (en) * | 2010-11-30 | 2012-06-14 | Zhigang Guo | Methods for increasing genetic gain in a breeding population |
WO2013107048A1 (en) * | 2012-01-20 | 2013-07-25 | 深圳华大基因健康科技有限公司 | Method and system for determining whether copy number variation exists in sample genome, and computer readable medium |
US20140283152A1 (en) * | 2013-03-14 | 2014-09-18 | University Of Florida Research Foundation, Inc. | Method for artificial selection |
CN106255764A (en) * | 2013-12-20 | 2016-12-21 | 比勒陀利亚大学 | The Disease Resistance labelling of Semen Maydis |
US20150181822A1 (en) * | 2013-12-31 | 2015-07-02 | Dow Agrosciences Llc | Selection based on optimal haploid value to create elite lines |
CN106028798A (en) * | 2013-12-31 | 2016-10-12 | 美国陶氏益农公司 | Selection based on optimal haploid value to create elite lines |
WO2017210102A1 (en) * | 2016-06-01 | 2017-12-07 | Institute For Systems Biology | Methods and system for generating and comparing reduced genome data sets |
CN109997192A (en) * | 2016-06-15 | 2019-07-09 | 哈佛学院董事及会员团体 | Method for rule-based genome design |
CN109688805A (en) * | 2016-07-11 | 2019-04-26 | 先锋国际良种公司 | The method for generating gray leaf spot resistance maize |
CN106779076A (en) * | 2016-11-18 | 2017-05-31 | 栾图 | Breeding variety system and its algorithm based on biological information |
CN110476214A (en) * | 2017-03-30 | 2019-11-19 | 孟山都技术有限公司 | System and method for identifying the Aggregate effect of the genome editor of multiple genome editors and prediction identification |
CN112601826A (en) * | 2018-02-27 | 2021-04-02 | 康奈尔大学 | Ultrasensitive detection of circulating tumor DNA by whole genome integration |
CN112204156A (en) * | 2018-05-25 | 2021-01-08 | 先锋国际良种公司 | Systems and methods for improving breeding by modulating recombination rates |
CA3105404A1 (en) * | 2018-07-03 | 2020-01-09 | New West Genetics Inc. | Cannabis variety which produces greater than 50% female plants |
CN110273007A (en) * | 2019-06-27 | 2019-09-24 | 广西扬翔农牧有限责任公司 | SNP marker relevant to the effective sperm count of boar and its preparation method and application |
CN112980962A (en) * | 2019-12-12 | 2021-06-18 | 深圳华大生命科学研究院 | SNP marker related to birth weight trait of pig and application thereof |
WO2021202910A1 (en) * | 2020-04-02 | 2021-10-07 | Embark Veterinary, Inc. | Methods and systems for determining pigmentation phenotypes |
CN111863137A (en) * | 2020-05-28 | 2020-10-30 | 上海朴岱生物科技合伙企业(有限合伙) | Complex disease state evaluation method established based on high-throughput sequencing data and clinical phenotype and application |
CN111640508A (en) * | 2020-05-28 | 2020-09-08 | 上海生物信息技术研究中心 | Method for constructing pan-tumor targeted drug susceptibility state evaluation model based on high-throughput sequencing data and clinical phenotype and application |
CN112802548A (en) * | 2021-01-07 | 2021-05-14 | 深圳吉因加医学检验实验室 | Method for predicting allele-specific copy number variation of single-sample whole genome |
CN113223606A (en) * | 2021-05-13 | 2021-08-06 | 浙江大学 | Genome selection method for genetic improvement of complex traits |
CN113234848A (en) * | 2021-05-26 | 2021-08-10 | 北京林业大学 | Molecular marker related to poplar stomatal morphology and photosynthetic efficiency and application thereof |
CN114317779A (en) * | 2022-01-19 | 2022-04-12 | 华中农业大学 | SNP molecular marker related to pig carcass traits and application |
CN114863991A (en) * | 2022-06-21 | 2022-08-05 | 沈阳农业大学 | Method for improving whole genome prediction precision based on two-step prediction model establishment |
CN116210571A (en) * | 2023-03-06 | 2023-06-06 | 广州市林业和园林科学研究院 | Three-dimensional greening remote sensing intelligent irrigation method and system |
Non-Patent Citations (2)
Title |
---|
TENG JIN-YAN等: "Incorporating genomic annotation into single-step genomic prediction with imputed whole-genome sequence data", 《JOURNAL OF INTEGRATIVE AGRICULTURE》, vol. 21, no. 4, pages 1126 - 1136, XP086994044, DOI: 10.1016/S2095-3119(21)63813-3 * |
朱墨等: "基于GBLUP 和BayesB 方法对肉鸡屠宰性状基因组预测准确性的比较", 《中国农业科学》, vol. 54, no. 23, pages 5125 - 5131 * |
Also Published As
Publication number | Publication date |
---|---|
CN116863998B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wallace et al. | On the road to breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics | |
Mir et al. | Genetic dissection of grain weight in bread wheat through quantitative trait locus interval and association mapping | |
Riedelsheimer et al. | Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines | |
Yin et al. | Genetic dissection on rice grain shape by the two-dimensional image analysis in one japonica× indica population consisting of recombinant inbred lines | |
CN107278877B (en) | A kind of full-length genome selection and use method of corn seed-producing rate | |
Pace et al. | Genomic prediction of seedling root length in maize (Zea mays L.) | |
Liu et al. | The impact of genetic relationship and linkage disequilibrium on genomic selection | |
Mir et al. | Allelic diversity, structural analysis, and Genome-Wide Association Study (GWAS) for yield and related traits using unexplored common bean (Phaseolus vulgaris L.) germplasm from Western Himalayas | |
Gosseau et al. | Heliaphen, an outdoor high-throughput phenotyping platform for genetic studies and crop modeling | |
CN114292928B (en) | Molecular marker related to sow breeding traits and screening method and application | |
Geuten et al. | Conflicting phylogenies of balsaminoid families and the polytomy in Ericales: combining data in a Bayesian framework | |
CN113053459A (en) | Hybrid prediction method for integrating parental phenotypes based on Bayesian model | |
McGaugh et al. | The utility of genomic prediction models in evolutionary genetics | |
Gonzalo et al. | Direct mapping of density response in a population of B73× Mo17 recombinant inbred lines of maize (Zea mays L.) | |
CN113223606B (en) | Genome selection method for genetic improvement of complex traits | |
Hodgins et al. | Asymmetrical mating patterns and the evolution of biased style-morph ratios in a tristylous daffodil | |
CN108197435B (en) | Marker locus genotype error-containing multi-character multi-interval positioning method | |
CN116863998B (en) | Genetic algorithm-based whole genome prediction method and application thereof | |
Howard et al. | Overview of Genomic Prediction Methods and the Associated Assumptions on the Variance of Marker Effect, and on the Architecture of the Target Trait | |
You et al. | Genomic cross prediction for linseed improvement | |
CN115732027A (en) | Genome selection method and application thereof in breeding of autopolyploid species | |
Fu et al. | A statistical model for mapping morphological shape | |
El-Kassaby et al. | Modern advances in tree breeding | |
Huang et al. | Evidence for two types of Aquilegia ecalcarata and its implications for adaptation to new environments | |
Sehgal et al. | Genomic selection in wheat: Progress, opportunities and challenges |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |