CN116863998A

CN116863998A - Genetic algorithm-based whole genome prediction method and application thereof

Info

Publication number: CN116863998A
Application number: CN202310741264.1A
Authority: CN
Inventors: 徐扬; 张宇翔; 周恺; 于广宁; 李成; 杨文艳; 王欣; 徐辰武; 杨泽峰; 鲁月; 陈茹佳; 陶天云; 李鹏程
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-10-10
Anticipated expiration: 2043-06-21
Also published as: CN116863998B

Abstract

The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof, wherein a genome optimal linear unbiased estimation model is adopted to predict the breeding value of an individual, and the method comprises the following steps: obtaining a molecular marker of crops to be predicted; randomly selecting a certain proportion of molecular marker subsets from the molecular marker subsets repeatedly for the initialization of a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving the molecular marker subsets with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subsets at a certain ratio to generate a new molecular marker subset; and calculating fitness functions of different molecular marker subsets again, reserving the molecular marker subset with higher fitness until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model for whole genome prediction. The method can be used for improving the accuracy of the whole genome selection of the hybrid seeds, and can provide important technical support for the accurate breeding of the hybrid seeds.

Description

Genetic algorithm-based whole genome prediction method and application thereof

Technical Field

The invention belongs to the field of biological information, and relates to a whole genome prediction method based on a genetic algorithm and application thereof.

Background

Cultivating high-quality high-yield, green and high-efficiency crop varieties is a great importance in the current crop genetic breeding work. Traditional crop breeding relies on phenotypic selection: breeders select lines with the trait of interest from them for further identification by examining the phenotype of crop lines in the field and in the laboratory, in combination with their breeding experience. However, many traits related to yield, quality of crops belong to quantitative traits, which are controlled by a large number of micro-efficient quantitative trait loci, are susceptible to environmental influences, and are unreliable to select by phenotype alone. Molecular biology development enables molecular marker assisted selection breeding, however, molecular marker assisted selection is only applicable to traits controlled by a few major quantitative trait loci, and has no ability to select for traits such as crop yield and quality. Whole genome selection techniques construct statistical models using high density molecular markers covering the whole genome and crop phenotypes to predict the behavior of materials of known genotype but unknown phenotype. Whole genome selection incorporates the effects of all markers on the genome into the model irrespective of their level of significance and is thus particularly suited for quantitative traits such as crop yield, quality, which are controlled by a minigenome. The genome best linear unbiased estimation (Genomic Best Liner Unbiased Prediction, GBLUP) model is the most robust and general model among the whole genome selection models, however, the GBLUP model assumes that all molecular markers have the same contribution to the target trait, which is contrary to the conclusion of modern molecular genetics, limiting further improvement in the prediction accuracy of GBLUP methods.

Disclosure of Invention

The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits. The prediction power of agronomic characters such as rice and corn hybrid yield can be effectively improved through a GA-GBLUP algorithm. Therefore, the invention can be used for improving the accuracy of the whole genome selection of the hybrid seeds, has important significance in the utilization of rice and corn heterosis, and can provide important technical support for the accurate breeding of the hybrid seeds.

The aim of the invention is realized by adopting the following technical scheme:

a whole genome prediction method based on genetic algorithm adopts genetic algorithm to select optimal molecular markers, and combines genome optimal linear unbiased estimation model to predict individual breeding value, comprising the following steps:

obtaining a molecular marker of crops to be predicted;

randomly selecting a certain proportion of molecular markers to initialize a genetic algorithm, constructing a genome prediction model, calculating the suitability of different molecular marker subsets, reserving a molecular marker subset with higher suitability, and carrying out mutation, pairing and cross exchange on the reserved molecular marker subset at a certain ratio to generate a new molecular marker subset;

and calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model.

Further, the method for constructing the genome predictive model comprises the following steps:

y is an n x 1 vector representing a quantitative trait, and the hybrid linear model containing m markers is expressed as:

wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z _k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) ² ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma _k Indicating the magnitude of the kth marker effect; solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.

Further, the step of randomly selecting includes: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] ₁ δ ₂ ...δ _m ]Wherein delta _k =0 means that this flag is excluded, δ _k =1 means that this flag is kept, the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA algorithm initialization.

Further, the calculation method of the suitability of the subset of different molecular markers is any one of the following methods,

red pool information criterion:

AIC＝2m-2ln(L)

where m is the number of parameters being estimated and L is the likelihood of the model; AIC represents the fitness calculation result calculated by adopting the red pool information rule;

bayesian information criterion:

BIC＝mln(n)-2ln(L)

where m is the number of parameters being estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated by adopting a Bayesian information criterion;

FIT function:

FIT＝1-SSE/SST

where SST is the sum of the squares of the total variations of the phenotype values and SSE is the sum of the squares of the residuals; FIT represents a fitness calculation result obtained by FIT function calculation;

HAT function:

HAT＝1-PRESS/SST

where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values; HAT represents the fitness calculation result calculated using HAT functions.

Further, mutating, pairing, cross-exchanging the remaining subset of molecular markers at a certain ratio, generating a new subset of molecular markers comprises:

carrying out mutation of 1 to 0 or 0 to 1 on the reserved molecular marker vector according to the probability of 0.1 of each site; every time a pair of delta vectors is randomly selected, two delta vectors after pairing are subjected to cross exchange, so that the information of a plurality of positions or large areas of the two delta vectors is recombined; pairing and cross-swapping simultaneously produces new molecular marker vectors.

Further, the predictive power of the model was evaluated by 10 fold cross validation.

The invention also provides a method for predicting the agronomic characters of the crop hybrid seeds.

Further, the crop is rice or maize.

Further, the agronomic trait is a quantitative trait controlled by a micro-efficient polygene.

Further, the agronomic traits include crop yield and quality traits.

Further, the agronomic traits include yield, tiller number of individual plants, spike weight, thousand seed weight and plant height.

The invention aims to provide an application of a genome optimal linear unbiased estimation algorithm GA-GBLUP based on a genetic algorithm in predicting hybrid agronomic traits.

In the embodiment of the invention, the plant is specifically rice and corn which are gramineous plants. The algorithm is specifically a genetic algorithm and a genome optimal linear unbiased estimation algorithm.

The method comprises the following steps: firstly, 1% of markers are randomly selected for 100 times repeatedly in all molecular markers input by a user to obtain 100 different chromosomes for the initialization of a GA algorithm, then genome prediction models are respectively constructed by using the markers selected by the chromosomes, thereby calculating fitness functions of the different chromosomes, keeping 5 chromosomes with highest fitness, eliminating the rest of the chromosomes, and then mutating the 5 chromosomes at a certain rate (representing the process that the markers at a certain position are not selected or are not selected to be selected), pairing (pairwise pairing for generating new chromosomes), cross exchanging, finally generating 100 new chromosomes, selecting the markers by taking the chromosomes as standards again for constructing a relationship matrix, and repeating the above processes until the maximum iteration number or convergence is reached. Constructing a genome optimal linear unbiased estimation model by utilizing a molecular marker subset finally selected by an algorithm, predicting the phenotype of a training set, evaluating the predictive power of the model, and predicting the phenotype of all potential hybrids on the basis, and selecting the hybrid with better target characters from the phenotype for field identification.

The algorithm provided by the invention is named as a whole genome prediction method based on a genetic algorithm.

Advantageous effects

The invention can improve the accuracy of the whole genome selection of the hybrid by adopting a whole genome prediction method based on a genetic algorithm. Compared with the traditional GBLUP method, the GA-GBLUP method can effectively improve the prediction capability of agronomic characters such as rice and corn hybrid seed yield, grain weight, plant height and the like, has important significance in rice and corn hybrid seed breeding, and provides an effective tool for improving the variety breeding efficiency of crops.

The genetic algorithm is combined with the traditional genome optimal linear unbiased estimation method to form the GA-GBLUP algorithm, the algorithm can effectively improve the predictive power of the whole genome selection of the agronomic characters of the hybrid rice and corn, improve the accuracy of the whole genome selection of the hybrid rice, and provide accurate and reliable digital reference basis for the breeding of new varieties of crops, thereby improving the research level of breeding and the breeding efficiency.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 shows the expression of the present invention on a rice hybrid dataset.

FIG. 3 is a representation of the invention on a maize hybrid dataset.

Detailed Description

The following describes the technical scheme provided by the invention in detail by combining examples, but the invention is not limited to the following.

The rice IMF2 populations (Hua, j.p., xing, y.z., wu, w.r., xu, c.g., sun, x.l., yu, s.b., & Zhang, q.f. (2003) Single-locus heterotic effects and dominance by dominance interactions can adequately explain the genetic basis of heterosis in an elite rice hybrid processes of the National Academy of Sciences of the United States of America,100 (5), 2574-2579) and maize 305 hybrid populations (Wang, x, zhang, z., xu, y, li, p., zhang, x., & Xu, c. (2020) Using genomic data to improve the estimation of general combining ability based on sparse partial diallel cross designs in mail.the Crop Journal,8 (5), 819-829) genotype and phenotype data in the following examples were all publicly available.

Example 1

Implementation of GA-GBLUP algorithm

The GA-GBLUP algorithm adopts a genome optimal linear unbiased estimation model to predict individual breeding values. y is an n x 1 vector representing a quantitative trait, and a hybrid linear model comprising m markers can be expressed as:

wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z _k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) ² ) Representing the residual error. m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma _k Indicating the magnitude of the kth marker effect. Solving the mixed linear model by using a limiting maximum likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.

The GA-GBLUP algorithm mainly comprises the following steps:

1) Chromosome representation

All m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] ₁ δ ₂ ...δ _m ]Wherein delta _k =0 means that this flag is excluded, δ _k =1 means that this flag is preserved and the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA-GBLUP algorithm initialization.

2) Fitness calculation

Selecting delta from among all labels for each delta vector described above _k The labels of =1 are retained, the retained labels are used to construct a genome predictive model, their fitness is calculated according to a fitness function, the fitness calculation method being any one of the following methods, which may be employedThe functions include:

red pond information rule (AIC)

AIC＝2m-2ln(L)

Wherein: m is the number of parameters estimated, L is the likelihood value of the model; AIC represents the fitness calculation result calculated using the red-pool information criterion.

Bayesian Information Criterion (BIC)

BIC＝mln(n)-2ln(L)

Wherein: m is the number of parameters estimated, L is the likelihood value of the model, and n is the sample size; BIC represents a fitness calculation result calculated using bayesian information criteria.

FIT function

FIT＝1-SSE/SST

Wherein: SST is the sum of squares of total variations of the phenotype values, SSE is the sum of squares of residuals; FIT represents the fitness calculation result calculated using the FIT function.

HAT function

HAT＝1-PRESS/SST

Where PRESS is the sum of squares of prediction residuals of the hybrid linear model and SST is the sum of squares of total variation of the phenotype values. HAT represents the fitness calculation result calculated using HAT functions.

After the fitness calculation is completed, sorting different delta vectors according to the fitness, reserving the delta vector of the first 5% with the highest fitness, and eliminating the rest delta vectors; from the 10-fold cross-validation results, it can be seen that the FIT and HAT functions are more effective.

3) Genetic manipulation

For the remaining 5 delta vectors, first a 1.fwdarw.0 or 0.fwdarw.1 mutation is performed with a probability of 0.1 per site, e.g

δ ⁽ⁱ⁾ ＝[1 0 1 1 0 0 1 0 1 1]Pre-mutation

δ ^(j) ＝[1 0 1 1 0 0 1 0 0 1]Post mutation indicates that the 9 th site on the delta vector is mutated from 1 to 0, whereby the 9 th site is excluded from the model.

Then, we randomly select a pair of delta vectors every time, and the two delta vectors after pairing are exchanged in a crossing way, so that the information of several sites or large areas of the two delta vectors are recombined.

parent ⁽ⁱ⁾ ＝[1 0 1 0 0 0 1 0 1 1]

parent ^(j) ＝[1 0 1 1 0 0 1 0 0 1]

child ⁽ⁱ⁾ ＝[1 0 1 0 0 0 1 0 1 1]

child ^(j) ＝[1 0 1 1 0 0 1 0 0 1]

The pairing and cross-exchange simultaneously produces a new individual, and the above process is repeated 50 times until 100 different delta vectors are produced.

And at the moment, calculating the fitness of different delta vectors again, selecting 5 individuals with highest fitness from the mixed linear models, repeating the steps until the fitness of the models is not increased or reaches the preset iteration times, taking a molecular marker subset finally selected by an algorithm as a new molecular marker matrix to be brought into the mixed linear model, solving the mixed linear model by adopting a limiting maximum likelihood method, estimating the sizes of a fixed effect and a random effect, taking genotypes of test data into the mixed linear model on the basis, obtaining a phenotype value of the test set, further predicting phenotypes of all potential hybrids on the basis of the predictive power of the 10-fold cross verification evaluation model, and selecting hybrids with better target characters for field identification.

Example 2

Use of GA-GBLUP algorithm on rice hybrid population

1619 bin markers of 278 hybrid seeds of the rice IMF2 population are used as genotype data, and four characters of yield, tiller number of a single plant, weight of a single spike and thousand grain weight are used as phenotype data. 278 hybrid seeds are randomly divided into 10 uniform parts, 9 parts are used as training sets, 1 part is used as a test set, the training sets are combined with GA-GBLUP models with different super parameters, marks are selected from all 1619 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 2 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of the whole genome selection of rice hybrid can be effectively improved. Compared with the traditional GBLUP algorithm, the GA-GBLUP algorithm can improve the predictive power of 24.2%, 12.6%, 3.9% and 2.2% at most for the four characters of yield, tiller number of each plant, spike weight and thousand grain weight, and has great significance for the characters of low genetic transmission such as yield.

Example 3

Use of GA-GBLUP algorithm on corn hybrid population

11255 SNP markers of 305 corn hybrids are used as genotype data, and two traits of spike weight and plant height are used as phenotype data. Randomly dividing 305 hybrid seeds into 10 uniform parts, wherein 9 parts are used as training sets, 1 part is used as a test set, a GA-GBLUP model combined with different super parameters is used on the training sets, marks are selected from all 11255 marks, and after the specified iteration times are reached, the mark selection is completed. After the marker selection is completed, constructing a genetic relationship matrix on the training set and the testing set by using the selected marker subset, and predicting the characters of the testing set. The above procedure is sequentially performed until all test sets are predicted once, and the final decision coefficient between the predicted value and the actual value is repeated 15 times as prediction accuracy to eliminate the random deviation caused by the GA algorithm. The dashed line in fig. 3 represents the predictive power of the GBLUP method, the box plot represents the predictive accuracy of the GA-GBLUP algorithm combined with different super parameters, and it is not difficult to see from the graph that when the GA-GBLUP algorithm is combined with FIT and HAT fitness functions, the predictive power of whole genome selection of corn hybrid can be effectively improved. When used for the prediction of ear weight, GA-GBLUP can increase the predictive power by 11.2% compared to the GBLUP method.

Claims

1. A whole genome prediction method based on a genetic algorithm is characterized in that an optimal molecular marker is selected by adopting the genetic algorithm, and the breeding value of an individual is predicted by combining a genome optimal linear unbiased estimation model on the basis, and the method comprises the following steps:

obtaining a molecular marker of crops to be predicted;

calculating the suitability of different molecular marker subsets again, reserving the molecular marker subset with higher suitability until the maximum iteration number or convergence is reached, obtaining a final molecular marker subset, and constructing a genome optimal linear unbiased estimation model; and (3) introducing the genotype of the crop to be predicted into a genome optimal linear unbiased estimation model to obtain the phenotype value of the crop to be predicted.

2. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the method of constructing a genome genetic relationship matrix prediction model comprises:

wherein X is an n X q fixed effect matrix, beta is a q X1 vector, and represents the magnitude of the fixed effect, Z _k Is an N1 vector representing the genotypes of all individuals at the kth marker, ε is a vector which obeys N (0,I σ) ² ) Representing the residual error; m represents the number of all marks, n represents the number of samples, q represents the number of fixation effects, gamma _k Indicating the magnitude of the kth marker effect; by means of limiting polesSolving the mixed linear model by a large likelihood estimation method, and estimating the magnitudes of a fixed effect beta and a random effect gamma; and obtaining a predicted value of the test set based on the estimated parameters, and further performing cross-validation to evaluate the predicted force of the model.

3. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the step of randomly selecting comprises: all m molecular markers are encoded in a 0/1 mode to obtain a vector delta= [ delta ] ₁ δ ₂ ...δ _m ]Wherein delta _k =0 means that this flag is excluded, δ _k =1 means that this flag is kept, the above procedure is repeated randomly 100 times, resulting in 100 different delta vectors for GA algorithm initialization.

4. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the calculation method of the suitability of the subset of different molecular markers is any one of the following methods:

red pool information criterion:

AIC＝2m-2ln(L)

bayesian information criterion:

BIC＝mln(n)-2ln(L)

FIT function:

FIT＝1-SSE/SST

HAT function:

HAT＝1-PRESS/SST

5. The genetic algorithm-based whole genome prediction method according to claim 1, wherein mutating, pairing, cross-interchanging the remaining subset of molecular markers at a certain ratio, generating a new subset of molecular markers comprises:

6. The genetic algorithm-based whole genome prediction method according to claim 1, wherein the predictive power of the model is evaluated by 10-fold cross-validation.

7. A method of predicting agronomic traits in a crop hybrid, comprising predicting agronomic traits in a crop hybrid using the method of claim 1.

8. The method of predicting agronomic traits in crop hybrids as claimed in claim 7, wherein the crop is rice or maize; the agronomic trait is a quantitative trait controlled by a micro-efficient polygene.

9. The method of predicting agronomic traits in crop hybrids of claim 7, wherein the agronomic traits comprise crop yield and quality traits.

10. The method of predicting agronomic traits in crop hybrids of claim 7, wherein the agronomic traits comprise yield, tiller number per plant, ear weight per spike, thousand kernel weight and plant height.