CN113611367A

CN113611367A - CRISPR/Cas9 off-target prediction method based on VAE data enhancement

Info

Publication number: CN113611367A
Application number: CN202110898820.7A
Authority: CN
Inventors: 彭绍亮; 向伟铭; 陈东
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-11-05
Anticipated expiration: 2041-08-05
Also published as: CN113611367B

Abstract

The invention discloses a variable aperture actuation (VAE) data enhancement based CRISPR/Cas9 off-target prediction method, which comprises the steps of S1, processing training data by using Pair coding; s2, pre-training the data processed in the step S1 by adopting an H-VAE model to obtain parameters of hidden variable distribution; s3, sampling a new positive sample by adopting the given posterior distribution and combining the parameters of the hidden variable distribution; s4, fusing the newly sampled positive sample with the previous training data, replacing the last full connection layer on the basis of keeping the information extraction module of the original information model, and performing combined training by using the fused data; and S5, utilizing the trained task classification result to carry out miss prediction on the new input task. The invention solves the problems of unstable learning and the like caused by class imbalance data.

Description

CRISPR/Cas9 off-target prediction method based on VAE data enhancement

Technical Field

The invention relates to the technical field of computer science, in particular to a CRISPR/Cas9 off-target prediction method based on VAE data enhancement.

Background

Since the acquisition of CRISPR/Cas9 off-target data needs to be obtained by biological experiments, there are some inherent disadvantages to biological experiments, such as: high cost, slow speed, multiple uncontrollable factors and the like, which can cause that the off-target data of the CRISPR/Cas9 is very little, so that the training of the model becomes difficult. One of the problems with CRISPR/Cas9 off-target data is that the number of positive and negative samples is very different, which presents a very challenging problem for training conventional deep learning algorithms. Conventional models trained on unbalanced datasets are easy to achieve higher accuracy for most classes. However, such high accuracy is not practical. Since the results show that such models tend to perform poorly for truly important positive sample classification accuracy. In the previous research, the deep crispr adopts an oversampling method, and copies positive samples to achieve the number matched with the negative samples, or generates new positive sample data by using a SMOTE algorithm to compensate the problem of insufficient positive samples.

Disclosure of Invention

The invention aims to provide a CRISPR/Cas9 off-target prediction method based on VAE data enhancement, so as to overcome the defects in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a CRISPR/Cas9 off-target prediction method based on VAE data enhancement, comprising the steps of:

s1, processing the training data by using Pair codes;

s2, pre-training the data processed in the step S1 by adopting an H-VAE model to obtain parameters of hidden variable distribution;

s3, sampling a new positive sample by adopting the given posterior distribution and combining the parameters of the hidden variable distribution;

s4, fusing the newly sampled positive sample with the previous training data, replacing the last full connection layer on the basis of keeping the information extraction module of the original information model, and performing combined training by using the fused data;

and S5, utilizing the trained task classification result to carry out miss prediction on the new input task.

Further, in step S1, specifically, Pair processing is performed on the sgrnas and the target DNAs in the training data in a one-to-one correspondence manner by using Pair codes.

Further, the framework of the H-VAE model in step S2 includes an Embedding layer, an Encoder, and a Decoder layer; wherein the Embedding layer is composed of a word Embedding matrix, and the mapping of the input of the Embedding layer is changed from the input of Nx24 to Nx24xd_hAs input to the Encoder layer; the Encoder layer consists of four blocks, and each Block consists of three operations of convolution-batch normalization-activation function; the Decoder layer is composed of four blocks, and each Block is composed of three operations of deconvolution-batch normalization-activation function.

Further, the training step of the H-VAE model comprises the following steps:

s21, inputting samples x for any batch₁,...,x_nDenoted by X, the dimension of X being R^Nx24N is the size of the batch, the output length of each sample obtained after passing through the sequence coding module is 24, the output length comprises sequence samples of three conditions of mismatching, insertion and deletion, the sample X is input into the word embedding layer, and the obtained dimension is

Tensor X of size₁Wherein d is_eIs the dimension of the word embedding layer;

s22 tensor X through word embedding₁Obtaining mean value mu and variance sigma of posterior distribution through a series of convolution operations of Encoder layer²Sampling data on the posterior distribution, converting the sampling operation into a sampling result by utilizing a heavy parameter skill to participate in calculation, wherein the calculation formula is as follows:

wherein,

is a gaussian distribution with a mean of 0 and a variance of 1. From N (μ, σ)²) Middle sampling z, which corresponds to sampling from N (0,1)Xi and let z ═ μ + xi × σ;

s23, after obtaining the result after sampling, inputting the result into the Decoder layer, and obtaining the result through deconvolution operation

k is used to indicate the sample z corresponding to each sample x, and g can be regarded as a deconvolution process.

Is a reconstructed x;

s24 adopting reconstruction loss

Constraining the generator to recover the original input data according to the hidden variables, wherein:

in training the calculation, use

To calculate:

further, step S24 includes adding a loss function to constrain the generator, the formula of the loss function is as follows:

where d is the dimension of the hidden variable, μ_(i)And

respectively representing the mean and variance of the ith component.

Further, in step S3, a plurality of different probability distributions are selected by using the parameters of the hidden variable distribution, and the plurality of different probability distributions are combined to sample the positive sample.

Compared with the prior art, the invention has the advantages that: aiming at the problem that the existing model is weak in extraction capacity of matching information of base pairs, the invention provides a deep learning framework based on Pair coding, so that the model can fully utilize the matching information of sgRNA-DNA base pairs. Meanwhile, the coding mode can also process types except mismatch off-target. Aiming at the problem that model training is extremely unstable due to extreme imbalance of data types, a Variable Amplitude Enhancement (VAE) -data-based CRISPR/Cas9 off-target prediction method is provided. After the training is converged, the mean and variance of the Gaussian distribution of the hidden space information of a minority class can be obtained. In the data expansion stage, generating random numbers of corresponding Gaussian distribution, and determining sampling variables; and inputting the sampling variables into a decoder of a variational encoder to generate similar samples, and performing mixed training on the generated samples and real data by a final classification model, thereby achieving the purposes of relieving the learning instability and the like caused by class imbalance data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flow chart of the CRISPR/Cas9 off-target prediction method based on VAE data enhancement of the present invention.

FIG. 2 is a diagram showing a method of expressing Pair sequence in the present invention.

FIG. 3 is a diagram of the H-VAE pre-training module of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.

Referring to fig. 1, the present embodiment discloses a CRISPR/Cas9 off-target prediction method based on VAE data enhancement, comprising the following steps:

step S1, processing the training data by using Pair coding.

As shown in fig. 2, since the base sequence and the text sequence have natural similarity, the model using word embedding representation can achieve very good effect, and the word embedding method has strong representation capability, so that the sequence is represented by using the word embedding method, unlike the conventional method, the two different sequences, namely sgRNA and DNA, are not encoded separately during encoding, but the sgRNA and DNA are encoded as pairing information. In this embodiment, the sgrnas and the target DNAs are in one-to-one correspondence, and 25 different base combinations can be obtained by taking indels into account. By considering the matching information between the sequences, the embodiment can obtain an efficient pairing representation mode. After the coding information is obtained, the input words are embedded into the layer to obtain the expression of each pair of basic groups in a high-dimensional space, so that the model can have a larger assumed space in a pre-training module, and the expression capability of the model is improved.

And S2, pre-training the data processed in the step S1 by adopting an H-VAE model to obtain parameters of hidden variable distribution.

The VAE usually uses a single convolutional neural network or a recurrent neural network for encoding and decoding in image generation and sequence generation tasks, and since the information of the sequence pair is too single by using image encoding, in order to enhance the expression capability of the model, the embodiment learns the hidden variables of the positive sample by using a VAE model (H-VAE) based on mixed word embedding and convolutional neural network, so as to obtain the parameters of the hidden variable distribution of the positive sample. Therefore, the data generated by the learned distribution is used as a supplement to the original data in the training process to alleviate the class imbalance problem.

The pre-training framework of the H-VAE model is divided into an Embedding layer, an Encoder layer and a Decode layer, which are respectively introduced below.

Embedding layer: the Embedding layer is composed of a word Embedding matrix, and the mapping of the input passing through the Embedding layer is changed from the input of Nx24 to Nx24xd_hAs input to the Encoder layer.

Encoder layer: the Encoder layer consists of four blocks, each of which consists of three operations of convolution-batch normalization-activation function. The convolution operation is to extract data features by a convolution kernel (convolution kernel). The region inside the convolution kernel is called the "perceptual field", and the size of the perceptual field is the size of the convolution kernel. The convolution operation is divided into two steps, local aggregation and window sliding. During local aggregation, the data in the receptive field is multiplied by matrix elements by using parameters in the convolution kernel, and then added and output to a Feature map (Feature map). After local aggregation, the convolution kernel is slid to the next region, and the step size of the sliding is specified in advance.

After the convolution operation, the next operation is the activation of the LeakyReLU, which has the effect of non-linearly mapping the result of the convolution linear transformation, and unlike the conventional ReLU activation function, the LeakyReLU does not set the number less than 0 to 0, but scales it. The problems of gradient disappearance and the like caused by ReLU can be relieved to a certain extent.

Wherein a is_iIs a value that is considered to be set for controlling the zoom ratio.

The last operation is a Batch Normalization (Batch Normalization) operation. The batch normalization has the effect that the normalization operation is carried out on the hidden layer input, so that the obtained output is in the unsaturated region of the activation function, the gradient reduction is facilitated, and the network training speed is increased.

In this embodiment, it is assumed that the posterior distribution of the hidden variable is a normal distribution, and the goal of the Encoder layer is to learn the distribution. And the following Decoder layer is the slave p (z | x)_k) Reducing z obtained by sampling into x_k. If a posterior distribution of hidden variables can be obtained, then the distribution can be determined from p (z | x)_k) Randomly sampling a series of samples, which are similar to x_kIn (1). After four Block operations, the finally obtained output respectively passes through two full-connection layers, and the mean value and the variance of the posterior distribution of the hidden variables are output.

Decoder layer: the Decoder layer also consists of four blocks, each of which consists of three operations of deconvolution-batch normalization-activation function. Because data needs to be generated at a decoding layer, the obtained hidden layer input needs to be subjected to deconvolution operation, and the batch normalization and activation functions are consistent with those of an Encoder layer.

In this embodiment, the training step of the H-VAE model includes:

s21, for a batch of input samples { x₁,...,x_nAnd the whole is represented by X, the dimension of X is, and N is the size of the batch. Each sample is the output obtained after passing through the sequence coding module, has the length of 24 and comprises sequence samples of three conditions of mismatching, insertion and deletion. Inputting X into the word embedding layer to obtain dimension of

Tensor of size, where is the dimension of the word embedding layer.

S22, obtaining the mean value mu and the variance sigma of the posterior distribution through a series of convolution operations of Encoder layers by the tensor X1 subjected to word embedding²To get the input to the Decoder layer, the data needs to be sampled over this distribution, since the sampling operation is not conducive. To train the network, a transformation is performed using a heavy parameter technique (reparameterization technique) such that transforming the sampling operation into a sampling result participates in the computation:

wherein

Is subject to a Gaussian distribution with a mean of 0 and a variance of 1, and is therefore from N (μ, σ)²) Middle sampling z, N (mu, sigma)²) Is a gaussian (normal) distribution giving mean and variance, a distribution commonly used by many models, which is equivalent to sampling ξ from N (0, I) and making z ═ μ + ξ × σ. Therefore, the original distributed sampling data is changed into a series of data sampled in N (0, I) distribution, and the result of the original distributed sampling is obtained through transformation, so that the sampling operation does not need to participate in gradient descent, and the sampling result is changed to participate, so that the model can be normally trained.

S23, after obtaining the result after sampling, inputting the result into the Decoder layer because of the obtained z_kIs specific to x_kThus, through a series of deconvolution operations in the generator, can be obtained

S24, for the generator to learn p (x)_k|z_k) Similar to the AE model, reconstruction loss is required

The generator is constrained so that it can recover the original input data from the hidden variables. For the training of the model, the L2 distance function is chosen herein as the reconstruction loss D. In addition, unlike conventional AE models, the process of reconstructing the VAE can be noisy. If the model is optimized by simply using the reconstruction loss, the variance of the hidden variable is finally reduced to 0 by the model so as to reduce the influence of noise as much as possible, and therefore, the model is degraded into a common AE model. Thus, in addition to reconstruction loss, the VAE also tends all p (z | x) to a normal distribution, and to achieve this goal, an additional loss function, i.e., KL divergence of two normal distributions, is added in addition to the reconstruction loss:

where d is the dimension of the hidden variable, μ_(i)And

respectively representing the mean and variance of the ith component. The final loss function is therefore:

and (5) stopping the machine after certain training steps until the loss value is not reduced any more. After training is completed, the mean and variance of the hidden variable distribution of the positive sample can be obtained.

Step S3, sampling a new positive sample by using the given posterior distribution and combining the parameters of the hidden variable distribution, that is: and selecting a plurality of different probability distributions by using parameters of the hidden variables, and combining the plurality of different probability distributions to sample the positive sample, thereby relieving the problem of too few positive samples.

And step S4, fusing the newly sampled positive sample with the previous training data, replacing the last full connection layer on the basis of keeping the information extraction module of the original information model, and performing combined training by using the fused data.

Specifically, in this embodiment, after the H-VAE pre-training is completed, in order to train the final CRISPR/Cas9 off-target prediction task, on the basis of retaining the information extraction module of the original information model, the last full-junction layer is replaced, so that it can predict the off-target activity of the CRISPR/Cas 9. And simultaneously, in the training process of each batch, adding a generated sample sampled from the positive sample distribution obtained by pre-training for joint training.

And step S5, utilizing the trained task classification result to carry out miss prediction on the new input task.

Specifically, the present embodiment utilizes the finally trained model obtained from the previous step processing, and combines the manual features to process and predict new data.

Aiming at the problem that the existing model is weak in extraction capacity of matching information of base pairs, the invention provides a deep learning framework based on Pair coding, so that the model can fully utilize the matching information of sgRNA-DNA base pairs. Meanwhile, the coding mode can also process types except mismatch off-target. Aiming at the problem that model training is extremely unstable due to extreme imbalance of data types, a Variable Amplitude Enhancement (VAE) -data-based CRISPR/Cas9 off-target prediction method is provided. After the training is converged, the mean and variance of the Gaussian distribution of the hidden space information of a minority class can be obtained. In the data expansion stage, generating random numbers of corresponding Gaussian distribution, and determining sampling variables; and inputting the sampling variables into a decoder of a variational encoder to generate similar samples, and performing mixed training on the generated samples and real data by a final classification model, thereby achieving the purposes of relieving the learning instability and the like caused by class imbalance data.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, various changes or modifications may be made by the patentees within the scope of the appended claims, and within the scope of the invention, as long as they do not exceed the scope of the invention described in the claims.

Claims

1. A CRISPR/Cas9 off-target prediction method based on VAE data enhancement is characterized by comprising the following steps:

s1, processing the training data by using Pair codes;

2. The CRISPR/Cas9 off-target prediction method based on VAE data enhancement according to claim 1, wherein in step S1, specifically, Pair processing of sgRNA and target DNA in training data is performed in a one-to-one correspondence manner by using Pair coding.

3. The CRISPR/Cas9 off-target prediction method based on VAE data enhancement according to claim 1, wherein the framework of the H-VAE model in step S2 comprises an Embedding layer, an Encoder and a Decoder layer; wherein the Embedding layer is composed of a word Embedding matrix, and the mapping of the input of the Embedding layer is changed from the input of Nx24 to Nx24xd_hAs input to the Encoder layer; the Encoder layer consists of four blocks, and each Block consists of three operations of convolution-batch normalization-activation function; the Decoder layer is composed of four blocks, and each Block is composed of three operations of deconvolution-batch normalization-activation function.

4. The CRISPR/Cas9 off-target prediction method based on VAE data enhancement according to claim 3, wherein the training step of the H-VAE model comprises:

Tensor X of size₁Wherein d is_eIs the dimension of the word embedding layer;

wherein,

is a gaussian distribution with a mean of 0 and a variance of 1. From N (μ, σ)²) The middle sampling z is equivalent to sampling xi from N (0,1), and making z ═ μ + xi × σ;

Is a reconstructed x;

s24 adopting reconstruction loss

in training the calculation, use

To calculate:

5. the method for CRISPR/Cas9 off-target prediction based on VAE data enhancement according to claim 4, further comprising adding a loss function to constrain the generator in step S24, wherein the formula of the loss function is as follows:

where d is the dimension of the hidden variable, μ_(i)And

respectively representing the mean and variance of the ith component.

6. The CRISPR/Cas9 off-target prediction method based on VAE data enhancement according to claim 1, wherein the step S3 is specifically to select a plurality of different probability distributions by using parameters of implicit variable distribution, and sample a positive sample by combining the plurality of different probability distributions.