CN113808671A

CN113808671A - Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning

Info

Publication number: CN113808671A
Application number: CN202111008365.5A
Authority: CN
Inventors: 李爱民; 熊思琪; 周红芳; 费蓉; 刘雅君; 王竹荣; 魏嵬; 袁细国; 黑新宏; 王磊
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-17
Anticipated expiration: 2041-08-30
Also published as: CN113808671B

Abstract

The invention discloses a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which comprises the following steps: screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence, and then converting each treated transcript sequence into k-mer frequency; and constructing a convolutional neural network model, taking the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence which are subjected to class balance processing as training sample data, training the training sample data input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequence to be distinguished into the prediction model to obtain a distinguishing result. The invention solves the problems of bad gene annotation and large calculation time consumption in the prior art.

Description

Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning

Technical Field

The invention belongs to the technical field of computational bioinformatics, and relates to a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning.

Background

Long-chain ribonucleic acids are RNA molecules with transcripts longer than 200 bases, and include long-chain coding ribonucleic acids and long-chain non-coding ribonucleic acids. Long non-coding RNAs do not encode proteins, are initially thought to be noise of genome transcription, are byproducts of RNA polymerase transcription, and are biologically nonfunctional. However, recent studies have shown that long noncoding ribonucleic acids are involved in many important regulatory processes such as chromatin modification, transcriptional activation, transcriptional interference, and the like. Currently, there are many methods for distinguishing long-chain coding and non-coding transcript sequences, mainly based on open reading frame characteristics, evolution characteristics, etc., which are affected by bad gene annotation and consume a lot of time.

Disclosure of Invention

The invention aims to provide a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which solves the problems that the prior art is subjected to bad gene annotation and needs to consume a large amount of calculation time.

The technical scheme adopted by the invention is a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which is implemented according to the following steps:

step 1, screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, and carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence;

step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency;

and 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences subjected to class balance processing in the step 1 as training sample data, training the input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result.

The present invention is also characterized in that,

in step 1, the data of human long non-coding RNA transcripts and mRNA transcripts are downloaded from Refseq database, and then long non-coding RNA transcripts and mRNA transcripts with sequence length greater than 200nt are screened from the transcript data.

In step 1, the class equilibrium treatment of the screened long non-coding RNA transcript sequence and the screened messenger RNA transcript sequence is that:

randomly selecting the same number of long non-coding RNA transcripts and messenger RNA transcripts from the selected long non-coding RNA transcripts and messenger RNA transcripts.

In step 2, the frequency of each of the long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences converted into k-mers is specifically:

firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;

then, with k as the length of the sliding window, sliding the sliding window along each transcript sequence, the step size of the sliding window being set to 1, and when the sliding window slides over each transcript sequence, if the character string in the sliding window matches a certain pattern of the 5460 patterns, adding 1 to the number of occurrences of the pattern in the transcript sequence, and calculating the length of the sliding window from c_iI is 1,2,3, …, 5460, i.e. is c_iIndicates the number of occurrences of pattern i in a transcript sequence;

then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:

wherein s is_kThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:

s_k＝L-k+1 (2)

wherein L is the length of the transcript sequence;

wherein, w_KThe weight coefficient is calculated according to the following formula:

w_K＝1/4^5-k (3)。

the convolutional neural network model structure constructed in step 3 is as follows:

the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; and finally, obtaining a prediction result by using a softmax function as an activation function, wherein the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid.

And 3, during training, using the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence as the input of a convolutional neural network model and the basis of model prediction.

And 3, adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.

In step 3, the nucleic acid sequences to be distinguished are input into a prediction model to obtain a distinguishing result, which specifically comprises the following steps: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.

The invention has the beneficial effects that:

the invention uses a non-alignment comparison tool, uses the frequency of the inherent characteristic k-mer in the RNA-seq sequence as a calculation characteristic, and then uses a deep learning model to predict long non-coding ribonucleic acid and messenger ribonucleic acid, thereby having high calculation efficiency.

The method of the invention distinguishes coding ribonucleic acid from non-coding ribonucleic acid to obtain 97.2 percent of accuracy, in addition, the method of the invention uses a k-mer sequence as the input of a deep learning model, reserves the information of three adjacent nucleotides in a codon, and can obtain better sampling and prediction in a transcript of long-chain ribonucleic acid.

Drawings

FIG. 1 is a flow chart of a method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning;

FIG. 2 is a block diagram of a convolutional neural network model in the method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The process of the method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is shown in figure 1 and is specifically implemented according to the following steps:

step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, then screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt from the transcript data, and randomly selecting the same number of long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts from the screened long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts;

step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency; the conversion of each of the long non-coding rna transcript sequences and the mrna transcript sequences to k-mer frequencies is specifically:

s_k＝L-k+1 (2)

wherein L is the length of the transcript sequence;

w_K＝1/4^5-k (3)。

step 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences after class balance processing in the step 1 as training sample data, training the input constructed convolutional neural network model to obtain a prediction model, and inputting the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result, wherein as shown in fig. 2, the convolutional neural network model has the following structure:

the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; finally, a softmax function is used as an activation function to obtain a prediction result, and the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid; during training, the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence are used as the input of a convolutional neural network model and the basis of model prediction; and adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.

Example 1

Step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, and screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt, wherein 48471 long non-coding ribonucleic acid transcripts and 62197 messenger ribonucleic acid transcripts are obtained through screening, in order to balance the quantity of the long non-coding ribonucleic acid transcripts and the messenger ribonucleic acid transcripts, 48471 transcripts are randomly selected from 62197 messenger ribonucleic acid transcripts, and the 48471 randomly selected messenger ribonucleic acid transcripts and the 48471 long non-coding ribonucleic acid transcripts are taken as experimental data;

step 2, calculating the k-mer frequencies of 48471 mRNA transcripts and 48471 long non-coding RNA transcripts;

step 3, using 48471 long non-coding ribonucleic acid transcript sequences and 48471 messenger ribonucleic acid transcript sequences in the step 1) as experimental data, wherein 38971 long non-coding ribonucleic acid transcript data and 38971 messenger ribonucleic acid transcript data are selected as training sample data of the model, in addition, 5000 transcript sequence data are respectively taken as a verification data set, finally 4500 transcript sequences are respectively taken as a test data set, and training is carried out on the constructed convolutional neural network model by using the training sample data to obtain a prediction model;

in this embodiment, training sample data is input into the prediction model to obtain an average classification accuracy of 99.5%, the verification data set is input into the prediction model to obtain an average classification accuracy of 99.7%, and the test data set is input into the prediction model to obtain an average classification accuracy of 97.2%.

Claims

1. The method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is implemented according to the following steps:

2. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 1, wherein step 1 is to download human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from RefSeq database, and then to select long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length greater than 200nt from the transcript data.

3. The method for discriminating between coding and non-coding RNAs based on deep learning of claim 2, wherein the step 1 of quasi-equilibrium processing of the selected long non-coding RNA transcript sequence and messenger RNA transcript sequence is:

4. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 3, wherein the step 2 of converting each of the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence into k-mer frequencies is specifically as follows:

then, k is used asSliding a sliding window along each transcript sequence for the length of the sliding window, the step size of the sliding window being set to 1, the number of occurrences of a pattern in each transcript sequence being increased by 1 if the string in the sliding window matches one of the 5460 patterns when the sliding window is slid over the transcript sequence, c_iI is 1,2,3, …, 5460, i.e. is c_iIndicates the number of occurrences of pattern i in a transcript sequence;

s_k＝L-k+1 (2)

wherein L is the length of the transcript sequence;

w_K＝1/4^5-k (3)。

5. the method for discriminating coding and non-coding ribonucleic acids based on deep learning of claim 4, wherein the convolutional neural network model structure constructed in the step 3 is as follows:

6. The method of claim 5, wherein the training in step 3 is performed by using 5460 k-mer patterns corresponding to the selected long non-coding RNA transcript sequence and the selected messenger RNA transcript sequence as k-mer frequencies as the input of the convolutional neural network model and as the basis for model prediction.

7. The method of claim 6, wherein the step 3 comprises adding a callback function during the training process to dynamically adjust the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified every epoch period, the optimizer is Adam, and 100 epochs are trained by selecting 128 patterns per training.

8. The method for discriminating between coding and non-coding ribonucleic acids based on deep learning of claim 7, wherein the step 3 of inputting the nucleic acid sequences to be discriminated into the prediction model results in the discrimination results specifically: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.