CN113808671A - Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning - Google Patents
Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning Download PDFInfo
- Publication number
- CN113808671A CN113808671A CN202111008365.5A CN202111008365A CN113808671A CN 113808671 A CN113808671 A CN 113808671A CN 202111008365 A CN202111008365 A CN 202111008365A CN 113808671 A CN113808671 A CN 113808671A
- Authority
- CN
- China
- Prior art keywords
- ribonucleic acid
- coding
- transcript sequence
- layer
- long non
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 229920002477 rna polymer Polymers 0.000 title claims abstract description 98
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000013135 deep learning Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 16
- 150000007523 nucleic acids Chemical group 0.000 claims abstract description 8
- 238000012216 screening Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 24
- 230000004913 activation Effects 0.000 claims description 15
- 108020004999 messenger RNA Proteins 0.000 claims description 14
- 108091046869 Telomeric non-coding RNA Proteins 0.000 claims description 12
- 239000002773 nucleotide Substances 0.000 claims description 7
- 125000003729 nucleotide group Chemical group 0.000 claims description 7
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 108020004705 Codon Proteins 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 3
- 108091027963 non-coding RNA Proteins 0.000 claims 1
- 102000042567 non-coding RNA Human genes 0.000 claims 1
- 238000004364 calculation method Methods 0.000 abstract description 4
- 101150079123 Bad gene Proteins 0.000 abstract description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 abstract 1
- 238000013136 deep learning model Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 108010077544 Chromatin Proteins 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000003483 chromatin Anatomy 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000029279 positive regulation of transcription, DNA-dependent Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000007363 regulatory process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which comprises the following steps: screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence, and then converting each treated transcript sequence into k-mer frequency; and constructing a convolutional neural network model, taking the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence which are subjected to class balance processing as training sample data, training the training sample data input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequence to be distinguished into the prediction model to obtain a distinguishing result. The invention solves the problems of bad gene annotation and large calculation time consumption in the prior art.
Description
Technical Field
The invention belongs to the technical field of computational bioinformatics, and relates to a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning.
Background
Long-chain ribonucleic acids are RNA molecules with transcripts longer than 200 bases, and include long-chain coding ribonucleic acids and long-chain non-coding ribonucleic acids. Long non-coding RNAs do not encode proteins, are initially thought to be noise of genome transcription, are byproducts of RNA polymerase transcription, and are biologically nonfunctional. However, recent studies have shown that long noncoding ribonucleic acids are involved in many important regulatory processes such as chromatin modification, transcriptional activation, transcriptional interference, and the like. Currently, there are many methods for distinguishing long-chain coding and non-coding transcript sequences, mainly based on open reading frame characteristics, evolution characteristics, etc., which are affected by bad gene annotation and consume a lot of time.
Disclosure of Invention
The invention aims to provide a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which solves the problems that the prior art is subjected to bad gene annotation and needs to consume a large amount of calculation time.
The technical scheme adopted by the invention is a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which is implemented according to the following steps:
step 1, screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, and carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency;
and 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences subjected to class balance processing in the step 1 as training sample data, training the input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result.
The present invention is also characterized in that,
in step 1, the data of human long non-coding RNA transcripts and mRNA transcripts are downloaded from Refseq database, and then long non-coding RNA transcripts and mRNA transcripts with sequence length greater than 200nt are screened from the transcript data.
In step 1, the class equilibrium treatment of the screened long non-coding RNA transcript sequence and the screened messenger RNA transcript sequence is that:
randomly selecting the same number of long non-coding RNA transcripts and messenger RNA transcripts from the selected long non-coding RNA transcripts and messenger RNA transcripts.
In step 2, the frequency of each of the long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences converted into k-mers is specifically:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, with k as the length of the sliding window, sliding the sliding window along each transcript sequence, the step size of the sliding window being set to 1, and when the sliding window slides over each transcript sequence, if the character string in the sliding window matches a certain pattern of the 5460 patterns, adding 1 to the number of occurrences of the pattern in the transcript sequence, and calculating the length of the sliding window from ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
the convolutional neural network model structure constructed in step 3 is as follows:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; and finally, obtaining a prediction result by using a softmax function as an activation function, wherein the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid.
And 3, during training, using the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence as the input of a convolutional neural network model and the basis of model prediction.
And 3, adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.
In step 3, the nucleic acid sequences to be distinguished are input into a prediction model to obtain a distinguishing result, which specifically comprises the following steps: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.
The invention has the beneficial effects that:
the invention uses a non-alignment comparison tool, uses the frequency of the inherent characteristic k-mer in the RNA-seq sequence as a calculation characteristic, and then uses a deep learning model to predict long non-coding ribonucleic acid and messenger ribonucleic acid, thereby having high calculation efficiency.
The method of the invention distinguishes coding ribonucleic acid from non-coding ribonucleic acid to obtain 97.2 percent of accuracy, in addition, the method of the invention uses a k-mer sequence as the input of a deep learning model, reserves the information of three adjacent nucleotides in a codon, and can obtain better sampling and prediction in a transcript of long-chain ribonucleic acid.
Drawings
FIG. 1 is a flow chart of a method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning;
FIG. 2 is a block diagram of a convolutional neural network model in the method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The process of the method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is shown in figure 1 and is specifically implemented according to the following steps:
step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, then screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt from the transcript data, and randomly selecting the same number of long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts from the screened long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency; the conversion of each of the long non-coding rna transcript sequences and the mrna transcript sequences to k-mer frequencies is specifically:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, with k as the length of the sliding window, sliding the sliding window along each transcript sequence, the step size of the sliding window being set to 1, and when the sliding window slides over each transcript sequence, if the character string in the sliding window matches a certain pattern of the 5460 patterns, adding 1 to the number of occurrences of the pattern in the transcript sequence, and calculating the length of the sliding window from ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
step 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences after class balance processing in the step 1 as training sample data, training the input constructed convolutional neural network model to obtain a prediction model, and inputting the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result, wherein as shown in fig. 2, the convolutional neural network model has the following structure:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; finally, a softmax function is used as an activation function to obtain a prediction result, and the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid; during training, the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence are used as the input of a convolutional neural network model and the basis of model prediction; and adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.
Example 1
Step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, and screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt, wherein 48471 long non-coding ribonucleic acid transcripts and 62197 messenger ribonucleic acid transcripts are obtained through screening, in order to balance the quantity of the long non-coding ribonucleic acid transcripts and the messenger ribonucleic acid transcripts, 48471 transcripts are randomly selected from 62197 messenger ribonucleic acid transcripts, and the 48471 randomly selected messenger ribonucleic acid transcripts and the 48471 long non-coding ribonucleic acid transcripts are taken as experimental data;
step 2, calculating the k-mer frequencies of 48471 mRNA transcripts and 48471 long non-coding RNA transcripts;
step 3, using 48471 long non-coding ribonucleic acid transcript sequences and 48471 messenger ribonucleic acid transcript sequences in the step 1) as experimental data, wherein 38971 long non-coding ribonucleic acid transcript data and 38971 messenger ribonucleic acid transcript data are selected as training sample data of the model, in addition, 5000 transcript sequence data are respectively taken as a verification data set, finally 4500 transcript sequences are respectively taken as a test data set, and training is carried out on the constructed convolutional neural network model by using the training sample data to obtain a prediction model;
in this embodiment, training sample data is input into the prediction model to obtain an average classification accuracy of 99.5%, the verification data set is input into the prediction model to obtain an average classification accuracy of 99.7%, and the test data set is input into the prediction model to obtain an average classification accuracy of 97.2%.
Claims (8)
1. The method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is implemented according to the following steps:
step 1, screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, and carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency;
and 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences subjected to class balance processing in the step 1 as training sample data, training the input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result.
2. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 1, wherein step 1 is to download human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from RefSeq database, and then to select long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length greater than 200nt from the transcript data.
3. The method for discriminating between coding and non-coding RNAs based on deep learning of claim 2, wherein the step 1 of quasi-equilibrium processing of the selected long non-coding RNA transcript sequence and messenger RNA transcript sequence is:
randomly selecting the same number of long non-coding RNA transcripts and messenger RNA transcripts from the selected long non-coding RNA transcripts and messenger RNA transcripts.
4. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 3, wherein the step 2 of converting each of the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence into k-mer frequencies is specifically as follows:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, k is used asSliding a sliding window along each transcript sequence for the length of the sliding window, the step size of the sliding window being set to 1, the number of occurrences of a pattern in each transcript sequence being increased by 1 if the string in the sliding window matches one of the 5460 patterns when the sliding window is slid over the transcript sequence, ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
5. the method for discriminating coding and non-coding ribonucleic acids based on deep learning of claim 4, wherein the convolutional neural network model structure constructed in the step 3 is as follows:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; and finally, obtaining a prediction result by using a softmax function as an activation function, wherein the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid.
6. The method of claim 5, wherein the training in step 3 is performed by using 5460 k-mer patterns corresponding to the selected long non-coding RNA transcript sequence and the selected messenger RNA transcript sequence as k-mer frequencies as the input of the convolutional neural network model and as the basis for model prediction.
7. The method of claim 6, wherein the step 3 comprises adding a callback function during the training process to dynamically adjust the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified every epoch period, the optimizer is Adam, and 100 epochs are trained by selecting 128 patterns per training.
8. The method for discriminating between coding and non-coding ribonucleic acids based on deep learning of claim 7, wherein the step 3 of inputting the nucleic acid sequences to be discriminated into the prediction model results in the discrimination results specifically: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111008365.5A CN113808671B (en) | 2021-08-30 | 2021-08-30 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111008365.5A CN113808671B (en) | 2021-08-30 | 2021-08-30 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113808671A true CN113808671A (en) | 2021-12-17 |
CN113808671B CN113808671B (en) | 2024-02-06 |
Family
ID=78941981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111008365.5A Active CN113808671B (en) | 2021-08-30 | 2021-08-30 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113808671B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999066302A2 (en) * | 1998-06-17 | 1999-12-23 | Musc Foundation For Research Development | Recognition of protein coding regions in genomic dna sequences |
CN108595913A (en) * | 2018-05-11 | 2018-09-28 | 武汉理工大学 | Differentiate the supervised learning method of mRNA and lncRNA |
CN111462820A (en) * | 2020-03-31 | 2020-07-28 | 浙江科技学院 | Non-coding RNA prediction method based on feature screening and integration algorithm |
-
2021
- 2021-08-30 CN CN202111008365.5A patent/CN113808671B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999066302A2 (en) * | 1998-06-17 | 1999-12-23 | Musc Foundation For Research Development | Recognition of protein coding regions in genomic dna sequences |
CN108595913A (en) * | 2018-05-11 | 2018-09-28 | 武汉理工大学 | Differentiate the supervised learning method of mRNA and lncRNA |
CN111462820A (en) * | 2020-03-31 | 2020-07-28 | 浙江科技学院 | Non-coding RNA prediction method based on feature screening and integration algorithm |
Non-Patent Citations (2)
Title |
---|
孙磊;许驰;胡学龙;: "一种基于随机森林的长非编码RNA预测方法", 扬州大学学报(自然科学版), no. 04 * |
杨阳;: "长非编码RNA鉴定方法研究", 智能计算机与应用, no. 03 * |
Also Published As
Publication number | Publication date |
---|---|
CN113808671B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108595913B (en) | Supervised learning method for identifying mRNA and lncRNA | |
CN108564117B (en) | SVM-based poverty and life assisting identification method | |
CN106295246A (en) | Find the lncRNA relevant to tumor and predict its function | |
CN111276187B (en) | Gene expression profile feature learning method based on self-encoder | |
CN108920895A (en) | A kind of incidence relation prediction technique of circular rna and disease | |
CN112669905B (en) | RNA sequence coding potential prediction method and system based on data enhancement | |
Chakraborty et al. | Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture | |
Min et al. | TargetNet: functional microRNA target prediction with deep neural networks | |
Nelander et al. | Predictive screening for regulators of conserved functional gene modules (gene batteries) in mammals | |
CN105279396B (en) | The Drought-resistant gene of plant module method of excavation | |
CN110534154B (en) | Whale DNA sequence optimization method based on harmony search | |
CN113808671B (en) | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning | |
CN113658643B (en) | Method for predicting lncRNA and mRNA based on attention mechanism | |
CN108288074A (en) | A kind of selection method and system of data characteristics | |
Forêt et al. | Characterizing the D2 statistic: word matches in biological sequences | |
CN108596239A (en) | A kind of theme temperature trend forecasting method based on Markov Chain and dynamic backtracking | |
CN115715415A (en) | Variant pathogenicity scoring and classification and uses thereof | |
Cai et al. | Discrete binary adaptive bat algorithm for RNA secondary structure prediction | |
CN109033743B (en) | Method for reducing technical noise in single-cell transcriptome data | |
Mohammed et al. | Novel algorithms for accurate DNA base-calling | |
Xu et al. | The wide and deep flexible neural tree and its ensemble in predicting long non-coding RNA subcellular localization | |
CN116994645B (en) | Prediction method of piRNA and mRNA target pair based on interactive reasoning network | |
Datta | Statistical techniques for microarray data: A partial overview | |
CN112786112B (en) | Method and system for predicting combination of lncRNA and target DNA | |
CN112989918B (en) | On-line electroencephalogram signal prediction method based on kernel recursive least square adaptive tracking algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |