Nothing Special   »   [go: up one dir, main page]

CN113808671A - Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning - Google Patents

Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning Download PDF

Info

Publication number
CN113808671A
CN113808671A CN202111008365.5A CN202111008365A CN113808671A CN 113808671 A CN113808671 A CN 113808671A CN 202111008365 A CN202111008365 A CN 202111008365A CN 113808671 A CN113808671 A CN 113808671A
Authority
CN
China
Prior art keywords
ribonucleic acid
coding
transcript sequence
layer
long non
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111008365.5A
Other languages
Chinese (zh)
Other versions
CN113808671B (en
Inventor
李爱民
熊思琪
周红芳
费蓉
刘雅君
王竹荣
魏嵬
袁细国
黑新宏
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202111008365.5A priority Critical patent/CN113808671B/en
Publication of CN113808671A publication Critical patent/CN113808671A/en
Application granted granted Critical
Publication of CN113808671B publication Critical patent/CN113808671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which comprises the following steps: screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence, and then converting each treated transcript sequence into k-mer frequency; and constructing a convolutional neural network model, taking the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence which are subjected to class balance processing as training sample data, training the training sample data input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequence to be distinguished into the prediction model to obtain a distinguishing result. The invention solves the problems of bad gene annotation and large calculation time consumption in the prior art.

Description

Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning
Technical Field
The invention belongs to the technical field of computational bioinformatics, and relates to a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning.
Background
Long-chain ribonucleic acids are RNA molecules with transcripts longer than 200 bases, and include long-chain coding ribonucleic acids and long-chain non-coding ribonucleic acids. Long non-coding RNAs do not encode proteins, are initially thought to be noise of genome transcription, are byproducts of RNA polymerase transcription, and are biologically nonfunctional. However, recent studies have shown that long noncoding ribonucleic acids are involved in many important regulatory processes such as chromatin modification, transcriptional activation, transcriptional interference, and the like. Currently, there are many methods for distinguishing long-chain coding and non-coding transcript sequences, mainly based on open reading frame characteristics, evolution characteristics, etc., which are affected by bad gene annotation and consume a lot of time.
Disclosure of Invention
The invention aims to provide a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which solves the problems that the prior art is subjected to bad gene annotation and needs to consume a large amount of calculation time.
The technical scheme adopted by the invention is a method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning, which is implemented according to the following steps:
step 1, screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, and carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency;
and 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences subjected to class balance processing in the step 1 as training sample data, training the input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result.
The present invention is also characterized in that,
in step 1, the data of human long non-coding RNA transcripts and mRNA transcripts are downloaded from Refseq database, and then long non-coding RNA transcripts and mRNA transcripts with sequence length greater than 200nt are screened from the transcript data.
In step 1, the class equilibrium treatment of the screened long non-coding RNA transcript sequence and the screened messenger RNA transcript sequence is that:
randomly selecting the same number of long non-coding RNA transcripts and messenger RNA transcripts from the selected long non-coding RNA transcripts and messenger RNA transcripts.
In step 2, the frequency of each of the long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences converted into k-mers is specifically:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, with k as the length of the sliding window, sliding the sliding window along each transcript sequence, the step size of the sliding window being set to 1, and when the sliding window slides over each transcript sequence, if the character string in the sliding window matches a certain pattern of the 5460 patterns, adding 1 to the number of occurrences of the pattern in the transcript sequence, and calculating the length of the sliding window from ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
Figure BDA0003236690990000031
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
the convolutional neural network model structure constructed in step 3 is as follows:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; and finally, obtaining a prediction result by using a softmax function as an activation function, wherein the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid.
And 3, during training, using the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence as the input of a convolutional neural network model and the basis of model prediction.
And 3, adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.
In step 3, the nucleic acid sequences to be distinguished are input into a prediction model to obtain a distinguishing result, which specifically comprises the following steps: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.
The invention has the beneficial effects that:
the invention uses a non-alignment comparison tool, uses the frequency of the inherent characteristic k-mer in the RNA-seq sequence as a calculation characteristic, and then uses a deep learning model to predict long non-coding ribonucleic acid and messenger ribonucleic acid, thereby having high calculation efficiency.
The method of the invention distinguishes coding ribonucleic acid from non-coding ribonucleic acid to obtain 97.2 percent of accuracy, in addition, the method of the invention uses a k-mer sequence as the input of a deep learning model, reserves the information of three adjacent nucleotides in a codon, and can obtain better sampling and prediction in a transcript of long-chain ribonucleic acid.
Drawings
FIG. 1 is a flow chart of a method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning;
FIG. 2 is a block diagram of a convolutional neural network model in the method of the present invention for discriminating between coding and non-coding ribonucleic acids based on deep learning.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The process of the method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is shown in figure 1 and is specifically implemented according to the following steps:
step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, then screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt from the transcript data, and randomly selecting the same number of long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts from the screened long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency; the conversion of each of the long non-coding rna transcript sequences and the mrna transcript sequences to k-mer frequencies is specifically:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, with k as the length of the sliding window, sliding the sliding window along each transcript sequence, the step size of the sliding window being set to 1, and when the sliding window slides over each transcript sequence, if the character string in the sliding window matches a certain pattern of the 5460 patterns, adding 1 to the number of occurrences of the pattern in the transcript sequence, and calculating the length of the sliding window from ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
Figure BDA0003236690990000061
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
step 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences after class balance processing in the step 1 as training sample data, training the input constructed convolutional neural network model to obtain a prediction model, and inputting the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result, wherein as shown in fig. 2, the convolutional neural network model has the following structure:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; finally, a softmax function is used as an activation function to obtain a prediction result, and the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid; during training, the k-mer frequencies corresponding to 5460 k-mer modes corresponding to the selected long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence are used as the input of a convolutional neural network model and the basis of model prediction; and adding a callback function in the training process to realize dynamic adjustment of the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified in each epoch period, the optimizer is Adam, 128 modes are selected in one-time training, and 100 epochs are trained.
Example 1
Step 1, downloading human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from a RefSeq database, and screening long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length larger than 200nt, wherein 48471 long non-coding ribonucleic acid transcripts and 62197 messenger ribonucleic acid transcripts are obtained through screening, in order to balance the quantity of the long non-coding ribonucleic acid transcripts and the messenger ribonucleic acid transcripts, 48471 transcripts are randomly selected from 62197 messenger ribonucleic acid transcripts, and the 48471 randomly selected messenger ribonucleic acid transcripts and the 48471 long non-coding ribonucleic acid transcripts are taken as experimental data;
step 2, calculating the k-mer frequencies of 48471 mRNA transcripts and 48471 long non-coding RNA transcripts;
step 3, using 48471 long non-coding ribonucleic acid transcript sequences and 48471 messenger ribonucleic acid transcript sequences in the step 1) as experimental data, wherein 38971 long non-coding ribonucleic acid transcript data and 38971 messenger ribonucleic acid transcript data are selected as training sample data of the model, in addition, 5000 transcript sequence data are respectively taken as a verification data set, finally 4500 transcript sequences are respectively taken as a test data set, and training is carried out on the constructed convolutional neural network model by using the training sample data to obtain a prediction model;
in this embodiment, training sample data is input into the prediction model to obtain an average classification accuracy of 99.5%, the verification data set is input into the prediction model to obtain an average classification accuracy of 99.7%, and the test data set is input into the prediction model to obtain an average classification accuracy of 97.2%.

Claims (8)

1. The method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning is implemented according to the following steps:
step 1, screening a long non-coding ribonucleic acid transcript sequence and a messenger ribonucleic acid transcript sequence with the length of more than 200nt from a database, and carrying out class balance treatment on the screened long non-coding ribonucleic acid transcript sequence and messenger ribonucleic acid transcript sequence;
step 2, converting each transcript sequence in the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence after the class balance treatment in the step 1 into k-mer frequency;
and 3, constructing a convolutional neural network model, selecting the same number of long non-coding ribonucleic acid transcript sequences and messenger ribonucleic acid transcript sequences subjected to class balance processing in the step 1 as training sample data, training the input into the constructed convolutional neural network model to obtain a prediction model, and inputting the nucleic acid sequences to be distinguished into the prediction model to obtain a distinguishing result.
2. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 1, wherein step 1 is to download human long non-coding ribonucleic acid transcript data and messenger ribonucleic acid transcript data from RefSeq database, and then to select long non-coding ribonucleic acid transcripts and messenger ribonucleic acid transcripts with sequence length greater than 200nt from the transcript data.
3. The method for discriminating between coding and non-coding RNAs based on deep learning of claim 2, wherein the step 1 of quasi-equilibrium processing of the selected long non-coding RNA transcript sequence and messenger RNA transcript sequence is:
randomly selecting the same number of long non-coding RNA transcripts and messenger RNA transcripts from the selected long non-coding RNA transcripts and messenger RNA transcripts.
4. The method for distinguishing between coding and non-coding ribonucleic acids based on deep learning of claim 3, wherein the step 2 of converting each of the long non-coding ribonucleic acid transcript sequence and the messenger ribonucleic acid transcript sequence into k-mer frequencies is specifically as follows:
firstly, each transcript sequence is converted into a k-mer pattern, wherein a k-mer pattern refers to a specific character string with k nucleotides, each character string is composed of four bases of A, T, G and C, wherein k is 1,2,3,4,5 and 6, and when k is 1, four patterns of A, T, G and C exist; when k is 2, there are AA, AT, AC, AG, TA, TT, TC, TG, … …, GG, 16 patterns, and so on, and when k is 3, there are 64 patterns; when k is 4, there are 256 modes; when k is 5, there are 1024 patterns; when k is 6, there are 4096 patterns, and thus each transcript sequence has 4+16+64+256+1024+4096 5460 patterns;
then, k is used asSliding a sliding window along each transcript sequence for the length of the sliding window, the step size of the sliding window being set to 1, the number of occurrences of a pattern in each transcript sequence being increased by 1 if the string in the sliding window matches one of the 5460 patterns when the sliding window is slid over the transcript sequence, ciI is 1,2,3, …, 5460, i.e. is ciIndicates the number of occurrences of pattern i in a transcript sequence;
then, the frequency of occurrence of pattern i in the transcript sequence is calculated according to the following formula:
Figure FDA0003236690980000021
wherein s iskThe total number of occurrences of the k-mer sliding window along the transcript sequence is calculated as follows:
sk=L-k+1 (2)
wherein L is the length of the transcript sequence;
wherein, wKThe weight coefficient is calculated according to the following formula:
wK=1/45-k (3)。
5. the method for discriminating coding and non-coding ribonucleic acids based on deep learning of claim 4, wherein the convolutional neural network model structure constructed in the step 3 is as follows:
the first layer is a convolution layer, and because each codon consists of three adjacent nucleotides in the mRNA, a convolution kernel with a sliding window of 1 x 3 is set, wherein the number of the convolution kernels is 32, and the activation function is Relu; the second layer is still a convolution layer, 32 convolution kernels with the size of 1 x 3 are adopted, and the activation function is a Relu function; the third layer is the largest pooling layer, and the size of the pooling area is 1 x 1; the fourth layer is a full-connection layer, 256 neurons are arranged in the full-connection layer, Dropout is carried out on the full-connection layer of the fourth layer and the full-connection layer of the fifth layer by selecting the probability of 0.5 so as to prevent the over-fitting condition, and Relu is selected as an activation function; the fifth layer and the sixth layer are still fully connected layers, wherein the number of the neurons is 64, the activation functions are Relu, and Dropout is carried out by adopting the probability of 0.5; and finally, obtaining a prediction result by using a softmax function as an activation function, wherein the output result is 0 or 1, wherein 0 represents long non-coding ribonucleic acid, and 1 represents messenger ribonucleic acid.
6. The method of claim 5, wherein the training in step 3 is performed by using 5460 k-mer patterns corresponding to the selected long non-coding RNA transcript sequence and the selected messenger RNA transcript sequence as k-mer frequencies as the input of the convolutional neural network model and as the basis for model prediction.
7. The method of claim 6, wherein the step 3 comprises adding a callback function during the training process to dynamically adjust the learning rate, wherein the initial learning rate is 0.001, the learning rate is automatically modified every epoch period, the optimizer is Adam, and 100 epochs are trained by selecting 128 patterns per training.
8. The method for discriminating between coding and non-coding ribonucleic acids based on deep learning of claim 7, wherein the step 3 of inputting the nucleic acid sequences to be discriminated into the prediction model results in the discrimination results specifically: and inputting the frequencies of the k-mers corresponding to the 5460 k-mer modes corresponding to the nucleic acid sequences to be distinguished into a prediction model to obtain a distinguishing result.
CN202111008365.5A 2021-08-30 2021-08-30 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning Active CN113808671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111008365.5A CN113808671B (en) 2021-08-30 2021-08-30 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111008365.5A CN113808671B (en) 2021-08-30 2021-08-30 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning

Publications (2)

Publication Number Publication Date
CN113808671A true CN113808671A (en) 2021-12-17
CN113808671B CN113808671B (en) 2024-02-06

Family

ID=78941981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111008365.5A Active CN113808671B (en) 2021-08-30 2021-08-30 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning

Country Status (1)

Country Link
CN (1) CN113808671B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN111462820A (en) * 2020-03-31 2020-07-28 浙江科技学院 Non-coding RNA prediction method based on feature screening and integration algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999066302A2 (en) * 1998-06-17 1999-12-23 Musc Foundation For Research Development Recognition of protein coding regions in genomic dna sequences
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN111462820A (en) * 2020-03-31 2020-07-28 浙江科技学院 Non-coding RNA prediction method based on feature screening and integration algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙磊;许驰;胡学龙;: "一种基于随机森林的长非编码RNA预测方法", 扬州大学学报(自然科学版), no. 04 *
杨阳;: "长非编码RNA鉴定方法研究", 智能计算机与应用, no. 03 *

Also Published As

Publication number Publication date
CN113808671B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN108595913B (en) Supervised learning method for identifying mRNA and lncRNA
CN108564117B (en) SVM-based poverty and life assisting identification method
CN106295246A (en) Find the lncRNA relevant to tumor and predict its function
CN111276187B (en) Gene expression profile feature learning method based on self-encoder
CN108920895A (en) A kind of incidence relation prediction technique of circular rna and disease
CN112669905B (en) RNA sequence coding potential prediction method and system based on data enhancement
Chakraborty et al. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture
Min et al. TargetNet: functional microRNA target prediction with deep neural networks
Nelander et al. Predictive screening for regulators of conserved functional gene modules (gene batteries) in mammals
CN105279396B (en) The Drought-resistant gene of plant module method of excavation
CN110534154B (en) Whale DNA sequence optimization method based on harmony search
CN113808671B (en) Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning
CN113658643B (en) Method for predicting lncRNA and mRNA based on attention mechanism
CN108288074A (en) A kind of selection method and system of data characteristics
Forêt et al. Characterizing the D2 statistic: word matches in biological sequences
CN108596239A (en) A kind of theme temperature trend forecasting method based on Markov Chain and dynamic backtracking
CN115715415A (en) Variant pathogenicity scoring and classification and uses thereof
Cai et al. Discrete binary adaptive bat algorithm for RNA secondary structure prediction
CN109033743B (en) Method for reducing technical noise in single-cell transcriptome data
Mohammed et al. Novel algorithms for accurate DNA base-calling
Xu et al. The wide and deep flexible neural tree and its ensemble in predicting long non-coding RNA subcellular localization
CN116994645B (en) Prediction method of piRNA and mRNA target pair based on interactive reasoning network
Datta Statistical techniques for microarray data: A partial overview
CN112786112B (en) Method and system for predicting combination of lncRNA and target DNA
CN112989918B (en) On-line electroencephalogram signal prediction method based on kernel recursive least square adaptive tracking algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant