WO1999066302A2 - Recognition of protein coding regions in genomic dna sequences - Google Patents
Recognition of protein coding regions in genomic dna sequences Download PDFInfo
- Publication number
- WO1999066302A2 WO1999066302A2 PCT/US1999/013705 US9913705W WO9966302A2 WO 1999066302 A2 WO1999066302 A2 WO 1999066302A2 US 9913705 W US9913705 W US 9913705W WO 9966302 A2 WO9966302 A2 WO 9966302A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- coding
- sequence
- neighboring
- stream
- neural network
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Definitions
- GRAIL Uberbacher and Mural 1991, Xu et al 1994
- GenelD Guigo et al 1992
- GeneParser Snyder and Stormo 1993, 1995
- GeneLang Dong and Searls 1994
- Genie Kulp et al 1996)
- VEIL Headson et al 1997)
- GENSCAN Burge and Karlin 1997) used hidden Markov models to model the human gene structure. Since the performance of these programs are still not satisfactory (see review in Burset and Guigo 1996), development of new methods, and/or improvement of existing methods, continues to be important objectives.
- a sequence of nucleotides within a DNA sequence may have associated therewith several variables, referred to as "content variables,” that are thought to be useful for discriminating between coding regions and non-coding regions.
- content variables include classic linear discriminant methods (Solovyev et al 1994) and feedforward neural networks (Snyder and Stormo 1993, 1995, Guigo et al 1992, Xu et al 1994).
- Feedforward neural networks benefit from the fact that they may be trained using gradient decent optimization algorithms such as the backpropagation algorithm.
- neural networks with feedbacks may provide significant advantages over purely feedforward networks. Feedbacks provide recursive computation and the ability to represent state information. In some cases, a neural network with feedbacks may be equivalent to a much larger feedforward neural network. Neural networks with feedbacks are generally referred to as recurrent neural networks.
- recurrent neural networks In general, the use of recurrent neural networks has not been nearly as extensive as that of feedforward neural networks. A primary reason the under-utilization of recurrent neural networks is the difficulty involved in developing generally applicable learning algorithms for recurrent neural networks. Due to the fact that the gradient of the error with respect to the connection strength is not easily solvable for recurrent neural networks, gradient-based optimization algorithms are not always applicable. As a result, the benefits of recurrent neural networks over purely feedforward neural networks have not been exploited with regard to extracting information from content variables of nucleotide sequences in order to identify coding/non- coding regions.
- the present invention provides a coding sensor that utilizes a recurrent neural network model.
- the coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure.
- a DNA sequence may be imagined as comprising a discrete nucleotides chain.
- the recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in the sequence is in a coding region.
- Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. At each nucleotide position, content variables are calculated based on a window centered on the nucleotide position. As mentioned, content variables are thought to be useful for discriminating between coding regions and non- coding regions.
- the present invention combines the calculated content variables in a specific way in order to provide an overall "coding sensor value.”
- the coding sensor value indicates whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.
- coding nucleotides i.e. nucleotides in a coding region
- Identifying "transition characteristics" between neighboring segments of a DNA sequence may provide additional information that is useful for detecting coding regions. In other words, detecting whether a particular nucleotide position is in a coding or non- coding region may depend not only on information determined from its own content variables but also information determined from the content variables of nearby nucleotides.
- the invention provides a novel method for using a recurrent neural network to determine up-stream and down-stream transition characteristics between nucleotide chains in a DNA sequence. Transition characteristics may be used to assist the coding sensor of the present invention in finding potential protein coding regions in unannotated genomic DNA sequences.
- FIG. 2 shows an illustrative recurrent neural network architecture in accordance with an exemplary embodiment of the present invention.
- FIG. 3 comprising FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, FIG. 3E, FIG. 3F, FIG. 3G, FIG. 3H and FIG. 31, shows one- dimensional distributions of nine content variables in accordance with an exemplary embodiment of the present invention.
- FIG. 4 comprising FIG. 4A and FIG. 4B, shows coding differentials of an exemplary data test set.
- FIG. 5 comprising FIG. 5A and FIG. 5B, shows coding differentials of an exemplary Burset/Guigo data set.
- FIG. 6 shows illustrates exemplary results obtained by operation of an exemplary embodiment of the present invention.
- FIG. 1, comprising FIG. IA and FIG. IB, and the following discussion are intended to provide a brief and general description of a suitable computing environment for implementing the present invention.
- neural networks are implemented in a computer environment.
- the computer 100 includes a processor 122, a system memory 120, and an Input/Output ("I/O") bus 126.
- a system bus 121 couples the central processing unit 122 to the system memory 120.
- a bus controller 123 controls the flow of data on the I/O bus 126 and between the central processing unit 122 and a variety of internal and external I/O devices.
- the I/O devices connected to the I/O bus 126 may have direct access to the system memory 120 using a Direct Memory Access (“DMA”) controller 124.
- DMA Direct Memory Access
- the I/O devices are connected to the I/O bus 126 via a set of device interfaces.
- the device interfaces may include both hardware components and software components.
- a hard disk drive 130 and a floppy disk drive 132 for reading or writing removable media 150 may be connected to the I/O bus 126 through disk drive controllers 140.
- An optical disk drive 134 for reading or writing optical media 152 may be connected to the I/O bus 126 using a Small Computer System Interface ("SCSI") 141.
- SCSI Small Computer System Interface
- an IDE (AT API) or EIDE interface may be associated with an optical drive such as a may be the case with a CD-ROM drive.
- the drives and their associated computer-readable media provide nonvolatile storage for the computer 100.
- other types of computer-readable media may also be used, such as ZIP drives, or the like.
- a display device 153 such as a monitor, is connected to the I/O bus 126 via another interface, such as a video adapter 142.
- a parallel interface 143 connects synchronous peripheral devices, such as a laser printer 156, to the I/O bus 126.
- a serial interface 144 connects communication devices to the I/O bus 126.
- a user may enter commands and information into the computer 100 via the serial interface 144 or by using an input device, such as a keyboard 138, a mouse 136 or a modem 157.
- Other peripheral devices may also be connected to the computer 100, such as audio input output devices or image capture devices.
- a number of program modules may be stored on the drives and in the system memory 120.
- the system memory 120 can include both Random Access Memory (“RAM”) and Read Only Memory (“ROM”).
- the program modules control how the computer 100 functions and interacts with the user, with I/O devices or with other computers.
- Program modules include routines, operating systems 165, application programs, data structures, and other software or firmware components.
- the present invention may comprise one or more coding sensor program modules 170 stored on the drives or in the system memory 120 of the computer 100. Coding sensor modules 170 may comprise one or more content variable calculation program modules 170A, one or more recurrent neural network program modules 170B, and one or more post-processing and prediction program modules 170C.
- Coding sensor program module(s) 170 may thus comprise computer-executable instructions for calculating content variable, analyzing content variables with a recurrent neural network model, and post-processing the output of the neural network model in order to predict whether a nucleotide position is in a coding region, according to exemplary methods to be described herein.
- the computer 100 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer 160.
- the remote computer 160 may be a server, a router, a peer device or other common network node, and typically includes many or all of the elements described in connection with the computer 100.
- program modules and data may be stored on the remote computer 160.
- the logical connections depicted in FIG. 1 include a local area network ("LAN") 154 and a wide area network (“WAN”) 155.
- a network interface 145 such as an Ethernet adapter card, can be used to connect the computer 100 to the remote computer 160.
- the computer 100 may use a telecommunications device, such as a modem 157, to establish a connection.
- a telecommunications device such as a modem 157
- the network connections shown are illustrative and other devices of establishing a communications link between the computers may be used.
- FIG. IB provides a graphical demonstration of the processing performed by the exemplary coding sensor program module 170.
- a DNA sequence 180 is sampled using a sliding window technique, whereby a window 185 is advanced one nucleotide at a time.
- content variables are calculated by the content variable computation program module 170A.
- Content variables for a current window, as well as the content variables for up-stream and down-stream windows, are input to the recurrent neural network program module 170B.
- the output from the recurrent neural network program module 170B is input to the post-processing and prediction program module 170C in order to account for noise, etc.
- the output from the recurrent neural network program module 170B represents a coding potential or a coding score, referred to herein as a coding sensor value. Coding sensor values for each nucleotide position are subsequently concatenated to determine a coding/non-coding picture of the DNA sequence.
- a neural network consists of a number of inter-connected computational neurons that operate in parallel to produce an output result. While each neuron within a neural network operates independently, the inputs and/or output of the neurons are connected to one another and are assigned a weight. The manner in which weights are assigned to each neuron determines the behavior of the neural network.
- a neural network may be trained by altering the values of the weights in a well-defined manner, described by a learning rule. As an example, a neural network may be trained to map a set of input patterns onto a set of output patterns.
- One method of training a neural network is referred to as "supervised learning.”
- Supervised learning employs an external teacher and requires a knowledge of the desired responses to input signals. The goal of supervised learning is to minimize the error between the desired output neuron values and computed output neuron values. The value of an output signal of a neuron depends upon the activation of the neuron, which is expressed as an output transfer function.
- the architecture of a neural network is formed by organizing neurons into layers. There may be connections between neurons in the same layers and connections between neurons in different layers. Interlay er connections allow the propagation of signals in one direction or in both directions.
- Input neurons receive signals from external sources and send output signals to other neurons.
- Output neurons receive signals from other neurons and send signals to the environment.
- Hidden neurons have no contact with the environment.
- a recurrent neural network is a special type of neural network that provides for internal memory. Apart from the regular input neurons, output neurons and hidden neurons that exists in the common feedforward multilayer neural networks, recurrent neural networks include a special type of neuron called a context neuron. Context neurons help the neural network to memorize its previous states and thus may model the associations that exist among these states.
- An illustrative embodiment of a recurrent neural network architecture that may be used in accordance with an exemplary embodiment of the present invention is shown in FIG. 2.
- the illustrative recurrent neural network comprises a one hidden layer, partially- connected recurrent network.
- the feedforward connections are modifiable while the recurrent connections are fixed.
- Input neurons 202 accept input signals from the environment and transmit output signals to hidden neurons 204.
- Hidden neurons 204 in turn transmit output signals to output units 208 and also to context neurons 206. Signals transmitted from hidden neurons 204 to context neurons 206 are referred to as feedback. Tank and linear activation function are employed for hidden neurons 204 and context neurons 206, respectively. The use of a tank activation function in hidden neurons 206 introduces a nonlinear component to the system. A logistic function is used in the output neurons 208. In an exemplary embodiment, sixty hidden neurons 204 are used in the recurrent neural network. Generalization errors were estimated using the split-sample validation method. Content Variables
- Content variables capture the statistical differences between coding and non-coding regions.
- a window of empirically selected size for example 43 base pairs
- nine content variables were calculated at each nucleotide position.
- the 5' and 3' flanking regions of the sequence were treated the same as introns.
- Hexamer 1 Let the preference value of each hexamer be the logarithmic ratio of its normalized probabilities in exons verse introns in human genes.
- the hexamer 1 is defined as the sum of preference values in the window adjusted by the number of hexamers (W-6). Algebraically,
- In-frame hexamer 1 This variable is similar to the hexamer 1 except that the observed hexamers in the sequence are compared with the preference values of in-frame hexamers in the human exons. The total preference is computed three times for the window, once for each reading frame. The predicted reading frame is taken to be the one that provides the highest hexamer in-frame coding verse non- coding preference, and the variable is defined as the total preference for this frame adjusted by the number of hexamers (W-6V3. Mathematically,
- In-frame hexamer l (3) Hexamer 2 and (4) In-frame hexamer 2: These two variables are similar to the previous two except that the probabilities F now are the frequencies of the hexamers in a random population based on the base composition of the sequence.
- Base composition The CG percentage is taken as the base composition variable.
- Fickett variable Fickett (1982) developed an algorithm for predicting coding regions by considering several properties of coding sequences. In a given window, it 3-periodicity of each of the four bases is independently examined and compared to the periodic properties of coding DNA. The overall base composition of the sequence under investigation is also compared with the known composition for coding and non-coding DNA.
- Uneven position bias First purposed by Staden
- this variable measures the asymmetry of the base composition in three codon positions.
- ⁇ (b) ( ⁇ if(b,i))/3
- diff(b) ⁇ (
- the uneven position bias variable is defined as ( ⁇ b diff(b))/W, where W is the width of the window.
- Codon prototype First purposed by Fickett and Tung (1992), let f(b,i) be the probability of finding base b at position i in an actual codon and q(b,i) be the probability of finding nucleotide b at position i in a trinucleotide that is not a codon.
- the codon prototype variable is the sum over the window of the dot product of B and the codons of the window.
- Frame bias Mural et al (1991) used the frame bias variable in their CRM module to recognize exons in DNA sequences. This variable is very similar to the codon prototype variable. Let f(b,i) be defined as the uneven position bias variable. If a window codes for protein, one frame should have a significantly better correlation with the f(b,i) matrix than the other two possible reading frames. The correlation coefficient between f(b,i) and each reading frame is calculated and the difference between the best and worst coefficient is taken as the frame bias variable.
- sequences encoding more than one gene sequences having introns whose lengths were less than 5 bp, sequences having introns not starting with GT or not ending with AG, sequences with CDS not starting with an ATG or not ending with a stop codon, sequences with CDS lengths not divisible by three.
- sequences corresponding to immunoglobulins and histocompatibility antigens were also discarded due to their ability to undergo complex DNA rearrangement.
- the final dataset consisted of 548 sequences. Each sequence encoded one and only one complete, spliceable, functional protein product in the forward strand. This set (dataset_A) contained 2,926,880 nucleotides, of which 597,720 were exon bases and 1,308,300 were intron bases.
- dataset_B was constructed from dataset_A for derivation and testing of the recurrent neural network by dropping the single-exon sequences (263 sequences). Since considerable evidence suggest that the human genome is heterogeneous with respect to C+G content (Burge 1997), the sequences in dataset_B were further divided into four groups according to the C+G composition of the sequences: I ( ⁇ 43 C+G%); II (43-51); III (51-57); and IV (>57). There were 45, 73, 67, and 79 sequences in groups I, II, III and IV, respectively.
- Each sequence (sequences longer than 15 kb were avoided) in the dataset_B was selected into one of three sets: training, validation or test set.
- the resultant training set consisted of 15, 38, 36 and 43 sequences for groups I, II, III and IV respectively while the validation set contained 4, 8, 8, and 9 sequences respectively.
- the test set shown below in Table 1, contained 10, 25, 23 and 27 sequences in each group. Table 1. Test sets for the four groups
- each sequence in dataset_B was sampled using a sliding-window technique with a window size of 43 bp and sliding the window one nucleotide at a time.
- One-dimensional distributions of these variables were studied.
- the results for group IV are shown in FIG. 3.
- two features stand out. First, as one would hope, the distributions of nearly all variables behave relatively normal. Secondly, there is significant overlapping between the coding and non-coding for all variables, meaning that there is little information available to distinguish the two classes in one dimension. Especially for variables such as codon prototype and C+G% content, the distribution of the coding class completely locates inside that of the non- coding class.
- the results for the other three groups demonstrate similar features.
- Bhattacharyya distance (B) showing the significance of each variable, were calculated under the equal variance assumption for these variables for each group. This statistical distance is defined as:
- the discriminative information correlates with the C+G% percentage. There is more information in high C+G% groups than in the low C+G% group. Thus, the Bhattacharyya distance of 0.560 for group IV is higher than the distance 0.493 for group I. This phenomenon may in part explain the observation that gene prediction programs tend to perform less well on A+T rich sequences (e.g. Snyder and Stormo, 1995).
- the in-frame hexamer 1 is the most discriminative content variable in the high C+G% groups, consistent with the previous result (Fickett and Tung, 1992). But it is not in the low C+G% group both individually and in the combined case.
- variable InFrame hexamer 1 in the case of group IV only contributes one third to the total statistical distance.
- hexamers, codon prototype, frame bias and uneven position base variables all depend on the positional base frequency information in the gene, they certainly capture non-redundant statistical aspects of this information.
- Training of the exemplary recurrent neural network described above was performed in the following manner.
- the training set of related values of inputs and targets from a sequence is represented by ⁇ x(i), d(i) ⁇ , l ⁇ i ⁇ L, where L is the total sample size from the sequence.
- Training is done by adjusting the weights assigned to neurons of the neural network in order to minimize a cost function.
- the cost function used was the sum of squared errors augmented by a simple weight-decay regularization term
- the networks were trained by 200 epochs using the backpropagation method. During training, the networks were evaluated using the mean-squared error (MSE) defined as follows:
- NP is the total number of observations from the training set.
- the validation error was calculated similarly. MSE values of 0.084, 0.088, 0.093 and 0.099 were achieved for groups I, II, III and IV, respectively. The corresponding validation errors were 0.063, 0.050, 0.060 and 0.087, respectively.
- the recurrent neural network coding sensor was evaluated using the coding differential measure ( ⁇ ), first proposed by Burge (1997).
- the coding differential for each sequence in the test set was calculated. The result is shown in FIG. 4 along with the results from the inhomogeneous 3 -periodic fifth-order Markov model.
- the following formula (Bayes theorem) was used in the calculations,
- the ⁇ RNN mean values for the four C+G% groups were 2.088, 3.913, 5.700, 6.166 while the corresponding ⁇ MARKOV mean values were 0.047, 0.076, 0.097, 0.105.
- the ⁇ RNN value significantly correlates with the sequence C+G% content (statistical significance level P ⁇ 0.01). On average high C+G% sequences have high ⁇ RNN values.
- the output of the neural network for a certain nucleotide position can be interpreted as the probability of that nucleotide position being a coding nucleotide.
- the post-processing and prediction method of the present invention concatenates the outputs of one or more neural networks to provide an overall coding/non-coding arrangement of the DNA sequence.
- An exemplary post-processing and prediction method is described by the following steps:
- the output value sequence is smoothed by a 5-point medium filter twice.
- the output sequence is scanned from left to right using global threshold technique.
- the threshold value is empirically decided. During scanning, starting from the first position:
- the output for gene HUMPNMTA (accession J03280) from group IV is shown in FIG. 6, in which the curve represents the output of the neural network while the straight line represents the annotated gene arrangement.
- the dots represent the prediction locations.
- the probability 0.8 was used as the global threshold value, which roughly means that the probability of correctness of the predicted exons is 0.8.
- the identification of coding regions relies on the global threshold technique so that the predicted coding region type (initial/internal/terminal) can not be known in advance. As a compromise, all predicted regions were treated as internal exons in this study. The shortcoming is that the initiation site and stop site locations can not be located precisely.
- the second example gives some insight into how the exemplary method will behave in a real situation.
- the sequence HSNCAMXl accession Z29373
- the neural network output is shown in Figure 6 and the text values are shown in Table 4.
- column 1 - column 7 was outputted from the exemplary method.
- Column 9 - column 12 is from GenBank annotation.
- Column 2 is the beginning position of the predicted exon.
- Column 3 is the ending position of the predicted exon.
- Column 4 is the length of the predicted exon.
- Column 5 is the coding sensor score of the coding portion of the exon.
- Column 6 is the score of the acceptor signal at the 5' end of the predicted exon.
- Column 7 is the score of the donor signal at the 3' end of the exon.
- Column 8 and Column 13 are provided here for illustrative purposes. The coding sensor score and the acceptor signal score of the first predicted exon were not calculated, so were the coding sensor score and the donor acceptor score of the last predicted exon.
- the probability 0.8 was used as the global threshold value, as in the previous example. Sensitivity and specificity are both 0.84. The correlation coefficient is 0.79.
- the annotated HSNCAMXl gene contains 28 coding exons, of which 14 were predicted exactly, eight were predicted partially, two were predicted by overlapping exons and four were missed completely. In addition, one wrong exon was predicted.
- the predicted exon which is wrong has an unusually weak acceptor signal score (weaker than any score for a true splice site in this gene) and a relatively weak coding sensor score.
- the splice signal and exon coding sensor scores may provide useful information about the reliability of the prediction.
- the most distinctive property of the four annotated exons which were missed is their small size (15, 132, 71 and 12, respectively).
- there were small peaks at the level 0.20, 0.60 and 0.40 at the regions spanned by the annotated exons 02, 09 and 17. Therefore, it could be possible to pick up these exons should a better assembly algorithm be used instead of the simple algorithm.
- Table 5 shows the nucleotide level accuracy for different C+G% compositional test groups along with the results from two of the most-widely used gene prediction programs for the test sets.
- Probabilities of 0.4, 0.6, 0.8 and 0.8 were used as the global threshold value for groups I, II, III and IV respectively.
- GenelD was assessed using the Email service geneid@darwin.bu.edu and the "- noexonblast" option was used to suppress the protein database search.
- the first ranked potential gene was used.
- SORFIND Version 2.8 (dated: July 9, 1996) was downloaded from website www.rabbithutch.com and the default parameter values were used in evaluation.
- the recurrent neural network is able to capture the information efficiently, as evidenced by its good performance in high C+G% groups. In fact, the results are competitive with other more sophisticated systems at the nucleotide level, which probably implies that the recurrent neural network extracts coding information more efficiently than the subsystems for coding region in these leading systems.
- the performance decreases gradually as expected, due to the global threshold operation. This decrease is evident at the nucleotide level as well as at the exon level. At the nucleotide level, the correlation coefficient decreases from 0.66 to 0.40.
- column 1 represents the number of sequences in each test set is given in the first parentheses, followed by the number of sequences for which no gene was predicted, in second parentheses.
- the Generalized hidden Markov model contains ten states and is similar in structure to the ones used in Genie and GENSCAN. All the parameters (state length distributions, Markov transition probabilities, and state initial probabilities) were estimated from the dataset_A (Lou 1997).
- the state sequence generating models for splice sites and initiation sites are WAM models.
- the sequence generating model for the coding/non-coding regions is the recurrent neural network model (converting the posterior probability to the sequence generating model using the Bayes theorem).
- the performance of the model (program Gene ACT), was tested on the set of 570 vertebrate genes constructed by Burset and Guigo (1996). The results are shown in Table 8 and the comparisons with other systems are shown in Table 9, below.
- the GeneACT is comparable with all leading systems. Although the sensitivity and specificity at the exon level are low, the missing exon percentage and wrong exon percentage are comparable with other systems. It should be noted that as of the overlapping between the training set of all these systems and the Burset and Guigo dataset, truly objective comparisons of these systems are not obtainable and probably even not meaningful.
- To increase the exon level sensitivity and specificity one obvious approach is to build more sophisticated splice site models (Burge and Karlin 1997). Another approach is to incorporate promoter, polyA signal and other signals (like signal peptide and CpG signal) into the generalized HMM model. It is anticipated that by using these two approaches the overall performance of the system will be substantially improved. After the incorporation of promoter and polyA signals into the HMM model, further improvement of the HMM modeling may come from the RNN model which treats the 5' UTR, introns, 3' UTR and intergenic regions differently.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Analytical Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Zoology (AREA)
- Genetics & Genomics (AREA)
- Wood Science & Technology (AREA)
- General Engineering & Computer Science (AREA)
- Biochemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Microbiology (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Plant Pathology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Immunology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU46917/99A AU4691799A (en) | 1998-06-17 | 1999-06-17 | Recognition of protein coding regions in genomic dna sequences |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US8968098P | 1998-06-17 | 1998-06-17 | |
US60/089,680 | 1998-06-17 |
Publications (3)
Publication Number | Publication Date |
---|---|
WO1999066302A2 true WO1999066302A2 (en) | 1999-12-23 |
WO1999066302A3 WO1999066302A3 (en) | 2000-06-22 |
WO1999066302A9 WO1999066302A9 (en) | 2000-07-27 |
Family
ID=22219015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1999/013705 WO1999066302A2 (en) | 1998-06-17 | 1999-06-17 | Recognition of protein coding regions in genomic dna sequences |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU4691799A (en) |
WO (1) | WO1999066302A2 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001095230A2 (en) * | 2000-06-08 | 2001-12-13 | Virco Bvba | Method for predicting therapeutic agent resistance using neural networks |
US7158889B2 (en) | 2002-12-20 | 2007-01-02 | International Business Machines Corporation | Gene finding using ordered sets |
CN111370055A (en) * | 2020-03-05 | 2020-07-03 | 中南大学 | Intron retention prediction model establishing method and prediction method thereof |
US10957421B2 (en) | 2014-12-03 | 2021-03-23 | Syracuse University | System and method for inter-species DNA mixture interpretation |
CN113808671A (en) * | 2021-08-30 | 2021-12-17 | 西安理工大学 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
CN117745704A (en) * | 2023-09-27 | 2024-03-22 | 深圳泰康医疗设备有限公司 | Vertebral region segmentation system for osteoporosis recognition |
JP7583153B2 (ja) | 2020-08-21 | 2024-11-13 | リジェネロン・ファーマシューティカルズ・インコーポレイテッド | 配列生成および予測のための方法およびシステム |
-
1999
- 1999-06-17 AU AU46917/99A patent/AU4691799A/en not_active Abandoned
- 1999-06-17 WO PCT/US1999/013705 patent/WO1999066302A2/en active Application Filing
Non-Patent Citations (3)
Title |
---|
SNYDER ET AL.: 'Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks' NUCLEIC ACIDS RESEARCH, vol. 21, no. 3, 1993, pages 607 - 613, XP002925273 * |
SNYDER ET AL.: 'Identification of Protein Coding Regions in Genomic DNA' JOURNAL OF MOLECULAR BIOLOGY, vol. 248, 1995, pages 1 - 18, XP002925271 * |
UBERBACHER ET AL.: 'Locating protein-encoding regions in human DNA sequence by a multiple sensor-neural network approach' PROC. NATL. ACAD. SCI. USA, vol. 88, December 1991, pages 11261 - 11265, XP002925272 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001095230A2 (en) * | 2000-06-08 | 2001-12-13 | Virco Bvba | Method for predicting therapeutic agent resistance using neural networks |
WO2001095230A3 (en) * | 2000-06-08 | 2003-08-21 | Virco Bvba | Method for predicting therapeutic agent resistance using neural networks |
US7158889B2 (en) | 2002-12-20 | 2007-01-02 | International Business Machines Corporation | Gene finding using ordered sets |
US8738299B2 (en) | 2002-12-20 | 2014-05-27 | International Business Machines Corporation | Gene finding using ordered sets of distinct marker strings |
US10957421B2 (en) | 2014-12-03 | 2021-03-23 | Syracuse University | System and method for inter-species DNA mixture interpretation |
CN111370055A (en) * | 2020-03-05 | 2020-07-03 | 中南大学 | Intron retention prediction model establishing method and prediction method thereof |
CN111370055B (en) * | 2020-03-05 | 2023-05-23 | 中南大学 | Intron retention prediction model establishment method and prediction method thereof |
JP7583153B2 (ja) | 2020-08-21 | 2024-11-13 | リジェネロン・ファーマシューティカルズ・インコーポレイテッド | 配列生成および予測のための方法およびシステム |
CN113808671A (en) * | 2021-08-30 | 2021-12-17 | 西安理工大学 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
CN113808671B (en) * | 2021-08-30 | 2024-02-06 | 西安理工大学 | Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning |
CN117745704A (en) * | 2023-09-27 | 2024-03-22 | 深圳泰康医疗设备有限公司 | Vertebral region segmentation system for osteoporosis recognition |
Also Published As
Publication number | Publication date |
---|---|
AU4691799A (en) | 2000-01-05 |
WO1999066302A9 (en) | 2000-07-27 |
WO1999066302A3 (en) | 2000-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102433458B1 (en) | Semi-supervised learning for training an ensemble of deep convolutional neural networks | |
Uberbacher et al. | [16] Discovering and understanding genes in human DNA sequence using GRAIL | |
Sonnhammer et al. | A hidden Markov model for predicting transmembrane helices in protein sequences. | |
US20030077586A1 (en) | Method and apparatus for combining gene predictions using bayesian networks | |
Hatzigeorgiou | Translation initiation start prediction in human cDNAs with high accuracy | |
NZ759659A (en) | Deep learning-based variant classifier | |
EP2320343A2 (en) | System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map | |
Choo et al. | Recent applications of hidden Markov models in computational biology | |
EP4254419A1 (en) | Artificial-intelligence-based cancer diagnosis and cancer type prediction method | |
Azad et al. | Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory | |
EP4182928B1 (en) | Method, system and computer program product for determining presentation likelihoods of neoantigens | |
CN111180013A (en) | Device for detecting blood disease fusion gene | |
WO1999066302A2 (en) | Recognition of protein coding regions in genomic dna sequences | |
Reese | Computational prediction of gene structure and regulation in the genome of Drosophila melanogaster | |
Yi et al. | Learning from data-rich problems: a case study on genetic variant calling | |
Kashiwabara et al. | Splice site prediction using stochastic regular grammars | |
US20240185953A1 (en) | Systems and methods for high-throughput predictions | |
Van Haeverbeke | DETECTION OF M6A MODIFICATIONS IN NATIVE RNA USING OXFORD NANOPORE TECHNOLOGY | |
US20240112751A1 (en) | Copy number variation (cnv) breakpoint detection | |
Sidi et al. | Predicting gene sequences with AI to study codon usage patterns | |
Gunady | Applications of Graph Segmentation Algorithms for Quantitative Genomic Analyses | |
Elst | RECOGNIZING IRREGULARITIES IN STACKED NANOPORE SIGNALS FROM IN SILICO PERMUTED SEQUENCING DATA | |
Tenney | Basecalling for Traces Derived for Multiple Templates | |
Uberbacher et al. | DNA sequence pattern recognition methods in GRAIL | |
Rose et al. | Mutual Information Measure for Distinguishing Coding and Non-Coding DNA Sequences. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AU CA JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AU CA JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AU CA JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/7-7/7, DRAWINGS, REPLACED BY NEW PAGES 1/17-17/17; DUE TO LATE TRANSMITTAL BY THE RECEIVINGOFFICE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 09719887 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |