Nothing Special   »   [go: up one dir, main page]

WO1999066302A2 - Recognition of protein coding regions in genomic dna sequences - Google Patents

Recognition of protein coding regions in genomic dna sequences Download PDF

Info

Publication number
WO1999066302A2
WO1999066302A2 PCT/US1999/013705 US9913705W WO9966302A2 WO 1999066302 A2 WO1999066302 A2 WO 1999066302A2 US 9913705 W US9913705 W US 9913705W WO 9966302 A2 WO9966302 A2 WO 9966302A2
Authority
WO
WIPO (PCT)
Prior art keywords
coding
sequence
neighboring
stream
neural network
Prior art date
Application number
PCT/US1999/013705
Other languages
French (fr)
Other versions
WO1999066302A9 (en
WO1999066302A3 (en
Inventor
Yuandan Lou
Zhen Zhang
Original Assignee
Musc Foundation For Research Development
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Musc Foundation For Research Development filed Critical Musc Foundation For Research Development
Priority to AU46917/99A priority Critical patent/AU4691799A/en
Publication of WO1999066302A2 publication Critical patent/WO1999066302A2/en
Publication of WO1999066302A3 publication Critical patent/WO1999066302A3/en
Publication of WO1999066302A9 publication Critical patent/WO1999066302A9/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • GRAIL Uberbacher and Mural 1991, Xu et al 1994
  • GenelD Guigo et al 1992
  • GeneParser Snyder and Stormo 1993, 1995
  • GeneLang Dong and Searls 1994
  • Genie Kulp et al 1996)
  • VEIL Headson et al 1997)
  • GENSCAN Burge and Karlin 1997) used hidden Markov models to model the human gene structure. Since the performance of these programs are still not satisfactory (see review in Burset and Guigo 1996), development of new methods, and/or improvement of existing methods, continues to be important objectives.
  • a sequence of nucleotides within a DNA sequence may have associated therewith several variables, referred to as "content variables,” that are thought to be useful for discriminating between coding regions and non-coding regions.
  • content variables include classic linear discriminant methods (Solovyev et al 1994) and feedforward neural networks (Snyder and Stormo 1993, 1995, Guigo et al 1992, Xu et al 1994).
  • Feedforward neural networks benefit from the fact that they may be trained using gradient decent optimization algorithms such as the backpropagation algorithm.
  • neural networks with feedbacks may provide significant advantages over purely feedforward networks. Feedbacks provide recursive computation and the ability to represent state information. In some cases, a neural network with feedbacks may be equivalent to a much larger feedforward neural network. Neural networks with feedbacks are generally referred to as recurrent neural networks.
  • recurrent neural networks In general, the use of recurrent neural networks has not been nearly as extensive as that of feedforward neural networks. A primary reason the under-utilization of recurrent neural networks is the difficulty involved in developing generally applicable learning algorithms for recurrent neural networks. Due to the fact that the gradient of the error with respect to the connection strength is not easily solvable for recurrent neural networks, gradient-based optimization algorithms are not always applicable. As a result, the benefits of recurrent neural networks over purely feedforward neural networks have not been exploited with regard to extracting information from content variables of nucleotide sequences in order to identify coding/non- coding regions.
  • the present invention provides a coding sensor that utilizes a recurrent neural network model.
  • the coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure.
  • a DNA sequence may be imagined as comprising a discrete nucleotides chain.
  • the recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in the sequence is in a coding region.
  • Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. At each nucleotide position, content variables are calculated based on a window centered on the nucleotide position. As mentioned, content variables are thought to be useful for discriminating between coding regions and non- coding regions.
  • the present invention combines the calculated content variables in a specific way in order to provide an overall "coding sensor value.”
  • the coding sensor value indicates whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.
  • coding nucleotides i.e. nucleotides in a coding region
  • Identifying "transition characteristics" between neighboring segments of a DNA sequence may provide additional information that is useful for detecting coding regions. In other words, detecting whether a particular nucleotide position is in a coding or non- coding region may depend not only on information determined from its own content variables but also information determined from the content variables of nearby nucleotides.
  • the invention provides a novel method for using a recurrent neural network to determine up-stream and down-stream transition characteristics between nucleotide chains in a DNA sequence. Transition characteristics may be used to assist the coding sensor of the present invention in finding potential protein coding regions in unannotated genomic DNA sequences.
  • FIG. 2 shows an illustrative recurrent neural network architecture in accordance with an exemplary embodiment of the present invention.
  • FIG. 3 comprising FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, FIG. 3E, FIG. 3F, FIG. 3G, FIG. 3H and FIG. 31, shows one- dimensional distributions of nine content variables in accordance with an exemplary embodiment of the present invention.
  • FIG. 4 comprising FIG. 4A and FIG. 4B, shows coding differentials of an exemplary data test set.
  • FIG. 5 comprising FIG. 5A and FIG. 5B, shows coding differentials of an exemplary Burset/Guigo data set.
  • FIG. 6 shows illustrates exemplary results obtained by operation of an exemplary embodiment of the present invention.
  • FIG. 1, comprising FIG. IA and FIG. IB, and the following discussion are intended to provide a brief and general description of a suitable computing environment for implementing the present invention.
  • neural networks are implemented in a computer environment.
  • the computer 100 includes a processor 122, a system memory 120, and an Input/Output ("I/O") bus 126.
  • a system bus 121 couples the central processing unit 122 to the system memory 120.
  • a bus controller 123 controls the flow of data on the I/O bus 126 and between the central processing unit 122 and a variety of internal and external I/O devices.
  • the I/O devices connected to the I/O bus 126 may have direct access to the system memory 120 using a Direct Memory Access (“DMA”) controller 124.
  • DMA Direct Memory Access
  • the I/O devices are connected to the I/O bus 126 via a set of device interfaces.
  • the device interfaces may include both hardware components and software components.
  • a hard disk drive 130 and a floppy disk drive 132 for reading or writing removable media 150 may be connected to the I/O bus 126 through disk drive controllers 140.
  • An optical disk drive 134 for reading or writing optical media 152 may be connected to the I/O bus 126 using a Small Computer System Interface ("SCSI") 141.
  • SCSI Small Computer System Interface
  • an IDE (AT API) or EIDE interface may be associated with an optical drive such as a may be the case with a CD-ROM drive.
  • the drives and their associated computer-readable media provide nonvolatile storage for the computer 100.
  • other types of computer-readable media may also be used, such as ZIP drives, or the like.
  • a display device 153 such as a monitor, is connected to the I/O bus 126 via another interface, such as a video adapter 142.
  • a parallel interface 143 connects synchronous peripheral devices, such as a laser printer 156, to the I/O bus 126.
  • a serial interface 144 connects communication devices to the I/O bus 126.
  • a user may enter commands and information into the computer 100 via the serial interface 144 or by using an input device, such as a keyboard 138, a mouse 136 or a modem 157.
  • Other peripheral devices may also be connected to the computer 100, such as audio input output devices or image capture devices.
  • a number of program modules may be stored on the drives and in the system memory 120.
  • the system memory 120 can include both Random Access Memory (“RAM”) and Read Only Memory (“ROM”).
  • the program modules control how the computer 100 functions and interacts with the user, with I/O devices or with other computers.
  • Program modules include routines, operating systems 165, application programs, data structures, and other software or firmware components.
  • the present invention may comprise one or more coding sensor program modules 170 stored on the drives or in the system memory 120 of the computer 100. Coding sensor modules 170 may comprise one or more content variable calculation program modules 170A, one or more recurrent neural network program modules 170B, and one or more post-processing and prediction program modules 170C.
  • Coding sensor program module(s) 170 may thus comprise computer-executable instructions for calculating content variable, analyzing content variables with a recurrent neural network model, and post-processing the output of the neural network model in order to predict whether a nucleotide position is in a coding region, according to exemplary methods to be described herein.
  • the computer 100 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer 160.
  • the remote computer 160 may be a server, a router, a peer device or other common network node, and typically includes many or all of the elements described in connection with the computer 100.
  • program modules and data may be stored on the remote computer 160.
  • the logical connections depicted in FIG. 1 include a local area network ("LAN") 154 and a wide area network (“WAN”) 155.
  • a network interface 145 such as an Ethernet adapter card, can be used to connect the computer 100 to the remote computer 160.
  • the computer 100 may use a telecommunications device, such as a modem 157, to establish a connection.
  • a telecommunications device such as a modem 157
  • the network connections shown are illustrative and other devices of establishing a communications link between the computers may be used.
  • FIG. IB provides a graphical demonstration of the processing performed by the exemplary coding sensor program module 170.
  • a DNA sequence 180 is sampled using a sliding window technique, whereby a window 185 is advanced one nucleotide at a time.
  • content variables are calculated by the content variable computation program module 170A.
  • Content variables for a current window, as well as the content variables for up-stream and down-stream windows, are input to the recurrent neural network program module 170B.
  • the output from the recurrent neural network program module 170B is input to the post-processing and prediction program module 170C in order to account for noise, etc.
  • the output from the recurrent neural network program module 170B represents a coding potential or a coding score, referred to herein as a coding sensor value. Coding sensor values for each nucleotide position are subsequently concatenated to determine a coding/non-coding picture of the DNA sequence.
  • a neural network consists of a number of inter-connected computational neurons that operate in parallel to produce an output result. While each neuron within a neural network operates independently, the inputs and/or output of the neurons are connected to one another and are assigned a weight. The manner in which weights are assigned to each neuron determines the behavior of the neural network.
  • a neural network may be trained by altering the values of the weights in a well-defined manner, described by a learning rule. As an example, a neural network may be trained to map a set of input patterns onto a set of output patterns.
  • One method of training a neural network is referred to as "supervised learning.”
  • Supervised learning employs an external teacher and requires a knowledge of the desired responses to input signals. The goal of supervised learning is to minimize the error between the desired output neuron values and computed output neuron values. The value of an output signal of a neuron depends upon the activation of the neuron, which is expressed as an output transfer function.
  • the architecture of a neural network is formed by organizing neurons into layers. There may be connections between neurons in the same layers and connections between neurons in different layers. Interlay er connections allow the propagation of signals in one direction or in both directions.
  • Input neurons receive signals from external sources and send output signals to other neurons.
  • Output neurons receive signals from other neurons and send signals to the environment.
  • Hidden neurons have no contact with the environment.
  • a recurrent neural network is a special type of neural network that provides for internal memory. Apart from the regular input neurons, output neurons and hidden neurons that exists in the common feedforward multilayer neural networks, recurrent neural networks include a special type of neuron called a context neuron. Context neurons help the neural network to memorize its previous states and thus may model the associations that exist among these states.
  • An illustrative embodiment of a recurrent neural network architecture that may be used in accordance with an exemplary embodiment of the present invention is shown in FIG. 2.
  • the illustrative recurrent neural network comprises a one hidden layer, partially- connected recurrent network.
  • the feedforward connections are modifiable while the recurrent connections are fixed.
  • Input neurons 202 accept input signals from the environment and transmit output signals to hidden neurons 204.
  • Hidden neurons 204 in turn transmit output signals to output units 208 and also to context neurons 206. Signals transmitted from hidden neurons 204 to context neurons 206 are referred to as feedback. Tank and linear activation function are employed for hidden neurons 204 and context neurons 206, respectively. The use of a tank activation function in hidden neurons 206 introduces a nonlinear component to the system. A logistic function is used in the output neurons 208. In an exemplary embodiment, sixty hidden neurons 204 are used in the recurrent neural network. Generalization errors were estimated using the split-sample validation method. Content Variables
  • Content variables capture the statistical differences between coding and non-coding regions.
  • a window of empirically selected size for example 43 base pairs
  • nine content variables were calculated at each nucleotide position.
  • the 5' and 3' flanking regions of the sequence were treated the same as introns.
  • Hexamer 1 Let the preference value of each hexamer be the logarithmic ratio of its normalized probabilities in exons verse introns in human genes.
  • the hexamer 1 is defined as the sum of preference values in the window adjusted by the number of hexamers (W-6). Algebraically,
  • In-frame hexamer 1 This variable is similar to the hexamer 1 except that the observed hexamers in the sequence are compared with the preference values of in-frame hexamers in the human exons. The total preference is computed three times for the window, once for each reading frame. The predicted reading frame is taken to be the one that provides the highest hexamer in-frame coding verse non- coding preference, and the variable is defined as the total preference for this frame adjusted by the number of hexamers (W-6V3. Mathematically,
  • In-frame hexamer l (3) Hexamer 2 and (4) In-frame hexamer 2: These two variables are similar to the previous two except that the probabilities F now are the frequencies of the hexamers in a random population based on the base composition of the sequence.
  • Base composition The CG percentage is taken as the base composition variable.
  • Fickett variable Fickett (1982) developed an algorithm for predicting coding regions by considering several properties of coding sequences. In a given window, it 3-periodicity of each of the four bases is independently examined and compared to the periodic properties of coding DNA. The overall base composition of the sequence under investigation is also compared with the known composition for coding and non-coding DNA.
  • Uneven position bias First purposed by Staden
  • this variable measures the asymmetry of the base composition in three codon positions.
  • ⁇ (b) ( ⁇ if(b,i))/3
  • diff(b) ⁇ (
  • the uneven position bias variable is defined as ( ⁇ b diff(b))/W, where W is the width of the window.
  • Codon prototype First purposed by Fickett and Tung (1992), let f(b,i) be the probability of finding base b at position i in an actual codon and q(b,i) be the probability of finding nucleotide b at position i in a trinucleotide that is not a codon.
  • the codon prototype variable is the sum over the window of the dot product of B and the codons of the window.
  • Frame bias Mural et al (1991) used the frame bias variable in their CRM module to recognize exons in DNA sequences. This variable is very similar to the codon prototype variable. Let f(b,i) be defined as the uneven position bias variable. If a window codes for protein, one frame should have a significantly better correlation with the f(b,i) matrix than the other two possible reading frames. The correlation coefficient between f(b,i) and each reading frame is calculated and the difference between the best and worst coefficient is taken as the frame bias variable.
  • sequences encoding more than one gene sequences having introns whose lengths were less than 5 bp, sequences having introns not starting with GT or not ending with AG, sequences with CDS not starting with an ATG or not ending with a stop codon, sequences with CDS lengths not divisible by three.
  • sequences corresponding to immunoglobulins and histocompatibility antigens were also discarded due to their ability to undergo complex DNA rearrangement.
  • the final dataset consisted of 548 sequences. Each sequence encoded one and only one complete, spliceable, functional protein product in the forward strand. This set (dataset_A) contained 2,926,880 nucleotides, of which 597,720 were exon bases and 1,308,300 were intron bases.
  • dataset_B was constructed from dataset_A for derivation and testing of the recurrent neural network by dropping the single-exon sequences (263 sequences). Since considerable evidence suggest that the human genome is heterogeneous with respect to C+G content (Burge 1997), the sequences in dataset_B were further divided into four groups according to the C+G composition of the sequences: I ( ⁇ 43 C+G%); II (43-51); III (51-57); and IV (>57). There were 45, 73, 67, and 79 sequences in groups I, II, III and IV, respectively.
  • Each sequence (sequences longer than 15 kb were avoided) in the dataset_B was selected into one of three sets: training, validation or test set.
  • the resultant training set consisted of 15, 38, 36 and 43 sequences for groups I, II, III and IV respectively while the validation set contained 4, 8, 8, and 9 sequences respectively.
  • the test set shown below in Table 1, contained 10, 25, 23 and 27 sequences in each group. Table 1. Test sets for the four groups
  • each sequence in dataset_B was sampled using a sliding-window technique with a window size of 43 bp and sliding the window one nucleotide at a time.
  • One-dimensional distributions of these variables were studied.
  • the results for group IV are shown in FIG. 3.
  • two features stand out. First, as one would hope, the distributions of nearly all variables behave relatively normal. Secondly, there is significant overlapping between the coding and non-coding for all variables, meaning that there is little information available to distinguish the two classes in one dimension. Especially for variables such as codon prototype and C+G% content, the distribution of the coding class completely locates inside that of the non- coding class.
  • the results for the other three groups demonstrate similar features.
  • Bhattacharyya distance (B) showing the significance of each variable, were calculated under the equal variance assumption for these variables for each group. This statistical distance is defined as:
  • the discriminative information correlates with the C+G% percentage. There is more information in high C+G% groups than in the low C+G% group. Thus, the Bhattacharyya distance of 0.560 for group IV is higher than the distance 0.493 for group I. This phenomenon may in part explain the observation that gene prediction programs tend to perform less well on A+T rich sequences (e.g. Snyder and Stormo, 1995).
  • the in-frame hexamer 1 is the most discriminative content variable in the high C+G% groups, consistent with the previous result (Fickett and Tung, 1992). But it is not in the low C+G% group both individually and in the combined case.
  • variable InFrame hexamer 1 in the case of group IV only contributes one third to the total statistical distance.
  • hexamers, codon prototype, frame bias and uneven position base variables all depend on the positional base frequency information in the gene, they certainly capture non-redundant statistical aspects of this information.
  • Training of the exemplary recurrent neural network described above was performed in the following manner.
  • the training set of related values of inputs and targets from a sequence is represented by ⁇ x(i), d(i) ⁇ , l ⁇ i ⁇ L, where L is the total sample size from the sequence.
  • Training is done by adjusting the weights assigned to neurons of the neural network in order to minimize a cost function.
  • the cost function used was the sum of squared errors augmented by a simple weight-decay regularization term
  • the networks were trained by 200 epochs using the backpropagation method. During training, the networks were evaluated using the mean-squared error (MSE) defined as follows:
  • NP is the total number of observations from the training set.
  • the validation error was calculated similarly. MSE values of 0.084, 0.088, 0.093 and 0.099 were achieved for groups I, II, III and IV, respectively. The corresponding validation errors were 0.063, 0.050, 0.060 and 0.087, respectively.
  • the recurrent neural network coding sensor was evaluated using the coding differential measure ( ⁇ ), first proposed by Burge (1997).
  • the coding differential for each sequence in the test set was calculated. The result is shown in FIG. 4 along with the results from the inhomogeneous 3 -periodic fifth-order Markov model.
  • the following formula (Bayes theorem) was used in the calculations,
  • the ⁇ RNN mean values for the four C+G% groups were 2.088, 3.913, 5.700, 6.166 while the corresponding ⁇ MARKOV mean values were 0.047, 0.076, 0.097, 0.105.
  • the ⁇ RNN value significantly correlates with the sequence C+G% content (statistical significance level P ⁇ 0.01). On average high C+G% sequences have high ⁇ RNN values.
  • the output of the neural network for a certain nucleotide position can be interpreted as the probability of that nucleotide position being a coding nucleotide.
  • the post-processing and prediction method of the present invention concatenates the outputs of one or more neural networks to provide an overall coding/non-coding arrangement of the DNA sequence.
  • An exemplary post-processing and prediction method is described by the following steps:
  • the output value sequence is smoothed by a 5-point medium filter twice.
  • the output sequence is scanned from left to right using global threshold technique.
  • the threshold value is empirically decided. During scanning, starting from the first position:
  • the output for gene HUMPNMTA (accession J03280) from group IV is shown in FIG. 6, in which the curve represents the output of the neural network while the straight line represents the annotated gene arrangement.
  • the dots represent the prediction locations.
  • the probability 0.8 was used as the global threshold value, which roughly means that the probability of correctness of the predicted exons is 0.8.
  • the identification of coding regions relies on the global threshold technique so that the predicted coding region type (initial/internal/terminal) can not be known in advance. As a compromise, all predicted regions were treated as internal exons in this study. The shortcoming is that the initiation site and stop site locations can not be located precisely.
  • the second example gives some insight into how the exemplary method will behave in a real situation.
  • the sequence HSNCAMXl accession Z29373
  • the neural network output is shown in Figure 6 and the text values are shown in Table 4.
  • column 1 - column 7 was outputted from the exemplary method.
  • Column 9 - column 12 is from GenBank annotation.
  • Column 2 is the beginning position of the predicted exon.
  • Column 3 is the ending position of the predicted exon.
  • Column 4 is the length of the predicted exon.
  • Column 5 is the coding sensor score of the coding portion of the exon.
  • Column 6 is the score of the acceptor signal at the 5' end of the predicted exon.
  • Column 7 is the score of the donor signal at the 3' end of the exon.
  • Column 8 and Column 13 are provided here for illustrative purposes. The coding sensor score and the acceptor signal score of the first predicted exon were not calculated, so were the coding sensor score and the donor acceptor score of the last predicted exon.
  • the probability 0.8 was used as the global threshold value, as in the previous example. Sensitivity and specificity are both 0.84. The correlation coefficient is 0.79.
  • the annotated HSNCAMXl gene contains 28 coding exons, of which 14 were predicted exactly, eight were predicted partially, two were predicted by overlapping exons and four were missed completely. In addition, one wrong exon was predicted.
  • the predicted exon which is wrong has an unusually weak acceptor signal score (weaker than any score for a true splice site in this gene) and a relatively weak coding sensor score.
  • the splice signal and exon coding sensor scores may provide useful information about the reliability of the prediction.
  • the most distinctive property of the four annotated exons which were missed is their small size (15, 132, 71 and 12, respectively).
  • there were small peaks at the level 0.20, 0.60 and 0.40 at the regions spanned by the annotated exons 02, 09 and 17. Therefore, it could be possible to pick up these exons should a better assembly algorithm be used instead of the simple algorithm.
  • Table 5 shows the nucleotide level accuracy for different C+G% compositional test groups along with the results from two of the most-widely used gene prediction programs for the test sets.
  • Probabilities of 0.4, 0.6, 0.8 and 0.8 were used as the global threshold value for groups I, II, III and IV respectively.
  • GenelD was assessed using the Email service geneid@darwin.bu.edu and the "- noexonblast" option was used to suppress the protein database search.
  • the first ranked potential gene was used.
  • SORFIND Version 2.8 (dated: July 9, 1996) was downloaded from website www.rabbithutch.com and the default parameter values were used in evaluation.
  • the recurrent neural network is able to capture the information efficiently, as evidenced by its good performance in high C+G% groups. In fact, the results are competitive with other more sophisticated systems at the nucleotide level, which probably implies that the recurrent neural network extracts coding information more efficiently than the subsystems for coding region in these leading systems.
  • the performance decreases gradually as expected, due to the global threshold operation. This decrease is evident at the nucleotide level as well as at the exon level. At the nucleotide level, the correlation coefficient decreases from 0.66 to 0.40.
  • column 1 represents the number of sequences in each test set is given in the first parentheses, followed by the number of sequences for which no gene was predicted, in second parentheses.
  • the Generalized hidden Markov model contains ten states and is similar in structure to the ones used in Genie and GENSCAN. All the parameters (state length distributions, Markov transition probabilities, and state initial probabilities) were estimated from the dataset_A (Lou 1997).
  • the state sequence generating models for splice sites and initiation sites are WAM models.
  • the sequence generating model for the coding/non-coding regions is the recurrent neural network model (converting the posterior probability to the sequence generating model using the Bayes theorem).
  • the performance of the model (program Gene ACT), was tested on the set of 570 vertebrate genes constructed by Burset and Guigo (1996). The results are shown in Table 8 and the comparisons with other systems are shown in Table 9, below.
  • the GeneACT is comparable with all leading systems. Although the sensitivity and specificity at the exon level are low, the missing exon percentage and wrong exon percentage are comparable with other systems. It should be noted that as of the overlapping between the training set of all these systems and the Burset and Guigo dataset, truly objective comparisons of these systems are not obtainable and probably even not meaningful.
  • To increase the exon level sensitivity and specificity one obvious approach is to build more sophisticated splice site models (Burge and Karlin 1997). Another approach is to incorporate promoter, polyA signal and other signals (like signal peptide and CpG signal) into the generalized HMM model. It is anticipated that by using these two approaches the overall performance of the system will be substantially improved. After the incorporation of promoter and polyA signals into the HMM model, further improvement of the HMM modeling may come from the RNN model which treats the 5' UTR, introns, 3' UTR and intergenic regions differently.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Analytical Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A coding sensor using a recurrent neural network technique is provided. The coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure. The recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in a nucleotide chain is in a coding region. Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. The content variables of neighboring nucleotide positions are processed using a trained recurrent neural network in order to provide a coding sensor value. In this way, transition characteristics may be used to assist the coding sensor in determining whether a nucleotide position is in a coding region. The coding sensor value represents a prediction of whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.

Description

Recognition of Protein Coding Regions in Genomic DNA Sequences
Background of the Invention
In recent years, the human genome project has generated an enormous amount of DNA and protein sequence data. It is well recognized that efficient and reliable computer predictions, coupled with experimental verification, can greatly speed up the identification and mapping of complex genes, especially in large-scale genomic sequencing projects. As an example, several computer programs using various techniques have been developed to predict the complete exon-intron structure of genes in large unannotated sequences. To accommodate this need, several computer programs using various techniques have been developed to predict the complete exon-intron structure of genes in large unannotated sequences. GeneModeler (Fields and Soderlund 1990), SORFIND (Hutchinson and Hayden 1992), and HEXON (Solovyev et al 1994) used heuristic methods to find potential genes in raw sequences. GRAIL (Uberbacher and Mural 1991, Xu et al 1994), GenelD (Guigo et al 1992), GeneParser (Snyder and Stormo 1993, 1995) and GeneLang (Dong and Searls 1994) are examples of a machine learning approach. Genie (Kulp et al 1996), VEIL (Henderson et al 1997) and GENSCAN (Burge and Karlin 1997) used hidden Markov models to model the human gene structure. Since the performance of these programs are still not satisfactory (see review in Burset and Guigo 1996), development of new methods, and/or improvement of existing methods, continues to be important objectives.
A sequence of nucleotides within a DNA sequence may have associated therewith several variables, referred to as "content variables," that are thought to be useful for discriminating between coding regions and non-coding regions. Prior art computer-implemented models for extracting information from content variables in order to identify coding/non-coding regions include classic linear discriminant methods (Solovyev et al 1994) and feedforward neural networks (Snyder and Stormo 1993, 1995, Guigo et al 1992, Xu et al 1994). Feedforward neural networks benefit from the fact that they may be trained using gradient decent optimization algorithms such as the backpropagation algorithm. However, when employing neural networks to solve problems involving nonlinear dynamical or state dependent systems, neural networks with feedbacks may provide significant advantages over purely feedforward networks. Feedbacks provide recursive computation and the ability to represent state information. In some cases, a neural network with feedbacks may be equivalent to a much larger feedforward neural network. Neural networks with feedbacks are generally referred to as recurrent neural networks.
In general, the use of recurrent neural networks has not been nearly as extensive as that of feedforward neural networks. A primary reason the under-utilization of recurrent neural networks is the difficulty involved in developing generally applicable learning algorithms for recurrent neural networks. Due to the fact that the gradient of the error with respect to the connection strength is not easily solvable for recurrent neural networks, gradient-based optimization algorithms are not always applicable. As a result, the benefits of recurrent neural networks over purely feedforward neural networks have not been exploited with regard to extracting information from content variables of nucleotide sequences in order to identify coding/non- coding regions.
Thus, there remains a need for a new approach based on a recurrent neural network to extract information from content variables of nucleotide sequences in order to identify coding/non-coding regions.
Summary of the Invention
The present invention provides a coding sensor that utilizes a recurrent neural network model. The coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure. A DNA sequence may be imagined as comprising a discrete nucleotides chain. The recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in the sequence is in a coding region.
Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. At each nucleotide position, content variables are calculated based on a window centered on the nucleotide position. As mentioned, content variables are thought to be useful for discriminating between coding regions and non- coding regions. The present invention combines the calculated content variables in a specific way in order to provide an overall "coding sensor value." The coding sensor value indicates whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.
It is assumed that neighboring segments of a particular DNA sequence have similar characteristics. For example, coding nucleotides (i.e. nucleotides in a coding region) are likely to have neighboring coding nucleotides. Identifying "transition characteristics" between neighboring segments of a DNA sequence may provide additional information that is useful for detecting coding regions. In other words, detecting whether a particular nucleotide position is in a coding or non- coding region may depend not only on information determined from its own content variables but also information determined from the content variables of nearby nucleotides.
The invention provides a novel method for using a recurrent neural network to determine up-stream and down-stream transition characteristics between nucleotide chains in a DNA sequence. Transition characteristics may be used to assist the coding sensor of the present invention in finding potential protein coding regions in unannotated genomic DNA sequences.
Brief Description of the Drawings
FIG. 1, comprising FIG. IA and FIG. IB, shows functional block diagrams of a suitable computing environment for implementing the present invention.
FIG. 2 shows an illustrative recurrent neural network architecture in accordance with an exemplary embodiment of the present invention.
FIG. 3, comprising FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, FIG. 3E, FIG. 3F, FIG. 3G, FIG. 3H and FIG. 31, shows one- dimensional distributions of nine content variables in accordance with an exemplary embodiment of the present invention.
FIG. 4, comprising FIG. 4A and FIG. 4B, shows coding differentials of an exemplary data test set.
FIG. 5, comprising FIG. 5A and FIG. 5B, shows coding differentials of an exemplary Burset/Guigo data set.
FIG. 6 shows illustrates exemplary results obtained by operation of an exemplary embodiment of the present invention.
Detailed Description of the Invention
FIG. 1, comprising FIG. IA and FIG. IB, and the following discussion are intended to provide a brief and general description of a suitable computing environment for implementing the present invention. As is well-known in the art, neural networks are implemented in a computer environment. Although the system shown in FIG. IA represents a conventional personal computer 100, those skilled in the art will recognize that the invention also may be implemented using other types of processor-based systems. The computer 100 includes a processor 122, a system memory 120, and an Input/Output ("I/O") bus 126. A system bus 121 couples the central processing unit 122 to the system memory 120. A bus controller 123 controls the flow of data on the I/O bus 126 and between the central processing unit 122 and a variety of internal and external I/O devices. The I/O devices connected to the I/O bus 126 may have direct access to the system memory 120 using a Direct Memory Access ("DMA") controller 124.
The I/O devices are connected to the I/O bus 126 via a set of device interfaces. The device interfaces may include both hardware components and software components. For instance, a hard disk drive 130 and a floppy disk drive 132 for reading or writing removable media 150 may be connected to the I/O bus 126 through disk drive controllers 140. An optical disk drive 134 for reading or writing optical media 152 may be connected to the I/O bus 126 using a Small Computer System Interface ("SCSI") 141. Alternatively, an IDE (AT API) or EIDE interface may be associated with an optical drive such as a may be the case with a CD-ROM drive. The drives and their associated computer-readable media provide nonvolatile storage for the computer 100. In addition to the computer-readable media described above, other types of computer-readable media may also be used, such as ZIP drives, or the like.
A display device 153, such as a monitor, is connected to the I/O bus 126 via another interface, such as a video adapter 142. A parallel interface 143 connects synchronous peripheral devices, such as a laser printer 156, to the I/O bus 126. A serial interface 144 connects communication devices to the I/O bus 126. A user may enter commands and information into the computer 100 via the serial interface 144 or by using an input device, such as a keyboard 138, a mouse 136 or a modem 157. Other peripheral devices (not shown) may also be connected to the computer 100, such as audio input output devices or image capture devices.
A number of program modules may be stored on the drives and in the system memory 120. The system memory 120 can include both Random Access Memory ("RAM") and Read Only Memory ("ROM"). The program modules control how the computer 100 functions and interacts with the user, with I/O devices or with other computers. Program modules include routines, operating systems 165, application programs, data structures, and other software or firmware components. In an illustrative embodiment, the present invention may comprise one or more coding sensor program modules 170 stored on the drives or in the system memory 120 of the computer 100. Coding sensor modules 170 may comprise one or more content variable calculation program modules 170A, one or more recurrent neural network program modules 170B, and one or more post-processing and prediction program modules 170C. Coding sensor program module(s) 170 may thus comprise computer-executable instructions for calculating content variable, analyzing content variables with a recurrent neural network model, and post-processing the output of the neural network model in order to predict whether a nucleotide position is in a coding region, according to exemplary methods to be described herein.
The computer 100 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer 160. The remote computer 160 may be a server, a router, a peer device or other common network node, and typically includes many or all of the elements described in connection with the computer 100. In a networked environment, program modules and data may be stored on the remote computer 160. The logical connections depicted in FIG. 1 include a local area network ("LAN") 154 and a wide area network ("WAN") 155. In a LAN environment, a network interface 145, such as an Ethernet adapter card, can be used to connect the computer 100 to the remote computer 160. In a WAN environment, the computer 100 may use a telecommunications device, such as a modem 157, to establish a connection. It will be appreciated that the network connections shown are illustrative and other devices of establishing a communications link between the computers may be used.
FIG. IB, provides a graphical demonstration of the processing performed by the exemplary coding sensor program module 170. As shown, a DNA sequence 180 is sampled using a sliding window technique, whereby a window 185 is advanced one nucleotide at a time. At each nucleotide position, content variables are calculated by the content variable computation program module 170A. Content variables for a current window, as well as the content variables for up-stream and down-stream windows, are input to the recurrent neural network program module 170B. The output from the recurrent neural network program module 170B is input to the post-processing and prediction program module 170C in order to account for noise, etc. The output from the recurrent neural network program module 170B represents a coding potential or a coding score, referred to herein as a coding sensor value. Coding sensor values for each nucleotide position are subsequently concatenated to determine a coding/non-coding picture of the DNA sequence.
Exemplary Recurrent Neural Network Architecture
A neural network consists of a number of inter-connected computational neurons that operate in parallel to produce an output result. While each neuron within a neural network operates independently, the inputs and/or output of the neurons are connected to one another and are assigned a weight. The manner in which weights are assigned to each neuron determines the behavior of the neural network. A neural network may be trained by altering the values of the weights in a well-defined manner, described by a learning rule. As an example, a neural network may be trained to map a set of input patterns onto a set of output patterns. One method of training a neural network is referred to as "supervised learning." Supervised learning employs an external teacher and requires a knowledge of the desired responses to input signals. The goal of supervised learning is to minimize the error between the desired output neuron values and computed output neuron values. The value of an output signal of a neuron depends upon the activation of the neuron, which is expressed as an output transfer function.
The architecture of a neural network is formed by organizing neurons into layers. There may be connections between neurons in the same layers and connections between neurons in different layers. Interlay er connections allow the propagation of signals in one direction or in both directions. In the common feedforward neural network, there are three types of neurons. Input neurons receive signals from external sources and send output signals to other neurons. Output neurons receive signals from other neurons and send signals to the environment. Hidden neurons have no contact with the environment.
A recurrent neural network is a special type of neural network that provides for internal memory. Apart from the regular input neurons, output neurons and hidden neurons that exists in the common feedforward multilayer neural networks, recurrent neural networks include a special type of neuron called a context neuron. Context neurons help the neural network to memorize its previous states and thus may model the associations that exist among these states. An illustrative embodiment of a recurrent neural network architecture that may be used in accordance with an exemplary embodiment of the present invention is shown in FIG. 2. The illustrative recurrent neural network comprises a one hidden layer, partially- connected recurrent network. The feedforward connections are modifiable while the recurrent connections are fixed. Input neurons 202 accept input signals from the environment and transmit output signals to hidden neurons 204. Hidden neurons 204 in turn transmit output signals to output units 208 and also to context neurons 206. Signals transmitted from hidden neurons 204 to context neurons 206 are referred to as feedback. Tank and linear activation function are employed for hidden neurons 204 and context neurons 206, respectively. The use of a tank activation function in hidden neurons 206 introduces a nonlinear component to the system. A logistic function is used in the output neurons 208. In an exemplary embodiment, sixty hidden neurons 204 are used in the recurrent neural network. Generalization errors were estimated using the split-sample validation method. Content Variables
Content variables capture the statistical differences between coding and non-coding regions. For a specific DNA sequence, a window of empirically selected size (for example 43 base pairs) was advanced by one nucleotide each time along the DNA sequence. In an exemplary embodiment, nine content variables were calculated at each nucleotide position. The 5' and 3' flanking regions of the sequence were treated the same as introns. The following is a description of the content variables calculated in the exemplary embodiment. (1) Hexamer 1 : Let the preference value of each hexamer be the logarithmic ratio of its normalized probabilities in exons verse introns in human genes. The hexamer 1 is defined as the sum of preference values in the window adjusted by the number of hexamers (W-6). Algebraically,
W-6
∑ln(/. / F.) Hexamer 1 :
{W - 6) where W is the window size.
(2) In-frame hexamer 1 : This variable is similar to the hexamer 1 except that the observed hexamers in the sequence are compared with the preference values of in-frame hexamers in the human exons. The total preference is computed three times for the window, once for each reading frame. The predicted reading frame is taken to be the one that provides the highest hexamer in-frame coding verse non- coding preference, and the variable is defined as the total preference for this frame adjusted by the number of hexamers (W-6V3. Mathematically,
In-frame hexamer l=
Figure imgf000013_0001
(3) Hexamer 2 and (4) In-frame hexamer 2: These two variables are similar to the previous two except that the probabilities F now are the frequencies of the hexamers in a random population based on the base composition of the sequence. Mathematically,
6
F, = freqb,
4=1 where freqb is the frequency of nucleotide b in the sequence under consideration.
(5) Base composition: The CG percentage is taken as the base composition variable.
(6) Fickett variable: Fickett (1982) developed an algorithm for predicting coding regions by considering several properties of coding sequences. In a given window, it 3-periodicity of each of the four bases is independently examined and compared to the periodic properties of coding DNA. The overall base composition of the sequence under investigation is also compared with the known composition for coding and non-coding DNA. (7) Uneven position bias: First purposed by Staden
(1984), this variable measures the asymmetry of the base composition in three codon positions. Let f(b,i) be the probability of b in position i, where b is the base (b=A,C,G,T) and i is the codon position (i=l,2,3). Then μ(b)=(∑if(b,i))/3 and diff(b)=∑ι(|f(b,i)-μ(b)|). The uneven position bias variable is defined as (∑b diff(b))/W, where W is the width of the window.
(8) Codon prototype: First purposed by Fickett and Tung (1992), let f(b,i) be the probability of finding base b at position i in an actual codon and q(b,i) be the probability of finding nucleotide b at position i in a trinucleotide that is not a codon. Define B to be the matrix B(b,i)=f(b,i)-q(b,i), then the codon prototype variable is the sum over the window of the dot product of B and the codons of the window.
(9) Frame bias: Mural et al (1991) used the frame bias variable in their CRM module to recognize exons in DNA sequences. This variable is very similar to the codon prototype variable. Let f(b,i) be defined as the uneven position bias variable. If a window codes for protein, one frame should have a significantly better correlation with the f(b,i) matrix than the other two possible reading frames. The correlation coefficient between f(b,i) and each reading frame is calculated and the difference between the best and worst coefficient is taken as the frame bias variable.
Datasets Used in the Embodiment of the Instant Invention The data sets used in operation of an exemplary embodiment were obtained from the primate division of GenBank Release 98.0
(December 28, 1996). The sequence with annotation of "DNA" in the
LOCUS field, "complete cds" (coding sequence) in the DEFINITION field, "Homo sapiens" in the SOURCE field, and at least one CDS entry in the FEATURE table were extracted. From this initial set, the following sequences were discarded: sequence encoding incomplete protein products, sequences encoding pseudogenes, sequences encoding putative genes, sequences encoding protein coding genes in the complementary strand, fragmented sequences that require sequence assembly, sequences having alternatively spliced sites, and sequences containing ambiguous nucleotides.
The following sequences were further dropped to ensure the dataset integrity: sequences encoding more than one gene, sequences having introns whose lengths were less than 5 bp, sequences having introns not starting with GT or not ending with AG, sequences with CDS not starting with an ATG or not ending with a stop codon, sequences with CDS lengths not divisible by three. Finally, sequences corresponding to immunoglobulins and histocompatibility antigens were also discarded due to their ability to undergo complex DNA rearrangement. The final dataset consisted of 548 sequences. Each sequence encoded one and only one complete, spliceable, functional protein product in the forward strand. This set (dataset_A) contained 2,926,880 nucleotides, of which 597,720 were exon bases and 1,308,300 were intron bases.
All the exon and intron parameters used in calculating the content variables were estimated from this dataset. A second dataset (dataset_B) was constructed from dataset_A for derivation and testing of the recurrent neural network by dropping the single-exon sequences (263 sequences). Since considerable evidence suggest that the human genome is heterogeneous with respect to C+G content (Burge 1997), the sequences in dataset_B were further divided into four groups according to the C+G composition of the sequences: I (<43 C+G%); II (43-51); III (51-57); and IV (>57). There were 45, 73, 67, and 79 sequences in groups I, II, III and IV, respectively. Each sequence (sequences longer than 15 kb were avoided) in the dataset_B was selected into one of three sets: training, validation or test set. The resultant training set consisted of 15, 38, 36 and 43 sequences for groups I, II, III and IV respectively while the validation set contained 4, 8, 8, and 9 sequences respectively. The test set, shown below in Table 1, contained 10, 25, 23 and 27 sequences in each group. Table 1. Test sets for the four groups
Group IV group > III group II group I C+G > 57 C+G ! 51-57 C+G 43-51 C+G < 43
HSU48869 U48869 HUMPRCA Ml 1228 HUMTHROM L36051 HUMHSD3BA M77144
A
HUMHBQ1A M33022 HUMCSFGM Ml 3207 HUMP45C17S M63871 HUMREGB J05412
A
HUMMK D 10604 HUMMH6 J03027 HUMTRHYA L09190 HUMIL9A M86593 L
HSU62025 U62025 HSU71086 U71086 HUMRPSI7A Ml 8000 HUMPALC M11518 HUMMIF LI 9686 HUMCD19A M84371 HSU57623 U57623 HUMBETGLO L26465
D
HUMPEPYYA L25648 HUMGHV K00470 HUMGASTA M15958 HSU20758 U20758
HUMALPHA J03252 HSU54701 U54701 HSU07807 U07807 HUMIL5A J02971
HUMPNMTA J03280 HUMPP14B M34046 HUMPHOSA LI 2760 HUMLUCT D14283
HSU19816 U19816 HSU22027 U22027 HUMGAD45A L24498 HUMHIAPPA M26650
HUMURAGL M87499 HSU48795 U48795 HUMPF4V1A M26167 HUMIL2B K02056
Y
HUMAZCDI M96326 HUMKEREP J00124 HUMKALLIS L28101
T
HSU20982 U20982 HSU47654 U47654 HUMHAP M92444
HUMACTGA M19283 HUMPLA J00289 HUMCRPG Ml 1880
HUMAKl J04809 HSU32576 U32576 HUMEFIA J04617
HUMCP210H M26856 HUMCOLA M95529 HUMPCBD L4I560
HUMAP0E4 Ml 0065 HUMAPOCII M10612 HUNPIV U 18745
HUMPEM M61170 HUMAGAL M5 199 HUMENA78A L37036
HUMALIFA M63420 HUMANFA K02043 HUMATPSYB M27132
HUMPGAMM J05073 HSU20223 U20223 HSU31929 U3 I929
G
HUMMHCW l Ml 6272 HUMIMPDH L33842 HUMNUCLE M60858
B 0
HSU 10307 U 10307 HUMCTLA1 M38193 HUMTCRBAP L48728
HUMMHHLA M80469 HUMCAPG J04990 HUMG0S19A M23178
JB
HSU48865 U48865 HUMCBRG M62420 HSU 19906 U 19906
HUMTNFBA M55913 HSU 12709 U 12709
HSU05259 U05259 HUMPRPH2 Ml 3058
HUMPROT2 M60332
HUMPRFIA M31951
Properties Of Content Variables
In an exemplary embodiment, each sequence in dataset_B was sampled using a sliding-window technique with a window size of 43 bp and sliding the window one nucleotide at a time. One-dimensional distributions of these variables were studied. The results for group IV are shown in FIG. 3. As may be seen, two features stand out. First, as one would hope, the distributions of nearly all variables behave relatively normal. Secondly, there is significant overlapping between the coding and non-coding for all variables, meaning that there is little information available to distinguish the two classes in one dimension. Especially for variables such as codon prototype and C+G% content, the distribution of the coding class completely locates inside that of the non- coding class. The results for the other three groups (data not shown) demonstrate similar features. To evaluate the relative contributions of individual variables to the total discriminative power, Bhattacharyya distance (B), showing the significance of each variable, were calculated under the equal variance assumption for these variables for each group. This statistical distance is defined as:
Figure imgf000018_0001
where M2 , , are the means and ∑, , ∑2 are the covariance matrices of the coding and non-coding regions respectively. The results for group IV and group I are shown in Table 2 and Table 3 respectively. The combined Bhattacharyya distances were calculated using the forward searching procedure under the same assumption.
Table 2. Significance of content variables - group IV (C+G > 57)
Content variable Order in forward Individual B Combined B searching
InFrame Hexamer 1 1 0.229 0.229
Codon Prototype 2 0.050 0.364
Frame Bias 3 0.087 0.41 1
Hexamer 2 4 0.067 0.447
Hexamer 1 5 0.161 0.478
Uneven Positional Base 6 0.142 0.500
Fickett 7 0.092 0.538
InFrame Hexamer 2 8 0.160 0.551
C+G% 9 0.046 0.560
Table 3. Significance of content variables group I (C+G < 43)
Content variable Order in forward Individual B Combined B searching
C+G% 1 0.166 0.166
Hexamer 1 2 0.040 0.317
Uneven Positional Base 3 0.089 0.365
Fickett 4 0.11 1 0.396
Frame Bias 5 0.038 0.420
Codon Prototype 6 0.010 0.442
InFrame Hexamer 2 7 0.159 0.465
Hexamer 2 8 0.161 0.481
InFrame Hexamer 1 9 0.066 0.493 There are a few notable observations concerning these calculations. First, the discriminative information correlates with the C+G% percentage. There is more information in high C+G% groups than in the low C+G% group. Thus, the Bhattacharyya distance of 0.560 for group IV is higher than the distance 0.493 for group I. This phenomenon may in part explain the observation that gene prediction programs tend to perform less well on A+T rich sequences (e.g. Snyder and Stormo, 1995). Secondly, the in-frame hexamer 1 is the most discriminative content variable in the high C+G% groups, consistent with the previous result (Fickett and Tung, 1992). But it is not in the low C+G% group both individually and in the combined case. Thirdly, even the most discriminative variable (variable InFrame hexamer 1 in the case of group IV) only contributes one third to the total statistical distance. Although hexamers, codon prototype, frame bias and uneven position base variables all depend on the positional base frequency information in the gene, they certainly capture non-redundant statistical aspects of this information.
Neural Network Training Details
Training of the exemplary recurrent neural network described above was performed in the following manner. Suppose the training set of related values of inputs and targets from a sequence is represented by {x(i), d(i)}, l<i<L, where L is the total sample size from the sequence. Training is done by adjusting the weights assigned to neurons of the neural network in order to minimize a cost function. The cost function used was the sum of squared errors augmented by a simple weight-decay regularization term
d(i) - y(i)
Figure imgf000019_0001
where w is the set of weights and α is a small regularization parameter. The weight decay is added to avoid overfitting as it puts constraints on the parameters and thus reduces the degrees of freedom.
The networks were trained by 200 epochs using the backpropagation method. During training, the networks were evaluated using the mean-squared error (MSE) defined as follows:
Figure imgf000020_0001
where NP is the total number of observations from the training set. The validation error was calculated similarly. MSE values of 0.084, 0.088, 0.093 and 0.099 were achieved for groups I, II, III and IV, respectively. The corresponding validation errors were 0.063, 0.050, 0.060 and 0.087, respectively.
Several techniques were used to increase the speed of convergence. (1) Stochastic updating was used instead of batch updating. (2) Training was performed without the momentum term. It was found that without the momentum term training gave better results than training with momentum (α=0.8). A possible explanation is that the error surface is so complicated that any change in weights must be very small. (3) The learning rate was adjusted sequentially using the search- then-converge technique during the training process. Rapid convergence was achieved after using these techniques, usually within 50 epochs.
Early stopping and simple weight decay were employed to increase the generalization ability of the trained network. The decay parameter α was taken as 0.0001 and the biases were not subjected to the decay. A second technique was also employed. At the end of each epoch, the sequences in the training set were shuffled randomly to mix the order of sequence presentation to the network. This procedure decreased the possibility of the network being stuck in a local minimum. Coding Differential
The recurrent neural network coding sensor was evaluated using the coding differential measure (Δ), first proposed by Burge (1997). The coding differential for each sequence in the test set was calculated. The result is shown in FIG. 4 along with the results from the inhomogeneous 3 -periodic fifth-order Markov model. In order to generate the sequence under the recurrent neural network model, the following formula (Bayes theorem) was used in the calculations,
P x coding
P * noncoding ]
Figure imgf000021_0001
It may be observed from FIG. 4 that the recurrent neural network model increases the coding differential dramatically in almost all sequences over the inhomogeneous 3 -period fifth-order Markov models. The only exception is sequence HUMTCRBAP (accession L48728) where ΔRNN=- 1-065124 and ΔMARKOV=-0.005051. In this case, the RNN model actually is worse than the Markov model, and both models tend to misclassify the exons as introns and introns as exons. This suggests that long-range interactions (at least 6 nucleotides apart) exist among the nucleotides and that capturing these interactions (such as with RNN modeling) can increase the coding differentials substantially, leading to potential better gene identification. The ΔRNN mean values for the four C+G% groups (I, II, III, IV) were 2.088, 3.913, 5.700, 6.166 while the corresponding ΔMARKOV mean values were 0.047, 0.076, 0.097, 0.105. The ΔRNN value significantly correlates with the sequence C+G% content (statistical significance level P<0.01). On average high C+G% sequences have high ΔRNN values.
Evaluation was also made using a data set constructed by Burset and Guigo (1996) which was comprised of 570 vertebrate genes. The results are shown in FIG. 5. The features demonstrated in FIG. 4 are also prominent in FIG. 5, suggesting that the superiority of the RNN model over the fifth-order Markov models for distinguishing coding/non-coding regions is independent of the gene set used. The ΔRNN mean values for the four C+G% groups (I, II, III, IV) were 2.457, 3.882, 4.890, 5.766 while the corresponding ΔMARKOV mean values were 0.004, 0.059, 0.094, 0.096 for this gene set.
Prediction of Coding Regions
The output of the neural network for a certain nucleotide position can be interpreted as the probability of that nucleotide position being a coding nucleotide. The post-processing and prediction method of the present invention concatenates the outputs of one or more neural networks to provide an overall coding/non-coding arrangement of the DNA sequence. An exemplary post-processing and prediction method is described by the following steps:
(1) The sequence is sampled using window size 43 bp and the window slides by one nucleotide at a time. At each position, content variables are calculated.
(2) The content variables are input into the neural network and the output value sequence is obtained.
(3) Due to statistical fluctuation, the output value sequence is smoothed by a 5-point medium filter twice. (4) To find coding regions, the output sequence is scanned from left to right using global threshold technique. The threshold value is empirically decided. During scanning, starting from the first position:
(a) If the output value of position i is larger than the threshold, continue scanning to the right until the output value does not increase. Let that position be position j. Then starting from position i, search to the left until the output value does not decrease. Let that be position k. The middle position between j and k is taken as the left boundary of a potential coding region. (b) From position j+1, continue scanning to the right until the output value is less than the threshold. Let that position be i. From position i, scan to the right until the output value does not decrease. Let that be k. From position i, scan to the left until the output value does not increase. Let that be j. The middle position between j and k is taken as the right boundary of a potential coding region.
(c) Repeat (a) and (b) until the end of the sequence is reached.
(5) The boundaries of these potential coding regions are adjusted by taking into consideration the exon and intron length distributions (Lou 1997). Specifically, if two potential coding regions are separated by a non-coding region less than 65 bp (introns are usually > 64 bp in length), then they are combined into one single coding region.
(6) Every potential donor site (containing GT) and acceptor site (containing AG) are evaluated by WAM matrices (Zhang and Marr
1993). Two sequences are thus obtained, one for potential donor sites and another for potential acceptor sites. Each sequence contains the locations of potential sites and the corresponding WAM scores. (7) The boundaries are further adjusted by consideration of WAM scores. Specifically, for each boundary point p, the potential splice sites located within a pre-specified distance d from point p are selected. The site with the largest WAM value among those selected potential splice sites is taken as the true site. In case no site is selected, the distance d is extended two times larger (constrained by the previous boundary point and the next boundary point) and the procedure is repeated (and so on). The pre-specified distance d is empirically determined as 120 bp both for donor and acceptor sites. (8) Repeat step (5).
As an example, the output for gene HUMPNMTA (accession J03280) from group IV is shown in FIG. 6, in which the curve represents the output of the neural network while the straight line represents the annotated gene arrangement. The dots represent the prediction locations.
The probability 0.8 was used as the global threshold value, which roughly means that the probability of correctness of the predicted exons is 0.8. This gene has three annotated exons with locations 1958- 2159, 3159-3366 and 3480-3918. The algorithm found all three exons with the predicted regions 2008-2159, 3159-3366, 3480-3884. Except for the translation initiation site and stop site, the four donor/acceptor sites match the actual (annotated) locations exactly, with the correlation coefficient CC=0.94. Though the overall level of accuracy in this example is somewhat higher than average for the algorithm, it is by no means atypical. This example also serves to illustrate some weaknesses of the algorithm. The identification of coding regions relies on the global threshold technique so that the predicted coding region type (initial/internal/terminal) can not be known in advance. As a compromise, all predicted regions were treated as internal exons in this study. The shortcoming is that the initiation site and stop site locations can not be located precisely.
The second example gives some insight into how the exemplary method will behave in a real situation. The sequence HSNCAMXl (accession Z29373) was not in the dataset_A because there is no "complete cds" in its DEFINITION line. It is a complete new sequence to the algorithm since it was not involved in any step of development of this algorithm. The neural network output is shown in Figure 6 and the text values are shown in Table 4.
Table 4 Prediction of the simple algorithm for GenBank seq uence HSNCAMXl
Predicted exons Annotatec gene arrangement exon Start End Len Cod Sen Ac Do Status exon Start End Len Status
# Score Score Score #
01 1508 1793 286 n/a n/a 3 01 overlap 01 1533 1608 76 overlap 02 4127 4141 15 missed
02 4732 4777 46 -22 33 6 18 3 58 partial 03 4672 4777 106 partial
03 5015 5217 203 -53 65 7 36 0 65 exact 04 5015 5217 203 exact
04 6192 6314 123 199 65 2 93 -0 01 exact 05 6192 6314 123 exact
05 641 1 6581 171 -11 75 2 19 2 53 exact 06 641 1 6581 171 exact
06 6871 6982 112 -56 75 2 19 1 69 exact 07 6871 6982 112 exact
07 7198 7315 1 18 -5 21 3 41 3 26 partial 08 7131 7315 185 partial 09 7437 7568 132 missed
08 7708 7794 87 -19 78 -0 33 3 69 partial 10 7708 7851 144 partial
09 7888 8032 145 -210 51 -1 21 4 21 wrong
10 8417 8528 1 12 -37 54 1 95 4 56 exact 1 1 8417 8528 112 exact
1 1 8642 8808 167 -62 80 3 09 1 76 exact 12 8642 8808 167 exact
12 891 1 9067 157 -214 68 3 16 3 00 exact 13 891 1 9067 157 exact
13 9248 9372 125 -35 53 5 98 5 99 exact 14 9248 9372 125 exact
14 9460 9570 1 1 1 -82 50 3 92 3 85 exact 15 9460 9570 1 1 1 exact
15 9817 10014 198 -512 01 471 6 88 exact 16 9817 10014 198 exact
17 10246 10316 71 missed
16 10499 10721 223 -164 65 2 18 4 92 exact 18 10499 10721 223 exact
17 11482 1 1666 185 -442 92 3 00 5 23 partial 19 1 1551 1 1666 116 partial
18 11870 12282 413 -40045 5 78 -1 46 partial 20 11870 12071 202 partial 21 12160 12282 123 partial
19 12376 12549 174 -77 68 4 26 444 exact 22 12376 12549 174 exact
20 12666 12785 120 -136 99 4 09 7 54 exact 23 12666 12785 120 exact
21 12925 13048 124 -14 98 2 70 5 54 partial 24 12893 13048 156 partial
22 13257 13407 151 -423 52 0 31 3 20 overlap 25 13352 13486 135 overlap
23 13820 13892 73 -16 07 3 86 5 08 exact 26 13820 13892 73 exact
27 13990 14001 12 missed
24 14475 14641 167 n/a 2 19 n/a partial 28 14475 14706 232 partial
In Table 4, column 1 - column 7 was outputted from the exemplary method. Column 9 - column 12 is from GenBank annotation. Column 2 is the beginning position of the predicted exon. Column 3 is the ending position of the predicted exon. Column 4 is the length of the predicted exon. Column 5 is the coding sensor score of the coding portion of the exon. Column 6 is the score of the acceptor signal at the 5' end of the predicted exon. Column 7 is the score of the donor signal at the 3' end of the exon. Column 8 and Column 13 are provided here for illustrative purposes. The coding sensor score and the acceptor signal score of the first predicted exon were not calculated, so were the coding sensor score and the donor acceptor score of the last predicted exon.
The probability 0.8 was used as the global threshold value, as in the previous example. Sensitivity and specificity are both 0.84. The correlation coefficient is 0.79. In this example, the annotated HSNCAMXl gene contains 28 coding exons, of which 14 were predicted exactly, eight were predicted partially, two were predicted by overlapping exons and four were missed completely. In addition, one wrong exon was predicted.
It is notable that the predicted exon which is wrong has an unusually weak acceptor signal score (weaker than any score for a true splice site in this gene) and a relatively weak coding sensor score. Thus, the splice signal and exon coding sensor scores may provide useful information about the reliability of the prediction. The most distinctive property of the four annotated exons which were missed (exons 02, 09, 17 and 27) is their small size (15, 132, 71 and 12, respectively). In the neural network output of this gene, there were small peaks (at the level 0.20, 0.60 and 0.40) at the regions spanned by the annotated exons 02, 09 and 17. Therefore, it could be possible to pick up these exons should a better assembly algorithm be used instead of the simple algorithm. On the other hand, there was no distinguishable peak around the region 13990 - 14001 where the fourth missed exon is located.
Evaluation Of Coding Sensor
The measures established by Burset and Guigo (1996) were used to evaluate the accuracy performance of the recurrent neural network on the test sets. Table 5 shows the nucleotide level accuracy for different C+G% compositional test groups along with the results from two of the most-widely used gene prediction programs for the test sets.
Table 5. Nucleotide-level accuracy foi the test set.
C+G% RNN GenelD SORFIND
> 57 (27 sequences)
Sn 0.73 0.70 0.66
Sp 0.75 0.70 0.79
AC 0.67 0.63 0.67
CC 0.66 0.62 0.66
51-57 (23 sequences)
Sn 0.63 0.58 0.72
Sp 0.81 0.77 0.86
AC 0.66 0.61 0.74
CC 0.65 0.59 0.73
43-51 (25 sequences)
Sn 0.50 0.71 0.74
Sp 0.62 0.78 0.81
AC 0.50 0.69 0.70
CC 0.48 0.67 0.69
<= 43 (10 sequences)
Sn 0.41 0.57 0.67
Sp 0.53 0.90 0.83
AC 0.41 0.71 0.72
CC 0.40 0.65 0.71
Probabilities of 0.4, 0.6, 0.8 and 0.8 were used as the global threshold value for groups I, II, III and IV respectively. GenelD was assessed using the Email service geneid@darwin.bu.edu and the "- noexonblast" option was used to suppress the protein database search.
The first ranked potential gene was used. SORFIND Version 2.8 (dated: July 9, 1996) was downloaded from website www.rabbithutch.com and the default parameter values were used in evaluation.
The accuracy at the exon level is shown in Table 6. Table 6. Exon-level accuracy for the test sets
Predicted exons Annotated exons
C+G% # Exact Part Overlap Wrong # Exact Part Overlap Miss
(%) (%) (%) (%) (%) (%) (%) (%)
> 57 135 16 42 17 25 112 19 54 15 12
51-57 118 29 48 9 14 143 24 41 7 28
43-51 104 14 38 13 35 112 13 35 12 40
<= 43 62 9 24 11 56 40 13 38 13 38
As may be seen in Table 6, several features stand out. First, the recurrent neural network is able to capture the information efficiently, as evidenced by its good performance in high C+G% groups. In fact, the results are competitive with other more sophisticated systems at the nucleotide level, which probably implies that the recurrent neural network extracts coding information more efficiently than the subsystems for coding region in these leading systems. Secondly, with the decreasing information available in the coding region (correlated to C+G content), the performance decreases gradually as expected, due to the global threshold operation. This decrease is evident at the nucleotide level as well as at the exon level. At the nucleotide level, the correlation coefficient decreases from 0.66 to 0.40. At the exon level, while 88% of annotated exons were identified in group IV (exact+partial+overlap), only 62% were identified in group I. Thirdly, the exon level accuracy is low. For example, in the best case (high C+G% group I), only 29 percent of 118 predicted exons are exactly correct (while 48 percent are partially correct). This is mainly due to the inefficient use of information from splice site signal and initiation site signal.
There were a large percentage of incorrect exons predicted. For example, 35% of 104 predicted exons in group II were wrong (see Tables 3-5). The reason is that there were some strong peaks (comparing with the true exon peaks) in the neural network output at the non-exon sites. To eliminate these peaks, a possible approach is to have a good assembly algorithm that can effectively assemble the coding region model (recurrent neural network), splice site models, initiation site model and stop site model into a gene model so that in the overall model, these false peaks would not pose a problem. It is quite possible that at these false peak regions either there are no splice site signals available or these signals are very weak. This approach is demonstrated by integrating this coding sensor into a rather simple generalized hidden Markov model (Rabiner 1989) which substantially improves the overall prediction accuracy both at the nucleotide level and exon level (Table 7).
Figure imgf000029_0001
In Table 7, column 1 represents the number of sequences in each test set is given in the first parentheses, followed by the number of sequences for which no gene was predicted, in second parentheses. The Generalized hidden Markov model contains ten states and is similar in structure to the ones used in Genie and GENSCAN. All the parameters (state length distributions, Markov transition probabilities, and state initial probabilities) were estimated from the dataset_A (Lou 1997). The state sequence generating models for splice sites and initiation sites are WAM models. The sequence generating model for the coding/non-coding regions is the recurrent neural network model (converting the posterior probability to the sequence generating model using the Bayes theorem). The performance of the model (program Gene ACT), was tested on the set of 570 vertebrate genes constructed by Burset and Guigo (1996). The results are shown in Table 8 and the comparisons with other systems are shown in Table 9, below.
At the nucleotide level, the GeneACT is comparable with all leading systems. Although the sensitivity and specificity at the exon level are low, the missing exon percentage and wrong exon percentage are comparable with other systems. It should be noted that as of the overlapping between the training set of all these systems and the Burset and Guigo dataset, truly objective comparisons of these systems are not obtainable and probably even not meaningful. To increase the exon level sensitivity and specificity, one obvious approach is to build more sophisticated splice site models (Burge and Karlin 1997). Another approach is to incorporate promoter, polyA signal and other signals (like signal peptide and CpG signal) into the generalized HMM model. It is anticipated that by using these two approaches the overall performance of the system will be substantially improved. After the incorporation of promoter and polyA signals into the HMM model, further improvement of the HMM modeling may come from the RNN model which treats the 5' UTR, introns, 3' UTR and intergenic regions differently.
Figure imgf000030_0001
In Table 8, the number of sequences in each subgroup is given under the heading "# Seqs", followed by the number of sequences for which no gene was predicted, in parentheses.
Figure imgf000030_0002
In Table 9, the performance of the model GeneACT is shown in the first row. Most of these results are from Burset and Guigo' s comprehensive study (1996), with the exception of Genie, GENSCAN and VEIL. Genie results are from Kulp et al. ( 1996). GENSCAN results are from Burge and Karlin (1997). VEIL results are from Henderson et al. (1997). Under the heading "# Seqs", the number of sequences (out of 570) effectively analyzed by each program is given (some programs failed to run on certain sequences), followed by the number of sequences for which no gene was predicted, in parentheses. GeneID+ and GeneParser3 make use of amino acid similarity searches and were tested only on sequences less than 8 kb in length. It will be understood that the foregoing is intended to be illustrative of the invention and that other examples and embodiments are contemplated within the scope and spirit of the invention and in the claims.
References Bishop, CM. 1995, Neural Networks for Pattern Recognition, Oxford
University Press, Oxford.
Burge C. 1997. Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University, Stanford, CA.
Burge C. and Karlin S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.
Burst, M. and Guigo, R. 1996. Evaluation of gene structure prediction programs. Genomics, 34, 353-367.
Dong, S. and Searls, D.B. (1994). Gene structure prediction by linguistic methods. Genomics, 162, 705-708. Fickett, J.W. 1982. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10, 5303-5318.
Fickett, J.W. and Tung, C.S. 1992. Assessment of protein coding measures. Nucl. Acids Res. 20, 6441-6450.
Fields, CA. and Soderlund, CA. 1990. gm: a practical tool for automating DNA sequence analysis. Comp. Appl. Biol. Sci. 6, 263-270.
Guigo, R., Knudsen, S., Drake, N. and Smith, T. 1992. Prediction of Gene Structure. J. Mol. Biol., 226, 141-157.
Henderson J., Salzberg S. and Fasman K. 1997. Finding Genes in Human DNA with a Hidden Markov Model. J. Comp. Biol. 4, 119-126.
Hutchinson, G.B. and Hayden, M.R. 1992. The prediction of exons through an analysis of spliceable open reading frames. Nucl. Acids Res. 20, 3453-3462.
Kulp, D., Haussler, D., Reese, M.G. and Eeckman, F.H. 1996. A generalized Hidden Markov Model for the recognition of human genes in DNA. In Proceedings of the Fourth International Conference on Intelligent System for Molecular Biology. AAAI Press, Menlo Park, CA.
Lou, Y. 1997. Recognition of Protein Coding Regions in Human Genomic DNA. PhD thesis. Medical University of South Carolina, SC.
Rabiner, L.R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE. 11, 257-285. Ripley, B.D. 1996. Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.
Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res. 21, 607-613.
Snyder, E.E. and Stormo, G.D. 1995. Identification of Protein Coding Regions in Genomic DNA. J. Mol. Biol. 248, 1-18.
Staden, R. 1984. Measurements of the effect that coding for a protein has on a DNA sequence and their use for finding genes. Nucl. Acids Res. 12, 551-567.
Solovyev, V.V., Salamov, A. A. and Lawrence, C.B. 1994. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl. Acids Res. 22, 5156-5163.
Uberbacher, E.C and Mural, RJ. 1991. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA, 88, 11261-11265.
Xu, Y., Mural, R.J. and Uberbacker, E.C 1994. Constructing gene models from accurately predicted exons: an application of dynamic programming. Comp. Appl. Biol. Sci. 10, 613-623.
Zhang, M.Q. and Marr, T.G. 1993. A weight array method for splicing signal analysis. Comp. Appl. Biol. Sci. 9, 499-509.

Claims

CLAIMSWhat is claimed is:
1. A method for analyzing and determining a genetic coding sequence from a nucleotide sequence of interest comprising applying a preprocessing step using information from the sequence segment of interest and from segments of the sequence that are up-stream or down stream adjacent to the current segment of sequence of interest.
2. The method of claim 1, wherein the preprocessing step comprises a recurrent neural network to analyze information from the sequence segment of interest and from segments of the sequence that are up-stream or down stream adjacent to the current segment of sequence of interest.
3. The method of claim 1, wherein the preprocessing step implements a coding sensor to determine the presence or absence of coding information in a nucleotide sequence of interest.
4. The method of claim 1 , wherein the preprocessing step comprises applying a coding sensor using a recurrent neural network to the sequence of interest.
5. A method for determining coding regions within a DNA sequence, the DNA sequence comprising a chain of nucleotides, comprising the steps of: calculating at least one content variable associated with a sampling window of a predetermined number of nucleotides centered at a selected nucleotide position; calculating at least one neighboring content variable associated with a neighboring sampling window of the predetermined number of nucleotides centered at a neighboring nucleotide; and based on the at least one content variable and the at least one neighboring content variable, predicting whether the selected nucleotide position is within a coding region of the DNA sequence.
6. The method of claim 5, wherein the at least one neighboring sampling window comprises an up-stream sampling window and wherein the at least one neighboring content variable comprises at least one up-stream content variable.
7. The method of claim 5, wherein the at least one neighboring sampling window comprises a down-stream sampling window and wherein the at least one neighboring content variable comprises at least one down-stream content variable.
8. The method of claim 5, wherein the at least one neighboring sampling window comprises at least one down-stream window and at least one up-stream sampling window; and wherein the at least one neighboring content variable comprises at least one down-stream content variable and at least one up- stream content variable.
9. The method of claim 5, wherein predicting whether the selected nucleotide position is within the coding region of the DNA sequence comprises processing the at least one content variable and the at least one neighboring content variable in a trained recurrent neural network in order to determine transition characteristics between the selected nucleotide position and the neighboring nucleotide position, said transition characteristics indicating whether the selected nucleotide position is in the coding region.
PCT/US1999/013705 1998-06-17 1999-06-17 Recognition of protein coding regions in genomic dna sequences WO1999066302A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU46917/99A AU4691799A (en) 1998-06-17 1999-06-17 Recognition of protein coding regions in genomic dna sequences

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US8968098P 1998-06-17 1998-06-17
US60/089,680 1998-06-17

Publications (3)

Publication Number Publication Date
WO1999066302A2 true WO1999066302A2 (en) 1999-12-23
WO1999066302A3 WO1999066302A3 (en) 2000-06-22
WO1999066302A9 WO1999066302A9 (en) 2000-07-27

Family

ID=22219015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/013705 WO1999066302A2 (en) 1998-06-17 1999-06-17 Recognition of protein coding regions in genomic dna sequences

Country Status (2)

Country Link
AU (1) AU4691799A (en)
WO (1) WO1999066302A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001095230A2 (en) * 2000-06-08 2001-12-13 Virco Bvba Method for predicting therapeutic agent resistance using neural networks
US7158889B2 (en) 2002-12-20 2007-01-02 International Business Machines Corporation Gene finding using ordered sets
CN111370055A (en) * 2020-03-05 2020-07-03 中南大学 Intron retention prediction model establishing method and prediction method thereof
US10957421B2 (en) 2014-12-03 2021-03-23 Syracuse University System and method for inter-species DNA mixture interpretation
CN113808671A (en) * 2021-08-30 2021-12-17 西安理工大学 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning
CN117745704A (en) * 2023-09-27 2024-03-22 深圳泰康医疗设备有限公司 Vertebral region segmentation system for osteoporosis recognition
JP7583153B2 (ja) 2020-08-21 2024-11-13 リジェネロン・ファーマシューティカルズ・インコーポレイテッド 配列生成および予測のための方法およびシステム

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SNYDER ET AL.: 'Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks' NUCLEIC ACIDS RESEARCH, vol. 21, no. 3, 1993, pages 607 - 613, XP002925273 *
SNYDER ET AL.: 'Identification of Protein Coding Regions in Genomic DNA' JOURNAL OF MOLECULAR BIOLOGY, vol. 248, 1995, pages 1 - 18, XP002925271 *
UBERBACHER ET AL.: 'Locating protein-encoding regions in human DNA sequence by a multiple sensor-neural network approach' PROC. NATL. ACAD. SCI. USA, vol. 88, December 1991, pages 11261 - 11265, XP002925272 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001095230A2 (en) * 2000-06-08 2001-12-13 Virco Bvba Method for predicting therapeutic agent resistance using neural networks
WO2001095230A3 (en) * 2000-06-08 2003-08-21 Virco Bvba Method for predicting therapeutic agent resistance using neural networks
US7158889B2 (en) 2002-12-20 2007-01-02 International Business Machines Corporation Gene finding using ordered sets
US8738299B2 (en) 2002-12-20 2014-05-27 International Business Machines Corporation Gene finding using ordered sets of distinct marker strings
US10957421B2 (en) 2014-12-03 2021-03-23 Syracuse University System and method for inter-species DNA mixture interpretation
CN111370055A (en) * 2020-03-05 2020-07-03 中南大学 Intron retention prediction model establishing method and prediction method thereof
CN111370055B (en) * 2020-03-05 2023-05-23 中南大学 Intron retention prediction model establishment method and prediction method thereof
JP7583153B2 (ja) 2020-08-21 2024-11-13 リジェネロン・ファーマシューティカルズ・インコーポレイテッド 配列生成および予測のための方法およびシステム
CN113808671A (en) * 2021-08-30 2021-12-17 西安理工大学 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning
CN113808671B (en) * 2021-08-30 2024-02-06 西安理工大学 Method for distinguishing coding ribonucleic acid from non-coding ribonucleic acid based on deep learning
CN117745704A (en) * 2023-09-27 2024-03-22 深圳泰康医疗设备有限公司 Vertebral region segmentation system for osteoporosis recognition

Also Published As

Publication number Publication date
AU4691799A (en) 2000-01-05
WO1999066302A9 (en) 2000-07-27
WO1999066302A3 (en) 2000-06-22

Similar Documents

Publication Publication Date Title
KR102433458B1 (en) Semi-supervised learning for training an ensemble of deep convolutional neural networks
Uberbacher et al. [16] Discovering and understanding genes in human DNA sequence using GRAIL
Sonnhammer et al. A hidden Markov model for predicting transmembrane helices in protein sequences.
US20030077586A1 (en) Method and apparatus for combining gene predictions using bayesian networks
Hatzigeorgiou Translation initiation start prediction in human cDNAs with high accuracy
NZ759659A (en) Deep learning-based variant classifier
EP2320343A2 (en) System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map
Choo et al. Recent applications of hidden Markov models in computational biology
EP4254419A1 (en) Artificial-intelligence-based cancer diagnosis and cancer type prediction method
Azad et al. Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory
EP4182928B1 (en) Method, system and computer program product for determining presentation likelihoods of neoantigens
CN111180013A (en) Device for detecting blood disease fusion gene
WO1999066302A2 (en) Recognition of protein coding regions in genomic dna sequences
Reese Computational prediction of gene structure and regulation in the genome of Drosophila melanogaster
Yi et al. Learning from data-rich problems: a case study on genetic variant calling
Kashiwabara et al. Splice site prediction using stochastic regular grammars
US20240185953A1 (en) Systems and methods for high-throughput predictions
Van Haeverbeke DETECTION OF M6A MODIFICATIONS IN NATIVE RNA USING OXFORD NANOPORE TECHNOLOGY
US20240112751A1 (en) Copy number variation (cnv) breakpoint detection
Sidi et al. Predicting gene sequences with AI to study codon usage patterns
Gunady Applications of Graph Segmentation Algorithms for Quantitative Genomic Analyses
Elst RECOGNIZING IRREGULARITIES IN STACKED NANOPORE SIGNALS FROM IN SILICO PERMUTED SEQUENCING DATA
Tenney Basecalling for Traces Derived for Multiple Templates
Uberbacher et al. DNA sequence pattern recognition methods in GRAIL
Rose et al. Mutual Information Measure for Distinguishing Coding and Non-Coding DNA Sequences.

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP US

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AU CA JP US

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

AK Designated states

Kind code of ref document: C2

Designated state(s): AU CA JP US

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

COP Corrected version of pamphlet

Free format text: PAGES 1/7-7/7, DRAWINGS, REPLACED BY NEW PAGES 1/17-17/17; DUE TO LATE TRANSMITTAL BY THE RECEIVINGOFFICE

WWE Wipo information: entry into national phase

Ref document number: 09719887

Country of ref document: US

122 Ep: pct application non-entry in european phase