WO1999066302A2

WO1999066302A2 - Recognition of protein coding regions in genomic dna sequences

Info

Publication number: WO1999066302A2
Application number: PCT/US1999/013705
Authority: WO
Inventors: Yuandan Lou; Zhen Zhang
Original assignee: Musc Foundation For Research Development
Priority date: 1998-06-17
Filing date: 1999-06-17
Publication date: 1999-12-23
Also published as: AU4691799A; WO1999066302A9; WO1999066302A3

Abstract

A coding sensor using a recurrent neural network technique is provided. The coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure. The recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in a nucleotide chain is in a coding region. Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. The content variables of neighboring nucleotide positions are processed using a trained recurrent neural network in order to provide a coding sensor value. In this way, transition characteristics may be used to assist the coding sensor in determining whether a nucleotide position is in a coding region. The coding sensor value represents a prediction of whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.

Description

Recognition of Protein Coding Regions in Genomic DNA Sequences

Background of the Invention

In recent years, the human genome project has generated an enormous amount of DNA and protein sequence data. It is well recognized that efficient and reliable computer predictions, coupled with experimental verification, can greatly speed up the identification and mapping of complex genes, especially in large-scale genomic sequencing projects. As an example, several computer programs using various techniques have been developed to predict the complete exon-intron structure of genes in large unannotated sequences. To accommodate this need, several computer programs using various techniques have been developed to predict the complete exon-intron structure of genes in large unannotated sequences. GeneModeler (Fields and Soderlund 1990), SORFIND (Hutchinson and Hayden 1992), and HEXON (Solovyev et al 1994) used heuristic methods to find potential genes in raw sequences. GRAIL (Uberbacher and Mural 1991, Xu et al 1994), GenelD (Guigo et al 1992), GeneParser (Snyder and Stormo 1993, 1995) and GeneLang (Dong and Searls 1994) are examples of a machine learning approach. Genie (Kulp et al 1996), VEIL (Henderson et al 1997) and GENSCAN (Burge and Karlin 1997) used hidden Markov models to model the human gene structure. Since the performance of these programs are still not satisfactory (see review in Burset and Guigo 1996), development of new methods, and/or improvement of existing methods, continues to be important objectives.

A sequence of nucleotides within a DNA sequence may have associated therewith several variables, referred to as "content variables," that are thought to be useful for discriminating between coding regions and non-coding regions. Prior art computer-implemented models for extracting information from content variables in order to identify coding/non-coding regions include classic linear discriminant methods (Solovyev et al 1994) and feedforward neural networks (Snyder and Stormo 1993, 1995, Guigo et al 1992, Xu et al 1994). Feedforward neural networks benefit from the fact that they may be trained using gradient decent optimization algorithms such as the backpropagation algorithm. However, when employing neural networks to solve problems involving nonlinear dynamical or state dependent systems, neural networks with feedbacks may provide significant advantages over purely feedforward networks. Feedbacks provide recursive computation and the ability to represent state information. In some cases, a neural network with feedbacks may be equivalent to a much larger feedforward neural network. Neural networks with feedbacks are generally referred to as recurrent neural networks.

In general, the use of recurrent neural networks has not been nearly as extensive as that of feedforward neural networks. A primary reason the under-utilization of recurrent neural networks is the difficulty involved in developing generally applicable learning algorithms for recurrent neural networks. Due to the fact that the gradient of the error with respect to the connection strength is not easily solvable for recurrent neural networks, gradient-based optimization algorithms are not always applicable. As a result, the benefits of recurrent neural networks over purely feedforward neural networks have not been exploited with regard to extracting information from content variables of nucleotide sequences in order to identify coding/non- coding regions.

Thus, there remains a need for a new approach based on a recurrent neural network to extract information from content variables of nucleotide sequences in order to identify coding/non-coding regions.

Summary of the Invention

The present invention provides a coding sensor that utilizes a recurrent neural network model. The coding sensor indicates the coding potential of a gene sequence and plays a vital role in the overall prediction of the gene structure. A DNA sequence may be imagined as comprising a discrete nucleotides chain. The recognition of the potential coding regions in a DNA sequence may be achieved by determining whether each individual nucleotide position in the sequence is in a coding region.

Determining whether an individual nucleotide position is in a coding region may be accomplished through a systematic sampling process carried out along the nucleotide chain from start to end. At each nucleotide position, content variables are calculated based on a window centered on the nucleotide position. As mentioned, content variables are thought to be useful for discriminating between coding regions and non- coding regions. The present invention combines the calculated content variables in a specific way in order to provide an overall "coding sensor value." The coding sensor value indicates whether or not the nucleotide position is in a coding region. Coding sensor values for each nucleotide position in the DNA sequence are aligned with the overall DNA sequence to generate a coding/non-coding picture of the DNA sequence.

It is assumed that neighboring segments of a particular DNA sequence have similar characteristics. For example, coding nucleotides (i.e. nucleotides in a coding region) are likely to have neighboring coding nucleotides. Identifying "transition characteristics" between neighboring segments of a DNA sequence may provide additional information that is useful for detecting coding regions. In other words, detecting whether a particular nucleotide position is in a coding or non- coding region may depend not only on information determined from its own content variables but also information determined from the content variables of nearby nucleotides.

The invention provides a novel method for using a recurrent neural network to determine up-stream and down-stream transition characteristics between nucleotide chains in a DNA sequence. Transition characteristics may be used to assist the coding sensor of the present invention in finding potential protein coding regions in unannotated genomic DNA sequences.

Brief Description of the Drawings

FIG. 1, comprising FIG. IA and FIG. IB, shows functional block diagrams of a suitable computing environment for implementing the present invention.

FIG. 2 shows an illustrative recurrent neural network architecture in accordance with an exemplary embodiment of the present invention.

FIG. 3, comprising FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, FIG. 3E, FIG. 3F, FIG. 3G, FIG. 3H and FIG. 31, shows one- dimensional distributions of nine content variables in accordance with an exemplary embodiment of the present invention.

FIG. 4, comprising FIG. 4A and FIG. 4B, shows coding differentials of an exemplary data test set.

FIG. 5, comprising FIG. 5A and FIG. 5B, shows coding differentials of an exemplary Burset/Guigo data set.

FIG. 6 shows illustrates exemplary results obtained by operation of an exemplary embodiment of the present invention.

Detailed Description of the Invention

FIG. 1, comprising FIG. IA and FIG. IB, and the following discussion are intended to provide a brief and general description of a suitable computing environment for implementing the present invention. As is well-known in the art, neural networks are implemented in a computer environment. Although the system shown in FIG. IA represents a conventional personal computer 100, those skilled in the art will recognize that the invention also may be implemented using other types of processor-based systems. The computer 100 includes a processor 122, a system memory 120, and an Input/Output ("I/O") bus 126. A system bus 121 couples the central processing unit 122 to the system memory 120. A bus controller 123 controls the flow of data on the I/O bus 126 and between the central processing unit 122 and a variety of internal and external I/O devices. The I/O devices connected to the I/O bus 126 may have direct access to the system memory 120 using a Direct Memory Access ("DMA") controller 124.

The I/O devices are connected to the I/O bus 126 via a set of device interfaces. The device interfaces may include both hardware components and software components. For instance, a hard disk drive 130 and a floppy disk drive 132 for reading or writing removable media 150 may be connected to the I/O bus 126 through disk drive controllers 140. An optical disk drive 134 for reading or writing optical media 152 may be connected to the I/O bus 126 using a Small Computer System Interface ("SCSI") 141. Alternatively, an IDE (AT API) or EIDE interface may be associated with an optical drive such as a may be the case with a CD-ROM drive. The drives and their associated computer-readable media provide nonvolatile storage for the computer 100. In addition to the computer-readable media described above, other types of computer-readable media may also be used, such as ZIP drives, or the like.

A display device 153, such as a monitor, is connected to the I/O bus 126 via another interface, such as a video adapter 142. A parallel interface 143 connects synchronous peripheral devices, such as a laser printer 156, to the I/O bus 126. A serial interface 144 connects communication devices to the I/O bus 126. A user may enter commands and information into the computer 100 via the serial interface 144 or by using an input device, such as a keyboard 138, a mouse 136 or a modem 157. Other peripheral devices (not shown) may also be connected to the computer 100, such as audio input output devices or image capture devices.

A number of program modules may be stored on the drives and in the system memory 120. The system memory 120 can include both Random Access Memory ("RAM") and Read Only Memory ("ROM"). The program modules control how the computer 100 functions and interacts with the user, with I/O devices or with other computers. Program modules include routines, operating systems 165, application programs, data structures, and other software or firmware components. In an illustrative embodiment, the present invention may comprise one or more coding sensor program modules 170 stored on the drives or in the system memory 120 of the computer 100. Coding sensor modules 170 may comprise one or more content variable calculation program modules 170A, one or more recurrent neural network program modules 170B, and one or more post-processing and prediction program modules 170C. Coding sensor program module(s) 170 may thus comprise computer-executable instructions for calculating content variable, analyzing content variables with a recurrent neural network model, and post-processing the output of the neural network model in order to predict whether a nucleotide position is in a coding region, according to exemplary methods to be described herein.

The computer 100 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer 160. The remote computer 160 may be a server, a router, a peer device or other common network node, and typically includes many or all of the elements described in connection with the computer 100. In a networked environment, program modules and data may be stored on the remote computer 160. The logical connections depicted in FIG. 1 include a local area network ("LAN") 154 and a wide area network ("WAN") 155. In a LAN environment, a network interface 145, such as an Ethernet adapter card, can be used to connect the computer 100 to the remote computer 160. In a WAN environment, the computer 100 may use a telecommunications device, such as a modem 157, to establish a connection. It will be appreciated that the network connections shown are illustrative and other devices of establishing a communications link between the computers may be used.

FIG. IB, provides a graphical demonstration of the processing performed by the exemplary coding sensor program module 170. As shown, a DNA sequence 180 is sampled using a sliding window technique, whereby a window 185 is advanced one nucleotide at a time. At each nucleotide position, content variables are calculated by the content variable computation program module 170A. Content variables for a current window, as well as the content variables for up-stream and down-stream windows, are input to the recurrent neural network program module 170B. The output from the recurrent neural network program module 170B is input to the post-processing and prediction program module 170C in order to account for noise, etc. The output from the recurrent neural network program module 170B represents a coding potential or a coding score, referred to herein as a coding sensor value. Coding sensor values for each nucleotide position are subsequently concatenated to determine a coding/non-coding picture of the DNA sequence.

Exemplary Recurrent Neural Network Architecture

A neural network consists of a number of inter-connected computational neurons that operate in parallel to produce an output result. While each neuron within a neural network operates independently, the inputs and/or output of the neurons are connected to one another and are assigned a weight. The manner in which weights are assigned to each neuron determines the behavior of the neural network. A neural network may be trained by altering the values of the weights in a well-defined manner, described by a learning rule. As an example, a neural network may be trained to map a set of input patterns onto a set of output patterns. One method of training a neural network is referred to as "supervised learning." Supervised learning employs an external teacher and requires a knowledge of the desired responses to input signals. The goal of supervised learning is to minimize the error between the desired output neuron values and computed output neuron values. The value of an output signal of a neuron depends upon the activation of the neuron, which is expressed as an output transfer function.

The architecture of a neural network is formed by organizing neurons into layers. There may be connections between neurons in the same layers and connections between neurons in different layers. Interlay er connections allow the propagation of signals in one direction or in both directions. In the common feedforward neural network, there are three types of neurons. Input neurons receive signals from external sources and send output signals to other neurons. Output neurons receive signals from other neurons and send signals to the environment. Hidden neurons have no contact with the environment.

A recurrent neural network is a special type of neural network that provides for internal memory. Apart from the regular input neurons, output neurons and hidden neurons that exists in the common feedforward multilayer neural networks, recurrent neural networks include a special type of neuron called a context neuron. Context neurons help the neural network to memorize its previous states and thus may model the associations that exist among these states. An illustrative embodiment of a recurrent neural network architecture that may be used in accordance with an exemplary embodiment of the present invention is shown in FIG. 2. The illustrative recurrent neural network comprises a one hidden layer, partially- connected recurrent network. The feedforward connections are modifiable while the recurrent connections are fixed. Input neurons 202 accept input signals from the environment and transmit output signals to hidden neurons 204. Hidden neurons 204 in turn transmit output signals to output units 208 and also to context neurons 206. Signals transmitted from hidden neurons 204 to context neurons 206 are referred to as feedback. Tank and linear activation function are employed for hidden neurons 204 and context neurons 206, respectively. The use of a tank activation function in hidden neurons 206 introduces a nonlinear component to the system. A logistic function is used in the output neurons 208. In an exemplary embodiment, sixty hidden neurons 204 are used in the recurrent neural network. Generalization errors were estimated using the split-sample validation method. Content Variables

Content variables capture the statistical differences between coding and non-coding regions. For a specific DNA sequence, a window of empirically selected size (for example 43 base pairs) was advanced by one nucleotide each time along the DNA sequence. In an exemplary embodiment, nine content variables were calculated at each nucleotide position. The 5' and 3' flanking regions of the sequence were treated the same as introns. The following is a description of the content variables calculated in the exemplary embodiment. (1) Hexamer 1 : Let the preference value of each hexamer be the logarithmic ratio of its normalized probabilities in exons verse introns in human genes. The hexamer 1 is defined as the sum of preference values in the window adjusted by the number of hexamers (W-6). Algebraically,

W-6

∑ln(/. / F.) Hexamer 1 :

{W - 6) where W is the window size.

(2) In-frame hexamer 1 : This variable is similar to the hexamer 1 except that the observed hexamers in the sequence are compared with the preference values of in-frame hexamers in the human exons. The total preference is computed three times for the window, once for each reading frame. The predicted reading frame is taken to be the one that provides the highest hexamer in-frame coding verse non- coding preference, and the variable is defined as the total preference for this frame adjusted by the number of hexamers (W-6V3. Mathematically,

In-frame hexamer l=

(3) Hexamer 2 and (4) In-frame hexamer 2: These two variables are similar to the previous two except that the probabilities F now are the frequencies of the hexamers in a random population based on the base composition of the sequence. Mathematically,

6

F, = freq_b,

4=1 where freq_b is the frequency of nucleotide b in the sequence under consideration.

(5) Base composition: The CG percentage is taken as the base composition variable.

(6) Fickett variable: Fickett (1982) developed an algorithm for predicting coding regions by considering several properties of coding sequences. In a given window, it 3-periodicity of each of the four bases is independently examined and compared to the periodic properties of coding DNA. The overall base composition of the sequence under investigation is also compared with the known composition for coding and non-coding DNA. (7) Uneven position bias: First purposed by Staden

(1984), this variable measures the asymmetry of the base composition in three codon positions. Let f(b,i) be the probability of b in position i, where b is the base (b=A,C,G,T) and i is the codon position (i=l,2,3). Then μ(b)=(∑if(b,i))/3 and diff(b)=∑ι(|f(b,i)-μ(b)|). The uneven position bias variable is defined as (∑b diff(b))/W, where W is the width of the window.

(8) Codon prototype: First purposed by Fickett and Tung (1992), let f(b,i) be the probability of finding base b at position i in an actual codon and q(b,i) be the probability of finding nucleotide b at position i in a trinucleotide that is not a codon. Define B to be the matrix B(b,i)=f(b,i)-q(b,i), then the codon prototype variable is the sum over the window of the dot product of B and the codons of the window.

(9) Frame bias: Mural et al (1991) used the frame bias variable in their CRM module to recognize exons in DNA sequences. This variable is very similar to the codon prototype variable. Let f(b,i) be defined as the uneven position bias variable. If a window codes for protein, one frame should have a significantly better correlation with the f(b,i) matrix than the other two possible reading frames. The correlation coefficient between f(b,i) and each reading frame is calculated and the difference between the best and worst coefficient is taken as the frame bias variable.

Datasets Used in the Embodiment of the Instant Invention The data sets used in operation of an exemplary embodiment were obtained from the primate division of GenBank Release 98.0

(December 28, 1996). The sequence with annotation of "DNA" in the

LOCUS field, "complete cds" (coding sequence) in the DEFINITION field, "Homo sapiens" in the SOURCE field, and at least one CDS entry in the FEATURE table were extracted. From this initial set, the following sequences were discarded: sequence encoding incomplete protein products, sequences encoding pseudogenes, sequences encoding putative genes, sequences encoding protein coding genes in the complementary strand, fragmented sequences that require sequence assembly, sequences having alternatively spliced sites, and sequences containing ambiguous nucleotides.

The following sequences were further dropped to ensure the dataset integrity: sequences encoding more than one gene, sequences having introns whose lengths were less than 5 bp, sequences having introns not starting with GT or not ending with AG, sequences with CDS not starting with an ATG or not ending with a stop codon, sequences with CDS lengths not divisible by three. Finally, sequences corresponding to immunoglobulins and histocompatibility antigens were also discarded due to their ability to undergo complex DNA rearrangement. The final dataset consisted of 548 sequences. Each sequence encoded one and only one complete, spliceable, functional protein product in the forward strand. This set (dataset_A) contained 2,926,880 nucleotides, of which 597,720 were exon bases and 1,308,300 were intron bases.

All the exon and intron parameters used in calculating the content variables were estimated from this dataset. A second dataset (dataset_B) was constructed from dataset_A for derivation and testing of the recurrent neural network by dropping the single-exon sequences (263 sequences). Since considerable evidence suggest that the human genome is heterogeneous with respect to C+G content (Burge 1997), the sequences in dataset_B were further divided into four groups according to the C+G composition of the sequences: I (<43 C+G%); II (43-51); III (51-57); and IV (>57). There were 45, 73, 67, and 79 sequences in groups I, II, III and IV, respectively. Each sequence (sequences longer than 15 kb were avoided) in the dataset_B was selected into one of three sets: training, validation or test set. The resultant training set consisted of 15, 38, 36 and 43 sequences for groups I, II, III and IV respectively while the validation set contained 4, 8, 8, and 9 sequences respectively. The test set, shown below in Table 1, contained 10, 25, 23 and 27 sequences in each group. Table 1. Test sets for the four groups

Group IV group > III group II group I C+G > 57 C+G ! 51-57 C+G 43-51 C+G < 43

HSU48869 U48869 HUMPRCA Ml 1228 HUMTHROM L36051 HUMHSD3BA M77144

A

HUMHBQ1A M33022 HUMCSFGM Ml 3207 HUMP45C17S M63871 HUMREGB J05412

A

HUMMK D 10604 HUMMH6 J03027 HUMTRHYA L09190 HUMIL9A M86593 L

HSU62025 U62025 HSU71086 U71086 HUMRPSI7A Ml 8000 HUMPALC M11518 HUMMIF LI 9686 HUMCD19A M84371 HSU57623 U57623 HUMBETGLO L26465

D

HUMPEPYYA L2₅648 HUMGHV K00470 HUMGASTA M15958 HSU20758 U20758

HUMALPHA J032₅2 HSU₅4701 U54701 HSU07807 U07807 HUMIL5A J02971

HUMPNMTA J03280 HUMPP14B M34046 HUMPHOSA LI 2760 HUMLUCT D14283

HSU19816 U19816 HSU22027 U22027 HUMGAD45A L24498 HUMHIAPPA M26650

HUMURAGL M87499 HSU48795 U48795 HUMPF4V1A M26167 HUMIL2B K02056

Y

HUMAZCDI M96326 HUMKEREP J00124 HUMKALLIS L28101

T

HSU20982 U20982 HSU47654 U47654 HUMHAP M92444

HUMACTGA M19283 HUMPLA J00289 HUMCRPG Ml 1880

HUMAKl J04809 HSU32576 U32576 HUMEFIA J04617

HUMCP210H M26856 HUMCOLA M95529 HUMPCBD L4I560

HUMAP0E4 Ml 0065 HUMAPOCII M10612 HUNPIV U 18745

HUMPEM M61170 HUMAGAL M5 199 HUMENA78A L37036

HUMALIFA M63420 HUMANFA K02043 HUMATPSYB M27132

HUMPGAMM J05073 HSU20223 U20223 HSU31929 U3 I929

G

HUMMHCW l Ml 6272 HUMIMPDH L33842 HUMNUCLE M60858

B 0

HSU 10307 U 10307 HUMCTLA1 M38193 HUMTCRBAP L48728

HUMMHHLA M80469 HUMCAPG J04990 HUMG0S19A M23178

JB

HSU48865 U48865 HUMCBRG M62420 HSU 19906 U 19906

HUMTNFBA M55913 HSU 12709 U 12709

HSU05259 U0₅259 HUMPRPH2 Ml 3058

HUMPROT2 M60332

HUMPRFIA M31951

Properties Of Content Variables

In an exemplary embodiment, each sequence in dataset_B was sampled using a sliding-window technique with a window size of 43 bp and sliding the window one nucleotide at a time. One-dimensional distributions of these variables were studied. The results for group IV are shown in FIG. 3. As may be seen, two features stand out. First, as one would hope, the distributions of nearly all variables behave relatively normal. Secondly, there is significant overlapping between the coding and non-coding for all variables, meaning that there is little information available to distinguish the two classes in one dimension. Especially for variables such as codon prototype and C+G% content, the distribution of the coding class completely locates inside that of the non- coding class. The results for the other three groups (data not shown) demonstrate similar features. To evaluate the relative contributions of individual variables to the total discriminative power, Bhattacharyya distance (B), showing the significance of each variable, were calculated under the equal variance assumption for these variables for each group. This statistical distance is defined as:

where M₂ , , are the means and ∑, , ∑₂ are the covariance matrices of the coding and non-coding regions respectively. The results for group IV and group I are shown in Table 2 and Table 3 respectively. The combined Bhattacharyya distances were calculated using the forward searching procedure under the same assumption.

Table 2. Significance of content variables - group IV (C+G > 57)

Content variable Order in forward Individual B Combined B searching

InFrame Hexamer 1 1 0.229 0.229

Codon Prototype 2 0.050 0.364

Frame Bias 3 0.087 0.41 1

Hexamer 2 4 0.067 0.447

Hexamer 1 5 0.161 0.478

Uneven Positional Base 6 0.142 0.500

Fickett 7 0.092 0.538

InFrame Hexamer 2 8 0.160 0.551

C+G% 9 0.046 0.560

Table 3. Significance of content variables ^• group I (C+G < 43)

Content variable Order in forward Individual B Combined B searching

C+G% 1 0.166 0.166

Hexamer 1 2 0.040 0.317

Uneven Positional Base 3 0.089 0.365

Fickett 4 0.11 1 0.396

Frame Bias 5 0.038 0.420

Codon Prototype 6 0.010 0.442

InFrame Hexamer 2 7 0.159 0.465

Hexamer 2 8 0.161 0.481

InFrame Hexamer 1 9 0.066 0.493 There are a few notable observations concerning these calculations. First, the discriminative information correlates with the C+G% percentage. There is more information in high C+G% groups than in the low C+G% group. Thus, the Bhattacharyya distance of 0.560 for group IV is higher than the distance 0.493 for group I. This phenomenon may in part explain the observation that gene prediction programs tend to perform less well on A+T rich sequences (e.g. Snyder and Stormo, 1995). Secondly, the in-frame hexamer 1 is the most discriminative content variable in the high C+G% groups, consistent with the previous result (Fickett and Tung, 1992). But it is not in the low C+G% group both individually and in the combined case. Thirdly, even the most discriminative variable (variable InFrame hexamer 1 in the case of group IV) only contributes one third to the total statistical distance. Although hexamers, codon prototype, frame bias and uneven position base variables all depend on the positional base frequency information in the gene, they certainly capture non-redundant statistical aspects of this information.

Neural Network Training Details

Training of the exemplary recurrent neural network described above was performed in the following manner. Suppose the training set of related values of inputs and targets from a sequence is represented by {x(i), d(i)}, l<i<L, where L is the total sample size from the sequence. Training is done by adjusting the weights assigned to neurons of the neural network in order to minimize a cost function. The cost function used was the sum of squared errors augmented by a simple weight-decay regularization term

d(i) - y(i)

where w is the set of weights and α is a small regularization parameter. The weight decay is added to avoid overfitting as it puts constraints on the parameters and thus reduces the degrees of freedom.

The networks were trained by 200 epochs using the backpropagation method. During training, the networks were evaluated using the mean-squared error (MSE) defined as follows:

where NP is the total number of observations from the training set. The validation error was calculated similarly. MSE values of 0.084, 0.088, 0.093 and 0.099 were achieved for groups I, II, III and IV, respectively. The corresponding validation errors were 0.063, 0.050, 0.060 and 0.087, respectively.

Several techniques were used to increase the speed of convergence. (1) Stochastic updating was used instead of batch updating. (2) Training was performed without the momentum term. It was found that without the momentum term training gave better results than training with momentum (α=0.8). A possible explanation is that the error surface is so complicated that any change in weights must be very small. (3) The learning rate was adjusted sequentially using the search- then-converge technique during the training process. Rapid convergence was achieved after using these techniques, usually within 50 epochs.

Early stopping and simple weight decay were employed to increase the generalization ability of the trained network. The decay parameter α was taken as 0.0001 and the biases were not subjected to the decay. A second technique was also employed. At the end of each epoch, the sequences in the training set were shuffled randomly to mix the order of sequence presentation to the network. This procedure decreased the possibility of the network being stuck in a local minimum. Coding Differential

The recurrent neural network coding sensor was evaluated using the coding differential measure (Δ), first proposed by Burge (1997). The coding differential for each sequence in the test set was calculated. The result is shown in FIG. 4 along with the results from the inhomogeneous 3 -periodic fifth-order Markov model. In order to generate the sequence under the recurrent neural network model, the following formula (Bayes theorem) was used in the calculations,

P x coding

P * noncoding ]

It may be observed from FIG. 4 that the recurrent neural network model increases the coding differential dramatically in almost all sequences over the inhomogeneous 3 -period fifth-order Markov models. The only exception is sequence HUMTCRBAP (accession L48728) where ΔRNN=- 1-065124 and ΔMARKOV=-0.005051. In this case, the RNN model actually is worse than the Markov model, and both models tend to misclassify the exons as introns and introns as exons. This suggests that long-range interactions (at least 6 nucleotides apart) exist among the nucleotides and that capturing these interactions (such as with RNN modeling) can increase the coding differentials substantially, leading to potential better gene identification. The ΔRNN mean values for the four C+G% groups (I, II, III, IV) were 2.088, 3.913, 5.700, 6.166 while the corresponding ΔMARKOV mean values were 0.047, 0.076, 0.097, 0.105. The ΔRNN value significantly correlates with the sequence C+G% content (statistical significance level P<0.01). On average high C+G% sequences have high ΔRNN values.

Evaluation was also made using a data set constructed by Burset and Guigo (1996) which was comprised of 570 vertebrate genes. The results are shown in FIG. 5. The features demonstrated in FIG. 4 are also prominent in FIG. 5, suggesting that the superiority of the RNN model over the fifth-order Markov models for distinguishing coding/non-coding regions is independent of the gene set used. The ΔRNN mean values for the four C+G% groups (I, II, III, IV) were 2.457, 3.882, 4.890, 5.766 while the corresponding ΔMARKOV mean values were 0.004, 0.059, 0.094, 0.096 for this gene set.

Prediction of Coding Regions

The output of the neural network for a certain nucleotide position can be interpreted as the probability of that nucleotide position being a coding nucleotide. The post-processing and prediction method of the present invention concatenates the outputs of one or more neural networks to provide an overall coding/non-coding arrangement of the DNA sequence. An exemplary post-processing and prediction method is described by the following steps:

(1) The sequence is sampled using window size 43 bp and the window slides by one nucleotide at a time. At each position, content variables are calculated.

(2) The content variables are input into the neural network and the output value sequence is obtained.

(3) Due to statistical fluctuation, the output value sequence is smoothed by a 5-point medium filter twice. (4) To find coding regions, the output sequence is scanned from left to right using global threshold technique. The threshold value is empirically decided. During scanning, starting from the first position:

(a) If the output value of position i is larger than the threshold, continue scanning to the right until the output value does not increase. Let that position be position j. Then starting from position i, search to the left until the output value does not decrease. Let that be position k. The middle position between j and k is taken as the left boundary of a potential coding region. (b) From position j+1, continue scanning to the right until the output value is less than the threshold. Let that position be i. From position i, scan to the right until the output value does not decrease. Let that be k. From position i, scan to the left until the output value does not increase. Let that be j. The middle position between j and k is taken as the right boundary of a potential coding region.

(c) Repeat (a) and (b) until the end of the sequence is reached.

(5) The boundaries of these potential coding regions are adjusted by taking into consideration the exon and intron length distributions (Lou 1997). Specifically, if two potential coding regions are separated by a non-coding region less than 65 bp (introns are usually > 64 bp in length), then they are combined into one single coding region.

(6) Every potential donor site (containing GT) and acceptor site (containing AG) are evaluated by WAM matrices (Zhang and Marr

1993). Two sequences are thus obtained, one for potential donor sites and another for potential acceptor sites. Each sequence contains the locations of potential sites and the corresponding WAM scores. (7) The boundaries are further adjusted by consideration of WAM scores. Specifically, for each boundary point p, the potential splice sites located within a pre-specified distance d from point p are selected. The site with the largest WAM value among those selected potential splice sites is taken as the true site. In case no site is selected, the distance d is extended two times larger (constrained by the previous boundary point and the next boundary point) and the procedure is repeated (and so on). The pre-specified distance d is empirically determined as 120 bp both for donor and acceptor sites. (8) Repeat step (5).

As an example, the output for gene HUMPNMTA (accession J03280) from group IV is shown in FIG. 6, in which the curve represents the output of the neural network while the straight line represents the annotated gene arrangement. The dots represent the prediction locations.

The probability 0.8 was used as the global threshold value, which roughly means that the probability of correctness of the predicted exons is 0.8. This gene has three annotated exons with locations 1958- 2159, 3159-3366 and 3480-3918. The algorithm found all three exons with the predicted regions 2008-2159, 3159-3366, 3480-3884. Except for the translation initiation site and stop site, the four donor/acceptor sites match the actual (annotated) locations exactly, with the correlation coefficient CC=0.94. Though the overall level of accuracy in this example is somewhat higher than average for the algorithm, it is by no means atypical. This example also serves to illustrate some weaknesses of the algorithm. The identification of coding regions relies on the global threshold technique so that the predicted coding region type (initial/internal/terminal) can not be known in advance. As a compromise, all predicted regions were treated as internal exons in this study. The shortcoming is that the initiation site and stop site locations can not be located precisely.

The second example gives some insight into how the exemplary method will behave in a real situation. The sequence HSNCAMXl (accession Z29373) was not in the dataset_A because there is no "complete cds" in its DEFINITION line. It is a complete new sequence to the algorithm since it was not involved in any step of development of this algorithm. The neural network output is shown in Figure 6 and the text values are shown in Table 4.

Table 4 Prediction of the simple algorithm for GenBank seq uence HSNCAMXl

Predicted exons Annotatec gene arrangement exon Start End Len Cod Sen Ac Do Status exon Start End Len Status

# Score Score Score #

01 1508 1793 286 n/a n/a 3 01 overlap 01 1533 1608 76 overlap 02 4127 4141 15 missed

02 4732 4777 46 -22 33 6 18 3 58 partial 03 4672 4777 106 partial

03 5015 5217 203 -53 65 7 36 0 65 exact 04 5015 5217 203 exact

04 6192 6314 123 199 65 2 93 -0 01 exact 05 6192 6314 123 exact

05 641 1 6581 171 -11 75 2 19 2 53 exact 06 641 1 6581 171 exact

06 6871 6982 112 -56 75 2 19 1 69 exact 07 6871 6982 112 exact

07 7198 7315 1 18 -5 21 3 41 3 26 partial 08 7131 7315 185 partial 09 7437 7568 132 missed

08 7708 7794 87 -19 78 -0 33 3 69 partial 10 7708 7851 144 partial

09 7888 8032 145 -210 51 -1 21 4 21 wrong

10 8417 8528 1 12 -37 54 1 95 4 56 exact 1 1 8417 8528 112 exact

1 1 8642 8808 167 -62 80 3 09 1 76 exact 12 8642 8808 167 exact

12 891 1 9067 157 -214 68 3 16 3 00 exact 13 891 1 9067 157 exact

13 9248 9372 125 -35 53 5 98 5 99 exact 14 9248 9372 125 exact

14 9460 9570 1 1 1 -82 50 3 92 3 85 exact 15 9460 9570 1 1 1 exact

15 9817 10014 198 -512 01 471 6 88 exact 16 9817 10014 198 exact

17 10246 10316 71 missed

16 10499 10721 223 -164 65 2 18 4 92 exact 18 10499 10721 223 exact

17 11482 1 1666 185 -442 92 3 00 5 23 partial 19 1 1551 1 1666 116 partial

18 11870 12282 413 -40045 5 78 -1 46 partial 20 11870 12071 202 partial 21 12160 12282 123 partial

19 12376 12549 174 -77 68 4 26 444 exact 22 12376 12549 174 exact

20 12666 12785 120 -136 99 4 09 7 54 exact 23 12666 12785 120 exact

21 12925 13048 124 -14 98 2 70 5 54 partial 24 12893 13048 156 partial

22 13257 13407 151 -423 52 0 31 3 20 overlap 25 13352 13486 135 overlap

23 13820 13892 73 -16 07 3 86 5 08 exact 26 13820 13892 73 exact

27 13990 14001 12 missed

24 14475 14641 167 n/a 2 19 n/a partial 28 14475 14706 232 partial

In Table 4, column 1 - column 7 was outputted from the exemplary method. Column 9 - column 12 is from GenBank annotation. Column 2 is the beginning position of the predicted exon. Column 3 is the ending position of the predicted exon. Column 4 is the length of the predicted exon. Column 5 is the coding sensor score of the coding portion of the exon. Column 6 is the score of the acceptor signal at the 5' end of the predicted exon. Column 7 is the score of the donor signal at the 3' end of the exon. Column 8 and Column 13 are provided here for illustrative purposes. The coding sensor score and the acceptor signal score of the first predicted exon were not calculated, so were the coding sensor score and the donor acceptor score of the last predicted exon.

The probability 0.8 was used as the global threshold value, as in the previous example. Sensitivity and specificity are both 0.84. The correlation coefficient is 0.79. In this example, the annotated HSNCAMXl gene contains 28 coding exons, of which 14 were predicted exactly, eight were predicted partially, two were predicted by overlapping exons and four were missed completely. In addition, one wrong exon was predicted.

It is notable that the predicted exon which is wrong has an unusually weak acceptor signal score (weaker than any score for a true splice site in this gene) and a relatively weak coding sensor score. Thus, the splice signal and exon coding sensor scores may provide useful information about the reliability of the prediction. The most distinctive property of the four annotated exons which were missed (exons 02, 09, 17 and 27) is their small size (15, 132, 71 and 12, respectively). In the neural network output of this gene, there were small peaks (at the level 0.20, 0.60 and 0.40) at the regions spanned by the annotated exons 02, 09 and 17. Therefore, it could be possible to pick up these exons should a better assembly algorithm be used instead of the simple algorithm. On the other hand, there was no distinguishable peak around the region 13990 - 14001 where the fourth missed exon is located.

Evaluation Of Coding Sensor

The measures established by Burset and Guigo (1996) were used to evaluate the accuracy performance of the recurrent neural network on the test sets. Table 5 shows the nucleotide level accuracy for different C+G% compositional test groups along with the results from two of the most-widely used gene prediction programs for the test sets.

Table 5. Nucleotide-level accuracy foi ^■ the test set.

C+G% RNN GenelD SORFIND

> 57 (27 sequences)

Sn 0.73 0.70 0.66

Sp 0.75 0.70 0.79

AC 0.67 0.63 0.67

CC 0.66 0.62 0.66

51-57 (23 sequences)

Sn 0.63 0.58 0.72

Sp 0.81 0.77 0.86

AC 0.66 0.61 0.74

CC 0.65 0.59 0.73

43-51 (25 sequences)

Sn 0.50 0.71 0.74

Sp 0.62 0.78 0.81

AC 0.50 0.69 0.70

CC 0.48 0.67 0.69

<= 43 (10 sequences)

Sn 0.41 0.57 0.67

Sp 0.53 0.90 0.83

AC 0.41 0.71 0.72

CC 0.40 0.65 0.71

Probabilities of 0.4, 0.6, 0.8 and 0.8 were used as the global threshold value for groups I, II, III and IV respectively. GenelD was assessed using the Email service geneid@darwin.bu.edu and the "- noexonblast" option was used to suppress the protein database search.

The first ranked potential gene was used. SORFIND Version 2.8 (dated: July 9, 1996) was downloaded from website www.rabbithutch.com and the default parameter values were used in evaluation.

The accuracy at the exon level is shown in Table 6. Table 6. Exon-level accuracy for the test sets

Predicted exons Annotated exons

C+G% # Exact Part Overlap Wrong # Exact Part Overlap Miss

(%) (%) (%⁾ (%) (%) (%) (%) (%)

> 57 135 16 42 17 25 112 19 54 15 12

51-57 118 29 48 9 14 143 24 41 7 28

43-51 104 14 38 13 35 112 13 35 12 40

<= 43 62 9 24 11 56 40 13 38 13 38

As may be seen in Table 6, several features stand out. First, the recurrent neural network is able to capture the information efficiently, as evidenced by its good performance in high C+G% groups. In fact, the results are competitive with other more sophisticated systems at the nucleotide level, which probably implies that the recurrent neural network extracts coding information more efficiently than the subsystems for coding region in these leading systems. Secondly, with the decreasing information available in the coding region (correlated to C+G content), the performance decreases gradually as expected, due to the global threshold operation. This decrease is evident at the nucleotide level as well as at the exon level. At the nucleotide level, the correlation coefficient decreases from 0.66 to 0.40. At the exon level, while 88% of annotated exons were identified in group IV (exact+partial+overlap), only 62% were identified in group I. Thirdly, the exon level accuracy is low. For example, in the best case (high C+G% group I), only 29 percent of 118 predicted exons are exactly correct (while 48 percent are partially correct). This is mainly due to the inefficient use of information from splice site signal and initiation site signal.

There were a large percentage of incorrect exons predicted. For example, 35% of 104 predicted exons in group II were wrong (see Tables 3-5). The reason is that there were some strong peaks (comparing with the true exon peaks) in the neural network output at the non-exon sites. To eliminate these peaks, a possible approach is to have a good assembly algorithm that can effectively assemble the coding region model (recurrent neural network), splice site models, initiation site model and stop site model into a gene model so that in the overall model, these false peaks would not pose a problem. It is quite possible that at these false peak regions either there are no splice site signals available or these signals are very weak. This approach is demonstrated by integrating this coding sensor into a rather simple generalized hidden Markov model (Rabiner 1989) which substantially improves the overall prediction accuracy both at the nucleotide level and exon level (Table 7).

In Table 7, column 1 represents the number of sequences in each test set is given in the first parentheses, followed by the number of sequences for which no gene was predicted, in second parentheses. The Generalized hidden Markov model contains ten states and is similar in structure to the ones used in Genie and GENSCAN. All the parameters (state length distributions, Markov transition probabilities, and state initial probabilities) were estimated from the dataset_A (Lou 1997). The state sequence generating models for splice sites and initiation sites are WAM models. The sequence generating model for the coding/non-coding regions is the recurrent neural network model (converting the posterior probability to the sequence generating model using the Bayes theorem). The performance of the model (program Gene ACT), was tested on the set of 570 vertebrate genes constructed by Burset and Guigo (1996). The results are shown in Table 8 and the comparisons with other systems are shown in Table 9, below.

At the nucleotide level, the GeneACT is comparable with all leading systems. Although the sensitivity and specificity at the exon level are low, the missing exon percentage and wrong exon percentage are comparable with other systems. It should be noted that as of the overlapping between the training set of all these systems and the Burset and Guigo dataset, truly objective comparisons of these systems are not obtainable and probably even not meaningful. To increase the exon level sensitivity and specificity, one obvious approach is to build more sophisticated splice site models (Burge and Karlin 1997). Another approach is to incorporate promoter, polyA signal and other signals (like signal peptide and CpG signal) into the generalized HMM model. It is anticipated that by using these two approaches the overall performance of the system will be substantially improved. After the incorporation of promoter and polyA signals into the HMM model, further improvement of the HMM modeling may come from the RNN model which treats the 5' UTR, introns, 3' UTR and intergenic regions differently.

In Table 8, the number of sequences in each subgroup is given under the heading "# Seqs", followed by the number of sequences for which no gene was predicted, in parentheses.

In Table 9, the performance of the model GeneACT is shown in the first row. Most of these results are from Burset and Guigo' s comprehensive study (1996), with the exception of Genie, GENSCAN and VEIL. Genie results are from Kulp et al. ( 1996). GENSCAN results are from Burge and Karlin (1997). VEIL results are from Henderson et al. (1997). Under the heading "# Seqs", the number of sequences (out of 570) effectively analyzed by each program is given (some programs failed to run on certain sequences), followed by the number of sequences for which no gene was predicted, in parentheses. GeneID+ and GeneParser3 make use of amino acid similarity searches and were tested only on sequences less than 8 kb in length. It will be understood that the foregoing is intended to be illustrative of the invention and that other examples and embodiments are contemplated within the scope and spirit of the invention and in the claims.

References Bishop, CM. 1995, Neural Networks for Pattern Recognition, Oxford

University Press, Oxford.

Burge C. 1997. Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University, Stanford, CA.

Burge C. and Karlin S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.

Burst, M. and Guigo, R. 1996. Evaluation of gene structure prediction programs. Genomics, 34, 353-367.

Dong, S. and Searls, D.B. (1994). Gene structure prediction by linguistic methods. Genomics, 162, 705-708. Fickett, J.W. 1982. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10, 5303-5318.

Fickett, J.W. and Tung, C.S. 1992. Assessment of protein coding measures. Nucl. Acids Res. 20, 6441-6450.

Fields, CA. and Soderlund, CA. 1990. gm: a practical tool for automating DNA sequence analysis. Comp. Appl. Biol. Sci. 6, 263-270.

Guigo, R., Knudsen, S., Drake, N. and Smith, T. 1992. Prediction of Gene Structure. J. Mol. Biol., 226, 141-157.

Henderson J., Salzberg S. and Fasman K. 1997. Finding Genes in Human DNA with a Hidden Markov Model. J. Comp. Biol. 4, 119-126.

Hutchinson, G.B. and Hayden, M.R. 1992. The prediction of exons through an analysis of spliceable open reading frames. Nucl. Acids Res. 20, 3453-3462.

Kulp, D., Haussler, D., Reese, M.G. and Eeckman, F.H. 1996. A generalized Hidden Markov Model for the recognition of human genes in DNA. In Proceedings of the Fourth International Conference on Intelligent System for Molecular Biology. AAAI Press, Menlo Park, CA.

Lou, Y. 1997. Recognition of Protein Coding Regions in Human Genomic DNA. PhD thesis. Medical University of South Carolina, SC.

Rabiner, L.R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE. 11, 257-285. Ripley, B.D. 1996. Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.

Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res. 21, 607-613.

Snyder, E.E. and Stormo, G.D. 1995. Identification of Protein Coding Regions in Genomic DNA. J. Mol. Biol. 248, 1-18.

Staden, R. 1984. Measurements of the effect that coding for a protein has on a DNA sequence and their use for finding genes. Nucl. Acids Res. 12, 551-567.

Solovyev, V.V., Salamov, A. A. and Lawrence, C.B. 1994. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl. Acids Res. 22, 5156-5163.

Uberbacher, E.C and Mural, RJ. 1991. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA, 88, 11261-11265.

Xu, Y., Mural, R.J. and Uberbacker, E.C 1994. Constructing gene models from accurately predicted exons: an application of dynamic programming. Comp. Appl. Biol. Sci. 10, 613-623.

Zhang, M.Q. and Marr, T.G. 1993. A weight array method for splicing signal analysis. Comp. Appl. Biol. Sci. 9, 499-509.

Claims

CLAIMSWhat is claimed is:

1. A method for analyzing and determining a genetic coding sequence from a nucleotide sequence of interest comprising applying a preprocessing step using information from the sequence segment of interest and from segments of the sequence that are up-stream or down stream adjacent to the current segment of sequence of interest.

2. The method of claim 1, wherein the preprocessing step comprises a recurrent neural network to analyze information from the sequence segment of interest and from segments of the sequence that are up-stream or down stream adjacent to the current segment of sequence of interest.

3. The method of claim 1, wherein the preprocessing step implements a coding sensor to determine the presence or absence of coding information in a nucleotide sequence of interest.

4. The method of claim 1 , wherein the preprocessing step comprises applying a coding sensor using a recurrent neural network to the sequence of interest.

5. A method for determining coding regions within a DNA sequence, the DNA sequence comprising a chain of nucleotides, comprising the steps of: calculating at least one content variable associated with a sampling window of a predetermined number of nucleotides centered at a selected nucleotide position; calculating at least one neighboring content variable associated with a neighboring sampling window of the predetermined number of nucleotides centered at a neighboring nucleotide; and based on the at least one content variable and the at least one neighboring content variable, predicting whether the selected nucleotide position is within a coding region of the DNA sequence.

6. The method of claim 5, wherein the at least one neighboring sampling window comprises an up-stream sampling window and wherein the at least one neighboring content variable comprises at least one up-stream content variable.

7. The method of claim 5, wherein the at least one neighboring sampling window comprises a down-stream sampling window and wherein the at least one neighboring content variable comprises at least one down-stream content variable.

8. The method of claim 5, wherein the at least one neighboring sampling window comprises at least one down-stream window and at least one up-stream sampling window; and wherein the at least one neighboring content variable comprises at least one down-stream content variable and at least one up- stream content variable.

9. The method of claim 5, wherein predicting whether the selected nucleotide position is within the coding region of the DNA sequence comprises processing the at least one content variable and the at least one neighboring content variable in a trained recurrent neural network in order to determine transition characteristics between the selected nucleotide position and the neighboring nucleotide position, said transition characteristics indicating whether the selected nucleotide position is in the coding region.