WO2010045475A1 - Techniques for predicting hiv viral tropism and classifying amino acid sequences - Google Patents
Techniques for predicting hiv viral tropism and classifying amino acid sequences Download PDFInfo
- Publication number
- WO2010045475A1 WO2010045475A1 PCT/US2009/060871 US2009060871W WO2010045475A1 WO 2010045475 A1 WO2010045475 A1 WO 2010045475A1 US 2009060871 W US2009060871 W US 2009060871W WO 2010045475 A1 WO2010045475 A1 WO 2010045475A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- class
- data points
- category
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- Embodiments of the present invention generally relate to techniques for sequence- based testing. More particularly, the present invention relates to improving computational techniques for predicting HIV viral tropism.
- HIV is a lentivirus (a member of the retrovirus family), infection with which can lead to acquired immunodeficiency syndrome (AIDS), a condition in humans in which the immune system begins to fail under influence of the virus. HIV primarily infects vital cells in the human immune system such as helper T cells (specifically CD4+ T cells), macrophages and dendritic cells, which can lead to decreased immune response. When CD4+ T cell numbers decline below a critical level, cell-mediated immunity is lost, and the body becomes progressively more susceptible to opportunistic infections.
- helper T cells specifically CD4+ T cells
- macrophages specifically CD4+ T cells
- dendritic cells dendritic cells
- a simple charge rule was first introduced to predict HIV tropism by de Jong, et al, J Virol 66(2):757-765 (1992) and Fouchier et al., J. CHn Microbiol. 33(4):906-911 (1995).
- the rule classifies the virus as "using CXCR4" if there is a positively charged amino acid at position 11 or 25, and as "not using CXCR4" otherwise.
- Resch, et al., Virology 288(1 ):51-62 (2001) proposed a neural network model using 16 amino acids in V3 loop to predict the tropism. Pillai, et al, AIDS Res Hum Retroviruses 19(2):145-149 (2003) introduced machine learning methods that included decision trees and Support Vector Machines (SVM).
- SVM Support Vector Machines
- Sensitivity and specificity are statistical measures of the performance of any of various binary classification tests. In statistics, specificity is defined to be a measure of the proportion of negatives which are correctly identified - e.g., the percentage of well people who are correctly identified as not having the condition. Specificity is defined to be a measure of the proportion of negatives which are correctly identified - e.g., the percentage of well people who are correctly identified as not having the condition.
- a gain in specificity usually comes at the expense of the sensitivity, and vice versa.
- the range of sensitivity ranged from 22% to 44%.
- the range of sensitivity ranged from 55% to 74%.
- the range of sensitivity ranged from 66% to 79%.
- An exemplary technique includes providing a first training set that includes a plurality of sequences of the first class and a second training set that includes a plurality of sequences of the second class. The technique includes determining a plurality of probabilities associated with a plurality of positions that takes into account the dependency between elements in adjacent positions.
- the technique includes determining a plurality of probabilities associated with a plurality of positions wherein the plurality of positions include a position, a preceding position, and a succeeding position.
- the technique includes determining a probability that a position on a sequence of the first class and a position on the test sequence are occupied by elements belonging to a first specific category, given that a preceding position on the sequence of the first class and a preceding position on the test sequence are occupied by elements belonging to a second specific category, and given that a succeeding position on the sequence of the first class and a succeeding position on the test sequence are occupied by elements belonging to a third specific category.
- the technique includes determining a probability that a position on a sequence of the second class and a position on the test sequence are occupied by elements belonging to a fourth specific category, given that a preceding position on the sequence of the second class and a preceding position on the test sequence are occupied by elements belonging to a fifth specific category, and given that a succeeding position on the sequence of the second class and a succeeding position on the test sequence are occupied by elements belonging to a sixth specific category.
- two pluralities of elements are considered to be of the same type if each of every pair of corresponding elements belongs to a specific predetermined category of amino acids.
- the predetermined categories of amino acids can be defined differently.
- the categorization can be used to reduce the complexity of the calculations required to comparing sequence similarities.
- the 20 known amino acids are divided into four categories.
- the first category consists of H, K and R (histidine, lysine, and arginine, respectively);
- the second category consists of A, F, I, L, M, P, V and W (alanine, phenylalanine, isoleucine, leucine, methionine, proline, valine, and tryptophan);
- the third category consists of C, G, N, Q, S, T and Y (cysteine, glycine, asparagine, glutamine, serine, threonine, and tyrosine);
- the fourth category consists of D and E (aspartic acid and glutamic acid).
- the 20 known amino acids are divided into twelve categories.
- the first category consists of A and P; the second category consists of F and W; the third category consists of I, L and V; the fourth category consists of M; the fifth category consists of H; the sixth category consists of K and R; the seventh category consists of D; the eight category consists of E; the ninth category consists of N, S and T; the tenth category consists of Q; the eleventh category consists of C and G; and the twelfth category consists of Y.
- the technique for categorizing a test sequence as a first class (e.g. CXCR4) or a second class (e.g. CCR5) includes determining a score for the test sequence based on the above-described plurality of probabilities and categorizing the test sequence as the first class or the second class based on the score.
- Another embodiment of the invention provides for a technique for classifying a test data point based on a vote of a multitude of classifiers.
- the technique includes providing a training set that includes a plurality of data points and subdividing the plurality of data points into a plurality of data subsets.
- data points taken from each patient may be grouped into a data subset.
- data points taken from patients of a particular locale may instead be grouped into one specific data subset.
- the technique includes forming a plurality of training sets each formed with one data point from each data subset and training a plurality of classifiers each based on one of the plurality of training sets.
- each training set is made of data points where each data point is obtained from a separate patient and the total number of data points equal to the number of patients.
- each training set is made of data points where each data point is obtained from a separate locale and the total number of data points equal to the number of locales.
- the technique further includes determining a plurality of tentative categorizations for the test data point using a plurality of classifiers trained on the training set determined above.
- the technique includes categorizing a test data point based on a vote of the plurality of tentative categorizations.
- the plurality of data points that can be associated with this embodiment can include biomarkers, amino acid sequences, nucleotide sequences, and the like.
- Another embodiment of the invention provides for a technique for training a classifier based on weighting individual data points in accordance with a distance from some reference plurality of data points.
- the reference plurality of data points can be defined globally to be the total data points - or individually for each individual data point to be the total data points excluding each of the individual data points in question.
- the weighting can be based on a linear distance, a geometric distance, or other types of distance.
- the method attempts to compensate for the under-sampled data points relative to the over-sampled data points (i.e., the points near to the reference plurality of data points).
- some data points are derived from over-sampled sources, while others are from relatively under-sampled sources.
- the technique includes weighting each of the plurality of data points in accordance with a distance from an average of some reference plurality of data points.
- the plurality of data points that can be associated with this embodiment can include biomarkers, amino acid sequences, nucleotide sequences, and the like.
- Fig. IA is a simplified ROC plot showing the performance for of various previous existing techniques for predicting HIV viral tropism
- Fig. IB is a simplified graph focusing on exemplary regions of interest for prediction of HIV viral tropism
- Fig. 2 is a simplified diagram illustrating generally a technique for classifying a test data point according to an embodiment of the invention
- FIG. 3 is a simplified diagram illustrating a technique for determining a test data point as belonging to a predetermined category using Position Specific Score Matrixes according to an embodiment of the invention
- Figs. 4A, 4B, and 4C illustrate three mathematical models associated with Position Specific Score Matrixes for predicting tropism according to embodiments of the invention
- Figs. 5A and 5B illustrate two embodiments for categorizing amino acids according to an embodiment of the invention
- Fig. 6 is a simplified flow chart illustrating techniques for classifying a data point based on a vote of a multitude of classifiers
- FIG. 7 is a simplified flow diagram illustrating techniques for weighting a training set an embodiment of the invention.
- Fig. 8 is a simplified block diagram of a computer system that can be used to practice various embodiments of the invention described in this application.
- Embodiments of the present invention can be applied to techniques for gene-based testing. More particularly, the present invention is useful for improving computational techniques for predicting HIV viral tropism.
- Position Specific Score Matrices offer a way to represent information in training sets in terms of probabilities that an element will occupy a particular position on a hypothetical sequence. Position Specific Score Matrices can be used to estimate the probability that two hypothetical sequences belong to a same class by comparing the specificity of each element on the two hypothetical sequences.
- a type of element e.g., A, C, G, or T for DNA sequences; one of twenty known amino acids for protein sequences.
- a Markov probability model is assumed, where each position depends on the position before it.
- a more dedicated Markov model can also be created that assumes that each position depends on its immediate neighbors.
- Fig. IA is a simplified ROC plot showing the performance of various previous existing techniques for predicting HIV viral tropism. Depicted along the x axis is the false positive rate, which by definition is equal to (1 - specificity). Depicted on the y axis is the true positive rate, which by definition is equal to the sensitivity. A rigorous comparison of these methods is performed based on public data downloaded from Los Alamos National Lab (LANL).
- Fig. IB is a simplified graph focusing on exemplary regions of interest according to embodiments of the invention.
- the graph shows, among others, the performances of previously known techniques for predicting HIV viral tropism.
- the range of interest ranges from a specificity between 90% to 99%.
- the corresponding range of sensitivity extends from a little over 20% to about 80%.
- Table I Comparison of the performance of several techniques for predicting HIV viral tropism.
- Table I is a simplified illustration of the performance of several techniques for predicting HIV viral tropism, including embodiments of the current invention. The results are shown in terms of sensitivities for various techniques, including embodiments of the current invention ("New method" in the Table), are provided for various specificities.
- Fig. 2 is a simplified flowchart illustrating generally a technique for classifying a test data point according to an embodiment of the invention.
- the techniques include steps for determining a training set of a first class (2010), determining a training set of a second class (2020), and training a classifier that can be used to categorize a test data point as belonging to the first class or the second class (2030).
- the classifier should be based on one or more Position Specific Scoring Matrices that take into account the dependencies between elements in adjacent positions on a sequence
- a classifier is trained to recognize one class of data point (i.e., the first class). Data points that are recognized not to be of the first class are assigned to the second class. Depending on the embodiments, a classifier may also be trained to recognize more than two classes of data points - in which case multiple sets of training sets corresponding to multiple classes may be needed.
- more than one classifier may be used, as disclosed in some embodiments below.
- some of the classifiers may be specialized to recognize a subset of classes while others may be specialized to recognize other subsets of classes.
- each of the classifiers may be trained to recognize all the sets of classes. Many of such variations are possible and recognizable to one of skill in the art. All of these variations are contemplated as part of the invention and fall under the scope of the current application.
- [0047] there are many ways to train a classifier based on one or more Position Specific Scoring Matrices. Fig.
- FIG. 3 is a simplified diagram illustrating a technique for determining a test data point based on Position Specific Scoring Matrices that take into account dependencies of adjacent positions.
- the training of a classifier is based on determining one or more Position Specific Score Matrices.
- An exemplary technique includes creating a first Position Specific Score Matrix based on the training set of the first class wherein individual entries of each matrix take into account dependencies between positions on a prototypical sequence (of the first class) (3110).
- the technique includes creating a second Position Specific Score Matrix based on the training set of the second class wherein individual entries of each matrix of each matrix take into account dependencies between positions on a prototypical sequence (of the second class) (3120).
- the technique includes processing a first score associated with the first Position Specific Score Matrix and processing a second score associated with the second Position Specific Score Matrix (3130). The technique then includes determining a classification score for a test data point based on a ratio between the first score and the second score (3140).
- Figs. 4A, 4B, and 4C illustrate three exemplary mathematical models that can be used to create the first Position Specific Score Matrix and the second Position Specific Score Matrix for predicting tropism.
- the PSSM method involves modeling the different amino acids in the V3 loop as statistically independent entities constituting a whole sequence.
- such PSSM method models the likelihood of a sequence such as the V3 loop sequence
- N N S [S 1 ,s 2 ,...,s N ⁇ , as J ⁇ [// ; 4 (or ] ⁇ [//f ) ⁇ > where N is the number of elements in a sequence,
- a score S for determining the likelihood that a sample virus can (or cannot) use CXCR4 as co-receptor can be calculated by a formula such as
- f a x t 4 and fTM are estimated from the training data.
- the method requires that all sequences be of identical lengths.
- all sequences can be aligned against a common reference HIV V3 loop sequence compiled from sequences in the LANL database. Insertions can be removed such that only the remaining amino acids are considered.
- gaps can be inserted when necessary, and contribute a 0 to the score. In this model, the elements occupying the various positions on a sequence are considered to be independent of each other.
- a moderate amount of dependency of adjacent amino acids can be introduced to better model the joint distribution of the sequences in each class or training set.
- a Markov probability model can be assumed where each position depends on the position before it.
- a more dedicated Markov model can also be created that assumes that each position depends on its immediate neighbors, as depicted in Fig. 4A for CXCR4 and Fig. 4B for CCR5.
- ff 4 and ff 4 35 can be defined in traditional PSSM (for the beginning and ending elements), but for intermediate elements, gf 4 ( s , I Vi > s ⁇ + ⁇ ) can be used to represents the probabilities that an element (e.g., amino acid) at the i th position is S 1 given that the corresponding virus can use CXCR4 as the co-receptor and the surrounding amino acids are S 1-1 and S 1+1 , respectively.
- an element e.g., amino acid
- ffl and // 5 35 are defined as in PSSM and sf 5 ( s , I s ,-v s , + ⁇ ) > similarly, can be defined to represent the probability that an element (e.g., amino acid) at the i' h position is S 1 given that the corresponding virus can use CCR5 as the co-receptor and the surrounding amino acids are S 1-1 and S 1+1 , respectively.
- element e.g., amino acid
- a ratio of the pseudo-likelihood scores based on CXCR4 and CCR5 introduced in Figs. 4A and 4B can be determined such as that shown in Fig. 4C.
- the scores can be further modified with a weighting factor W 1 to allow different positions to be weighted differently in accordance with predetermined knowledge relating to the relative importance of the positions with respect to determining tropism, as shown in Fig. 4D.
- Positions 5, 11 and 25 have been shown to be particularly important in the literature and can be given a weight of 3, according to one embodiment.
- Positions 7, 8, 10, 13, 18-22, 24, 27 and 32 have been shown in Resch et al. 2001 also to be important and can be given a weight of 2, according to this embodiment.
- Other positions can be assigned to have a regular weight of 1.
- Figs. 5 A and 5B illustrate two embodiments for categorizing amino acids according to an embodiment adapted to simplify the calculation needed for determining PSSM.
- S 1-1 , S 1+1 ) at position i , 20x20 400 distributions need to be estimated from the training data, corresponding to the 400 possible combinations of (S 1-1 , S 1+1 ) .
- the model can be simplified by merging amino acids into specific predetermined categories.
- the 20 amino acids can be grouped into 4 predetermined categories according to their physico-chemical properties: H, K and R as category 1; A, F, I, L, M, P, V and W as category 2; C, G, N, Q, S, T and Y as category 3; and D and E as category 4, as shown in Fig. 5 A.
- the 20 amino acids can be grouped into 12 smaller predetermined categories: A and P as category 1; F and W as category 2; I, L and V as category 3; M as category 4; H as category 5; K and R as category 6; D as category 7; E as category 8; N, S and T as category 9; Q as category 10; C and G as category 11; and Y as category 12, as shown in Fig. 5B.
- ,S 141 S 1+1 ) can be defined on 12 small categories instead of the 20 amino acids for S 1 , further simplifying the calculations.
- a pseudo count can be set to 0.1 , an arbitrary small number.
- ⁇ is a constant that can be used to adjust the contribution of the input training sequences. This factor has been set to 10 for many of data sets tested.
- the pseudo counts represent the best guess of the conditional probabilities without the actual information. Generally, without looking at the actual combination of (a, a_ ⁇ , a ⁇ ) , the best guess is just the marginal distribution. In these embodiments, the larger ⁇ is, the smaller the contribution from each sequences with a_ y at position i — 1 and a x at position i + 1 is.
- S 1 is a constant that can be used to adjust the contribution of the input training sequences. This factor has been set to 10 for many of data sets tested.
- the pseudo counts represent the best guess of the conditional probabilities without the actual information. Generally, without looking at the actual combination of (a, a_ ⁇ , a ⁇ ) , the best guess is just the marginal distribution. In these embodiments, the larger ⁇ is, the smaller the contribution from each sequences with a_
- S 1 is an actual amino acid
- two possibilities can exist. In a first case, only one of S 1-1 and S 1+1 is a gap; in a second case, both S 1-1 and S 1+1 are gaps.
- a log ratio of the partial conditional distributions i.e., gf 4 (a ⁇ a_ ⁇ at position i - 1) (or gf 4 (a ⁇ a ⁇ at position i + T) depending on which of Vi an d Vi * s a 8 a P)' can be used to replace the corresponding term in the score calculation.
- a_ ⁇ , a ⁇ ) can be estimated in a similar manner, except according to one embodiment the constant ⁇ can be set to 5, as the number of sequences with a_ ⁇ at position i — ⁇ or a ⁇ at position i + 1 is larger.
- the marginal distribution can be used to replace the conditional distributions to calculate the log ratio.
- Another aspect of the invention relates to classifying a data point based on forming a multitude of classifiers.
- the method includes subdividing a data set into a set of data subsets, forming a plurality of training sets each created by sampling one data point from each data set, training a plurality of classifiers based on the plurality of training sets, and taking a vote of the decisions made by the plurality of classifiers.
- a set of training data is obtained from several patients.
- classifiers can be built. Data from each patient can be defined as forming an individual data subset. Multiple training sets can then be derived from each individual data subset by randomly obtaining one data point from each data subset (i.e. patients). According to the embodiment, information from other sequences of a data subset from the same patient can thus be ignored. Next, each classifier can be trained based on each one of the training sets. A final prediction result can then be derived based on a vote of the pool of classifiers.
- Fig. 6 is a simplified flow chart illustrating techniques for classifying a data point based on a vote of a multitude of classifiers.
- a technique 6000 for more fully utilizing the information contained in a training set includes providing for a set of data set (6010), subdividing the data set into a set of data subsets (6020), determining a set of training sets each formed by selecting one data point from each data subset (6030), and creating a set of classifiers each trained on one of the training sets (6040).
- the technique further includes determining a set of tentative categorizations for a test data point based on the set of classifiers (6050) and determining a categorization for the test data point based on a vote of the set of tentative categorizations (6060).
- a majority vote scheme may be proposed to fully utilize such information.
- Other types of schemes can be used, such as a 2/3 vote for example.
- the scheme may involve a dynamic evaluation of specific data sets without deviating from the scope and spirit of the invention.
- a set of data may be obtained for patients obtained from three geographic locations, for example Location 1, Location 2, and Location 3.
- the set of data need not include an equal number of points from each location, but may include, for example, 5 data points from Location 1, 8 data points from Location 2, and 10 data points from Location 3.
- a few training data set may be formed, where each training data set is formed by randomly selecting one data point from Location 1, randomly selecting one data point from Location 2, and randomly selecting one data point from Location 3.
- the training data set is then used to train a multitude of classifiers. To categorize a test data point, a simple majority vote of the decisions made by the multitude classifiers is taken to derive a final categorization for the test data point.
- a technique for processing data obtained from multiple closely related sources is disclosed.
- multiple HIV sequences are obtained from a patient.
- many sequences are obtained from different patients.
- sample space overlap of viral sequences from different patients which may occur as in the case, for example, when patients are infected from the same source, resulting in samples from different patients sharing a high degree of similarity
- including data sets from separate patients may also unnecessarily introduce bias.
- a sequence- reweighting procedure is employed for a sequence in a training set.
- a corresponding weight may be assigned relating to a distance of a specified sequence to a reference data set. The distance may be calculated from the specified data point to an average of a reference data set - where distance in general may be defined as a degree of similarity between positions on two sequences.
- data points from under-sampled sources can be emphasized while the data points from over-sampled sources can be de-emphasized..
- Fig. 7 is a simplified flow diagram illustrating techniques for weighting a training set an embodiment of the invention.
- the technique includes obtaining a set of data points sampled from various sources, some sources being over- sampled relative to other sources (7010), determining a reference plurality of data points for each of the set of data points (7020), determining a distance from a specified data point to an average of a reference plurality of data points (7030), weighting each data point in accordance with a distance from a specified data point to some average of a reference plurality of data points (7040), and forming a training set in accordance with the weighting of the data points (7050).
- the reference plurality of data points may be the same for all data points or different for each data point. According to an embodiment in which the reference plurality of data points are the same, the reference plurality of data points is simply the entire data set. According to an embodiment where the reference plurality of data points for each data point is different, the reference plurality of data points for a specified data point may be the entire data set excluding the specified data point.
- the estimation of probabilistic distributions such as gf 4 (S 1 I S 1-1 , s l+l ) can be weighted in proportion to the distance metrics associated with the data making up the training set.
- FIG. 8 is a simplified block diagram of a computer system 100 that can be used to practice an embodiment of the various inventions described in this application.
- computer system 100 includes a processor 102 that communicates with a number of peripheral subsystems via a bus subsystem 104.
- peripheral subsystems can include a storage subsystem 106, comprising a memory subsystem 108 and a file storage subsystem 110, user interface input devices 112, user interface output devices 114, and a network interface subsystem 116.
- Bus subsystem 104 provides a mechanism for letting the various components and subsystems of computer system 100 communicate with each other as intended. Although bus subsystem 104 is shown schematically as a single bus, alternative embodiments of the bus subsystem may utilize multiple busses.
- Network interface subsystem 116 provides an interface to other computer systems, networks, and portals.
- Network interface subsystem 116 serves as an interface for receiving data from and transmitting data to other systems from computer system 100.
- User interface input devices 112 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a barcode scanner, a touch screen incorporated into the display, audio input devices such as voice recognition systems, microphones, and other types of input devices.
- pointing devices such as a mouse, trackball, touchpad, or graphics tablet
- audio input devices such as voice recognition systems, microphones, and other types of input devices.
- use of the term "input device” is intended to include all possible types of devices and mechanisms for inputting information to computer system 100.
- User interface output devices 114 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices, etc.
- the display subsystem may be a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), or a projection device, and the like.
- CTR cathode ray tube
- LCD liquid crystal display
- projection device and the like.
- output device is intended to include all possible types of devices and mechanisms for outputting information from computer system 100.
- Storage subsystem 106 can be configured to store the basic programming and data constructs that provide the functionality of the present invention.
- Software (code modules or instructions) that provides the functionality of the present invention can be stored in storage subsystem 106. These software modules or instructions can be executed by processor(s) 102.
- Storage subsystem 106 may also provide a repository for storing data used in accordance with the present invention.
- Storage subsystem 106 can comprise memory subsystem 108 and file/disk storage subsystem 110.
- Memory subsystem 108 can include a number of memories including a main random access memory (RAM) 118 for storage of instructions and data during program execution and a read only memory (ROM) 120 in which fixed instructions are stored.
- RAM main random access memory
- ROM read only memory
- File storage subsystem 110 provides persistent (non- volatile) storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a Compact Disk Read Only Memory (CD-ROM) drive, an optical drive, removable media cartridges, and other like storage media.
- Computer system 100 can be of various types including a personal computer, a portable computer, a workstation, a network computer, a mainframe, a kiosk, a server or any other data processing system. Due to the ever-changing nature of computers and networks, the description of computer system 100 depicted in Fig. 7 is intended only as an example for purposes of illustrating the preferred embodiment of the computer system. Many other configurations having more or fewer components than the system depicted in Fig. 7 are possible.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09821264A EP2347255A1 (en) | 2008-10-17 | 2009-10-15 | Techniques for predicting hiv viral tropism and classifying amino acid sequences |
JP2011532258A JP2012506099A (en) | 2008-10-17 | 2009-10-15 | Techniques for predicting the directionality of HIV viruses and techniques for classifying amino acid sequences |
CA2740879A CA2740879A1 (en) | 2008-10-17 | 2009-10-15 | Techniques for predicting hiv viral tropism and classifying amino acid sequences |
CN2009801413850A CN102203603A (en) | 2008-10-17 | 2009-10-15 | Techniques for predicting HIV viral tropism and classifying amino acid sequences |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10640508P | 2008-10-17 | 2008-10-17 | |
US61/106,405 | 2008-10-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010045475A1 true WO2010045475A1 (en) | 2010-04-22 |
Family
ID=42106896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2009/060871 WO2010045475A1 (en) | 2008-10-17 | 2009-10-15 | Techniques for predicting hiv viral tropism and classifying amino acid sequences |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP2347255A1 (en) |
JP (1) | JP2012506099A (en) |
CN (1) | CN102203603A (en) |
CA (1) | CA2740879A1 (en) |
WO (1) | WO2010045475A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379464A (en) * | 2019-07-29 | 2019-10-25 | 桂林电子科技大学 | The prediction technique of DNA transcription terminator in a kind of bacterium |
CN113299345A (en) * | 2021-06-30 | 2021-08-24 | 中国人民解放军军事科学院军事医学研究院 | Virus gene classification method and device and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020183934A1 (en) * | 1999-01-19 | 2002-12-05 | Sergey A. Selifonov | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US20040043430A1 (en) * | 2000-02-10 | 2004-03-04 | Dahiyat Bassil I. | Protein design automation for protein libraries |
-
2009
- 2009-10-15 CA CA2740879A patent/CA2740879A1/en not_active Abandoned
- 2009-10-15 CN CN2009801413850A patent/CN102203603A/en active Pending
- 2009-10-15 WO PCT/US2009/060871 patent/WO2010045475A1/en active Application Filing
- 2009-10-15 JP JP2011532258A patent/JP2012506099A/en not_active Withdrawn
- 2009-10-15 EP EP09821264A patent/EP2347255A1/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020183934A1 (en) * | 1999-01-19 | 2002-12-05 | Sergey A. Selifonov | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
US20040043430A1 (en) * | 2000-02-10 | 2004-03-04 | Dahiyat Bassil I. | Protein design automation for protein libraries |
Non-Patent Citations (2)
Title |
---|
HOFFMAN ET AL.: "Variability in the Human Immunodeficiency Virus Type 1 gp120 Env Protein Linked to Phenotype-Associated Changes in the V3 Loop.", JOURNAL OF VIROLOGY, vol. 76, no. 8, April 2002 (2002-04-01), pages 3852 - 3864, XP002433828 * |
JENSEN ET AL.: "Improved Coreceptor Usage Prediction and Genotypic Monitoring of R5-to-X4 Transition by Motif Analysis of Human Immunodeficiency Virus Type 1 env V3 Loop Sequences.", JOUMAL OF VIROLOGY, vol. 77, no. 24, December 2003 (2003-12-01), pages 13376 - 13388, XP002433825 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379464A (en) * | 2019-07-29 | 2019-10-25 | 桂林电子科技大学 | The prediction technique of DNA transcription terminator in a kind of bacterium |
CN113299345A (en) * | 2021-06-30 | 2021-08-24 | 中国人民解放军军事科学院军事医学研究院 | Virus gene classification method and device and electronic equipment |
CN113299345B (en) * | 2021-06-30 | 2024-05-07 | 中国人民解放军军事科学院军事医学研究院 | Virus gene classification method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
JP2012506099A (en) | 2012-03-08 |
EP2347255A1 (en) | 2011-07-27 |
CN102203603A (en) | 2011-09-28 |
CA2740879A1 (en) | 2010-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gower et al. | Detecting adaptive introgression in human evolution using convolutional neural networks | |
US12002545B2 (en) | Technique for identifying features | |
JP2018527647A (en) | Methods for predicting pathogenicity of gene sequence variants | |
US11636951B2 (en) | Systems and methods for generating a genotypic causal model of a disease state | |
Gupta et al. | Extracting dynamics from static cancer expression data | |
Cuturi et al. | Noisy adaptive group testing using Bayesian sequential experimental design | |
Doherty et al. | A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes | |
Bing et al. | Essential regression: a generalizable framework for inferring causal latent factors from multi-omic datasets | |
Chicco et al. | An enhanced Random Forests approach to predict heart failure from small imbalanced gene expression data | |
Berman et al. | MutaGAN: A sequence-to-sequence GAN framework to predict mutations of evolving protein populations | |
Wistrand et al. | Improving profile HMM discrimination by adapting transition probabilities | |
Ismail et al. | RF-NR: random forest based approach for improved classification of nuclear receptors | |
EP2347255A1 (en) | Techniques for predicting hiv viral tropism and classifying amino acid sequences | |
JP2019505940A (en) | Determining phenotype from genotype | |
Hu et al. | Developing optimal non-linear scoring function for protein design | |
Sakakibara et al. | Stem kernels for RNA sequence analyses | |
Gupta et al. | DAVI: Deep learning-based tool for alignment and single nucleotide variant identification | |
KR20240110613A (en) | Systems and methods for evaluating immunological peptide sequences | |
Bozek et al. | V3 loop sequence space analysis suggests different evolutionary patterns of CCR5-and CXCR4-tropic HIV | |
Faucon et al. | High accuracy base calls in nanopore sequencing | |
Kuznetsov et al. | Discriminative ability with respect to amino acid types: Assessing the performance of knowledge‐based potentials without threading | |
Bing et al. | CausER-a framework for inferring causal latent factors using multi-omic human datasets | |
Karr et al. | Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier | |
Zhu et al. | Variable selection in high-dimensional logistic regression models using a whitening approach | |
Balaji | Santiago Segarra |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980141385.0 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09821264 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2740879 Country of ref document: CA Ref document number: 2011532258 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1998/KOLNP/2011 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009821264 Country of ref document: EP |