Nothing Special   »   [go: up one dir, main page]

WO2005000226A2 - Mixed bed multi-dimensional chromatography systems and methods of making and using them - Google Patents

Mixed bed multi-dimensional chromatography systems and methods of making and using them Download PDF

Info

Publication number
WO2005000226A2
WO2005000226A2 PCT/US2004/017647 US2004017647W WO2005000226A2 WO 2005000226 A2 WO2005000226 A2 WO 2005000226A2 US 2004017647 W US2004017647 W US 2004017647W WO 2005000226 A2 WO2005000226 A2 WO 2005000226A2
Authority
WO
WIPO (PCT)
Prior art keywords
reverse phase
bed
chromatography system
rpc
sequence
Prior art date
Application number
PCT/US2004/017647
Other languages
French (fr)
Other versions
WO2005000226A3 (en
Inventor
Jing Wei
Martin Latterich
Original Assignee
Diversa Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Diversa Corporation filed Critical Diversa Corporation
Publication of WO2005000226A2 publication Critical patent/WO2005000226A2/en
Publication of WO2005000226A3 publication Critical patent/WO2005000226A3/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/26Conditioning of the fluid carrier; Flow patterns
    • G01N30/38Flow patterns
    • G01N30/46Flow patterns using more than one column
    • G01N30/461Flow patterns using more than one column with serial coupling of separation columns
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/13Labelling of peptides
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/14Extraction; Separation; Purification
    • C07K1/16Extraction; Separation; Purification by chromatography
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/14Extraction; Separation; Purification
    • C07K1/16Extraction; Separation; Purification by chromatography
    • C07K1/18Ion-exchange chromatography
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/14Extraction; Separation; Purification
    • C07K1/36Extraction; Separation; Purification by a combination of two or more processes of different types

Definitions

  • TECHNICAL FIELD This invention relates to proteomics and mass spectrometry technology.
  • the invention provides novel systems and methods for determining polypeptide profiles and protein expression variations, as with proteome analyses.
  • the present invention provides systems and methods for simultaneously identifying and quantifying individual proteins in complex protein mixtures by selective differential labeling of amino acid residues followed by chromatographic and mass spectrographic analysis.
  • the invention also provides computer program products and computer implemented methods for practicing the systems and methods of the invention.
  • Biochemical pathways and metabolic networks can also be analyzed by globally and quantitatively measuring protein expression in various cell types and biological states (see, e.g., Ideker (2001) Science 292:929-934).
  • State-of-the-art techniques such as liquid-chromatography- electrospray-ionization tandem mass spectrometry have, in conjunction with database- searching computer algorithms, revolutionized the analysis of biochemical species from complex biological mixtures.
  • ICATs isotope-coded affinity tags
  • tandem mass spectrometry The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source.
  • Parent proteins of methylated peptides are identified by correlative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3 -methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3-methylated peptide pairs.
  • Screening markers include, for example, luciferase, beta-galactosidase, and green fluorescent protein. Screening can also be done by observing a cell holistically including but not limited to utilizing methods pertaining to genomics, RNA profiling, proteomics, metabolomics, and lipidomics as well as observing such aspects of growth as colony size, halo formation, etc. Additionally, screening for production of a desired compound, such as a therapeutic drug or "designer chemical" can be accomplished by observing binding of cell products to a receptor or ligand, such as on a solid support or on a column. Such screening can additionally be accomplished by binding to antibodies, as in an ELISA. In some instances the screening process can be automated so as to allow screening of suitable numbers of colonies or cells.
  • FACS fluorescence activated cell sorting
  • Selection is a form of screening in which identification and physical separation are achieved simultaneously, for example, by expression of a selectable marker, which, in some genetic circumstances, allows cells expressing the marker to survive while other cells die (or vice versa).
  • Selectable markers can include, for example, drug, toxin resistance, or nutrient synthesis genes. Selection is also done by such techniques as growth on a toxic substrate to select for hosts having the ability to detoxify a substrate, growth on a new nutrient source to select for hosts having the ability to utilize that nutrient source, competitive growth in culture based on ability to utilize a nutrient source, etc.
  • uncloned but differentially expressed proteins can be screened by differential display (Appleyard et al. Mol. Gen. Gent. 247:338-342 (1995)). Hopwood (Phil Trans R. Soc. Lond B 324:549-562) provides a review of screens for antibiotic production.
  • Omura Microbio. Rev. 50:259-279 (1986) and Nisbet (Ann Rev. Med. Chem.
  • Tagged substrates can also be used.
  • Upases and esterases can be screened using different lengths of fatty acids linked to umbelliferyl. The action of upases or esterases removes this tag from the fatty acid, resulting in a quenching or enhancement of umbelliferyl fluorescence. These enzymes can be screened in microtiter plates by a robotic device.
  • Genomics Genomics can refer to various investigative techniques that are broad in scope but often refers to measuring gene expression for multitudes of genes simultaneously. For a review see Lockhart, D.J. and Winzeler, E.A. 2000. Genomics, gene expression and DNA arrays. Nature, 405 (6788): 827-36. Biological Chips General considerations In some systems, an oligonucleotide probe is tethered, i.e., by covalent attachment, to a solid support, and arrays of oligonucleotide probes immobilized on solid supports have been used to detect specific nucleic acid sequences in a target nucleic acid.
  • bioinformatics involves studying an organism's genome to determine the sequence and placement of its genes and their relationship to other sequences and genes within the genome or to genes in other organisms. Another use of bioinformatics involves studying genes differentially or commonly expressed in different tissues or cell lines (e.g. normal and cancerous tissue). Such information is of significant interest in biomedical and pharmaceutical research, for instance to assist in the evaluation of drug efficacy and resistance.
  • the sequence tag method involves generation of a large number (e.g., thousands) of Expressed Sequence Tags ("ESTs”) from cDNA libraries (each produced from a different tissue or sample). ESTs are partial transcript sequences that may cover different parts of the cDNA(s) of a gene, depending on cloning and sequencing strategy.
  • Each EST includes about 50 to 300 nucleotides. If it is assumed that the number of tags is proportional to the abundance of transcripts in the tissue or cell type used to make the cDNA library, then any variation in the relative frequency of those tags, stored in computer databases, can be used to detect the differential abundance and potentially the expression of the corresponding genes.
  • genomic and EST information manipulation easy to perform and understand, sophisticated computer database systems have been developed. In one database system, developed by Incyte Pharmaceuticals, Inc. of Palo Alto, CA, genomic sequence data and the abundance levels of mRNA species represented in a given sample is electronically recorded and annotated with information available from public sequence databases such as GenBank. Examples of such databases include GenBank (NCBI) and TIGR.
  • the resulting information is stored in a relational database that may be employed to determine relationships between sequences and genes within and among genomes and establish a cDNA profile for a given tissue and to evaluate changes in gene expression caused by disease progression, pharmacological treatment, aging, etc.
  • a relational database developed by Incyte Pharmaceuticals, Inc. of Palo Alto, Calif, abundance levels of mRNA species represented in a given sample are electronically recorded and annotated with information available from public sequence databases such as GenBank.
  • the resulting information is stored in a relational database that may be employed to establish a cDNA profile for a given tissue and to evaluate changes in gene expression caused by disease progression, pharmacological treatment, aging, etc. Genetic information for a number of organisms has been catalogued in computer databases.
  • Bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence and structure from DNA sequence data.
  • ICATs isotope-coded affinity tags
  • tandem mass spectrometry The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source.
  • Parent proteins of methylated peptides are identified by correlative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3 -methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3 -methylated peptide pairs.
  • High throughput genomics refers to application of genomic or genetic data or analysis techniques that use microarrays or other genomic technologies to rapidly identify large numbers of genes or proteins, or distinguish their structure, expression or function from normal or abnormal cells or tissues.
  • an observer can be a person viewing a slide with a microscope or an observer who views digital images.
  • an observer can be a computer-based image analysis system, which automatically observes, analyses and quantitates biological arrayed samples with or without user interaction.
  • the present invention provides for the use of arrays of oligonucleotide probes immobilized in microfabricated patterns on silica chips for analyzing molecular interactions of biological interest.
  • the invention provides several strategies employing immobilized arrays of probes for comparing a reference sequence of known sequence with a target sequence showing substantial similarity with the reference sequence, but differing in the presence of, e.g., mutations.
  • the invention provides a tiling strategy employing an array of immobilized oligonucleotide probes comprising at least two sets of probes.
  • a first probe set comprises a plurality of probes, each probe comprising a segment of at least three nucleotides exactly complementary to a subsequence of the reference sequence, the segment including at least one interrogation position complementary to a corresponding nucleotide in the reference sequence.
  • a second probe set comprises a corresponding probe for each probe in the first probe set, the corresponding probe in the second probe set being identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the at least one interrogation position, except that the at least one interrogation position is occupied by a different nucleotide in each of the two corresponding probes from the first and second probe sets.
  • the probes in the first probe set have at least two interrogation positions corresponding to two contiguous nucleotides in the reference sequence. One interrogation position corresponds to one of the contiguous nucleotides, and the other interrogation position to the other.
  • the invention provides a tiling strategy employing an array comprising four probe sets.
  • a first probe set comprises a plurality of probes, each probe comprising a segment of at least three nucleotides exactly complementary to a subsequence of the reference sequence, the segment including at least one interrogation position complementary to a corresponding nucleotide in the reference sequence.
  • Second, third and fourth probe sets each comprise a corresponding probe for each probe in the first probe set.
  • the probes in the second, third and fourth probe sets are identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the at least one interrogation position, except that the at least one interrogation position is occupied by a different nucleotide in each of the four corresponding probes from the four probe sets.
  • the first probe can have at least 100 interrogation positions corresponding to 100 contiguous nucleotides in the reference sequence.
  • the first probe set can have an interrogation position corresponding to every nucleotide in the reference sequence.
  • the segment of complementarity within the probe set is usually about 9 to 21 nucleotides.
  • the invention provides immobilized arrays of probes tiled for multiple reference sequences, one such array comprises at least one pair of first and second probe groups, each group comprising first and second sets of probes as defined in the first aspect.
  • Each probe in the first probe set from the first group is exactly complementary to a subsequence of a first reference sequence
  • each probe in the first probe set from the second group is exactly complementary to a subsequence of a second reference sequence.
  • the first group of probes are tiled with respect to a first reference sequence and the second group of probes with respect to a second reference sequence.
  • Each group of probes can also include third and fourth sets of probes as defined in the second aspect.
  • the second reference sequence is a mutated form of the first reference sequence.
  • the invention provides arrays for block tiling.
  • Block tiling is a species of the general tiling strategies described above.
  • the usual unit of a block tiling array is a group of probes comprising a wildtype probe, a first set of three mutant probes and a second set of three mutant probes.
  • the wildtype probe comprises a segment of at least three nucleotides exactly complementary to a subsequence of a reference sequence.
  • the segment has at least first and second interrogation positions corresponding to first and second nucleotides in the reference sequence.
  • the probes in the first set of three mutant probes are each identical to a sequence comprising the wildtype probe or a subsequence of at least three nucleotides thereof including the first and second interrogation positions, except in the first interrogation position, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the probes in the second set of three mutant probes are each identical to a sequence comprising the wildtype probes or a subsequence of at least three nucleotides thereof including the first and second interrogation positions, except in the second interrogation position, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the invention provides methods of comparing a target sequence with a reference sequence using arrays of immobilized pooled probes.
  • the arrays employed in these methods represent a further species of the general tiling arrays noted above.
  • variants of a reference sequence differing from the reference sequence in at least one nucleotide are identified and each is assigned a designation.
  • An array of pooled probes is provided, with each pool occupying a separate cell of the array.
  • Each pool comprises a probe comprising a segment exactly complementary to each variant sequence assigned a particular designation.
  • the array is then contacted with a target sequence comprising a variant of the reference sequence.
  • the relative hybridization intensities of the pools in the array to the target sequence are determined.
  • each variant is assigned a designation having at least one digit and at least one value for the digit.
  • each pool comprises a probe comprising a segment exactly complementary to each variant sequence assigned a particular value in a particular digit.
  • n x (m-1) pooled probes are used are used to assign each variant a designation.
  • the invention provides a pooled probe for trellis tiling, a further species of the general tiling strategy.
  • a pooled trellis probe comprises a segment exactly complementary to a subsequence of a reference sequence except at a first interrogation position occupied by a pooled nucleotide N, a second interrogation position occupied by a pooled nucleotide selected from the group of three consisting of (1) M or K, (2) R or Y and (3) S or W, and a third interrogation position occupied by a second pooled nucleotide selected from the group.
  • the pooled nucleotide occupying the second interrogation position comprises a nucleotide complementary to a corresponding nucleotide from the reference sequence when the second pooled probe and reference sequence are maximally aligned
  • the pooled nucleotide occupying the third interrogation position comprises a nucleotide complementary to a corresponding nucleotide from the reference sequence when the third pooled probe and the reference sequence are maximally aligned.
  • Standard IUPAC nomenclature is used for describing pooled nucleotides.
  • an array comprises at least first, second and third cells, respectively occupied by first, second and third pooled probes, each according to the generic description above.
  • the segment of complementarity, location of interrogation positions, and selection of pooled nucleotide at each interrogation position may or may not differ between the three pooled probes subject to the following constraint.
  • One of the three interrogation positions in each of the three pooled probes must align with the same corresponding nucleotide in the reference sequence. This interrogation position must be occupied by a N in one of the pooled probes, and a different pooled nucleotide in each of the other two pooled probes.
  • the invention provides arrays for bridge tiling.
  • Bridge tiling is a species of the general tiling strategies noted above, in which probes from the first probe set contain more than one segment of complementarity.
  • a nucleotide in a reference sequence is usually determined from a comparison of four probes.
  • a first probe comprises at least first and second segments, each of at least three nucleotides and each exactly complementary to first and second subsequences of a reference sequences.
  • the segments including at least one interrogation position corresponding to a nucleotide in the reference sequence.
  • first and second subsequences are noncontiguous in the reference sequence, or
  • the arrays of the invention can further comprise second, third and fourth probes, which are identical to a sequence comprising the first probe or a subsequence thereof comprising at least three nucleotides from each of the first and second segments, except in the at least one interrogation position, which differs in each of the probes.
  • the first and second subsequences are separated by one or two nucleotides in the reference sequence.
  • the invention provides arrays of probes for multiplex tiling.
  • Multiplex tiling is a strategy, in which the identity of two nucleotides in a target sequence is determined from a comparison of the hybridization intensities of four probes, each having two interrogation positions.
  • Each of the probes comprising a segment of at least 7 nucleotides that is exactly complementary to a subsequence from a reference sequence, except that the segment may or may not be exactly complementary at two interrogation positions.
  • the nucleotides occupying the interrogation positions are selected by the following rules: (1) the first interrogation position is occupied by a different nucleotide in each of the four probes, (2) the second interrogation position is occupied by a different nucleotide in each of the four probes, (3) in first and second probes, the segment is exactly complementary to the subsequence, except at no more than one of the interrogation positions, (4) in third and fourth probes, the segment is exactly complementary to the subsequence, except at both of the interrogation positions.
  • the invention provides arrays of immobilized probes including helper mutations.
  • Helper mutations are useful for, e.g., preventing self- annealing of probes having inverted repeats.
  • the identity of a nucleotide in a target sequence is usually determined from a comparison of four probes.
  • a first probe comprises a segment of at least 7 nucleotides exactly complementary to a subsequence of a reference sequence except at one or two positions, the segment including an interrogation position not at the one or two positions. The one or two positions are occupied by helper mutations.
  • third and fourth mutant probes are each identical to a sequence comprising the wildtype probe or a subsequence thereof including the interrogation position and the one or two positions, except in the interrogation position, which is occupied by a different nucleotide in each of the four probes.
  • the invention provides arrays of probes comprising at least two probe sets, but lacking a probe set comprising probes that are perfectly matched to a reference sequence. Such arrays are usually employed in methods in which both reference and target sequence are hybridized to the array.
  • the first probe set comprising a plurality of probes, each probe comprising a segment exactly complementary to a subsequence of at least 3 nucleotides of a reference sequence except at an interrogation position.
  • the second probe set comprises a corresponding probe for each probe in the first probe set, the corresponding probe in the second probe set being identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the interrogation position, except that the interrogation position is occupied by a different nucleotide in each of the two corresponding probes and the complement to the reference sequence.
  • the invention provides methods of comparing a target sequence with a reference sequence comprising a predetermined sequence of nucleotides using any of the arrays described above. The methods comprise hybridizing the target nucleic acid to an array and determining which probes, relative to one another, in the array bind specifically to the target nucleic acid.
  • the relative specific binding of the probes indicates whether the target sequence is the same or different from the reference sequence.
  • the target sequence has a substituted nucleotide relative to the reference sequence in at least one undetermined position, and the relative specific binding of the probes indicates the location of the position and the nucleotide occupying the position in the target sequence.
  • a second target nucleic acid is also hybridized to the array. The relative specific binding of the probes then indicates both whether the target sequence is the same or different from the reference sequence, and whether the second target sequence is the same or different from the reference sequence.
  • the relative specific binding of probes in the first group indicates whether the target sequence is the same or different from the first reference sequence.
  • the relative specific binding of probes in the second group indicates whether the target sequence is the same or different from the second reference sequence.
  • Such methods are particularly useful for analyzing heterologous alleles of a gene. Some methods entail hybridizing both a reference sequence and a target sequence to any of the arrays of probes described above. Comparison of the relative specific binding of the probes to the reference and target sequences indicates whether the target sequence is the same or different from the reference sequence.
  • the invention provides arrays of immobilized probes in which the probes are designed to tile a reference sequence from a human immunodeficiency virus.
  • Reference sequences from either the reverse transcriptase gene or protease gene of HIV are of particular interest.
  • Some chips further comprise arrays of probes tiling a reference sequence from a 16S RNA or DNA encoding the 16S RNA from a pathogenic microorganism.
  • the invention further provides methods of using such arrays in analyzing a HIV target sequence. The methods are particularly useful where the target sequence has a substituted nucleotide relative to the reference sequence in at least one position, the substitution conferring resistance to a drug use in treating a patient infected with a HIV virus.
  • the methods reveal the existence of the substituted nucleotide.
  • the methods are also particularly useful for analyzing a mixture of undetermined proportions of first and second target sequences from different HIV variants.
  • the relative specific binding of probes indicates the proportions of the first and second target sequences.
  • the invention provides arrays of probes tiled based on reference sequence from a CFTR gene.
  • An exemplary array comprises at least a group of probes comprising a wildtype probe, and five sets of three mutant probes.
  • the wildtype probe is exactly complementary to a subsequence of a reference sequence from a cystic fibrosis gene, the segment having at least five interrogation positions corresponding to five contiguous nucleotides in the reference sequence.
  • the probes in the first set of three mutant probes are each identical to the wildtype probe, except in a first of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the probes in the second set of three mutant probes are each identical to the wildtype probe, except in a second of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the probes in the third set of three mutant probes are each identical to the wildtype probe, except in a third of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the probes in the fourth set of three mutant probes are each identical to the wildtype probe, except in a fourth of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • the probes in the fifth set of three mutant probes are each identical to the wildtype probe, except in a fifth of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe.
  • a chip can comprise two such groups of probes.
  • the first group comprises a wildtype probe exactly complementary to a first reference sequence
  • the second group comprises a wildtype probe exactly complementary to a second reference sequence that is a mutated form of the first reference sequence.
  • the invention further provides methods of using the arrays of the invention for analyzing target sequences from a CFTR gene.
  • the methods are capable of simultaneously analyzing first and second target sequences representing heterozygous alleles of a CFTR gene.
  • the invention provides arrays of probes tiling a reference sequence from a p53 gene, an hMLHl gene and/or an MSH2 gene.
  • the invention further provides methods of using the arrays described above to analyze these genes. The method are useful, e.g., for diagnosing patients susceptible to developing cancer.
  • the invention provides arrays of probes tiling a reference sequence from a mitochondrial genome.
  • the reference sequence may comprise part or all of the D-loop region, or all, or substantially all, of the mitochondrial genome.
  • the invention further provides method of using the arrays described above to analyze target sequences from a mitochondrial genome. The methods are useful for identifying mutations associated with disease, and for forensic, epidemiological and evolutionary studies.
  • the invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each
  • the sample of step (a) comprises a cell or a cell extract.
  • the method can further comprise providing two or more samples comprising a polypeptide.
  • One or more of the samples can be derived from a wild type cell and one sample can be derived from an abnormal or a modified cell.
  • the abnormal cell can be a cancer cell.
  • the modified cell can be a cell that is mutagenized &/or treated with a chemical, a physiological factor, or the presence of another organism (including, e.g. a eukaryotic organism, prokaryotic organism, virus, vector, prion, or part thereof), &/or exposed to an environmental factor or change or physical force (including, e.g., sound, light, heat, sonication, and radiation).
  • the modification can be genetic change (including, for example, a change in DNA or RNA sequence or content) or otherwise.
  • the method further comprises purifying or fractionating the polypeptide before the fragmenting of step (c).
  • the method can further comprise purifying or fractionating the polypeptide before the labeling of step (d).
  • the method can further comprise purifying or fractionating the labeled peptide before the chromatography of step (e).
  • the purifying or fractionating comprises a method selected from the group consisting of size exclusion chromatography, size exclusion chromatography, HPLC, reverse phase HPLC and affinity purification.
  • the method further comprises contacting the polypeptide with a labeling reagent of step (b) before the fragmenting of step (c).
  • the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: Z A OH and Z B OH, to esterify peptide C-terminals and/or Glu and Asp side chains; Z A NH 2 and Z B NH 2 , to form amide bond with peptide C-terminals and/or Glu and Asp side chains; and Z ⁇ CO 2 H and Z B CO 2 H.
  • Z ⁇ and Z B independently of one another comprise the general formula R-Z 1 - A'-Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 -, Z 1 , Z 2 , Z 3 , and Z 4 independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ , C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR 1 , (Si(RR')O)n, SnRR 1 , Sn(RR')O, BR(OR'), BRR 1 , B
  • the alkyl group (see definition below) is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
  • One or more C-C bonds from (CRR')n can be replaced with a double or a triple bond; thus, in alternative aspects, an R or an R 1 group is deleted.
  • the (CRR')n can be selected from the group consisting of an o-arylene, an w-arylene and a »-arylene, wherein each group has none or up to 6 substituents.
  • the (CRR')n can be selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom.
  • two or more labeling reagents have the same structure but a different isotope composition.
  • Z ⁇ has the same structure as Z B
  • Z has a different isotope composition than Z B .
  • the isotope is boron- 10 and boron- 11; carbon- 12 and carbon-13; nitrogen- 14 and nitrogen-15; and, sulfur-32 and sulfur-34.
  • x is greater than y.
  • x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51.
  • the labeling reagent of step (b) can comprise the general formulae selected from the group consisting of: Z A OH and Z B OH to esterify peptide C-terminals; Z A NH 2 / Z B NH 2 to form an amide bond with peptide C-terminals; and, Z ⁇ CO 2 H / Z B CO 2 H to form an amide bond with peptide N-terminals; wherein Z A and Z B have the general formula R-Z'-A'-Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 - ; Z 1 , Z 2 , Z 3 , and Z 4 , independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ ,
  • a single C-C bond in a (CRR')n group is replaced with a double or a triple bond; thus, the R and R 1 can be absent.
  • the (CRR')n can comprise a moiety selected from the group consisting of an o-arylene, an w-arylene and ap- arylene, wherein the group has none or up to 6 substituents.
  • the group can comprise a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom.
  • R, R 1 independently from other R and R 1 in Z 1 - Z 4 and independently from other R and R in A - A 4 , are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group.
  • the alkyl group (see definition below) can be an alkenyl, an alkynyl or an aryl group.
  • the "n" in Z 1 - Z 4 is independent of n in A 1 - A 4 and is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11 and about 6.
  • Z A has the same structure a Z B but Z A further comprises x number of -CH 2 - fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer. In one aspect, Z A has the same structure a Z B but Z A further comprises x number of -CF 2 - fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer. In one aspect, Z A comprises x number of protons and Z B comprises y number of halogens in the place of protons, wherein x and y are integers.
  • Z A contains x number of protons and Z B contains y number of halogens, and there are x - y number of protons remaining in one or more A 1 - A fragments, wherein x and y are integers.
  • Z A further comprises x number of -O- fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer.
  • Z A further comprises x number of -S- fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer.
  • Z A further comprises x number of -O- fragment(s) and Z B further comprises y number of-S- fragment(s) in the place of-O- fragment(s), wherein and y are integers.
  • Z A further comprises x -y number of -O- fragment(s) in one or more A 1 - A 4 fragments, wherein x andy are integers.
  • x and y are integers selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y.
  • n, m and y are integers selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51.
  • two or more labeling reagents have the same structure but a different isotope composition.
  • An exemplary labeling reagent pair is N, N, dimethyl-iodoacetamide and N, N, d6-dimethyl-iodoacetamide, having the structures:
  • the methyl group can be replaced by any lower alkyl group (e.g., ethyl, butyl and the like).
  • the separating of step (e) comprises a liquid chromatography system, such as a multidimensional liquid chromatography (e.g., a system of the invention) or a capillary chromatography system.
  • the mass spectrometer comprises a tandem mass spectrometry device or an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof.
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM (Thermo Electron Co ⁇ oration, San Jose, CA), or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer.
  • the Agilent LC/MSD Trap is an 1100 series LC/MSD TRAPTM, or, the LC/MSD Trap SLTM, or, the LC/MSD Trap XCT TM (Agilent Technologies, Palo Alto, CA), or equivalent device.
  • the method further comprises quantifying the amount of each polypeptide or each peptide.
  • the invention provides a method for defining the expressed proteins associated with a given cellular state, the method comprising the following steps: (a) providing a sample comprising a cell in the desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cell into peptide fragments by enzymatic digestion or by non- enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the a
  • the invention provides a method for quantifying changes in protein expression between at least two cellular states, the method comprising the following steps: (a) providing at least two samples comprising cells in a desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cells into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents, wherein the labels used in one same are different from the labels used in other samples; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (
  • the invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by multidimensional liquid chromatography to generate an eluate; (f) feeding the eluate of step (e) into a tandem mass spectrometer or an ion trap mass spectrometer or a
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM.
  • the invention provides a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope.
  • the isotope(s) can be in the first domain or the second domain.
  • the isotope(s) can be in the biotin.
  • the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur-32 or a sulfur-34 isotope.
  • the chimeric labeling reagent can comprise two or more isotopes.
  • the chimeric labeling reagent reactive group capable of covalently binding to an amino acid can be a succimide group, an isothiocyanate group or an isocyanate group.
  • the reactive group can be capable of covalently binding to an amino acid binds to a lysine or a cysteine.
  • the chimeric labeling reagent can further comprising a linker moiety linking the biotin group and the reactive group.
  • the linker moiety can comprise at least one isotope.
  • the linker is a cleavable moiety that can be cleaved by, e.g., enzymatic digest or by reduction.
  • the invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the small molecule tags are structurally identical but differ in their isotope composition, and the small molecules comprise reactive groups that covalently bind to cysteine or lysine residues or both; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (d) comparing relative protein concentrations of each sample.
  • the sample comprises a complete or a fractionated cellular sample.
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ
  • the differential small molecule tags comprise a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and, (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope.
  • the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur- 32 or a sulfur-34 isotope.
  • the chimeric labeling reagent can comprise two or more isotopes.
  • the reactive group can be capable of covalently binding to an amino acid is selected from the group consisting of a succimide group, an isothiocyanate group and an isocyanate group.
  • the invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the differential small molecule tags comprise a chimeric labeling reagent comprising (i) a first domain comprising a biotin; and, (ii) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) isolating the tagged polypeptides on a biotin-binding column by binding tagged polypeptides to the column, washing non-bound materials off the column, and eluting
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM.
  • the invention provides chromatography systems comprising a first reverse phase column (RPC) (a first dimension), an ion exchange column (e.g., a cation (CX) or anion exchange column) (a second dimension), a second reverse phase column (RPC) (a third dimension), wherein the first reverse phase column (RPC), the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC) are connected in series; the first reverse phase column (RPC) has a free distal end and a proximal end connected to the ion exchange column (e.g., a cation (CX) or anion exchange column), or, first reverse phase column (RPC) is configured such that either the distal end or the proximal
  • the second reverse phase column (RPC), or the first reverse phase column (RPC), or both are connected to an analytical device on its distal end such that an eluate can be fed into the analytical device.
  • the analytical device can comprise a mass spectrometer.
  • the mass spectrometer can further comprise a nano-spray apparatus.
  • the mass spectrometer comprises a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof.
  • the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC) are enclosed in one housing and the first reverse phase column (RPC) is enclosed in a second housing.
  • the three dimensions, or columns are all in different housings, or, the columns are arranged such that they can be easily, and individually, replaced.
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM.
  • a flow valve e.g., a low volume flow valve (e.g., a microvalve) and/or an inline microfilter assembly connects the various columns (e.g., the various housings).
  • each dimensions, or column is in a different housing and one, two or all of the housings are connected from each other by a flow valve, e.g., an inline microfilter assembly, and the like.
  • a flow valve separates the first housing and the second housing.
  • a flow valve e.g., a low volume flow valve and/or an inline microfilter assembly connects the first reverse phase column (RPC) to the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC).
  • the first reverse phase column (RPC), the ion exchange column and the second reverse phase column (RPC) are enclosed in one housing.
  • valves inputs and/or outputs to or from any or all of the columns are fitted with valves.
  • the flow valve is a one-way, a two-way, a three-way (a "T-valve") or a four way valve.
  • the housing(s) comprise fused silica capillaries.
  • a valve is fitted on the distal end of either reverse phase column (the end not connected to the ion exchange), or both distal ends of the reverse phase columns (this alternative aspect can be in addition to having a valve between the first reverse phase column and the ion exchange/ second reverse phase column assembly).
  • the flow valve is a one-way, a two-way, a three-way (a "T- valve") or a four way valve.
  • this valve or valves are a flow valve, e.g., a low volume flow valve.
  • the valve connection assembly can further comprise an inline microfilter assembly.
  • the system of the invention is fully automated.
  • the system can comprise a sample injector fully integrated with the automated system.
  • the system is integrated to a computer, which can be programmed to run samples on the system, including equilibrating columns, washing, step elution of samples, and the like.
  • an automated system of the invention is used for high throughput proteome profiling with on-line sample collection.
  • the first, second or both reverse phase columns are packed with a reverse phase resin or equivalent.
  • the first, second or both reverse phase resins can comprise a C18 reverse phase resin or equivalent.
  • the ion exchange column can comprise a strong cation exchange (SCX) resin or equivalent.
  • the strong cation exchange (SCX) resin can comprise a polysulfoethyl A strong cation exchange resin.
  • the first reverse phase column (RPC), the second first reverse phase column (RPC), or both are connected to an HPLC on a distal end.
  • the first reverse phase column has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%, 1000%, or more, greater capacity than the second reverse phase column (RPC) (which, in one aspect, is the third dimension in an exemplary 3-D LC-MS/MS or 3D LC LCQ MS/MS)
  • the first reverse phase column has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%, 1000%, or more, resin than the second reverse phase column (RPC), e.g.
  • the loading capacity is proportional to the column dimension.
  • the loading capacity is approximately 100 ug protein digest per 10 cm X 180 um CI 8 column, up to milligram sized sample.
  • the chromatography systems can further comprise a computer system operatively linked to the cliromatography system, thereby making the chromatography system an automated operation.
  • the chromatography systems can further comprise a computer system operatively linked to the mass spectrometer for quantifying the amount of each peptide by use of data from the mass spectrometer.
  • the chromatography systems can further comprise a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer.
  • the invention provides mixed bed multi-dimensional liquid chromatographs comprising a first resin bed (a first dimension), a second resin bed (a second dimension) and a third resin bed (a third dimension) connected in series, wherein the first resin bed comprises a reverse phase resin, the second resin bed comprises an ion exchange (e.g., a cation or anion exchange) resin bed and the third resin bed comprises a reverse phase resin, and the reverse phase resin of the first bed has a free distal end and a proximal end connected to the ion exchange bed, or, the reverse phase resin of the first bed is configured such that the distal end and/or the proximal end are connected to the ion exchange column such that a sample can be loaded into and eluted out of first reverse phase column (RPC) to the ion exchange column from the same end (which can be either the distal end or the proximal end), and the reverse phase resin of the third bed has a free distal end and a proximal end connected to the ion
  • the reverse phase resin of the first bed has a greater capacity than the reverse phase resin of the third bed, or, the reverse phase resin of the third bed has a greater capacity than the reverse phase resin of the first bed.
  • the reverse phase resin of the first bed, the reverse phase resin of the third bed, or both can be connected to an analytical device such that an eluate can be fed into the analytical device.
  • the loading capacity is proportional to the column dimension. For example, in one aspect, the loading capacity is approximately 100 ug protein digest per 10 cm X 180 um C18 column, or equivalent, up to milligram sized sample.
  • the analytical device comprises a mass spectrometer.
  • the mass spectrometer can further comprise a nano-spray apparatus.
  • the mass spectrometer can comprise a tandem mass spectrometer or an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof.
  • the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM.
  • each resin bed is enclosed in a separate housing (for, in some aspects, easy, independent replacement of any individual resin bed).
  • the second resin bed and a third resin bed are enclosed in one housing and the first resin bed is enclosed in a second housing.
  • a flow valve e.g., a low volume flow valve and/or an inline microfilter assembly, connects each housing to each other and/or to any inputs or outputs.
  • a flow valve connects the first housing and the second housing.
  • the inline microfilter assembly can further comprise a valve, e.g., a one way or two way valve.
  • a flow valve e.g., a low volume flow valve, or directional control flow valve, e.g., a one way or two way flow valve
  • an inline microfilter assembly connects the first bed to the second and third resin beds.
  • the first reverse phase resin bed, the ion exchange resin bed and the second reverse phase resin bed are enclosed in one housing.
  • the mixed bed multi-dimensional liquid chromatographs of the invention are fully automated.
  • the chromatographs can comprise a sample injector fully integrated with the automated system.
  • the chromatographs of the invention are integrated to a computer, which can be programmed to run samples, including equilibrating columns, washing, step elution of samples, and the like.
  • chromatographs of the invention are used for high throughput proteome profiling with on-line sample collection. See Figure 22 for an exemplary automated chromatograph system of the invention.
  • the reverse phase resin of the first bed, the reverse phase resin of the third bed or both reverse phase resin beds are packed with a Cx reverse phase resin or equivalent, wherein X is an integer between five and thirty.
  • the Cx reverse phase resin or equivalent comprises a CI 8 reverse phase resin or equivalent.
  • the ion exchange bed is packed with a strong cation exchange (SCX) resin or equivalent.
  • the strong cation exchange resin (SCX) can comprise a polysulfoethyl A strong cation exchange resin.
  • the reverse phase resin of the first bed, or the reverse phase resin of the third bed, or both are connected to an HPLC.
  • the first reverse phase resin bed has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%,, 375%, 400%, 425%, 450%, 475%,, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%,, 800%, 825%, 850%, 875%, 900%, 925%,, 950%, 975%, 1000%, or more, greater capacity than the second reverse phase resin bed.
  • the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the chromatography system, thereby making the chromatography system an automated operation.
  • the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the mass spectrometer for quantifying the amount of each peptide by use of data from the mass spectrometer.
  • the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer.
  • the invention provides methods for separating proteins comprising the following steps: (a) providing a sample comprising a polypeptide; (b) fragmenting the polypeptide into peptide fragments; and (c) separating the peptides by chromatography to generate an eluate using a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention.
  • the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system.
  • the peptide fragments are eluted through the distal end of the reverse phase resin of the first bed and/or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph, or the peptide fragments are eluted through the distal end of the first or the second RP column of the chromatography system. In one aspect, the peptide fragments are eluted through the same end from which they were loaded.
  • the peptide fragments can be generated by enzymatic digestion or by non-enzymatic fragmentation. The enzymatic digestion can be by trypsin, endoproteinase or a combination thereof.
  • the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system without desalting or removing the detergent, or both.
  • the peptide fragments can be solubilized in a detergent or a denaturing agent before loading into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system, and, in one aspect, loaded without having to remove the detergent.
  • the system is used to analyze membrane proteins, or other hydrophobic proteins or compounds (e.g., organic compounds, e.g., steroids, fats, lipopolysaccharides) by loading samples without removing detergents.
  • the detergent or denaturing agent is SDS or urea.
  • the multi-dimensional chromatographs of the invention are detergent tolerant, and thus are excellent for membrane proteins or any protein or compound needing detergent to be solubilized.
  • the peptide fragments are loaded into reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system using a pressure bomb.
  • the method can further comprise feeding the eluate into a mass spectrometer and quantifying the amount of each peptide.
  • the method can further comprise feeding the eluate into a mass spectrometer and generating the sequence of each peptide by use of the mass spectrometer.
  • the method can further comprise inputting the sequence into a computer program product to compare the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which a sequenced peptide originated.
  • the separating of step (c) comprises (i) loading a labeled peptide mixture into the first reverse phase column (RPC) of the chromatography system or the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph, wherein the first RPC or first reverse phase resin bed absorbs a plurality of peptides; (ii) eluting a fraction of the first RPC-absorbed or first resin bed-absorbed plurality of peptides to the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or the ion exchange (CX) resin bed of the mixed bed multidimensional liquid chromatograph, using a reverse phase gradient; (iii) eluting a fraction of the ion exchange column-absorbed or CX resin bed-absorbed plurality of peptides onto the second reverse phase column (RPC) of the chromatography system or the reverse phase resin of the third bed of the mixed bed multi-dimensional
  • the plurality of peptides eluted in step (iv) can be eluted through the distal end of the second reverse phase column (RPC) of the cliromatography system or the distal end of the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph.
  • RPC reverse phase column
  • the plurality of peptides eluted in step (iv) is eluted back through the proximal end of the second RPC of the chromatography system or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph, through the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or CX resin bed of the mixed bed multi-dimensional liquid chromatograph, and back through the proximal end of the first RPC of the chromatography system or the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph, and the eluate passes through the distal end of the first RPC or the first reverse phase resin bed.
  • the ion exchange column e.g., a cation (CX) or anion exchange column
  • step (iv) the fraction of the second RPC-absorbed or third resin bed-absorbed plurality of peptides are eluted using the same reverse phase gradient used to elute the first RPC-absorbed or first resin bed-absorbed fraction of peptides in step (ii).
  • the method further comprises: after step (iii) is completed and before the step (iv) eluting a fraction of the second RPC-absorbed or second reverse phase resin bed-absorbed plurality of peptides is begun, washing the column free of the salts and buffers used to elute a fraction of the ion exchange column-absorbed or CX resin bed-absorbed plurality of peptides.
  • a discrete fraction of the first RPC-absorbed or first resin bed-absorbed plurality of peptides is eluted to the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or the ion exchange (CX) resin bed of the mixed bed multi-dimensional liquid chromatograph from using a reverse phase gradient.
  • the reverse phase gradient comprises (X n -X n+ ⁇ %B) over 120 minutes with a flow rate of 250 nl/min
  • the salt gradient steps comprise 12 salt gradient steps comprising 25 mM, 50 mM, 75 mM, 100 mM, 125 mM, 150 mM, 175 mM, 200 mM, 225mM, 250mM, and 2M ammonium acetate, or equivalent.
  • the method further comprises labeling the peptide fragments before loading them into the chromatography system or the mixed bed multi-dimensional liquid chromatograph.
  • the sample can be derived from a cell, a seed or a spore.
  • the cell can be a prokaryotic cell or a eukaryotic cell.
  • the cell, seed or spore can be derived from a bacteria, a yeast, an insect, a plant, a fungus, a protozoa or a mammal.
  • the mammalian cell can be a human cell or a mouse cell.
  • the bacterial cell or spore can be a Bacillus anthracis.
  • the invention provides methods for separating and detecting proteins by differential labeling of peptides.
  • the method comprises the following steps: (a) providing at least two samples comprising a polypeptide; (b) providing at least two sets of labeling reagents (e.g., at least one pair of labeling reagents), wherein each set of labeling reagent differs in molecular mass from the other sets (e.g., wherein each member of a pair differs in molecular mass from the second member of a pair) and the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptides into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), wherein each sample is labeled with a different labeling reagent, thereby differentially labeling the peptides; (e) separating the labeled peptides by chromatography to generate an eluate using a chromatography system of the invention or
  • the method further comprises a step (f) comprising feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer.
  • the method further comprises providing two or more samples from different sources.
  • one sample is derived from a wild type cell and one sample is derived from an abnormal or a modified cell.
  • the abnormal cell can be a cancer cell.
  • the peptide fragments can be labeled with a reagent comprising a general formula selected from the group consisting of: Z A OH for labeling at least a first sample and Z B OH for labeling at least a second sample, to esterify peptide C-terminals and/or Glu and Asp side chains; Z A NH 2 for labeling at least a first sample and Z B NH 2 for labeling at least a second sample, to form amide bond with peptide C-terminals and/or Glu and Asp side chains; and Z A CO 2 H for labeling at least a first sample and Z B CO 2 H ⁇ for labeling at least a second sample to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein Z A and Z B independently of one another comprise the general formula R-Z -A -Z -A - Z 3 -A 3 -Z 4 -A 4 - , Z 1 , Z 2 , Z 3 , and Z 4 independently of
  • the alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
  • one or more C- C bonds from (CRR 1 ), are replaced with a double or a triple bond.
  • an R and/or an R 1 group are absent.
  • (CRR') n is selected from the group consisting of an o-arylene, an w-arylene and a »-arylene, wherein each group has none or up to 6 substituents.
  • (CRR 1 ), is selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom.
  • two or more labeling reagents have the same structure but a different isotope composition.
  • Z A can have the same structure as Z B , but Z A has a different isotope composition than Z B .
  • the isotope can be boron-10 and boron-11, carbon-12 and carbon-13, nitrogen-14 and nitrogen-15, sulfur-32 and or sulfur-34.
  • the isotope with the lower mass can be x and the isotope with the higher mass is y, and x and y are integers, x is greater than y. In one aspect, x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51.
  • the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: i. Z A OH for labeling at least a first sample and Z B OH for labeling at least a second sample to esterify peptide C-terminals; ii. Z A NH 2 for labeling at least a first sample and Z B NH 2 for labeling at least a second sample to form an amide bond with peptide C-terminals; and iii.
  • a single C-C bond in a (CRR 1 ),, group is replaced with a double or a triple bond.
  • R and R 1 are absent.
  • (CRR 1 ) ! comprises a moiety selected from the group consisting of an o- arylene, an /n-arylene and a p-arylene, wherein the group has none or up to 6 substituents.
  • the (CRR 1 ),, group comprises a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom.
  • R, R 1 independently from other R and R 1 in Z 1 - Z 4 and independently from other R and R 1 in A 1 - A 4 , are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group.
  • alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
  • n in Z - Z is independent of n in A - A and is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11 and about 6.
  • Z A has the same structure a Z B but Z A further comprises x number of -CH 2 - fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer. In one aspect, Z A has the same structure a Z B but Z A further comprises x number of -CF 2 - fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer. In one aspect, Z A comprises x number of protons and Z B comprises y number of halogens in the place of protons, wherein x and y are integers.
  • Z A contains x number of protons and Z B contains y number of halogens, and there are x - y number of protons remaining in one or more A 1 - A 4 fragments, wherein x and y are integers.
  • Z A further comprises x number of -O- fragment(s) in one or more A 1 - A fragments, wherein x is an integer.
  • Z A further comprises x number of -S- fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer.
  • Z A further comprises x number of -O- fragment(s) and Z further compnses y number of -S- fragment(s) in the place of -O- fragment(s), wherein and y are integers.
  • Z A further comprises x -y number of -O- fragment(s) in one or more A - A fragments, wherein x and y are integers.
  • x and y are integers independently selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y.
  • the labeling reagent pair used in the method is N, N, dimethyl-iodoacetamide and N, N, d6-dimethyl-iodoacetamide, having the structures: 0 ,CH 3 N CH 3 ⁇ C*D 3 1 ⁇ /, ⁇ /-dimethyliodoacetamide ⁇ /,/V-dimethyl-c/6-iodoacetamide
  • the invention provides methods for separating and detecting a hydrophobic protein (e.g., membrane protein) or a hydrophobic compound, the method comprising the following steps: (a) providing a sample comprising the hydrophobic protein (e.g., membrane protein) or the hydrophobic compound; (b) solubilizing the hydrophobic protein (e.g., membrane protein) or the hydrophobic compound in a detergent or urea; (c) loading the detergent or urea solubilized hydrophobic protein or hydrophobic compound into a chromatography system of the invention or
  • the hydrophobic protein is a membrane protein such as an integral membrane protein, e.g., a protein expressed on the surface of a pathogenic cell or a cancer cell.
  • the hydrophobic compound can be a lipid or a steroid.
  • the invention provides computer program products comprising a computer useable medium having computer program logic recorded thereon for analyzing data generated by a chromatography system, said computer program logic comprising computer program code logic configured to perform operations as set forth in Figure 17, Figure 18, Figure 19, Figure 20 or Figure 21.
  • the invention provides computer program products wherein the chromatography system comprises a system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention.
  • the invention provides computer-implemented methods for analyzing data generated by a chromatography system comprising the following steps: providing a chromatography system capable of outputting data to a computer; providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product of the invention; and, inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system.
  • the chromatography system comprises a system of the invention or a mixed bed multi- dimensional liquid chromatograph of the invention.
  • an exemplary computer-implemented method comprises an LC-MS data file operatively linked to a component extraction file, operatively linked to a precursor integration and series reconstruction files, operatively linked to a progression file, as schematically illustrated in Example 17.
  • the component extraction aspect of the computer-implemented method is schematically illustrated in Figure 18.
  • the invention provides quantitative proteomics systems comprising a chromatography system comprising a system of the invention or a mixed bed multidimensional liquid chromatograph of the invention, wherein the system is capable of outputting data to a processor; a processor; and a computer program product of the invention embodied within the processor.
  • the invention provides methods for fractionating a proteome of a cell comprising (a) providing a chromatography system comprising a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25; (b) providing a proteome preparation; and (c) fractionating the proteome preparation with the chromatography system, wherein 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41 %, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51 %, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%
  • the invention also provides methods of the invention comprising use of a computer-implemented method for analyzing data generated by a chromatography system comprising the following steps: (a) providing a chromatography system capable of outputting data to a computer; (b) providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product of the invention; (c) inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system.
  • the invention provides quantitative proteomics systems comprising: (a) a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention, and a mass spectrometer, wherein the system is capable of outputting data to a processor; (b) a processor; and (c) a computer program product (e.g., a computer program product of the invention) embodied within the processor.
  • the mass spectrometer comprises an ion trap mass spectrometer, such as a Finnigan LCQ Deca XP MAXTM, a Finnigan MDLC LTQTM or a Finnigan LTQ FTTM. 226
  • FIG. 1 illustrates an exemplary process of the invention wherein samples are combined, separated by multidimensional chromatography, and analyzed by mass spectrometry methods, as described in detail, below.
  • Figure 2 is an illustration of a MALDI MS spectrum of a peptide pairs, as described in detail, below.
  • Figure 3 illustrates an exemplary 3D LC set-up and process, as described in detail, below.
  • Figure 4 illustrates an exemplary multi-dimensional chromatography apparatus of the invention, as described in detail in Example 3, below.
  • Figure 5 graphically depicts the statistics of an exemplary mixed resin chromatography analysis and protein identification (Figure 5A graphically depicts the # MS/ MS spectra; Figure 5B graphically depicts the annotated spectra (%); Figure 5C graphically depicts the # protein ID), as described in detail in Example 3, below.
  • Figure 6 gives a three-dimensional view of proteins identified using the exemplary apparatus and methods of the invention), as described in detail in Example 3, below.
  • Figure 6A shows an overlay of the predicted (and also observed) membrane proteins (solid circles) over the total population (open circles).
  • Figures 6B, 6C, and 6D Certain functional classes are depicted by the overlays in Figures 6B, 6C, and 6D, illustrating the class of proteins belonging to "protein synthesis", “glycolysis” and “protein glycosylation", respectively, as described in detail in Example 3, below.
  • Figure 7 illustrates the sequence of pyruvate decarboxylase set forth in SEQ ID NO: 1 as generated using an exemplary chromatography system and method of the invention, as described in detail in Example 3, below.
  • Figure 8 illustrates an exemplary method of the invention, as described in detail in Examples 3 and 4, below.
  • Figure 9 illustrates an exemplary sample preparation protocol of the invention, see Example 4, below.
  • Figure 10 illustrates the results of salt extraction subfractions in a reverse phase sub-fraction for analysis of the B.
  • Figure 11 illustrates the results of an analysis of a B. anthracis proteome using a chromatography system of the invention, as described in Example 4, below.
  • Figure 12 summarizes a "matrix" of protein distribution from different B. anthracis samples, as described in Example 4, below.
  • Figure 13 summarizes the discovered protein distribution by "role” category.
  • Figure 14 illustrates an exemplary multi-dimensional chromatography apparatus of the invention, as described in detail in Example 3, below.
  • Figure 15 describes the metabolic pathways identified in the yeast proteome using an exemplary multi-dimensional chromatography apparatus and methods of the invention, as described in detail in Example 3, below.
  • Figure 16 illustrates proteins (highlighted in blue) from the glycolysis pathway identified using this system.
  • Figure 17 is a schematic, a flow chart, illustrating an exemplary data analysis algorithm of the invention for quantitative proteomics.
  • Figure 18 is a schematic, a flow chart, illustrating the "component extraction” section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17.
  • Figure 19 is a schematic, a flow chart, illustrating the "precursor integration" section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17.
  • Figure 20 is a schematic, a flow chart, illustrating the "spectra comparison" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19.
  • Figure 21 is a schematic, a flow chart, illustrating the "identity and merge of duplicates LC-MS spectra" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19.
  • Figure 22 illustrates an exemplary automated chromatograph system of the invention.
  • Figure 23 illustrates the results of an MS/MS of the separated peptides in a proteome analysis, as discussed in Example 3, below.
  • Figure 24 schematically illustrates the design of an oxidative stress experiment, as discussed in Example 5, below.
  • Figure 25 schematically illustrates the design of a sample preparation protocol used in oxidative stress experiments, as discussed in Example 5, below.
  • Figure 26 graphically illustrates data representing the number of protein identifications 3D LC-MS/MS analyses in oxidative stress experiments, as discussed in Example 5, below.
  • Figure 27 summarizes data representing differences in the number of proteins identified in non-stressed and stressed cell samples in oxidative stress experiments, as discussed in Example 5, below.
  • Figure 28 summarizes data representing a down-regulation in superoxide reductase ("Sor") protein levels after oxidative stress of Desulfovibrio vulgaris cells, as discussed in Example 5, below.
  • Figure 29 illustrates that after oxidative stress oi Desulfovibrio vulgaris cells a concerted down-regulation of proteins along the polyglucose utilization pathway (schematically illustrated) was found, as discussed in Example 5, below.
  • Figure 30 summarizes the results of proteome analysis from different organisms using an exemplary 3D LC LCQ MS/MS system of the invention, as discussed in Example 6, below.
  • Figure 31 summarizes the results of proteome analysis comparing two exemplary 3D LC LCQ MS/MS systems of the invention: 3D LC LCQ MS/MS versus 3D LC LTQ MS/MS, as discussed in Example 6, below.
  • Figure 32 illustrates the results of an LTQ and LCQ MS/MS Human Embryonic Kidney HEK293 proteome analysis, as discussed in Example 7, below.
  • Like reference symbols in the various drawings indicate like elements.
  • the invention provides a number of strategies for comparing a polynucleotide of known sequence (a reference sequence) with variants of that sequence (target sequences).
  • the comparison can be performed at the level of entire genomes, chromosomes, genes, exons or introns, or can focus on individual mutant sites and immediately adjacent bases.
  • the strategies allow detection of variations, such as mutations or polymo ⁇ hisms, in the target sequence irrespective whether a particular variant has previously been characterized.
  • the strategies both define the nature of a variant and identify its location in a target sequence.
  • the strategies employ arrays of oligonucleotide probes immobilized to a solid support.
  • Target sequences are analyzed by determining the extent of hybridization at particular probes in the array.
  • the strategy in selection of probes facilitates distinction between perfectly matched probes and probes showing single- base or other degrees of mismatches.
  • the strategy usually entails sampling each nucleotide of interest in a target sequence several times, thereby achieving a high degree of confidence in its identity. This level of confidence is further increased by sampling of adjacent nucleotides in the target sequence to nucleotides of interest.
  • the number of probes on the chip can be quite large (e.g., 10 5 -10 6 ). However, usually only a small proportion of the total number of probes of a given length are represented.
  • Some advantage of the use of only a small proportion of all possible probes of a given length include: (i) each position in the array is highly informative, whether or not hybridization occurs; (ii) nonspecific hybridization is minimized; (iii) it is straightforward to correlate hybridization differences with sequence differences, particularly with reference to the hybridization pattern of a known standard; and (iv) the ability to address each probe independently during synthesis, using high resolution photolithography, allows the array to be designed and optimized for any sequence. For example the length of any probe can be varied independently of the others.
  • the present tiling strategies result in sequencing and comparison methods suitable for routine large-scale practice with a high degree of confidence in the sequence output.
  • the chips can be designed to contain probes exhibiting complementarity to one or more selected reference sequence whose sequence is known.
  • the chips are used to read a target sequence comprising either the reference sequence itself or variants of that sequence.
  • Target sequences may differ from the reference sequence at one or more positions but show a high overall degree of sequence identity with the reference sequence (e.g., at least 75, 90, 95, 99, 99.9 or 99- 99%).
  • Any polynucleotide of known sequence can be selected as a reference sequence.
  • Reference sequences of interest include sequences known to include mutations or polymo ⁇ hisms associated with phenotypic changes having clinical significance in human patients.
  • the CFTR gene and P53 gene in humans have been identified as the location of several mutations resulting in cystic fibrosis or cancer respectively.
  • Other reference sequences of interest include those that serve to identify pathogenic microorganisms and/or are the site of mutations by which such microorganisms acquire drug resistance (e.g., the HIV reverse transcriptase gene).
  • Other reference sequences of interest include regions where polymo ⁇ hic variations are known to occur (e.g., the D-loop region of mitochondrial DNA). These reference sequences have utility for, e.g., forensic or epidemiological studies.
  • Reference sequences of interest include p34 (related to p53), p65 (implicated in breast, prostate and liver cancer), and DNA segments encoding cytochromes P450 (see Meyer et al., Pharmac. Ther. 46, 349-355 (1990)).
  • Reference sequences of interest include those from the genome of pathogenic viruses (e.g., hepatitis J, B, or Q, he ⁇ es virus (e.g., VZV, HSV-1, HAV-6, HSV-II, and CMV, Epstein Barr virus), adenovirus, influenza virus, flaviviruses, echovirus, rhinovirus, coxsackie virus, cornovirus, respiratory syncytial virus, mumps virus, rotavirus, measles virus, rubella virus, parvovirus, vaccinia virus, HTLV virus, dengue virus, papillomavirus, molluscum virus, poliovirus, rabies virus, JC virus and arboviral encephalitis virus.
  • pathogenic viruses e.g., hepatitis J, B, or Q, he ⁇ es virus (e.g., VZV, HSV-1, HAV-6, HSV-II, and CMV, Epstein Barr virus)
  • Other reference sequences of interest are from genomes or episomes of pathogenic bacteria, particularly regions that confer drug resistance or allow phylogenic characterization of the host (e.g., 16S rRNA or corresponding DNA).
  • pathogenic bacteria include Chlamydia, rickettsial bacteria, mycobacteria, staphylococci, streptococci, pneumonococci, meningococci and conococci, klebsiella, proteus, serratia, pseudomonas, legionella, diphtheria, salmonella, bacilli, cholera, tetanus, botulism, anthrax, plague, leptospirosis, and Lymes disease bacteria.
  • reference sequences of interest include those in which mutations result in the following autosomal recessive disorders: sickle cell anemia, beta-thalassemia, phenylketonuria, galactosemia, Wilson's disease, hemochromatosis, severe combined immunodeficiency, alpha-1-antitrypsin deficiency, albinism, alkaptonuria, lysosomal storage diseases and Ehlers-Danlos syndrome.
  • Reference sequences of interest include those in which mutations result in X-linked recessive disorders: hemophilia, glucose-6-phosphate dehydrogenase, agammaglobulemia, diabetes insipidus, Lesch- Nyhan syndrome, muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's disease and fragile X- syndrome.
  • Reference sequences of interest includes those in which mutations result in the following autosomal dominant disorders: familial hypercholesterolemia, polycystic kidney disease, Huntingdon's disease, hereditary spherocytosis, Marian's syndrome, von Willebrand's disease, neurof ⁇ bromatosis, tuberous sclerosis, hereditary hemorrhagic telangiectasia, familial colonic polyposis, Ehlers-Danlos syndrome, myotomc dystrophy, muscular dystrophy, osteogenesis imperfecta, acute intermittent po ⁇ hyria, and von Hippel- Lindau disease.
  • the length of a reference sequence can vary widely from a full-length genome, to an individual chromosome, episome, gene, component of a gene, such as an exon, intron or regulatory sequences, to a few nucleotides.
  • a reference sequence of between about 2, 5, 10, 20, 50, 100, 5000, 1000, 5,000 or 10,000, 20,000 or 100,000 nucleotides is common.
  • regions of a sequence e.g., exons of a gene
  • the particular regions can be considered as separate reference sequences or can be considered as components of a single reference sequence, as matter of arbitrary choice.
  • a reference sequence can be any naturally occurring, mutant, consensus or purely hypothetical sequence of nucleotides, RNA or DNA.
  • sequences can be obtained from computer data bases, publications or can be determined or conceived de novo.
  • a reference sequence is selected to show a high degree of sequence identity to envisaged target sequences.
  • more than one reference sequence is selected.
  • Combinations of wildtype and mutant reference sequences are employed in several applications of the tiling strategy.
  • the basic tiling strategy provides an array of immobilized probes for analysis of target sequences showing a high degree of sequence identity to one or more selected reference sequences.
  • the strategy is first illustrated for an exemplary array that is subdivided into four probe sets, although it will be apparent that in some situations, satisfactory results are obtained from only two probe sets.
  • a first probe set comprises a plurality of probes exhibiting perfect complementarity with a selected reference sequence. The perfect complementarity usually exists throughout the length of the probe. However, probes having a segment or segments of perfect complementarity that is/are flanked by leading or trailing sequences lacking complementarity to the reference sequence can also be used.
  • each probe in the first probe set has at least one interrogation position that corresponds to a nucleotide in the reference sequence. That is, the inte ⁇ ogation position is aligned with the corresponding nucleotide in the reference sequence, when the probe and reference sequence are aligned to maximize complementarity between the two. If a probe has more than one interrogation position, each corresponds with a respective nucleotide in the reference sequence. The identity of an interrogation position and corresponding nucleotide in a particular probe in the first probe set cannot be determined simply by inspection of the probe in the first set.
  • an interrogation position and corresponding nucleotide is defined by the comparative structures of probes in the first probe set and corresponding probes from additional probe sets.
  • a probe can have an inte ⁇ ogation position at each position in the segment complementary to the reference sequence.
  • An inte ⁇ ogation position can be located away from the ends of a segment of complementarity. Inte ⁇ ogation positions may provide more accurate data when located away from the ends of a segment of complementarity.
  • a probe can have a segment of complementarity of length x does not contain more than x-2 inte ⁇ ogation positions. Since probes are typically 9-21 nucleotides, and usually all of a probe is complementary, a probe typically has 1-19 interrogation positions.
  • the probes can contain a single inte ⁇ ogation position, at or near the center of probe.
  • For each probe in the first set there can be three conesponding probes from three additional probe sets. Thus, there can be four probes conesponding to each nucleotide of interest in the reference sequence. Each of the four conesponding probes has an inte ⁇ ogation position aligned with that nucleotide of interest.
  • the probes from the three additional probe sets can be identical to the conesponding probe from the first probe set with one exception.
  • the conesponding probe from the first probe set has its inte ⁇ ogation position occupied by a T
  • the conesponding probes from the additional three probe sets have their respective inte ⁇ ogation positions occupied by A, C, or G, a different nucleotide in each probe.
  • a probe from the first probe set comprises trailing or flanking sequences lacking complementarity to the reference sequences, these sequences need not be present in conesponding probes from the three additional sets.
  • conesponding probes from the three additional sets can contain leading or trailing sequences outside the segment of complementarity that are not present in the conesponding probe from the first probe set.
  • the probes from the additional three probe set are identical (with the exception of intenogation position(s)) to a contiguous subsequence of the full complementary segment of the conesponding probe from the first probe set.
  • the subsequence includes the inte ⁇ ogation position and usually differs from the full- length probe only in the omission of one or both terminal nucleotides from the termini of a segment of complementarity. That is, if a probe from the first probe set has a segment of complementarity of length n, conesponding probes from the other sets will usually include a subsequence of the segment of at least length n-2.
  • the subsequence is usually at least 3, 4, 7, 9, 15, 21, or 25 nucleotides long, most typically, in the range of 9-21 nucleotides.
  • the subsequence should be sufficiently long to allow a probe to hybridize detectably more strongly to a variant of the reference sequence mutated at the inte ⁇ ogation position than to the reference sequence.
  • the probes can be oligodeoxyribonucleotides or oligoribonucleotides, or any modified forms of these polymers that are capable of hybridizing with a target nucleic sequence by complementary base-pairing.
  • Complementary base pairing means sequence-specific base pairing which includes e.g., Watson-Crick base pairing as well as other forms of base pairing such as Hoogsteen base pairing.
  • Modified forms include 2'-0-methyl oligoribonucleotides and so-called PNAs, in which oligodeoxyribonucleotides are linked via peptide bonds rather than phophodiester bonds.
  • the probes can be attached by any linkage to a support (e.g., 3', 5' or via the base). 3' attachment is more usual as this orientation is compatible with a chemistry for solid phase synthesis of oligonucleotides.
  • the number of probes in the first probe set depends on the length of the reference sequence, the number of nucleotides of interest in the reference sequence and the number of intenogation positions per probe.
  • each nucleotide of interest in the reference sequence requires the same intenogation position in the four sets of probes.
  • a reference sequence can have 100 nucleotides, 50 of which are of interest, and probes each having a single intenogation position.
  • the first probe set requires fifty probes, each having one intenogation position conesponding to a nucleotide of interest in the reference sequence.
  • the second, third and fourth probe sets each have a conesponding probe for each probe in the first probe set, and so each also contains a total of fifty probes.
  • each nucleotide of interest in the reference sequence is determined by comparing the relative hybridization signals at four probes having inte ⁇ ogation positions conesponding to that nucleotide from the four probe sets.
  • every nucleotide is of interest.
  • only certain portions in which variants (e.g., mutations or polymo ⁇ hisms) are concentrated are of interest.
  • only particular mutations or polymo ⁇ hisms and immediately adjacent nucleotides are of interest.
  • the first probe set has intenogation positions selected to conespond to at least a nucleotide (e.g., representing a point mutation) and one immediately adjacent nucleotide.
  • the probes in the first set have inte ⁇ ogation positions conesponding to at least 3, 10, 50, 100, 1000, or 20,000 contiguous nucleotides.
  • the probes usually have intenogation positions conesponding to at least 5, 10, 30, 50, 75, 90, 99 or sometimes 100%, of the nucleotides in a reference sequence.
  • the probes in the first probe set can completely span the reference sequence and overlap with one another relative to the reference sequence. For example, in one common anangement each probe in the first probe set differs from another probe in that set by the omission of a 3' base complementary to the reference sequence and the acquisition of a 5' base complementary to the reference sequence.
  • the probes in a set can be a ⁇ anged in order of the sequence in a lane across the chip.
  • a lane contains a series of overlapping probes, which represent or tile across, the selected reference sequence.
  • the components of the four sets of probes are usually laid down in four parallel lanes, collectively constituting a row in the horizontal direction and a series of 4-member columns in the vertical direction. Conesponding probes from the four probe sets (i.e., complementary to the same subsequence of the reference sequence) occupy a column.
  • Each probe in a lane usually differs from its predecessor in the lane by the omission of a base at one end and the inclusion of additional base at the other end.
  • probes sets can be laid down in lanes such that all probes having an intenogation position occupied by an A form an- A-lane, all probes having an intenogation position occupied by a C fonn a C-lane, all probes having an intenogation position occupied by a G form a G-lane, and all probes having an inte ⁇ ogation position occupied by a T (or U) form a T lane (or a U lane).
  • the probe from the first probe set is laid down in the A-lane, C-lane, A-lane, A- lane and T-lane for the five columns.
  • the intenogation position on a column of probes conesponds to the position in the target sequence whose identity is determined from analysis of hybridization to the probes in that column.
  • the inte ⁇ ogation position can be anywhere in a probe but is usually at or near the central position of the probe to maximize differential hybridization signals between a perfect match and a single-base mismatch. For example, for an 11 mer probe, the central position is the sixth nucleotide.
  • the anay of probes is usually laid down in rows and columns as described above, such a physical anangement of probes on the chip is not essential.
  • the data from the probes can be collected and processed to yield the sequence of a target inespective of the physical anangement of the probes on a chip.
  • the hybridization signals from the respective probes can be reassorted into any conceptual anay desired for subsequent data reduction whatever the physical anangement of probes on the chip.
  • a range of lengths of probes can be employed in the chips.
  • a probe may consist exclusively of a complementary segments, or may have one or more complementary segments juxtaposed by flanking, trailing and/or intervening segments.
  • the total length of complementary segment(s) is more important than the length of the probe.
  • the complementarity segment(s) of the first probe sets should be sufficiently long to allow the probe to hybridize detectably more strongly to a reference sequence compared with a variant of the reference including a single base mutation at the nucleotide conesponding to the inte ⁇ ogation position of the probe.
  • the complementarity segment(s) in conesponding probes from additional probe sets can be sufficiently long to allow a probe to hybridize detectably more strongly to a variant of the reference sequence having a single nucleotide substitution at the intenogation position relative to the reference sequence.
  • a probe can have a single complementary segment having a length of at least 3 nucleotides, and more usually at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 or bases exhibiting perfect complementarity (other than possibly at the intenogation position(s) depending on the probe set) to the reference sequence.
  • each segment provides at least three complementary nucleotides to the reference sequence and the combined segments provide at least two segments of three or a total of six complementary nucleotides.
  • the combined length of complementary segments is typically from 6-30 nucleotides, or, from about 9-21 nucleotides. The two segments are often approximately the same length.
  • the probes (or segment of complementarity within probes) have an odd number of bases, so that an inte ⁇ ogation position can occur in the exact center of the probe.
  • all probes are the same length.
  • Other chips employ different groups of probe sets, in which case the probes are of the same size within a group, but differ between different groups. For example, some chips have one group comprising four sets of probes as described above in which all the probes are 11 mers, together with a second group comprising four sets of probes in which all of the probes are 13 mers. Of course, additional groups of probes can be added. Thus, some chips contain, e.g., four groups of probes having sizes of 11 mers, 13 mers, 15 mers and 17 mers.
  • the probes in the first set can vary in length independently of each other. Probes in the other sets are usually the same length as the probe occupying the same column from the first set. However, occasionally different lengths of probes can be included at the same column position in the four lanes.
  • the different length probes are included to equalize hybridization signals from probes inespective of whether A-T or C-G bonds are formed at the intenogation position.
  • the length of probe can be important in distinguishing between a perfectly matched probe and probes showing a single- base mismatch with the target sequence. The discrimination is usually greater for short probes. Shorter probes are usually also less susceptible to formation of secondary structures.
  • the absolute amount of target sequence bound, and hence the signal is greater for larger probes.
  • the probe length representing the optimum compromise between these competing considerations may vary depending on inter alia the GC content of a particular region of the target DNA sequence, secondary structure, synthesis efficiency and cross- hybridization. In some regions of the target, depending on hybridization conditions, short probes (e.g., 1 1 mers) may provide information that is inaccessible from longer probes (e.g., 19 mers) and vice versa.
  • Maximum sequence information can be read by including several groups of different sized probes on the chip as noted above. However, for many regions of the target sequence, such a strategy provides redundant information in that the same sequence is read multiple times from the different groups of probes.
  • Equivalent information can be obtained from a single group of different sized probes in which the sizes are selected to maximize readable sequence at particular regions of the target sequence.
  • the strategy of customizing probe length within a single group of probe sets minimizes the total number of probes required to read a particular target sequence. This leaves ample capacity for the chip to include probes to other reference sequences.
  • the invention provides an optimization block which allows systematic variation of probe length and inte ⁇ ogation position to optimize the selection of probes for analyzing a particular nucleotide in a reference sequence.
  • the block comprises alternating columns of probes complementary to the wildtype target and probes complementary to a specific mutation.
  • the inte ⁇ ogation position is varied between columns and probe length is varied down a column.
  • Hybridization of the chip to the reference sequence or the mutant form of the reference sequence identifies the probe length and inte ⁇ ogation position providing the greatest differential hybridization signal.
  • the probes are designed to be complementary to either strand of the reference sequence (e.g., coding or non-coding), some chips contain separate groups of probes, one complementary to the coding strand, the other complementary to the noncoding strand. Independent analysis of coding and noncoding strands provides largely redundant information. However, the regions of ambiguity in reading the coding strand are not always the same as those in reading the noncoding strand. Thus, combination of the information from coding and noncoding strands increases the overall accuracy of sequencing. Some chips contain additional probes or groups of probes designed to be complementary to a second reference sequence.
  • the second reference sequence can often be a subsequence of the first reference sequence bearing one or more commonly occurring mutations or interstrain variations.
  • the second group of probes is designed by the same principles as described above except that the probes exhibit complementarity to the second reference sequence.
  • the inclusion of a second group is particular useful for analyzing short subsequences of the primary reference sequence in which multiple mutations are expected to occur within a short distance commensurate with the length of the probes (i.e., two or more mutations within 9 to 21 bases).
  • the same principle can be extended to provide chips containing groups of probes for any number of reference sequences.
  • the chips may contain additional probe(s) that do not form part of a tiled anay as noted above, but rather serves as probe(s) for a conventional reverse dot blot.
  • the presence of mutation can be detected from binding of a target sequence to a single oligomeric probe harboring the mutation.
  • An additional probe containing the equivalent region of the wildtype sequence can be included as a control.
  • the chips can be read by comparing the intensities of labeled target bound to the probes in an anay. In one aspect, a comparison is performed between each lane of probes (e.g., A, C, G and T lanes) at each columnar position (physical or conceptual).
  • the lane showing the greatest hybridization signal is called as the nucleotide present at the position in the target sequence conesponding to the inte ⁇ ogation position in the probes.
  • the conesponding position in the target sequence is that aligned with the intenogation position in conesponding probes when the probes and target are aligned to maximize complementarity.
  • the four probes in a column only one can exhibit a perfect match to the target sequence whereas the others usually exhibit at least a one base pair mismatch.
  • the probe exhibiting a perfect match usually produces a substantially greater hybridization signal than the other three probes in the column and is thereby easily identified.
  • a call ratio is established to define the ratio of signal from the best hybridizing probes to the second best hybridizing probe that must be exceeded for a particular target position to be read from the probes.
  • a high call ratio ensures that few if any e ⁇ ors are made in calling target nucleotides, but can result in some nucleotides being scored as ambiguous, which could in fact be accurately read.
  • a lower call ratio can result in fewer ambiguous calls, but can result in more enoneous calls. It has been found that at a call ratio of 1.2 virtually all calls are accurate.
  • a small but significant number of bases may have to be scored as ambiguous.
  • small regions of the target sequence can sometimes be ambiguous, these regions usually occur at the same or similar segments in different target sequences.
  • An anay of probes is most useful for analyzing the reference sequence from which the probes were designed and variants of that sequence exhibiting substantial sequence similarity with the reference sequence (e.g., several single- base mutants spaced over the reference sequence).
  • an anay When an anay is used to analyze the exact reference sequence from which it was designed, one probe exhibits a perfect match to the reference sequence, and the other three probes in the same column exhibits single-base mismatches. Thus, discrimination between hybridization signals is usually high and accurate sequence is obtained. High accuracy is also obtained when an anay is used for analyzing a target sequence comprising a variant of the reference sequence that has a single mutation relative to the reference sequence, or several widely spaced mutations relative to the reference sequence. At different mutant loci, one probe exhibits a perfect match to the target, and the other three probes occupying the same column exhibit single-base mismatches, the difference (with respect to analysis of the reference sequence) being the lane in which the perfect match occurs.
  • a single group of probes i.e., designed with respect to a single reference sequence
  • Such a comparison does not always allow the target nucleotide conesponding to that columnar position to be called.
  • Deletions in target sequences can be detected by loss of signal from probes having intenogation positions encompassed by the deletion.
  • signal may also be lost from probes having intenogation positions closely proximal to the , deletion resulting in some regions of the target sequence that cannot be read.
  • Target sequence bearing insertions will also exhibit short regions including and proximal to the insertion that usually cannot be read.
  • the presence of short regions of difficult-to-read target because of closely spaced mutations, insertions or deletion, does not prevent determination of the remaining sequence of the target as different regions of a target sequence are determined independently.
  • such ambiguities as might result from analysis of diverse variants with a single group of probes can be avoided by including multiple groups of probe sets on a chip.
  • one group of probes can be designed based on a full-length reference sequence, and the other groups on subsequences of the reference sequence inco ⁇ orating frequently occurring mutations or strain variations.
  • the sequencing strategy of the invention has the capacity to simultaneously detect and quantify proportions of multiple target sequences. Such capacity is valuable, e.g., for diagnosis of patients who are heterozygous with respect to a gene or who are infected with a virus, such as HIV, which is usually present in several polymo ⁇ hic forms. Such capacity is also useful in analyzing targets from biopsies of tumor cells and sunounding tissues.
  • the presence of multiple target sequences is detected from the relative signals of the four probes at the anay columns conesponding to the target nucleotides at which diversity occurs.
  • the relative signals at the four probes for the mixture under test are compared with the conesponding signals from a homogeneous reference sequence.
  • the extent in shift in hybridization signals of the probes is related to the proportion of a target sequence in the mixture.
  • Shifts in relative hybridization signals can be quantitatively related to proportions of reference and mutant sequence by prior calibration of the chip with seeded mixtures of the mutant and reference sequences.
  • a chip can be used to detect variant or mutant strains constituting as little as 1, 5, 20, or 25 % of a mixture of stains.
  • Similar principles allow the simultaneous analysis of multiple target sequences even when none is identical to the reference sequence. For example, with a mixture of two target sequences bearing first and second mutations, there would be a variation in the hybridization patterns of probes having intenogation positions conesponding to the first and second mutations relative to the hybridization pattern with the reference sequence.
  • one of the probes having a mismatched intenogation position relative to the reference sequence would show an increase in hybridization signal, and the probe having a matched inte ⁇ ogation position relative to the reference sequence would show a decrease in hybridization signal.
  • Analysis of the hybridization pattern of the mixture of mutant target sequences indicates the presence of two mutant target sequences, the position and nature of the mutation in each strain, and the relative proportions of each strain.
  • the different components in a mixture of target sequences are differentially labeled before being applied to the anay. For example, a variety of fluorescent labels emitting at different wavelength are available.
  • differential labels allows independent analysis of different targets bound simultaneously to the anay.
  • the methods permit comparison of target sequences obtained from a patient at different stages of a disease.
  • Omission of Probes The general strategy of the aspects of the invention outlined above employs four probes to read each nucleotide of interest in a target sequence. One probe (from the first probe set) shows a perfect match to the reference sequence and the other three probes (from the second, third and fourth probe sets) exhibit a mismatch with the reference sequence and a perfect match with a target sequence bearing a mutation at the nucleotide of interest. The provision of three probes from the second, third and fourth probe sets allows detection of each of the three possible nucleotide substitutions of any nucleotide of interest.
  • probes that would detect silent mutations are omitted.
  • the probes from the first probe set are omitted conesponding to some or all positions of the reference sequences.
  • Such chips comprise at least two probe sets.
  • the first probe set has a plurality of probes. Each probe comprises a segment exactly complementary to a subsequence of a reference sequence except in at least one intenogation position.
  • a second probe set has a conesponding probe for each probe in the first probe set.
  • the conesponding probe in the second probe set is identical to a sequence comprising the conesponding probe form the first probe set or a subsequence thereof that includes the at least one (and usually only one) inte ⁇ ogation position except that the at least one intenogation position is occupied by a different nucleotide in each of the two conesponding probes from the first and second probe sets.
  • a third probe set if present, also comprises a conesponding probe for each probe in the first probe set except at the at least one intenogation position, which 0226
  • the presence of a mutation is detected by a shift in the background hybridization intensity of the reference sequence to a perfectly matched hybridization signal of the target sequence, rather than by a comparison of the hybridization intensities of probes from the first set with conesponding probes from the second, third and fourth sets.
  • Wildtype Probe Lane When the chips comprise four probe sets, as discussed supra, and the probe sets are laid down in four lanes, an A-lane, a C-lane, a G-lane and a T or U- lane, the probe having a segment exhibiting perfect complementarity to a reference sequence varies between the four lanes from one column to another. This does not present any significant difficulty in computer analysis of the data from the chip.
  • each probe has a segment exhibiting perfect complementarity to the reference sequence.
  • This segment is identical to a segment from one of the probes in the other four lanes (which lane depending on the column position).
  • the extra lane of probes (designated the wildtype lane) hybridizes to a target sequence at all nucleotide positions except those in which deviations from the reference sequence occurs.
  • the hybridization pattern of the wildtype lane thereby provides a simple visual indication of mutations.
  • the chips provide an additional probe set specifically designed for analyzing deletion mutations.
  • the additional probe set comprises a probe conesponding to each probe in the first probe set as described above.
  • a probe from the additional probe set differs from the conesponding probe in the first probe set in that the nucleotide occupying the intenogation position is deleted in the probe from the additional probe set.
  • the probe from the additional probe set bears an additional nucleotide at one of its termini relative to the conesponding probe from the first probe set.
  • the probe from the additional probe set will hybridize more strongly than the conesponding probe from the first probe set to a target sequence having a single base deletion at the nucleotide conesponding to the intenogation position.
  • Additional probe sets are provided in which not only the intenogation position, but also an adjacent nucleotide is detected.
  • other chips provide additional probe sets for analyzing insertions.
  • one additional probe set has a probe conesponding to each probe in the first probe set as described above.
  • the probe in the additional probe set has an extra T nucleotide inserted adjacent to the intenogation position.
  • the probe has one fewer nucleotide at one of its termini relative to the conesponding probe from the first probe set.
  • the probe from the additional probe set hybridizes more strongly than the conesponding probe from the first probe set to a target sequence having an A nucleotide inserted in a position adjacent to that conesponding to the intenogation position.
  • Similar additional probe sets are constructed having C, G or T/U nucleotides inserted adjacent to the intenogation position. Usually, four such probe sets, one for each nucleotide, are used in combination.
  • Other chips provide additional probes (multiple-mutation probes) for analyzing target sequences having multiple closely spaced mutations.
  • a multiple- mutation probe is usually identical to a conesponding probe from the first set as described above, except in the base occupying the inte ⁇ ogation position, and except at one or more additional positions, conesponding to nucleotides in which substitution may occur in the reference sequence.
  • the one or more additional positions in the multiple mutation probe are occupied by nucleotides complementary to the nucleotides occupying conesponding positions in the reference sequence when the possible substitutions have occuned.
  • Block Tiling As noted in the discussion of the general tiling strategy, in one aspect, a probe in the first probe set can have more than one intenogation position.
  • a probe in the first probe set is sometimes matched with multiple groups of at least one, and usually, three additional probe sets.
  • Three additional probe sets are used to allow detection of the three possible nucleotide substitutions at any one position. If only certain types of substitution are likely to occur (e.g., transitions), only one or two additional probe sets are required (analogous to the use of probes in the basic tiling strategy).
  • a group comprises three additional probe sets
  • a first such group comprises second, third and fourth probe sets, each of which has a probe conesponding to each probe in the first probe set.
  • the conesponding probes from the second, third and fourth probes sets differ from the conesponding probe in the first set at a first of the intenogation positions.
  • the relative hybridization signals from conesponding probes from the first, second, third and fourth probe sets indicate the identity of the nucleotide in a target sequence conesponding to the first inte ⁇ ogation position.
  • a second group of three probe sets (designated fifth, sixth and seventh probe sets), each also have a probe conesponding to each probe in the first probe set. These conesponding probes differ from that in the first probe set at a second intenogation position.
  • the relative hybridization signals from conesponding probes from the first, fifth, sixth, and seventh probe sets indicate the identity of the nucleotide in the target sequence conesponding to the second intenogation position.
  • the probes in the first probe set often have seven or more intenogation positions. If there are seven intenogation positions, there are seven groups of three additional probe sets, each group of three probe sets serving to identify the nucleotide conesponding to one of the seven intenogation positions.
  • Each block of probes allows short regions of a target sequence to be read. For example, for a block of probes having seven intenogation positions, seven nucleotides in the target sequence can be read.
  • a chip can contain any number of blocks depending on how many nucleotides of the target are of interest.
  • the hybridization signals for each block can be analyzed independently of any other block.
  • the block tiling strategy can also be combined with other tiling strategies, with different parts of the same reference sequence being tiled by different strategies.
  • the block tiling strategy offers two advantages over the basic strategy in which each probe in the first set has a single intenogation position. One advantage is that the same sequence information can be obtained from fewer probes.
  • a second advantage is that each of the probes constituting a block (i.e., a probe from the first probe set and a conesponding probe from each of the other probe sets) can have identical 3' and 5' sequences, with the variation confined to a central segment containing the inte ⁇ ogation positions.
  • the identity of 3' sequence between different probes simplifies the strategy for solid phase synthesis of the probes on the chip and results in more uniform deposition of the different probes on the chip, thereby in turn increasing the uniformity of signal to noise ratio for different regions of the chip.
  • a third advantage is that greater signal uniformity is achieved within a block.
  • the identity of a nucleotide in a target or reference sequence is determined by comparison of hybridization patterns of one probe having a segment showing a perfect match with that of other probes (usually three other probes) showing a single base mismatch.
  • the identity of at least two nucleotides in a reference or target sequence is determined by comparison of hybridization signal intensities of four probes, two of which have a segment showing perfect complementarity or a single base mismatch to the reference sequence, and two of which have a segment showing perfect complementarity or a double-base mismatch to a segment.
  • the four probes whose hybridization patterns are to be compared each have a segment that is exactly complementary to a reference sequence except at two intenogation positions, in which the segment may or may not be complementary to the reference sequence.
  • the intenogation positions conespond to the nucleotides in a reference or target sequence which are determined by the comparison of intensities.
  • the nucleotides occupying the intenogation positions in the four probes are selected according to the following rule.
  • the first intenogation position is occupied by a different nucleotide in each of the four probes.
  • the second intenogation position is also occupied by a different nucleotide in each of the four probes.
  • the segment is exactly complementary to the reference sequence except at not more than one of the two inte ⁇ ogation positions.
  • one of the intenogation positions is occupied by a nucleotide that is complementary to the conesponding nucleotide from the reference sequence and the other intenogation position may or may not be so occupied.
  • the segment is exactly complementary to the reference sequence except that both inte ⁇ ogation positions are occupied by nucleotides which are non-complementary to the respective conesponding nucleotides in the reference sequence.
  • the conditions noted above are satisfied by each of the intenogation positions in any one of the four probes being occupied by complementary nucleotides.
  • the intenogation positions could be occupied by A and T, in the second probe by C and G, in the third probe by G and C and in the four probe, by T and A.
  • the four probes are hybridized to a target that is the same as the reference sequence or differs from the reference sequence at one (but not both) of the intenogation positions, two of the four probes show a double-mismatch with the target and two probes show a single mismatch.
  • the identity of probes showing these different degrees of mismatch can be determined from the different hybridization signals. From the identity of the probes showing the different degrees of mismatch, the nucleotides occupying both of the intenogation positions in the target sequence can be deduced.
  • the multiplex strategy has been initially described for the situation where there are two nucleotides of interest in a reference sequence and only four probes in an anay.
  • the strategy can be extended to analyze any number of nucleotides in a target sequence by using additional probes.
  • each pair of inte ⁇ ogation positions is read from a unique group of four probes.
  • helper mutations serve to break-up regions of internal complementarity within a probe and thereby prevent annealing.
  • one or two helper mutations are quite sufficient for this pu ⁇ ose.
  • the inclusion of helper mutations can be beneficial in any of the tiling strategies noted above.
  • each probe having a particular intenogation position has the same helper mutation(s).
  • such probes have a segment in common which shows perfect complementarity with a reference sequence, except that the segment contains at least one helper mutation (the same in each of the probes) and at least one intenogation position
  • a probe from the first probe set comprises a segment containing an intenogation position and showing perfect complementarity with a reference sequence except for one or two helper mutations.
  • the conesponding probes from the second, third and fourth probe sets usually comprise the same segment (or sometimes a subsequence thereof including the helper mutation(s) and intenogation position), except that the base occupying the intenogation position varies in each probe.
  • the helper mutation tiling strategy is used in conjunction with one of the tiling strategies described above.
  • the probes containing helper mutations are used to tile regions of a reference sequence otherwise giving low hybridization signal (e.g., because of self-complementarity), and the alternative tiling strategy is used to tile intervening regions.
  • Pooling Strategies Pooling strategies of the invention can also employ anays of immobilized probes. Probes can be immobilized in cells of an anay, and the hybridization signal of each cell can be determined independently of any other cell. A particular cell may be occupied by pooled mixture of probes. Although the identity of each probe in the mixture is known, the individual probes in the pool are not separately addressable. Thus, the hybridization signal from a cell is the aggregate of that of the different probes occupying the cell.
  • a cell is scored as hybridizing to a target sequence if at least one probe occupying the cell comprises a segment exhibiting perfect complementarity to the target sequence.
  • a simple strategy to show the increased power of pooled strategies over a standard tiling is to create three cells each containing a pooled probe having a single pooled position, the pooled position being the same in each of the pooled probes. At the pooled position, there are two possible nucleotides, allowing the pooled probe to hybridize to two target sequences. In tiling terminology, the pooled position of each probe is an inte ⁇ ogation position.
  • the identity of the nucleotide in the target sequence conesponding to the intenogation position i.e., that is matched with the intenogation position when the target sequence and pooled probes are maximally aligned for complementarity.
  • the three cells are assigned probe pools that are perfectly complementary to the target except at the pooled position, which is occupied by a different pooled nucleotide in each probe. With 3 pooled probes, all 4 possible single base pair states (wild and 3 mutants) are detected.
  • a pool hybridizes with a target if some probe contained within that pool is complementary to that target.
  • a cell containing a pair (or more) of oligonucleotides lights up when a target complementary to any of the oligonucleotide in the cell is present.
  • each of the four possible targets yields a unique hybridization pattern among the three cells. Since a different pattern of hybridizing pools is obtained for each possible nucleotide in the target sequence conesponding to the pooled intenogation position in the probes, the identity of the nucleotide can be determined from the hybridization pattern of the pools.
  • a standard tiling requires four cells to detect and identify the possible single-base substitutions at one location, this simple pooled 45 strategy only requires three cells.
  • pooling strategy for sequence analysis is the 'Trellis' strategy.
  • each pooled probe has a segment of perfect complementarity to a reference sequence except at three pooled positions.
  • One pooled position is an N pool.
  • the three pooled positions may or may not be contiguous in a probe.
  • the other two pooled positions are selected from the group of three pools consisting of (1) M or K, (2) R or Y and (3) W or S, where the single letters are IUPAC standard ambiguity codes.
  • the sequence of a pooled probe is thus, of the form XXXN[(M/K) or (R/Y) or (W/S)][(M/K) or (R/Y) or (W/S)]XXXXX, where XXX represents bases complementary to the reference sequence.
  • the three pooled positions may be in any order, and may be contiguous or separated by intervening nucleotides. For, the two positions occupied by [(M/K) or (R/Y) or (W/S)], two choices must be made. First, one must select one of the following three pairs of pooled nucleotides (1) M/K, (2) R/Y and (3) W/S.
  • the one of three pooled nucleotides selected may be the same or different at the two pooled positions.
  • This choice should result in selection of a pooled nucleotide comprising a nucleotide that complements the conesponding nucleotide in a reference sequence, when the probe and reference sequence are maximally aligned.
  • the same principle governs the selection between R and Y, and between W and S.
  • a trellis pool probe has one pooled position with four possibilities, and two pooled positions, each with two possibilities.
  • a trellis pool probe comprises a mixture of 16 (4 x 2 x 2) probes.
  • each pooled position includes one nucleotide that complements the conesponding nucleotide from the reference sequence
  • one of these 16 probes has a segment that is the exact complement of the reference sequence.
  • a target sequence that is the same as the reference sequence i.e., a wildtype target
  • the segment of complementarity should be sufficiently long to permit specific hybridization of a pooled probe to a reference sequence be detected relative to a variant of that reference sequence.
  • the segment of complementarity is about 9-21 nucleotides.
  • a target sequence is analyzed by comparing hybridization intensities at three pooled probes, each having the structure described above.
  • the segments complementary to the reference sequence present in the three pooled probes show some overlap. Sometimes the segments are identical (other than at the inte ⁇ ogation positions). However, this need not be the case.
  • the segments can tile across a reference sequence in increments of one nucleotide (i.e., one pooled probe differs from the next by the acquisition of one nucleotide at the 5' end and loss of a nucleotide at the 3' end).
  • the three intenogation positions may or may not occur at the same relative positions within each pooled probe (i.e., spacing from a probe terminus).
  • one of the three intenogation positions from each of the three pooled probes aligns with the same nucleotide in the reference sequence, and that this intenogation position is occupied by a different pooled nucleotide in each of the three probes.
  • the intenogation position is occupied by an N.
  • the inte ⁇ ogation position is occupied by one of (M/K) or (R/Y) or (W/S).
  • three pooled probes are used to analyze a single nucleotide in the reference sequence.
  • Still another combination of three pooled probes from the set of five have an inte ⁇ ogation position that aligns with a third nucleotide in the reference sequence and these probes are used to analyze that nucleotide.
  • three nucleotides in the reference sequence are fully analyzed from only five pooled probes.
  • the basic tiling strategy would require 12 probes for a similar analysis.
  • the trellis strategy can employ an anay of probes having at least three cells, each of which is occupied by a pooled probe as described above.
  • Three cells are occupied by pooled probes having a pooled intenogation position conesponding to the position of possible substitution in the target sequence, one cell with an N', one cell with one of M' or K', and one cell with R' or Y'.
  • the cell with the N' in the intenogation position lights up for the wildtype sequence and any of the three single base substitutions of the target sequence.
  • a further class of strategies involving pooled probes are termed coding strategies. These strategies assign code words from some set of numbers to variants of a reference sequence. Any number of variants can be coded. The variants can include multiple closely spaced substitutions, deletions or insertions.
  • the designation letters or other symbols assigned to each variant may be any arbitrary set of numbers, in any order. For example, a binary code is often used, but codes to other bases are entirely feasible. The numbers are often assigned such that each variant has a designation having at least one digit and at least one nonzero value for that digit.
  • a variant assigned the number 101 has a designation of three digits, with one possible nonzero value for each digit.
  • the designation of the variants are coded into an anay of pooled probes comprising a pooled probe for each nonzero value of each digit in the numbers assigned to the variants. For example, if the variants are assigned successive number in a numbering system of base m, and the highest number assigned to a variant has n digits, the array would have about n x (m -1) pooled probes.
  • log m (3N+1) probes are required to analyze all variants of N locations in a reference sequence, each having three possible mutant substitutions.
  • each pooled probe has a segment exactly complementary to the reference sequence except that certain positions are pooled.
  • the segment should be sufficiently long to allow specific hybridization of the pooled probe to the reference sequence relative to a mutated form of the reference sequence.
  • segments lengths of 9-21 nucleotides are typical.
  • the probe has no nucleotides other than the 9-21 nucleotide segment.
  • the pooled positions comprise nucleotides that allow the pooled probe to hybridize to every variant assigned a particular nonzero value in a particular digit.
  • the pooled positions further comprises a nucleotide that allows the pooled probe to hybridize to the reference sequence.
  • a wildtype target or reference sequence
  • a target is hybridized to the pools, only those pools comprising a component probe having a segment that is exactly complementary to the target light up.
  • the identity of the target is then decoded from the pattern of hybridizing pools.
  • Each pool that lights up is conelated with a particular value in a particular digit.
  • the aggregate hybridization patterns of each lighting pool reveal the value of each digit in the code defining the identity of the target hybridized to the anay.
  • Probes that contain partial matches to two separate (i.e., non contiguous) subsequences of a target sequence sometimes hybridize strongly to the target sequence. In certain instances, such probes have generated stronger signals than probes of the same length which are perfect matches to the target sequence. It is believed (but not necessary to the invention) that this observation results from interactions of a single target sequence with two or more probes simultaneously.
  • This invention exploits this observation to provide anays of probes having at least first and second segments, which are respectively complementary to first and second subsequences of a reference sequence. Optionally, the probes may have a third or more complementary segments. These probes can be employed in any of the, strategies noted above.
  • the two segments of such a probe can be complementary to disjoint subsequences of the reference sequences or contiguous subsequences. If the latter, the two segments in the probe are inverted relative to the order of the complement of the reference sequence.
  • the two subsequences of the reference sequence each typically comprises about 3 to 30 contiguous nucleotides.
  • the subsequences of the reference sequence are sometimes separated by 0, 1, 2 or 3 bases. Often the sequences, are adjacent and nonoverlapping.
  • the bridging strategy can offer the following advantages: (1) Higher discrimination between matched and mismatched probes, (2) The possibility of using longer probes in a bridging tiling, thereby increasing the specificity of the hybridization, without sacrificing discrimination, (3) The use of probes in which an intenogation position is located very off-center relative to the regions of target complementarity. This may be of particular advantage when, for example, when a probe centered about one region of the target gives low hybridization signal. The low signal is overcome by using a probe centered about an adjoining region giving a higher hybridization signal. (4) Disruption of secondary structure that might result in annealing of certain probes (see previous discussion of helper mutations).
  • the invention also provides a deletion tiling strategy.
  • Deletion tiling is related to both the bridging and helper mutant strategies described above.
  • comparisons are performed between probes sharing a common deletion but differing from each other at an intenogation position located outside the deletion.
  • a first probe comprises first and second segments, each exactly complementary to respective first and second subsequences of a reference sequence, wherein the first and second subsequences of the reference sequence are separated by a short distance (e.g., 1 or 2 nucleotides).
  • the order of the first and second segments in the probe is usually the same as that of the complement to the first and second subsequences in the reference sequence.
  • Such tilings sometimes offer superior discrimination in hybridization intensities between the probe having an inte ⁇ ogation position complementary to the target and other probes.
  • the difference between the hybridizations to matched and mismatched targets for the probe set shown above is the difference between a single-base bulge, and a large asymmetric loop (e.g., two bases of target, one of probe). This often results in a larger difference in stability than the comparison of a perfectly matched probe with a probe showing a single base mismatch in the basic tiling strategy.
  • the use of deletion or bridging probes is quite general. These probes can be used in any of the tiling strategies of the invention.
  • the target polynucleotide whose sequence is to be determined, is usually isolated from a tissue sample. If the target is genomic, the sample may be from any tissue (except exclusively red blood cells). For example, whole blood, peripheral blood lymphocytes or PBMC, skin, hair or semen are convenient sources of clinical samples. These sources are also suitable if the target is RNA. Blood and other body fluids are also a convenient source for isolating viral nucleic acids.
  • the sample is obtained from a tissue in which the mRNA is expressed.
  • the polynucleotide in the sample is RNA, it is usually reverse transcribed to DNA.
  • DNA samples or cDNA resulting from reverse transcription are usually amplified, e.g., by PCR.
  • the amplification product can be RNA or DNA. Paired primers are selected to flank the borders of a target polynucleotide of interest. More than one target can be simultaneously amplified by multiplex PCR in which multiple paired primers are employed.
  • the target can be labeled at one or more nucleotides during or after amplification.
  • target polynucleotides e.g., episomal DNA
  • sufficient DNA is present in the tissue sample to dispense with the amplification step.
  • the sense of the strand should of course be complementary to that of the probes on the chip. This is achieved by appropriate selection of primers.
  • the target can be fragmented before application to the chip to reduce or eliminate the formation of secondary structures in the target.
  • the average size of targets segments following hybridization is usually larger than the size of probe on the chip.
  • genome sequencing can be accomplished according to the enzymatic/Sanger method (described in F. Sanger, S. Nicklen, and A. R. Coulson, Proc. Natl. Acad. Sci, USA, 74:5463-5467 (1977)) and involve cloning and subcloning (described in U.S. Patent No. 4725677; Chen and Seeburg, DNA 4, 165-170 (1985); Lim et al., Gene Anal., Techn. 5, 32-39 (1988); PCR Protocols- A Guide to Methods and Applications. Innis et al., editors, Academic Press, San Diego (1990); Innis et al., Proc. Nat. Acad. Sci.
  • sequencing can be accomplished according to the chemical/Maxam and Gilbert method which is described in references: A. M. Maxam, and W. Gilbert, Proc. Nat. Acad. of Sci., USA, 74:560-564 (1977) and
  • genome sequencing can be accomplished by methodology described by Guo and Wu (Guo and Wu, Nucleic Acids Res., 10:2065 (1982); and Meth. Enz., 100:60 (1983)) or those methods that utilize 3'hydroxy-protected and labeled nucleotides as exemplified in the following references: Churchich, J.E., Eur. J. Biochem., 231 :736 (1995);
  • sequencing may be read by autoradiography using radioisotopes (as described in Orastein et al., Biotechniques 2, 476 (1985)) or by using non-radioactively labeling strategies that have been integrated into partly automated DNA sequencing procedures (Smith et al., Nature M, 674-679 (1986) and EPO Patent No. 873 00998.9; Du Pont De Nemours EPO Application No. 03 59225; Ansorge et al., L Biochem. Biophys. Method 13, 325-32 (19860; Prober et al.
  • this invention provides for various methods of reading sequencing data such as capillary zone electrophoresis (described in Jorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al., Nucleic Acids Res. 18, 1415- 1419 (1990)), mass spectrometry (including ES [described in Fenn et al. J. Phys. Chem. 18, 4451-59 (1984); PCT Application No.
  • the invention provides a method of performing whole cell engineering comprising the step of cell screening.
  • the method includes DNA amplification.
  • DNA can be amplified by a variety of procedures including cloning (Sambrook et at., Molecular Cloning : A Laboratory Manual., Cold Spring Harbor Laboratory Press, 1989), polymerase chain reaction (PCR) (CR. Newton and A. Graham, PCF, BIOS Publishers, 1994; Bevan et al., "Sequencing of PCR-Amplified DNA” PCR Meth. App. 4:222 (1992)), ligase chain reaction (LCR) (F. Barany Proc. Natl. Acad Sci USA 88, 189-93 (1991), strand displacement amplification (SDA) (G.
  • PCR polymerase chain reaction
  • LCR ligase chain reaction
  • SDA strand displacement amplification
  • this invention provides for additional sequencing methods (as described in Labeit et al., MA 5, 173-177 (1986); Amersham, PCT- Application GB86/00349; Eckstein et al., Nucleic Acids Res. 1 ⁇ , 9947 (1988); Max- Planck- Geselischaft, DE 3930312 Al; Saiki, R. et al., Science 239:487-491 (1998); Sarkat, G. and Bolander Mark E., Semi Exponential Cycle Sequencing Nucleic Acids Research, 1995, Vol. 23, No. 7, p. 1269-1270).
  • This invention also provides for the following sequencing strategies: shotgun sequencing, transposon-mediated directed sequencing (Sfrathmann, M. et al.
  • the step of genomic sequencing includes constructing ordered clone maps of DNA sequencing (as described in sections of U.S. Patent Publication No. 5604100 and PCT Patent Publication No. WO9627025). This invention provides that the method of genome sequencing be achieved by various steps that may utilize modifications of certain methods mentioned above (described in the following patents: PCT Publication Nos.
  • this invention provides for the use of a relational database system for storing and manipulating biomolecular sequence information and storing and displaying genetic information
  • the database including genomic libraries for a plurality of types of organisms, the libraries having multiple genomic sequences, at least some of which represent open reading frames located along a contiguous sequence on each the plurality of organisms' genomes, and a user interface capable of receiving a selection of two or more of the genomic libraries for comparison and displaying the results of the comparison.
  • Associated with the database is a software system that allows a user to determine the relative position of a selected gene sequence within a genome. The system allows execution of a method of displaying the genetic locus of a biomolecular sequence.
  • the method involves providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome.
  • the system also provides a user interface capable of receiving a selection of one or more probe open reading frames for use in determining homologous matches between such probe open reading frame(s) and the open reading frames in the genomic libraries, and displaying the results of the determination.
  • An open reading frame for the sequence is selected and displayed together with adjacent open reading frames located upstream and downstream in the relative positions in which they occur on the contiguous sequence.
  • the invention provides a relational database system for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more protein function hierarchies.
  • the hierarchies allow searches for sequences based upon a protein's biological function or molecular function. Also disclosed is a mechanism for automatically grouping new sequences into protein function hierarchies. This mechanism uses descriptive information obtained from "external hits" which are matches of stored sequences against gene sequences stored in an external database such as GenBank. The descriptive information provided with the external database is evaluated according to a specific algorithm and used to automatically group the external hits (or the sequences associated with the hits) in the categories. Ultimately, the biomolecular sequences stored in databases of this invention are provided with both descriptive information from the external hit and category information from a relevant hierarchy or hierarchies.
  • a relational database system for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to association with one or more projects for obtaining full-length biomolecular sequences from shorter sequences.
  • the relational database has sequence records containing information identifying one or more projects to which each of the sequence records belong. Each project groups together one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the computer system has a user interface allowing a user to selectively view information regarding one or more projects.
  • the relational database also provides interfaces and methods for accessing and manipulating and analyzing project-based information. Polymer sequences can be assembled into bins. A first number of bins are populated with polymer sequences.
  • the polymer sequences in each bin are assembled into one or more consensus sequences representative of the polymer sequences of the bin.
  • the consensus sequences of the bins are compared to determine relationships, if any, between the consensus sequences of the bins.
  • the bins are modified based on the relationships between the consensus sequences of the bins.
  • the polymer sequences are reassembled in the modified bins to generate one or more modified consensus sequences for each bin representative of the modified bins.
  • sequence similarities and dissimilarities are analyzed in a set of polymer sequences. Pairwise alignment data is generated for pairs of the polymer sequences.
  • the pairwise alignment data defines regions of similarity between the pairs of polymer sequences with boundaries.
  • ANNOTATING - GENERAL METHODOLOGY In one aspect the invention provides relational databases for storing and retrieving biological information. More particularly the invention relates to systems and methods for providing sequences of biological molecules in a relational format allowing retrieval in a client-server environment and for providing full-length cDNA sequences in a relational format allowing retrieval in a client-server environment.
  • this present invention provides relational database systems for storing and analyzing biomolecular sequence information together with biological annotations detailing the source and inte ⁇ retation the sequence data.
  • the present invention provides a powerful database tool for drug development and other research and development piuposes.
  • the present invention provides relational database systems for storing and analyzing biomolecular sequence information together with biological detailing the source and inte ⁇ retation the sequence data.
  • Disclosed is a relational database systems for storing and displaying genetic information.
  • a software system the allows a user to determine the relative position of a selected gene sequence within a genome.
  • the system allows execution of a method of displaying the genetic locus of a biomolecular sequence.
  • the method involves providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. An open reading frame for the sequence is selected and displayed together with adjacent open reading frames located upstream and downstream in the relative positions in which they occur on the contiguous sequence.
  • the invention provides a method of displaying the genetic locus of a biomolecular sequence.
  • the method involve providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome.
  • the method further involves identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame.
  • the adjacent open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence, textually and/or graphically.
  • the method of the invention may be practiced with sequences from microbial organisms, and the sequences may include nucleic acid or protein sequences.
  • the invention also provides a computer system including a database having multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome.
  • the computer system also includes a user interface capable of identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame. The adjacent the open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence.
  • the user interface may also capable of detecting a scrolling command, and based upon the direction and magnitude of the scrolling command, identifying a new selected open reading frame from the contiguous sequence.
  • the invention further provides a computer program product comprising a computer-usable medium having computer-readable program code embodied thereon relating to a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome.
  • the computer program product includes computer-readable program code for identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame. The adjacent open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence.
  • Comparative Genomics is a feature of the database system of the present invention which allows a user to compare the sequence data of sets of different organism types.
  • Comparative searches may be formulated in a number of ways using the Comparative Genomics feature. For example, genes common to a set of organisms may be identified through a "commonality" query, and genes unique to one of a set of organisms may be identified through a "subtraction” query.
  • Electronic Southern is a feature of the present database system which is useful for identifying genomic libraries in which a given gene or ORF exists.
  • a Southern analysis is a conventional molecular biology technique in which a nucleic acid of known sequence is used to identify matching (complementary) sequences in a sample of nucleic acid to be analyzed.
  • Electronic Southerns may be used to locate homologous matches between a "probe" DNA sequence and a large number of DNA sequences in one or more libraries.
  • the present invention provides a method of comparing genetic complements of different types of organisms. The method involves providing a database having sequence libraries with multiple biomolecular sequences for different types of organisms, where at least some of the sequences represent open reading frames located along one or more contiguous sequences on each of the organisms' genomes. The method further involves receiving a selection of two or more of the sequence libraries for comparison, determining open reading frames common or unique to the selected sequence libraries, and displaying the results of the determination. The invention also provides a method of comparing genomic complements of different types of organisms.
  • the method involves providing a database having genomic sequence libraries with multiple biomolecular sequences for different types of organisms, where at least some of the sequences represent open reading frames located along one or more contiguous sequences on each of the organisms' genomes.
  • the method further involves receiving a selection of two or more of the sequence libraries for comparison, determining sequences common or unique to the selected sequence libraries, and displaying the results of the determination.
  • the invention further provides a computer system including a database containing genomic libraries for different types of organisms, which libraries have multiple genomic sequences, at least some of which representing open reading frames located along one or more contiguous sequences on each the organisms' genomes.
  • the system also includes a user interface capable of receiving a selection of two or more genomic libraries for comparison and displaying the results of the comparison.
  • Another aspect of the present invention provides a method of identifying libraries in which a given gene exists.
  • the method involves providing a database including genomic libraries for one or more types of organisms.
  • the libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes.
  • the method further involves receiving a selection of one or more probe sequences, determining homologous matches between the selected probe sequences and the sequences in the genomic libraries, and displaying the results of the determination.
  • the invention also provides a computer system including a database including genomic libraries for one or more types of organisms, which libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes.
  • the system also includes a user interface capable of receiving a selection of one or more probe sequences for use in determining homologous matches between one or more probe sequences and the sequences in the genomic libraries, and displaying the results of the determination.
  • a computer program product including a computer- usable medium having computer-readable program code embodied thereon relating to a database including genomic libraries for one or more types of organisms.
  • the libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes.
  • the computer program product includes computer-readable program code for providing, within a computing system, an interface for receiving a selection of two or more genomic libraries for comparison, determining sequences common or unique to the selected genomic libraries, and displaying the results of the determination. Additionally provided is a computer program product including a computer-usable medium having computer-readable program code embodied thereon relating to a database including genomic libraries for one or more types of organisms.
  • the libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes.
  • the computer program product includes computer-readable program code for providing, within a computing system, an interface for receiving a selection of one or more probe open reading frames, determining homologous matches between the probe sequences and the sequences in the genomic libraries, and displaying the results of the determination.
  • the invention further provides a method of presenting the genetic complement of an organism. The method involves providing a database including sequence libraries for a plurality of types of orgamsms, where the libraries have multiple biomolecular sequences, at least some of which represent open reading 226
  • the method further involves receiving a selection of one of the sequence libraries, determining open reading frames within the selected sequence library, and displaying the results as one or more unique identifiers for groups of related opening reading frames.
  • the present invention provides relational database systems for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more protein function hierarchies.
  • the hierarchies are provided to allow carefully tailored searches for sequences based upon a protein's biological function or molecular function. To make this capability available in large sequence databases, the invention provides a mechanism for automatically grouping new sequences into protein function hierarchies.
  • the mechanism takes advantage of descriptive information obtained from "external hits" which are matches of stored sequences against gene sequences stored in an external database such as GenBank.
  • GenBank an external database
  • the descriptive information provided with GenBank is evaluated according to a specific algorithm and used to automatically group the external hits (or the sequences associated with the hits) in the categories.
  • the biomolecular sequences stored in databases of this invention are provided with both descriptive information from the external hit and category information from a relevant hierarchy or hierarchies.
  • the invention provides a computer system having a database containing records pertaining to a plurality of biomolecular sequences. At least some of the biomolecular sequences are grouped into a first hierarchy of protein function categories, the protein function categories specifying biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy.
  • the hierarchy includes a first set of protein function categories specifying biological functions at a cellular level, and a second set of protein function categories specifying biological functions at a level above the cellular level.
  • the computer system of the invention also includes a user interface allowing a user to selectively view information regarding the plurality of biomolecular sequences as it relates to the first hierarchy.
  • the computer system may also include additional protein function categories based, for example, on molecular or enzymatic function of proteins.
  • the biomolecular sequences may include nucleic acid or amino acid sequences. Some of said biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about such projects.
  • the invention also provides a method of using a computer system to present information pertaining to a plurality of biomolecular sequence records stored in a database.
  • the method involves displaying a list of the records or a field for entering information identifying one or more of the records, identifying one or more of the records that a user has selected from the list or field, matching the one or more selected records with one or more protein function categories from a first hierarchy of protein function categories into which at least some of the biomolecular sequence records are grouped, and displaying the one or more categories matching the one or more selected records.
  • the protein function categories specify biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy includes a first set of protein function categories specifying biological functions at a cellular level, and a second set of protein function categories specifying biological functions at a tissue level.
  • the method may also involve matching the records against other protein function hierarchies, such as hierarchies based on molecular and/or enzymatic function, and displaying the results.
  • At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects. Additionally, the invention provides a method of using a computer system to present information pertaining to a plurality of biomolecular sequence records stored in a database.
  • the method involves displaying a list of one or more protein biological function categories from a first hierarchy of protein biological function categories into which at least some of the biomolecular sequence records are grouped, identifying one or more of the protein biological function categories that a user has selected from the list, matching the one or more selected protein biological function categories with one or more biomolecular sequence records which are grouped in the selected protein biological function categories, and displaying the one or more sequence records matching the one or more selected protein biological function categories.
  • the protein biological function categories specify biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy includes a first set of protein biological function categories specifying biological functions at a cellular level, and a second set of protein biological function categories specifying biological functions at a tissue level.
  • the method may also involve matching the records against other protein function hierarchies, such as hierarchies based on molecular and/or enzymatic function, and displaying the results.
  • At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects.
  • Another aspect of the invention provides a database system having a plurality of internal records.
  • the database includes a plurality of sequence records specifying biomolecular sequences, at least some of which records reference hits to an external database, which hits specify genes having sequences that at least partially match those of the biomolecular sequences.
  • the database also includes a plurality of external hit records specifying the hits to the external database, and at least some of the records reference protein function hierarchy categories which specify at least one of biological functions of proteins or molecular functions of proteins. At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects. Further aspects of the present invention provide a method of using a computer system and a computer readable medium having program instructions to automatically categorize biomolecular sequence records into protein function categories in an internal database.
  • the method and program involve receiving descriptive information about a biomolecular sequence in the internal database from a record in an external database pertaining to a gene having a sequence that at least partially matches that of the biomolecular sequence.
  • a determination is made whether the descriptive information contains one or more terms matching one or more keywords associated with a first protein function category, the keywords being terms consistent with a classification in the first protein function category.
  • a determination is made whether the descriptive information contains a term matching one or more anti- keywords associated with the first protein function category, the anti- keywords being terms inconsistent with a classification in the first protein function category.
  • the present invention provides relational database systems for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more characteristics.
  • the sequence information of the database is generated by one or more "projects" which are concerned with identifying the full- length coding sequence of a gene (i.e., mRNA).
  • the projects involve the extension of an initial sequenced portion of a clone of a gene of interest (e.g., an EST) by a variety of methods which use conventional molecular biological techniques, recently developed adaptations of these techniques, and certain novel database applications.
  • Data accumulated in these projects may be provided to the database of the present invention throughout the course of the projects and may be available to database users (subscribers) throughout the course of these projects for research, product (i.e., drug) development, and other pu ⁇ oses.
  • the database of the present invention and its associated projects may provide sequence and related data in amounts and forms not previously available.
  • the present invention can make partial and full-length sequence information for a given gene available to a user both during the course of the data acquisition and once the full-length sequence of the gene has been elucidated.
  • the database can provide a variety of tools for analysis and manipulation of the data, including Northern analysis and Expression summaries.
  • the present invention should permit more complete and accurate annotation, of sequence data, as well as the study of relationships between genes of different tissues, systems or organisms, and ultimately detailed expression studies of full-length gene sequences.
  • the invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong. Each project groups together one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the computer system also has a user interface allowing a user to selectively view information regarding one or more projects.
  • the biomolecular sequences may include nucleic acid or amino acid sequences.
  • the user interface may allow users to view at least three levels of project information including a project information results level listing at least some of the projects in said database, a sequence information results level listing at least some of the sequences associated with a given project, and a sequence retrieval results level sequentially listing monomers which comprise a given sequence.
  • a method of using a computer system and a computer program product to present information pertaining to a plurality of sequence records stored in a database are also provided by the present invention.
  • the sequence records contain information identifying one or more projects to which each of the sequence records belong.
  • Each of the projects groups one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the method and program involve providing an interface for entering query information relating to one or more projects, locating data conesponding to the entered query information, and displaying the data conesponding to the entered query information. Additionally, the invention provides a method of using a computer system to present information pertaining to a plurality of sequence records stored in a database.
  • the sequence records contains information identifying one or more projects to which each of the sequence records belong.
  • Each of the projects groups one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the method involves displaying a list of one or more project identifiers, determining which project identifier or identifiers from the list is selected by a user, then displaying a second list of one or more biomolecular sequence identifiers associated with the selected project identifier or identifiers, determining which sequence identifier or identifiers from the second list has been selected by a user, and displaying a third list of one or more sequences conesponding to the selected sequence identifier or identifiers. Following the display of the third list, a determination may be made whether and which sequence from the third list has been selected by a user.
  • the invention further provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of said projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the system also has a user interface capable of allowing a user to select one or more project identifiers or project member identifiers specifying one or more sequences to be compared with one or more cDNA sequence libraries, and displaying matches resulting from that comparison.
  • a method of using a computer system to present comparative information pertaining to a plurality of sequence records stored in a database is also provided by the present invention.
  • the sequence records contain information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the method involves providing an interface capable of allowing a user to select one or more project identifiers or project member identifiers specifying one or more sequences, comparing the one or more specified sequences with one or more cDNA sequence libraries, and displaying matches resulting from the comparison.
  • the invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • the system also has a user interface allowing a user to view expression information pertaining to the projects by selecting one or more expression categories for a query, and displaying the result of the query.
  • a method of using a computer system to view expression information pertaining to one or more projects, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence is also provided by the invention.
  • the computer system includes a database storing a plurality of sequence records, the sequence records containing information identifying one or more projects to which each of the sequence records belong.
  • the method involves providing an interface which allows a user to select one or more expression categories as a query, locating projects belonging to the selected one or more expression categories, and displaying a list of located projects.
  • the present invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence.
  • This computer system has a user interface allowing a user to selectively view information regarding said one or more projects and which displays information to a user in a format common to one or more other sequence databases.
  • Polymer sequences are assembled into bins. A first number of bins are populated with polymer sequences. The polymer sequences in each bin are assembled into one or more consensus sequences representative of the polymer sequences of the bin. The consensus sequences of the bins are compared to determine relationships, if any, between the consensus sequences. The bins are modified based on the relationships between the consensus sequences. The polymer sequences are reassembled in the modified bins to generate one or more modified consensus sequences for each bin representative of the modified bins.
  • sequence similarities and dissimilarities are analyzed in a set of polymer sequences.
  • Pairwise alignment data is generated for pairs of the polymer sequences.
  • the pairwise alignment data defines regions of similarity between the pairs of polymer sequences with boundaries. Additional boundaries in particular polymer sequences are determined by applying at least one boundary from at least one pairwise alignment for one pair of polymer sequences to at least one other pairwise alignment for another pair of polymer sequences including one of the particular polymer sequences. Additional regions of similarity are generated based on the boundaries.
  • ANNOTATING - RELATIONAL DATABASES The present invention provides an improved relational database for storing and manipulating genomic sequence information. While the invention is described in terms of a database optimized for microbial data, it is by no means so limited.
  • the invention may be employed to investigate data from various sources.
  • the invention covers databases optimized for other sources of sequence data, such as animal sequences (e.g., human, primate, rodent, amphibian, insect, etc.), plant sequences and microbial sequences.
  • animal sequences e.g., human, primate, rodent, amphibian, insect, etc.
  • RNA profiling or RNA profiling
  • SAGE Serial Analysis of Gene Expression
  • RNA molecules with the ability to bind a predetermined protein or a predetermined dye molecule were selected by alternate rounds of selection and PCR amplification (Tuerk and Gold, 1990; Ellington and Szostak, 1990).
  • Proteomics in another aspect of this invention, relates to the emerging field of proteomics. Proteomics involves the qualitative and quantitative measurement of gene activity by detecting and quantitating expression at the protein level, rather than at the messenger RNA level. Proteomics also involves the study of non-genome encoded events, including the post-translational modification of proteins (including glycosylation or other modifications), interactions between proteins, and the location of proteins within a cell. The structure, function, and or level of activity of the proteins expressed by the cell are also of interest.
  • proteomics involves the study of part or all of the status of the total protein contained within or secreted by a cell.
  • Proteomics requires means of separating proteins in complex mixtures and identifying both low-and high-abundance species. Examples of powerful methods cunently used to resolve complex protein mixtures are 2D gel electrophoresis, reverse phase HPLC, capillary electrophoresis, isoelectric focusing and related hybrid techniques.
  • Commonly used protein identification techniques include N-terminal Edman and mass spectrometry (electrospray [ESI] or matrix- assisted laser deso ⁇ tion ionization [MALDI] MS) and sophisticated database search programs, such as SEQUESTTM (see, e.g., U.S. Patent Nos.
  • SEQUESTTM co ⁇ elates uninte ⁇ reted tandem mass spectra of peptides with amino acid sequences from protein and nucleotide databases. SEQUESTTM can determine the amino acid sequence and thus the protein(s) and organism(s) that conespond to the mass spectrum being analyzed. SEQUESTTM uses algorithms described in U.S. Patent Nos. 6,017,693 and 5,538,897. Using a computer, the output of the mass spectrometry can be analyzed so as to link a gene and the particular protein for which it codes.
  • the present invention is further directed to a method for generating a selected mutant polynucleotide sequence (or a population of selected polynucleotide sequences) typically in the form of amplified and/or cloned polynucleotides, whereby the selected polynucleotide sequences(s) possess at least one desired phenotypic characteristic (e.g., encodes a polypeptide, promotes transcription of linked polynucleotides, binds a protein, and the like) which can be selected for.
  • a desired phenotypic characteristic e.g., encodes a polypeptide, promotes transcription of linked polynucleotides, binds a protein, and the like
  • One method for identifying hybrid polypeptides that possess a desired structure or functional property involves the screening of a large library of polypeptides for individual library members which possess the desired structure or functional property confe ⁇ ed by the amino acid sequence of the polypeptide.
  • One method of screening peptides involves the display of a peptide sequence, antibody, or other protein on the surface of a bacteriophage particle or cell. Generally, in these methods each bacteriophage particle or cell serves as an individual library member displaying a single species of displayed peptide in addition to the natural bacteriophage or cell protein sequences.
  • Each bacteriophage or cell contains the nucleotide sequence information encoding the particular displayed peptide sequence; thus, the displayed peptide sequence can be ascertained by nucleotide sequence determination of an isolated library member.
  • a well-known peptide display method involves the presentation of a peptide sequence on the surface of a filamentous bacteriophage, typically as a fusion with a bacteriophage coat protein.
  • the bacteriophage library can be incubated with an immobilized, predetermined macromolecule or small molecule (e.g., a receptor) so that bacteriophage particles which present a peptide sequence that binds to the immobilized macromolecule can be differentially partitioned from those that do not present peptide sequences that bind to the predetermined macromolecule.
  • the bacteriophage particles i.e., library members
  • which are bound to the immobilized macromolecule are then recovered and replicated to amplify the selected bacteriophage sub-population for a subsequent round of affinity enrichment and phage replication.
  • the bacteriophage library members that are thus selected are isolated and the nucleotide sequence encoding the displayed peptide sequence is determined, thereby identifying the sequence(s) of peptides that bind to the predetermined macromolecule (e.g., receptor).
  • the predetermined macromolecule e.g., receptor
  • the fusion protein/vector DNA complexes can be screened against a predetermined macromolecule in much the same way as bacteriophage particles are screened in the phage-based display system, with the replication and sequencing of the DNA vectors in the selected fusion protein/vector DNA complexes serving as the basis for identification of the selected library peptide sequence(s).
  • the displayed peptide sequences can be of varying lengths, typically from 3-5000 amino acids long or longer, frequently from 5-100 amino acids long, and often from about 8-15 amino acids long.
  • a library can comprise library members having varying lengths of displayed peptide sequence, or may comprise library members having a fixed length of displayed peptide sequence.
  • Portions or all of the displayed peptide sequence(s) can be random, pseudorandom, defined set kernal, fixed, or the like.
  • the present display methods include methods for in vitro and in vivo display of single-chain antibodies, such as nascent scFv on polysomes or scfv displayed on phage, which enable large-scale screening of scfv libraries having broad diversity of variable region sequences and binding specificities.
  • the present invention also provides random, pseudorandom, and defined sequence framework peptide libraries and methods for generating and screening those libraries to identify useful compounds (e.g., peptides, including single-chain antibodies) that bind to receptor molecules or epitopes of interest or gene products that modify peptides or RNA in a desired fashion.
  • the random, pseudorandom, and defined sequence framework peptides are produced from libraries of peptide library members that comprise displayed peptides or displayed single-chain antibodies attached to a polynucleotide template from which the displayed peptide was synthesized.
  • the mode of attachment may vary according to the specific aspect of the invention selected, and can include encapsulation in a phage particle or inco ⁇ oration in a cell.
  • Screening that utilizes in vitro translation systems An aspect of this invention provides for the use of in vitro translation during the step of screening. In vitro translation has been used to synthesize proteins of interest and has been proposed as a method for generating large libraries of peptides.
  • Affinity enrichment provides for the use of affinity enrichment which allows a very large library of peptides and single-chain antibodies to be screened and the polynucleotide sequence encoding the desired peptide(s) or single-chain antibodies to be selected.
  • the polynucleotide can then be isolated and shuffled to recombine combinatorially the amino acid sequence of the selected peptide(s) (or predetermined portions thereof) or single-chain antibodies (or just VHI, VLI or CDR portions thereof).
  • a peptide or single-chain antibody as having a desired binding affinity for a molecule and can exploit the process of shuffling to converge rapidly to a desired high-affinity peptide or scfv.
  • the peptide or antibody can then be synthesized in bulk by conventional means for any suitable use (e.g., as a therapeutic or diagnostic agent).
  • a significant advantage of the present invention is that no prior information regarding an expected ligand structure is required to isolate peptide ligands or antibodies of interest.
  • the peptide identified can have biological activity, which is meant to include at least specific binding affinity for a selected receptor molecule and, in some instances, will further include the ability to block the binding of other compounds, to stimulate or inhibit metabolic pathways, to act as a signal or messenger, to stimulate or inhibit cellular activity, and the like.
  • the present invention also provides a method for shuffling a pool of polynucleotide sequences selected by affinity screening a library of polysomes displaying nascent peptides (including single-chain antibodies) for library members which bind to a predetermined receptor (e.g., a mammalian proteinaceous receptor such as, for example, a peptidergic hormone receptor, a cell surface receptor, an intracellular protein which binds to other protein(s) to form intracellular protein complexes such as hetero-dimers and the like) or epitope (e.g., an immobilized protein, glycoprotein, oligosaccharide, and the like).
  • a predetermined receptor e.g., a mammalian proteinaceous receptor such as, for example, a peptidergic hormone receptor, a cell surface receptor, an intracellular protein which binds to other protein(s) to form intracellular protein complexes such as hetero-dimers and the like
  • epitope e.g.,
  • the invention also provides peptide libraries comprising a plurality of individual library members of the invention, wherein (1) each individual library member of said plurality comprises a sequence produced by shuffling of a pool of selected sequences, and (2) each individual library member comprises a variable peptide segment sequence or single-chain antibody segment sequence which is distinct from the variable peptide segment sequences or single-chain antibody sequences of other individual library members in said plurality (although some library members may be present in more than one copy per library due to uneven amplification, stochastic probability, or the like).
  • Antibody Display The present method can be used to shuffle, by in vitro and/or in vivo recombination by any of the disclosed methods, and in any combination, polynucleotide sequences selected by antibody display methods, wherein an associated polynucleotide encodes a displayed antibody which is screened for a phenotype (e.g., for affinity for binding a predetermined antigen (ligand).
  • a phenotype e.g., for affinity for binding a predetermined antigen (ligand).
  • Various prokaryotic expression systems have been developed that can be manipulated to produce combinatorial antibody libraries which may be screened for high-affinity antibodies to specific antigens.
  • a bacteriophage antibody display library is screened with a receptor (e.g., polypeptide, carbohydrate, glycoprotein, nucleic acid) that is immobilized (e.g., by covalent linkage to a chromatography resin to enrich for reactive phage by affinity chromatography) and/or labeled (e.g., to screen plaque or colony lifts).
  • a receptor e.g., polypeptide, carbohydrate, glycoprotein, nucleic acid
  • scfv single-chain fragment variable
  • Intracellular expression of an anti-Rev scfv has been shown to inhibit HIV-1 virus replication in vitro (Duan et al, 1994), and intracellular expression of an anti-p21rar, scfv has been shown to inhibit meiotic maturation of Xenopus oocytes (Biocca et al, 1993). Recombinant scfv which can be used to diagnose HIV infection have also been reported, demonstrating the diagnostic utility of scfv (Lilley et al, 1994). Fusion proteins wherein an scFv is linked to a second polypeptide, such as a toxin or fibrinolytic activator protein, have also been reported (Holvost et al, 1992; Nicholls et al, 1993).
  • Enzymatic inverse PCR mutagenesis has been shown to be a simple and reliable method for constructing relatively large libraries of scfv site-directed hybrids (Stemmer et al, 1993), as has enor-prone PCR and chemical mutagenesis (Deng et al, 1994).
  • Riechmann Riechmann et al, 1993
  • Barbas (Barbas et al, 1992) attempted to circumvent the problem of limited repertoire sizes resulting from using biased variable region sequences by randomizing the sequence in a synthetic CDR region of a human tetanus toxoid-binding Fab.
  • Displayed peptide/polynucleotide complexes which encode a variable segment peptide sequence of interest or a single-chain antibody of interest are selected from the library by an affinity enrichment technique. This is accomplished by means of a immobilized macromolecule or epitope specific for the peptide sequence of interest, such as a receptor, other macromolecule, or other epitope species.
  • the affinity selection procedure provides an enrichment of library members encoding the desired sequences, which may then be isolated for pooling and shuffling, for sequencing, and/or for further propagation and affinity enrichment.
  • the library members without the desired specificity are removed by washing.
  • the degree and stringency of washing required will be determined for each peptide sequence or single-chain antibody of interest and the immobilized predetermined macromolecule or epitope. A certain degree of control can be exerted over the binding characteristics of the nascent peptide/DNA complexes recovered by adjusting the conditions of the binding incubation and the subsequent washing.
  • the temperature, pH, ionic strength, divalent cations concentration, and the volume and duration of the washing will select for nascent peptide/DNA complexes within particular ranges of affinity for the immobilized macromolecule. Selection based on slow dissociation rate, which is usually predictive of high affinity, is often the most practical route. This may be done either by continued incubation in the presence of a saturating amount of free predetermined macromolecule, or by increasing the volume, number, and length of the washes. In each case, the rebinding of dissociated nascent peptide/DNA or peptide RNA complex is prevented, and with increasing time, nascent peptide/DNA or peptide/RNA complexes of higher and higher affinity are recovered.
  • affinities of some peptides are dependent on ionic strength or cation concentration. This is a useful characteristic for peptides that will be used in affinity purification of various proteins when gentle conditions for removing the protein from the peptides are required.
  • One variation involves the use of multiple binding targets (multiple epitope species, multiple receptor species), such that a scfv library can be simultaneously screened for a multiplicity of scfv which have different binding specificities. Given that the size of a scfv library often limits the diversity of potential scfv sequences, it is typically desirable to us scfv libraries of as large a size as possible.
  • multiple predetermined epitope species can be concomitantly screened in a single library, or sequential screening against a number of epitope species can be used.
  • multiple target epitope species each encoded on a separate bead (or subset of beads), can be mixed and incubated with a polysome-display scfv library under suitable binding conditions.
  • the collection of beads, comprising multiple epitope species can then be used to isolate, by affinity selection, scfv library members.
  • subsequent affinity screening rounds can include the same mixture of beads, subsets thereof, or beads containing only one or two individual epitope species.
  • This approach affords efficient screening, and is compatible with laboratory automation, batch processing, and high throughput screening methods.
  • Expression systems will typically include an expression control DNA sequence operably linked to the coding sequences, including naturally-associated or heterologous promoter regions.
  • the expression control sequences can be eukaryotic promoter systems in vectors capable of transforming or transfecting eukaryotic host cells. Once the vector has been inco ⁇ orated into the appropriate host, the host is maintained under conditions suitable for high level expression of the nucleotide sequences, and the collection and purification of the mutant' "engineered" antibodies.
  • the DNA sequences will be expressed in hosts after the sequences have been operably linked to an expression control sequence (i.e., positioned to ensure the transcription and translation of the structural gene).
  • expression control sequence i.e., positioned to ensure the transcription and translation of the structural gene.
  • These expression vectors are typically replicable in the host organisms either as episomes or as an integral part of the host chromosomal DNA.
  • expression vectors will contain selection markers, e.g., tetracycline or neomycin, to permit detection of those cells transformed with the desired DNA sequences (see, e.g., USPN 4,704,362).
  • mammalian tissue cell culture may also be used to produce the polypeptides of the present invention (see Winnacker, 1987), which is inco ⁇ orated herein by reference).
  • Eukaryotic cells can be used because a number of suitable host cell lines capable of secreting intact immunoglobulins have been developed in the art, and include the CHO cell lines, various COS cell lines, HeLa cells, and myeloma cell lines, or transformed B cells or hybridomas.
  • Expression vectors for these cells can include expression control sequences, such as an origin of replication, a promoter, an enhancer (Queen et al, 1986), and necessary processing information sites, such as ribosome binding sites, RNA splice sites, polyadenylation sites, and transcriptional terminator sequences.
  • Expression control sequences can be promoters derived from immunoglobulin genes, cytomegalovirus, SV40, Adenovirus, Bovine Papilloma Virus, and the like.
  • Eukaryotic DNA transcription can be increased by inserting an enhancer sequence into the vector.
  • Enhancers are cis-acting sequences of between 10 to 300 bp that increase transcription by a promoter. Enhancers can effectively increase transcription when either 5' or 3' to the transcription unit.
  • viral enhancers are used, including SV40 enhancers, cytomegalovirus enhancers, polyoma enhancers, and adenovirus enhancers. Enhancer sequences from mammalian systems are also commonly used, such as the mouse immunoglobulin heavy chain enhancer. Mammalian expression vector systems will also typically include a selectable marker gene. Examples of suitable markers include, the dihydrofolate reductase gene (DHFR), the thymidine kinase gene (TK), or prokaryotic genes conferring drug resistance. The first two marker genes can use mutant cell lines that lack the ability to grow without the addition of thymidine to the growth medium.
  • DHFR dihydrofolate reductase gene
  • TK thymidine kinase gene
  • prokaryotic genes conferring drug resistance. The first two marker genes can use mutant cell lines that lack the ability to grow without the addition of thymidine to the growth medium.
  • Transformed cells can then be identified by their ability to grow on non-supplemented media.
  • prokaryotic drug resistance genes useful as markers include genes conferring resistance to G418, mycophenolic acid and hygromycin.
  • the vectors containing the DNA segments of interest can be transfened into the host cell by well-known methods, depending on the type of cellular host. For example, calcium chloride transfection is commonly utilized for prokaryotic cells, whereas calcium phosphate treatment, lipofection, or electroporation may be used for other cellular hosts. Other methods used to transform mammalian cells include the use of Polybrene, protoplast fusion, liposomes, electroporation, and micro-injection (see, generally, Sambrook et al, 1982 and 1989).
  • the antibodies, individual mutated immunoglobulin chains, mutated antibody fragments, and other immunoglobulin polypeptides of the invention can be purified according to standard procedures of the art, including ammonium sulfate precipitation, fraction column chromatography, gel electrophoresis and the like; see, e.g., Scopes, 1982.
  • the polypeptides may then be used therapeutically or in developing and performing assay procedures, immunofluorescent stainings, and the like (see, generally, Lefkovits and Perm ' s, 1979 and 1981; Lefkovits, 1997).
  • This invention provides a two-hybrid screening system to identify library members which bind a predetermined polypeptide sequence.
  • the selected library members are pooled and shuffled by in vitro and/or in vivo recombination.
  • the shuffled pool can then be screened in a yeast two hybrid system to select library members which bind said predetermined polypeptide sequence (e. g., and SH2 domain) or which bind an alternate predetermined polypeptide sequence (e.g., an SH2 domain from another protein species).
  • Polynucleotides encoding two hybrid proteins, one consisting of the yeast Gal4 DNA-binding domain fused to a polypeptide sequence of a known protein and the other consisting of the Gal4 activation domain fused to a polypeptide sequence of a second protein', are constructed and introduced into a yeast host cell. Intermolecular binding between the two fusion proteins reconstitutes the Gal4 DNA-binding domain with the Gal4 activation domain, which leads to the transcriptional activation of a reporter gene (e.g., lacz, HIS3) which is operably linked to a Gal4 binding site.
  • a reporter gene e.g., lacz, HIS3
  • the two-hybrid method is used to identify novel polypeptide sequences which interact with a known protein (Silver and Hunt, 1993; Durfee et al, 1993; Yang et al, 1992; Luban et al, 1993; Hardy et al, 1992; Bartel et al, 1993; and Vojtek et al, 1993).
  • variations of the two-hybrid method have been used to identify mutations of a known protein that affect its binding to a second known protein (Li and Fields, 1993; Lalo et al, 1993; Jackson et al, 1993; and Madura et al, 1993).
  • Two-hybrid systems have also been used to identify interacting structural domains of two known proteins (Bardwell et al, 1993; Chakrabarty et al, 1992; Staudinger et al, 1993; and Milne and Weaver 1993) or domains responsible for oligomerization of a single protein (Iwabuchi et al, 1993; Bogerd et al, 1993). Variations of two-hybrid systems have been used to study the in vivo activity of a proteolytic enzyme (Dasmahapatra et al, 1992). Alternatively, an E.
  • coli BCCP interactive screening system (Germino et al, 1993; Guarente, 1993) can be used to identify interacting protein sequences (i.e., protein sequences which heterodimerize or form higher order he teromul timers). Sequences selected by a two-hybrid system can be pooled and shuffled and introduced into a two-hybrid system for one or more subsequent rounds of screening to identify polypeptide sequences which bind to the hybrid containing the predetermined binding sequence. The sequences thus identified can be compared to identify consensus sequence(s) and consensus sequence kernals. Improved methods for cellular engineering, protein expression profiling, differential labeling of peptides.
  • the invention relates to peptide chemistry, proteomics, and mass spectrometry technology.
  • the invention provides novel methods for determining polypeptide profiles and protein expression variations, as with proteome analyses.
  • the present invention provides methods of simultaneously identifying and quantifying individual proteins in complex protein mixtures by selective differential labeling of amino acid residues followed by chromatographic and mass spectrographic analysis.
  • the diagnosis and treatment, as well as the predisposition of, a variety of diseases and disorders may often be accomplished through identification and quantitative measurement of polypeptide expression variations between different cell types and cell states.
  • Biochemical pathways and metabolic networks can also be analyzed by globally and quantitatively measuring protein expression in various cell types and biological states (see, e.g., Ideker (2001) Science 292:929-934).
  • State-of-the-art techniques such as liquid-chromatography- electrospray-ionization tandem mass spectrometry have, in conjunction with database- searching computer algorithms, revolutionized the analysis of biochemical species from complex biological mixtures. With these techniques, it is now possible to perform high-throughput protein identification at picomolar to subpicomolar levels from complex mixtures of biological molecules (see, e.g., Dongre (1997) Trends Biotechnol. 15:418-425).
  • ICATs isotope-coded affinity tags
  • tandem mass spectrometry or ion trap mass spectrometry or a combination thereof The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source. The measured differences in protein expression conelated with known yeast metabolic function under glucose-repressed conditions.
  • two different protein mixtures for quantitative comparison are digested to peptide mixtures, the peptides mixtures are separately methylated using either dO- or d3 -methanol, the mixtures of methylated peptide combined and subjected to microcapillary HPLC-MS/MS (see, e.g., Goodlett, D. R., et al., (2000) "Differential stable isotope labeling of peptides for quantitation and de novo sequence derivation," 49th ASMS; Zhou, H; Watts, JD; Aebersold, R. A systematic approach to the analysis of protein phosphorylation.; Comment In: Nat Biotechnol.
  • Parent proteins of methylated peptides are identified by co ⁇ elative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3-methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3-methylated peptide pairs.
  • differential labeling reagents which relied on stable isotopes, which are expensive, and not flexible to differential labeling of more than two mixtures of peptides
  • labeling methods limited only to methylation of carboxy-termini
  • protein expression profiling limited to duplex comparison
  • one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't has enough capacity and resolving power for complex mixtures of peptides.
  • this invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non- enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the
  • the sample of step (a) comprises a cell or a cell extract.
  • the method can further comprise providing two or more samples comprising a polypeptide.
  • One or more of the samples can be derived from a wild type cell and one sample can be derived from an abnormal or a modified cell.
  • the abnormal cell can be a cancer cell.
  • the modified cell can be a cell that is mutagenized &/or treated with a chemical, a physiological factor, or the presence of another organism (including, e.g. a eukaryotic organism, prokaryotic organism, virus, vector, prion, or part thereof), &/or exposed to an environmental factor or change or physical force (including, e.g., sound, light, heat, sonication, and radiation).
  • the modification can be genetic change (including, for example, a change in DNA or RNA sequence or content) or otherwise.
  • the method further comprises purifying or fractionating the polypeptide before the fragmenting of step (c).
  • the method can further comprise purifying or fractionating the polypeptide before the labeling of step (d).
  • the method can further comprise purifying or fractionating the labeled peptide before the chromatography of step (e).
  • the purifying or fractionating comprises a method selected from the group consisting of size exclusion chromatography, size exclusion chromatography, HPLC, reverse phase HPLC and affinity purification.
  • the method further comprises contacting the polypeptide with a labeling reagent of step (b) before the fragmenting of step (c).
  • the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: Z A OH and Z B OH, to esterify peptide C-terminals and/or Glu and Asp side chains; Z A NH 2 and Z B NH , to form amide bond with peptide C-terminals and or Glu and Asp side chains; and Z A CO 2 H and Z B CO 2 H.
  • Z A and Z B independently of one another comprise the general formula R-Z 1 - A'-Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 -, Z 1 , Z 2 , Z 3 , and Z 4 independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ , C(O), C(O)O,
  • R 1 is an alkyl group
  • a 1 , A 2 , A 3 , and A 4 independently of one another, are selected from the group consisting of nothing or (CRR ) mention, wherein R, R , independently from other R and R 1 in Z 1 to Z 4 and independently from other R and R 1 in A 1 to A 4 , are selected from the group consisting of a hydrogen atom, a halogen atom and an alkyl group; "n" in Z 1 to Z 4 , independent of n in A 1 to A 4 , is an integer having a value selected from the group consisting of
  • the alkyl group (see definition below) is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
  • One or more C-C bonds from (CRR 1 ), can be replaced with a double or a triple bond; thus, in alternative aspects, an R or an R 1 group is deleted.
  • the (CRR 1 ), can be selected from the group consisting of an ⁇ -arylene, an w-arylene and a 7-arylene, wherein each group has none or up to 6 substituents.
  • the (CRR ) itself can be selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom.
  • two or more labeling reagents have the same structure but a different isotope composition.
  • Z has the same structure as Z
  • Z has a different isotope composition than Z .
  • the isotope is boron- 10 and boron- 11; carbon- 12 and carbon- 13; nitrogen- 14 and nitrogen-15; and, sulfur-32 and sulfur-34.
  • x is greater than y.
  • x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51.
  • the labeling reagent of step (b) can comprise the general formulae selected from the group consisting of: Z A OH and Z B OH to esterify peptide C-terminals; Z A NH 2 / Z B NH 2 to fonn an amide bond with peptide C-terminals; and, Z A CO 2 H / Z B CO 2 H to form an amide bond with peptide N-terminals; wherein Z A and Z B have the general formula R-Z 1 -A 1 -Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 - ; Z 1 , Z 2 , Z 3 , and Z 4 , independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR
  • a single C-C bond in a (CRR')n group is replaced with a double or a triple bond; thus, the R and R 1 can be absent.
  • the (CRR ⁇ n can comprise a moiety selected from the group consisting of an ⁇ -arylene, an m-arylene and ap- arylene, wherein the group has none or up to 6 substituents.
  • the group can comprise a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom.
  • R, R 1 independently from other R and R 1 in Z 1 - Z 4 and independently from other R and R 1 in A 1 - A 4 , are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group.
  • the alkyl group (see definition below) can be an alkenyl, an alkynyl or an aryl group.
  • the "n" in Z 1 - Z 4 is independent of n in A 1 - A 4 and is an integer selected from the group consisting of about 51 ; about 41 ; about 31 ; about 21 , about 11 and about 6.
  • Z A has the same structure a Z B but Z A further comprises x number of -CH 2 - fragment(s) in one or more A - A fragments, wherein x is an integer. In one aspect, Z A has the same structure a Z B but Z A further comprises x number of -CF 2 - fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer. In one aspect, Z A comprises x number of protons and Z B comprises y number of halogens in the place of protons, wherein x and y are integers.
  • Z contains x number of protons and Z B contains ⁇ number of halogens, and there are x - y number of protons remaining in one or more A 1 - A 4 fragments, wherein x andy are integers.
  • Z A further comprises x number of -O- fragment(s) in one or more A - A fragments, wherein x is an integer.
  • Z further comprises x number of -S- fragment(s) in one or more A 1 - A 4 fragments, wherein x is an integer.
  • Z A further comprises x number of -O- fragment(s) and Z B further comprises y number of-S- firagment(s) in the place of-O- fragment(s), wherein x and y are integers.
  • Z A further comprises x - y number of -O- fragment(s) in one or more A 1 - A 4 fragments, wherein x and y are integers.
  • x and are integers selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y.
  • n, m and y are integers selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51.
  • the separating of step (e) comprises a liquid chromatography system, such as a multidimensional liquid chromatography or a capillary chromatography system.
  • the mass spectrometer comprises a tandem mass spectrometry device or an ion trap mass spectrometer or a combination thereof.
  • the method further comprises quantifying the amount of each polypeptide or each peptide.
  • the invention provides a method for defining the expressed proteins associated with a given cellular state, the method comprising the following steps: (a) providing a sample comprising a cell in the desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cell into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c
  • the invention provides a method for quantifying changes in protein expression between at least two cellular states, the method comprising the following steps: (a) providing at least two samples comprising cells in a desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cells into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation;
  • step (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents, wherein the labels used in one same are different from the labels used in other samples; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step
  • the invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by multidimensional liquid chromatography to generate an eluate; (f) feeding the eluate of step (e) into a tandem mass spectrometer or an ion trap mass spectrometer or a
  • the invention provides a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope.
  • the isotope(s) can be in the first domain or the second domain.
  • the isotope(s) can be in the biotin.
  • the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen-15 isotope, or, a sulfur-32 or a sulfur-34 isotope.
  • the chimeric labeling reagent can comprise two or more isotopes.
  • the chimeric labeling reagent reactive group capable of covalently binding to an amino acid can be a succimide group, an isothiocyanate group or an isocyanate group.
  • the reactive group can be capable of covalently binding to an amino acid binds to a lysine or a cysteine.
  • the chimeric labeling reagent can further comprising a linker moiety linking the biotin group and the reactive group.
  • the linker moiety can comprise at least one isotope.
  • the linker is a cleavable moiety that can be cleaved by, e.g., enzymatic digest or by reduction.
  • the invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the small molecule tags are structurally identical but differ in their isotope composition, and the small molecules comprise reactive groups that covalently bind to cysteine or lysine residues or both; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (d) comparing relative protein concentrations of each sample.
  • the sample comprises a complete or a fractionated cellular sample.
  • the differential small molecule tags comprise a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and, (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope.
  • the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur- 32 or a sulfur-34 isotope.
  • the chimeric labeling reagent can comprise two or more isotopes.
  • the reactive group can be capable of covalently binding to an amino acid is selected from the group consisting of a succimide group, an isothiocyanate group and an isocyanate group.
  • the invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the differential small molecule tags comprise a chimeric labeling reagent comprising (i) a first domain comprising a biotin; and, (ii) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) isolating the tagged polypeptides on a biotin-binding column by binding tagged polypeptides to the column, washing non-bound materials off the column, and eluting tagged polypeptides off the column; (e) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof;
  • the invention provides methods for simultaneously identifying individual proteins in complex mixtures of biological molecules and quantifying the expression levels of those proteins, e.g., proteome analyses.
  • the methods compare two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation.
  • the proteins in the standard and investigated samples are subjected separately to a series of chemical modifications, i.e., differential chemical labeling, and fragmentation, e.g., by proteolytic digestion and/or other enzymatic reactions or physical fragmenting methodologies.
  • the chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides.
  • Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but of similar properties, such that peptides with the same sequence from both samples are eluted together in the separation procedure and their ionization and detection properties regarding the mass spectrometry are very similar.
  • Differential chemical labeling can be performed on reactive functional groups on some or all of the carboxy- and/or amino- termini of proteins and peptides and/or on selected amino acid side chains.
  • a combination of chemical labeling, proteolytic digestion and other enzymatic reaction steps, physical fragmentation and or fractionation can provide access to a variety of residues to general different specifically labeled peptides to enhance the overall selectivity of the procedure.
  • Mass spectrometry data is processed by special software, which allows for identification and quantification of peptides and proteins.
  • LC-LC-MS/MS combined mixtures of peptides are first separated by a chromatography method, such as a multidimensional liquid chromatography system of the invention, before being fed into a coupled mass spectrometry device, such as a tandem mass spectrometry device or an ion trap mass spectrometer or a combination thereof.
  • a coupled mass spectrometry device such as a tandem mass spectrometry device or an ion trap mass spectrometer or a combination thereof.
  • the combination of multidimensional liquid chromatography and tandem mass spectrometry can be called "LC-LC-MS/MS.”
  • LC- LC-MS/MS was first developed by Link A. and Yates J.
  • Another exemplary system of the invention comprises the combination of multidimensional liquid chromatography and tandem mass spectrometry and an ion trap mass spectrometry, designated 3D LC LCQ MS/MS or 3D LC LTQ MS/MS, as described herein (e.g., comprising Finnigan MDLC LTQTM or LTQ FTTM, Thermo Electron Co ⁇ oration, San Jose, CA, or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer).
  • proteins can be first substantially or partially isolated from the biological samples of interest.
  • the polypeptides can be treated before selective differential labeling; for example, they can be denatured, reduced, preparations can be desalted, and the like. Conversion of samples of proteins into mixtures of differentially labeled peptides can include preliminary chemical and/or enzymatic modification of side groups and or termini; proteolytic digestion or fragmentation; post-digestion or post-fragmentation chemical and/or enzymatic modification of side groups and/or termini. The differentially modified polypeptides and peptides are then combined into one or more peptide mixtures. Solvent or other reagents can be removed, neutralized or diluted, if desired or necessary.
  • the buffer can be modified, or, the peptides can be re-dissolved in one or more different buffers, such as a "MudPIT" (see below) loading buffer.
  • the peptide mixture is then loaded onto chromatography column, such as a liquid chromatography column, a 2D capillary column or a multidimensional chromatography column, to generate an eluate.
  • the eluate is fed into a mass spectrometer, such as a tandem mass spectrometer, an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof
  • LCQ or LTQ ion trap mass spectrometer
  • data output is processed by appropriate software using database searching and data analysis.
  • high yields of peptides can generated for mass spectrograph analysis.
  • Two or more samples can be differentially labeled by selective labeling of each sample.
  • Peptide modifications, i.e., labeling are stable.
  • Reagents having differing masses or reactive groups can be chosen to maximize the number of reactive groups and differentially labeled samples, thus allowing for a multiplex analysis of sample, polypeptides and peptides.
  • a "MudPIT" protocol is used for peptide analysis, as described herein.
  • the methods of the invention can be fully automated and can essentially analyze every protein in a sample.
  • alkyl is used to refer to a genus of compounds including branched or unbranched, saturated or unsaturated, monovalent hydrocarbon radicals, including substituted derivatives and equivalents thereof.
  • the hydrocarbons have from about 1 to about 100 carbons, about 1 to about 50 carbons or about 1 to about 30 carbons, about 1 to about 20 carbons, about 1 to about 10 carbons.
  • alkyl group When the alkyl group has from about 1 to 6 carbon atoms, it is refened to as a "lower alkyl.”
  • Suitable alkyl radicals include, e.g., structures containing one or more methylene, methine and/or methyne groups ananged in acyclic and/or cyclic forms. Branched structures have a branching motif similar to isopropyl, tert-butyl isobutyl, 2-ethylpropyl, etc.
  • substituted alkyl refers to alkyl as just described including one or more functional groups such as lower alkyl, aryl, acyl, halogen (i.e., alkylhalos, e.g., CF3), hydroxy, amino, alkoxy, alkylamino, acylamino, thioamido, acyloxy, aryloxy, arylamino, aryloxyalkyl, mercapto, thia, aza, oxo, both saturated and unsaturated cyclic hydrocarbons, heterocycles and the like. These groups may be attached to any carbon of the alkyl moiety. Additionally, these groups may be pendent from, or integral to, the alkyl chain.
  • alkoxy is used herein to refer to the to a COR group, where
  • R is a lower alkyl, substituted lower alkyl, aryl, substituted aryl, arylalkyl or substituted arylalkyl wherein the alkyl, aryl, substituted aryl, arylalkyl and substituted arylalkyl groups are as described herein.
  • Suitable alkoxy radicals include, for example, methoxy, ethoxy, phenoxy, substituted phenoxy, benzyloxy phenethyloxy, tert.-butoxy, etc.
  • aryl is used herein to refer to an aromatic substituent that may be a single aromatic ring or multiple aromatic rings which are fused together, linked covalently, or linked to a common group such as a methylene or ethylene moiety.
  • the common linking group may also be a carbonyl as in benzophenone.
  • the aromatic ring(s) may include phenyl, naphthyl, biphenyl, diphenylmethyl and benzophenone among others.
  • aryl encompasses
  • arylalkyl refers to aryl as just described including one or more functional groups such as lower alkyl, acyl, halogen, alkylhalos (e.g., CF3), hydroxy, amino, alkoxy, alkylamino, acylamino, acyloxy, phenoxy, mercapto and both saturated and unsaturated cyclic hydrocarbons which are fused to the aromatic ring(s), linked covalently or linked to a common group such as a methylene or ethylene moiety.
  • the linking group may also be a carbonyl such as in cyclohexyl phenyl ketone.
  • substituted aryl encompasses “substituted arylalkyl.”
  • arylalkyl is used herein to refer to a subset of “aryl” in which the aryl group is further attached to an alkyl group, as defined herein.
  • biotin refers to any natural or synthetic biotin or variant thereof, which are well known in the art; ligands for biotin, and ways to modify the affinity of biotin for a ligand, are also well known in the art; see, e.g., U.S. Patent Nos.
  • labeling reagents which ... do not differ in ionization and detection properties in mass spectrographic analysis means that the amount and/or mass sequence of the labeling reagents can be detected using the same mass spectrographic conditions and detection devices.
  • polypeptide includes natural and synthetic polypeptides, or mimetics, which can be either entirely composed of synthetic, non-natural analogues of amino acids, or, they can be chimeric molecules of partly natural peptide amino acids and partly non-natural analogs of amino acids.
  • polypeptide as used herein includes proteins and peptides of all sizes.
  • sample as used herein includes any polypeptide-containing sample, including samples from natural sources, or, entirely synthetic samples.
  • column as used herein means any substrate surface, including beads, filaments, anays, tubes and the like.
  • chromatographic retention properties as used herein means that two compositions have substantially, but not necessary exactly, the same retention properties in a chromatograph, such as a liquid chromatograph. For example, two compositions do not differ in chromatographic retention properties if they elute together, i.e., they elute in what a skilled artisan would consider the same elution fraction.
  • proteins and peptides are subjected to a series of chemical modifications, i.e., differential chemical labeling.
  • the chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides.
  • Differential labeling reagents can differ in their isotope composition (i.e., isotopical reagents), in their structural composition (i.e., homologous reagents), but by a rather small fragment which change does not alter the properties stated above, i.e., the labeling reagent differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, and the differences in molecular mass are distinguishable by mass spectrographic analysis.
  • mixtures of polypeptides and/or peptides coming from the "standard" protein sample and the "investigated” protein sample(s) are labeled separately with differential reagents, or, one sample is labeled and other sample remains unlabeled.
  • differential reagents differ in molecular mass, but do not differ in retention properties regarding the separation method used (e.g., chromatography) and the mass spectrometry methods used will not detect different ionization and detection properties.
  • differential reagents differ either in their isotope composition (i.e., they are isotopical reagents) or they differ structurally by a rather small fragment which change does not alter the properties stated above (i.e., they are homologous reagents).
  • Differential chemical labeling can include esterification of C-termini, amidation of C-termini and/or acylation of N-termini. Esterification targets C-termini of peptides and carboxylic acid groups in amino acid side chains. Amidation targets C-termini of peptides and carboxylic acid groups in amino acid side chains. Amidation may require protection of amine groups first.
  • reagents comprise the general formulae: Z A OH and Z B OH to esterify peptide C-terminals and or Glu and Asp side chains; Z A NH 2 / Z B NH 2 to form amide bond with peptide C-terminals and/or Glu and Asp side chains; or Z A CO 2 H / Z B CO 2 H to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein Z A and Z B independently of one another can be R-Z 1 -A 1 -Z 2 -
  • a 2 -Z 3 -A 3 -Z 4 -A 4 - , and Z 1 , Z 2 , Z 3 , and Z 4 independently of one another can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ , C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR 1 , (Si(RR')O)n, SnRR 1 , Sn(RR ] )O, BR(OR'), BRR 1 , B(0R)(0R') , OBR(OR'), OBRR 1 , OB(OR)(OR'), or, Z 1 , Z 2 , Z 3 , and Z 4 independently of one another may be
  • some single C-C bonds from (CRR')n may be replaced with double or triple bonds, in which case some groups R and R 1 will be absent
  • (CRR')n can be an o- arylene, an -arylene, or a -arylene with up to 6 substituents, carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms in the cycle with or without heteroatoms (O, N, S) and with or without substituents, or A 1 , A 2 , A 3 , and A 4 independently of one another can be absent;
  • R, R 1 independently from other R and R 1 in Z 1 - Z 4 and independently from other R and R 1 in A 1 - A 4
  • n in Z 1 - Z 4 independent of n in A 1 - A 4 , is an integer that can have value from 0 to about 51
  • Z A contains x number of protons
  • Z may contain y number of deuterons in the place of protons, and, conespondingly, x - y number of protons remaining
  • Z A contains x number of borons- 10
  • Z may contain y number of borons- 11 in the place of borons- 10, and, conespondingly, x - y number of borons- 10 remaining
  • Z A contains x number of carbons- 12
  • Z may contain y number of carbons- 13 in the place of carbons-12, and, conespondingly, x - y number of carbons-12 remaining
  • Z A contains x number of nitrogens- 14
  • Z B may contain y number of nitrogens- 15 in the place of nitrogens- 14, and, conespondingly, x - y number of nitrogens- 14 remaining
  • Z may contain y number of sulfurs-32
  • Z may contain y number of
  • x and y are between 1 and about 11 , between 1 and about 21 , between 1 and about 31 , between 1 and about 41, between 1 and about 51.
  • Z A OH and Z B OH to esterify peptide C-terminals;
  • Z A NH 2 / Z B NH 2 to form an amide bond with peptide C-terminals;
  • Z A CO 2 H / Z B CO 2 H to form an amide bond with peptide N-terminals;
  • Z A and Z B can be R-Z 1 -A 1 -Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 - and Z 1 , Z 2 , Z 3 , and Z 4 , independently of one another, can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ , C(O), C(O)O, C(S), C(S)O, C
  • single C-C bonds in some (CRR')n groups may be replaced with double or triple bonds, in which case some groups R and R will be absent, or (CRR )n can be an o-arylene, an w-arylene, or ap-arylene with up to 6 substituents, or a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without heteroatoms (e.g., O, N or S atoms), or, with or without substituents, or, A 1 - A 4 independently of one another may be absent;
  • R, R 1 independently from other R and R 1 in Z 1 - Z 4 and independently from other R and R 1 in A - A 4 , can be a hydrogen atom, a halogen or an alkyl group, such as an alkenyl, an alkynyl or an aryl group;
  • n in Z 1 - Z 4 is independent of n
  • Z A has a similar structure to that of Z B , but Z A has x extra -CH 2 - fragment(s) in one or more A 1 - A 4 fragments, and or Z A has x extra -CF 2 - fragment(s) in one or more A 1 - A 4 fragments.
  • Z A can contain x number of protons and Z B may contain >> number of halogens in the place of protons.
  • Z contains x number of protons and Z contains y number of halogens
  • x andy are integers that can have value of between 1 about 51; of between 1 about 41; of between 1 about 31; of between 1 about 21, of between 1 about 11; of between 1 about 6, such that x is greater thany.
  • a liquid chromatography is used, e.g., a multidimensional liquid chromatography, such as the mixed bed multidimensional liquid chromatograph of the invention.
  • a chromatogram eluate is coupled to a mass spectrometer, such as a tandem mass spectrometry device (e.g., a "3D LC-LC-MS/MS" system of the invention, as described herein), or an ion trap mass spectrometer (e.g., 3D LC LCQ MS/MS or 3D LC LTQ MS/MS systems of the invention, as described herein), or a combination of LC-LTQ-MS/MS or LC-LCQ-MS/MS and LC-LC-MS/MS. Any variation and equivalent thereof can be used to separate and detect peptides.
  • LC-LC-MS/MS was first developed by Link A. and Yates J.
  • LC-LC-MS/MS as described, e.g., in (Link (1999) Nature Biotechnology 17:676-682; Link (2000) Electrophoresis 18, 1314-1334.
  • the LC-LC-MS/MS technique is used; it is effective for complexed peptide separation and it is easily automated.
  • LC-LC-MS/MS is commonly known by the acronym "MudPIT,” for "Multi-dimensional Protein Identification Technique.”
  • Variations and equivalents of LC-LC-MS/MS and LC-LCQ-MS/MS or LC-LTQ-MS/MS systems of the invention used in the methods of the invention include methodologies involving reverse phase columns coupled to either cation exchange columns (as described, e.g., by Opiteck (1997) Anal. Chem.
  • an LC-LC-MS/MS or LC-LCQ-MS/MS or LC-LTQ- MS/MS technique uses a mixed bed microcapillary column containing strong cation exchange (SCX) and reverse phase (RPC) resins.
  • SCX strong cation exchange
  • RPC reverse phase
  • Other exemplary alternatives include protein fractionation combined with one-dimensional LC-ESI MS/MS or peptide fractionation combined MALDI MS/MS.
  • any protein fractionation method including size exclusion chromatography, ion exchange chromatography, reverse phase chromatography, or any of the possible affinity purifications, can be introduced prior to labeling and proteolysis. In some circumstances, use of several different methods may be necessary to identify all proteins or specific proteins in a sample.
  • both quantity and sequence identity of the protein from which the modified peptide originated is determined by a mass spectrometry device, such as a "multistage mass spectrometry" (MS), including 3D LC-LC-MS/MS or LC-LCQ-MS/MS or LC-LTQ-MS/MS systems of the invention, as described herein.
  • MS multistage mass spectrometry
  • This can be achieved by the operation of the mass spectrometer in a dual mode in which it alternates in successive scans between measuring the relative quantities of peptides eluting from the capillary column and recording the sequence information of selected peptides.
  • Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents.
  • Peptide sequence information can be automatically generated by selecting peptide ions of a particular mass-to-charge (m z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode, as described, e.g., by Link (1997) Electrophoresis 18: 1314-1334; Gygi (1999) Nature Biotechnol. 17:994-999; Gygi (1999) Cell Biol. 19:1720-1730.
  • m z mass-to-charge
  • CID collision-induced dissociation
  • tandem mass spectra can be conelated to sequence databases to identify the protein from which the sequenced peptide originated.
  • Exemplary commercial available softwares include TURBO SEQUESTTM by Thermo Finnigan, San Jose, CA; MASSSCOTTM by Matrix Science, SONAR MS/MSTM by Proteometrics. Routine software modifications may be necessary for automated relative quantification.
  • Mass spectrometry devices can use mass spectrometry to identify and quantify differentially labeled peptides and polypeptides. Any mass spectrometry system can be used.
  • combined mixtures of peptides are separated by a chromatography system of the invention comprising multidimensional liquid chromatography coupled to tandem mass spectrometry, or, "LC-LC-MS/MS,” see, e.g., Link (1999) Biotechnology 17:676-682; Link (1999)
  • Electrophoresis 18: 1314-1334 combined mixtures of peptides are separated by a chromatography method comprising a multidimensional liquid chromatography system of the invention coupled to a combination tandem mass spectrometry and an ion trap mass spectrometry device of the invention, or, LC-LCQ- MS/MS or LC-LTQ-MS/MS, as described herein.
  • Exemplary ion trap mass spectrometry devices that can be used in the systems and methods of the invention include, for example, the LCQ Deca XPTM electrospray ionization/ion trap mass spectrometer, including a Finnigan LCQ Deca XPTM or LCQ Deca XP MAXTM, or MDLC LTQTM, from Thermo Electron Co ⁇ oration, San Jose, CA, , or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer.
  • the LCQ Deca XPTM electrospray ionization/ion trap mass spectrometer including a Finnigan LCQ Deca XPTM or LCQ Deca XP MAXTM, or MDLC LTQTM, from Thermo Electron Co ⁇ oration, San Jose, CA, , or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto
  • a sample can be introduced by direct infusion using a syringe pump, by flow injection using a injection valve and an LC pump, or by LC fitted with a column (LC/MS).
  • exemplary mass spectrometry devices also include those inco ⁇ orating matrix-assisted laser deso ⁇ tion-ionization-time-of-flight (MALDI-TOF) mass spectrometry (see, e.g., Isola (2001) Anal. Chem. 73:2126-2131; Van de Water (2000) Methods Mol. Biol. 146:453-459; Griffin (2000) Trends Biotechnol. 18:77-84; Ross (2000) Biotechniques 29:620-626, 628-629).
  • MALDI-TOF matrix-assisted laser deso ⁇ tion-ionization-time-of-flight
  • MALDI-TOF MS The inherent high molecular weight resolution of MALDI-TOF MS conveys high specificity and good signal-to-noise ratio for performing accurate quantitation.
  • Use of mass spectrometry, including MALDI-TOF MS, and its use in detecting nucleic acid hybridization and in nucleic acid sequencing, is well known in the art, see, e.g., U.S. Patent Nos. 6,258,538; 6,238,871; 6,238,869; 6,235,478; 6,232,066; 6,228,654; 6,225,450; 6,051,378; 6,043,031.
  • polypeptides can be fragmented, e.g., by proteolytic, i.e., enzymatic, digestion and/or other enzymatic reactions or physical fragmenting methodologies.
  • the fragmentation can be done before and/or after reacting the peptides/ polypeptides with the labeling reagents used in the methods of the invention.
  • Methods for proteolytic cleavage of polypeptides are well known in the art, e.g., enzymes include trypsin (see, e.g., U.S. Patent No. 6,177,268; 4,973,554), chymotrypsin (see, e.g., U.S. Patent No.
  • a chimeric labeling reagent of the invention includes a cleavable linker.
  • cleavable linker sequences include, e.g., Factor Xa or enterokinase (Invitrogen, San Diego CA).
  • purification facilitating domains can be used, such as metal chelating peptides, e.g., polyhistidine tracts and histidine- tryptophan modules that allow purification on immobilized metals, protein A domains that allow purification on immobilized immunoglobulin, and the domain utilized in the FLAGS extension/affinity purification system (Immunex Co ⁇ , Seattle WA).
  • metal chelating peptides e.g., polyhistidine tracts and histidine- tryptophan modules that allow purification on immobilized metals
  • protein A domains that allow purification on immobilized immunoglobulin
  • domain utilized in the FLAGS extension/affinity purification system Immunex Co ⁇ , Seattle WA.
  • the invention provides a method for quantifying changes in protein expression between at least two cellular states, such as, an activated cell versus a resting cell, a normal cell versus a cancerous cell, a stem cell versus a differentiated cell, an injured cell or infected cell versus an uninjured cell or uninfected cell; or, for defining the expressed proteins associated with a given cellular state.
  • Sample can be derived from any biological source, including cells from, e.g., bacteria, insects, yeast, mammals and the like.
  • the proteome of the Bacillus anthracis microbe is analyzed using the mixed bed multi-dimensional liquid chromatographs and methods of the invention.
  • Cells can be harvested from any body fluid or tissue source, or, they can be in vitro cell lines or cell cultures.
  • Detection Devices and Methods can also inco ⁇ orate in whole or in part designs of detection devices as described, e.g., in U.S. Patent Nos.
  • JA, Prime SB, Plait AE, Stoney RM Computer-assisted methods and apparatus for identification and characterization of biomolecules in a biological sample.
  • Alting-Mecs MA and Short JM Polycos vectors: a system for packaging filamentous phage and phagemid vectors using lambda phage packaging extracts. Gene 137: 1, 93-
  • Arkin AP and Youvan DC An algorithm for protein engineering: simulations of recursive ensemble mutagenesis. Proc Natl Acad Sci USA 89(16):7811-7815, (Aug 15)
  • Haemophilus gallinarum (Hga I). Proc Natl Acad Sci USA 74(8):3213-6, (Aug) 1977.
  • Caldwell RC and Joyce GF Randomization of genes by PCR mutagenesis.
  • Caton AJ and Koprowski H Influenze virus hemagglutinin-specific antibodies isolated from a combinatorial expression library are closely related to the immune response of the donor. Proc Natl Acad Sci USA 87(16):6450-6454, 1990.
  • Chothia C and Lesk AM Canonical structures for the hypervariable regions of immunoglobulins. JMolBiol 196)4):901-917, 1987.
  • the retinoblastoma protein associates with the protein phosphatase type 1 catalytic subunit. Genes Dev 7(4):555-569, 1993.
  • Fields S and Song 0 A novel genetic system to detect protein-protein interactions.
  • Germino FJ Wang ZX, Weissman SM: Screening for in vivo protein-protein interactions. Proc Natl Acad Sci USA 90(3):933-937, 1993.
  • Gingeras TR Brooks JE: Cloned restriction/modification system from Pseudomonas aeruginosa. Proc Natl Acad Sci USA 80(2):402-6, 1983 (Jan).
  • Gluzman Y SV40-transformed simian cells support the replication of early SV40 mutants. Cell 23(1): 175-182, 1981.
  • Gottschalk G Bacterial Metabolism. 2 nd ed. New York: Springer- Verlag Inc., 1986.
  • Gansemans Y, Collen D Biochemical characterization of single-chain chimeric plasminogen activators consisting of a single-chain Fv fragment of a fibrin-specific antibody and single-chain urokinase. Eur J Biochem 210(3):945-952, 1992.
  • Kettleborough CA Ansell KH, Allen RW, Rosell-Vives E, Gussow DH, Bendig MM:
  • Li B and Fields S Identification of mutations in p53 that affect its binding to SV40 large
  • Milne GT and Weaver DT Dominant negative alleles of RAD52 reveal a DNA repair/ recombination complex including Rad51 and Rad52. Genes Dev 7(9):1755-1765, 1993.
  • Nath K, Azzolina BA in Gene Amplification and Analysis (ed. Chirikjian JG), vol. 1, p. 113, Elsevier North Holland, Inc., New York, New York, ⁇ 1981.
  • Needleman SB and Wunsch CD A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443-453, 1970.
  • Oiler AR, Vanden Broek W, Conrad M, Topal MD Ability of DNA and spermidine to affect the activity of restriction endonucleases from several bacterial species.
  • Reidhaar-Olson JF and Sauer RT Combinatorial cassette mutagenesis as a probe of the informational content of protein sequences. Science 241(4861):53-57, 1988. Riechmann L and Weill M: Phage display and selection of a site-directed randomized single-chain antibody Fv fragment for its affinity improvement. Biochemistry 32(34):8848-8855, 1993.
  • Segel IH Enzyme Kinetics: Behavior and Analysis of Rapid Equilibrium and Steady- State Enzyme Systems. New York: John Wiley & Sons, Inc., 1993. Silver SC and Hunt SW 3d: Techniques for cloning cDNAs encoding interactive transcriptional regulatory proteins. Mol Biol Rep 17(3):155-165, 1993. Smith TF, Waterman MS, Fitch WM: Comparative biosequence metrics. J MolEvol S18(l):38-46, 1981.
  • Staudinger J, Perry M, Elledge SJ, Olson EN Interactions among vertebrate helix-loop- helix proteins in yeast using the two-hybrid system. J Biol Chem 268(7):4608-4611, 1993.
  • Stemmer WP Morris SK, Wilson BS: Selection of an active single chain Fv antibody from a protein linker library prepared by enzymatic inverse PCR. Biotechniques 14(2):256-265, 1993.
  • Stemmer WP DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. Proc Natl Acad Sci USA 91(22): 10747-10751,
  • Tague BW, Dickinson CD, Chrispeels MJ A short domain of the plant vacuolar protein phytohemagglutinin targets invertase to the yeast vacuole. Plant Cell 2(6): 533-46, (June)
  • Thiesen HJ and Bach C Target Detection Assay (TDA): a versatile procedure to determine DNA binding sites as demonstrated on SP1 protein.
  • Tingey SV, Walker EL, Corruzzi GM Glutamine synthetase genes of pea encode distinct polypeptides which are differentially expressed in leaves, roots and nodules.
  • Tuerk C and Gold L Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249(4968):505-510, 1990.
  • Schimmel PR Method for deletion of a gene from a bacteria.
  • Nienhuis, James Identification and localization and introgression into plants of desired multigenic traits.
  • van de Poll ML Lafleur MV, van Gog F, Vrieling H, Meerman JH: N-acetylated and deacetylated 4'-fluoro-4-aminobiphenyl and 4-aminobiphenyl adducts differ in their ability to inhibit DNA replication of single-stranded Ml 3 in vitro and of single-stranded phi X174 in Escherichia coli. Carcinogenesis 13(5):751-8, (May) 1992.
  • Yarchuk OB Spirin AS: Method for Obtaining Polypeptides in a Cell-free System.
  • Prime SB Platt AE, Stoney RM: Computer-assisted methods and apparatus for identification and characterization of biomolecules in a biological sample.
  • Electrophoresis apparatus and method are Electrophoresis apparatus and method.
  • Protein expression profiling using selective differential labeling The use of mass spectrometry to identify proteins whose sequences are present in either DNA or protein databases is well established and integral to the field of Proteomics. Protein and peptide mass can be determined at high accuracy by several mass spectrometric techniques. Peptide can be further fragmented in a tandem or ion trap mass spectrometer yielding sequence information of the peptide. Both types of mass information can be used to identify protein in a sequence database.
  • One goal of Proteomics is to define the expressed proteins associated with a given cellular state and another is to quantify changes in protein expression between cellular states.
  • ICAT isotope-coded affinity tag
  • the method is based on a newly synthesized class of chemical reagents (ICATs) used in combination with tandem mass spectrometry.
  • the ICAT reagent contains a biotin affinity tag and a thiol specific reactive group, which are joined by a spacer domain that is available in two forms: regular and isotopically heavy, which includes eight deuterium atoms.
  • regular and isotopically heavy which includes eight deuterium atoms.
  • a reduced protein mixture representing one cell state is derivatized with the isotopically light version of the ICAT reagent, while the conesponding reduced protein mixture representing a second cell state is derivatized with the isotopically heavy version of the ICAT reagent.
  • the labeled samples are combined and proteolytically digested to produce peptide fragments.
  • this present invention provides a method for simultaneous identification and quantification of expression levels of individual proteins carrying certain functional groups in their side chains.
  • the proteins may be analyzed in complex mixtures.
  • the method is based on comparison of two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation.
  • the samples of proteins are subjected to a sequence of manipulations including (i) proteolytic digestion into mixtures of peptides, (ii) treatment of the mixtures of peptides with chemical probes, (iii) washing away and discarding the unbound peptides from the mixtures, (iv) cleaving the chemical probes and the consequential release of the peptides still carrying parts of the chemical probes into solution.
  • This sequence of manipulations may also include one or more auxiliary chemical and/or enzymatic modifications of functional groups in side chains and/or in the free termini of the proteins and/or peptides in order to achieve selective and the most favorable modification for the next steps in the protocol.
  • the auxiliary modifications may be performed between any steps of the main sequence.
  • the core structure of the chemical probe consists of (i) a solid support,
  • the chemical probes perform three functions: (i) they attach peptides carrying specific functional groups in their side chains and/or termini to a solid support by forming covalent chemical bonds to the reactive group of the probe, (ii) they provide means for selective cleavage of the attached peptide from the solid support such that a part of the probe still remains attached to the peptide, and (iii) they serve as differential labeling reagents.
  • Differential labeling results from attaching of chemical moieties of different mass but of similar properties to a protein or a peptide such that peptides with the same sequence but with different labels are eluted together in the separation procedure and their ionization and detection properties regarding mass spectrometrical analysis are very similar.
  • the differential mass labeling unit remains covalently bound to the peptide after it is cleaved from the solid support part of the probe. Signals conesponding to peptides with the same sequence but marked with differential mass labels are assigned to different original protein samples.
  • the auxiliary chemical and/or enzymatic modification can be used to introduce additional differential mass labels into the peptides.
  • the reactive group on the chemical probe may be activated or modified by a bridging reagent prior to a reaction with mixtures of peptides. Such activation or modification provides for a greater flexibility in design of the chemical probe since the same core structure of a chemical probe may be tuned to increase reactivity and/or selectivity towards different functional groups in side chains and/or in termini of the peptides.
  • the differentially labeled peptide mixtures are combined, subjected to multidimensional chromatographic separation, and analyzed by mass spectrometry methods. Mass spectrometry data is processed by special software, which allows for determination and tracing the composition and sequence of peptides in the mixture to identification of the original proteins and their quantification.
  • This approach can be used for duplex or potentially multiplex protein expression profiling.
  • the complexity of the sample is simplified by targeting peptides containing particular amino acids, which selected by a reaction with chemical probes.
  • Alternative aspects of this invention include: (i) design of solid phase- based differential mass labeling reagents for selective peptide modification; (ii) design of various kinds of differential mass unit; (iii) combination of differential mass probes with various bridge reagent to target certain amino acid specifically; (iv) multiplex analysis; (v) combination of proteolytic digestion and chemical and/or enzymatic modifications in side chains and/or in termini of proteins and peptides in order to achieve selective and the most favorable modifications for the next steps in the protocol; (vi) combination of differential chemical labeling with MudPIT, and possible all other protein peptide separation or purification technologies if necessary.
  • One aspect of this invention provides reagents and procedures for quantification of protein expression using combination of selective differential peptides labeling, and the mixed bed multi-dimensional liquid chromatographs of the invention, e.g., 3D LC MS/MS, 3D LC-LC MS/MS or LC-LCQ-MS/MS or LC-LTQ- MS/MS systems of the invention, as described herein.
  • This invention overcomes the limitations inherent in traditional techniques.
  • the basic approach described can be employed for quantitative analysis of protein expression in complex samples (such as cells, tissues, and fraction etc.), the detection and quantitation of specific proteins in complex samples, and quantitative measurement of specific enzymatic activities in complexed samples.
  • the solid support part of the chemical probe may consist of any of the following materials or any combination of them: gel, glass beads, magnetic beads, polymers, silicon wafer, membrane, or resin.
  • the spacer between the solid phase part and the cleavable unit of the chemical probe may be included for convenience and improved yields in synthetic preparation of the chemical probe.
  • the spacer may consist of a chain of 2 to 8 atoms, which can be C, O, N, B, Si, S, P, Se ..., covalently bound to each other.
  • the atoms may carry hydrogen atoms, halogens, or one of the following groups containing up to 25 atoms: alkyl, hydroxy, alkoxy, amino, alkylamino...
  • the spacer may contain cyclic moieties with or without heteroatoms and with or without substituents.
  • the cleavable moiety provides means for selective detachment of the solid phase part of the chemical probe from the differential mass label attached to peptide. It is designed such that it can be cleaved by treating the probe with a chemical reagent or any kind of electromagnetic inadiation, photochemically, enzymatically, or thermally.
  • Differential mass labeling units differ in molecular mass, but do not differ in retention properties regarding the separation method used and in ionization and detection properties regarding the mass spectrometry methods used.
  • a R Z can have the same structure as Z , but they have different isotope A R composition. For instance, if Z contains x number of protons, Z may contain ⁇ number of deuterons in the place of protons, and, conespondingly, x -y number of protons remaining; and/or if Z A contains x number of borons- 10, Z B may contain y number of borons- 11 in the place of borons- 10, and, conespondingly, x - y number of borons- 10 remaining; and/or if Z A contains x number of carbons-12, Z B may contain y number of carbons- 13 in the place of carbons- 12, and, conespondingly, x -y number of carbons-12 remaining; and/or if Z A contains x number of nitrogens- 14, Z B may contain y number of nitrogens- 15 in the place of nitrogens- 14, and, conespondingly, x - y number of nitrogens- 14 remaining; and/or if Z A contains
  • Z A and Z B R-Z 1 -A 1 -Z 2 -A 2 -Z 3 -A 3 -Z 4 -A 4 - Z 1 , Z 2 , Z 3 , and Z 4 independently of one another can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR 1 , S, SC(O), SC(S), SS, S(O), S(O 2 ), NR, NRR 1+ , C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR 1 , (Si(
  • OBRR 1 , OB(OR)(OR') or Z 1 - Z 4 may be absent; A 1 , A 2 , A 3 , and A 4 independently of one another can be selected from
  • R and R in which some single C-C bonds may be replaced with double or triple bonds, in which case some groups R and R will be absent, o-arylene, m-arylene, p- arylene with up to 6 substituents, carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms in the cycle with or without heteroatoms (O, N, S) and with or without substituents, or A 1 - A 4 may be absent; R, R 1 independently from other R and R'in Z 1 - Z 4 and independently from other R and R 1 in A 1 - A 4 is hydrogen, halogen, an alkyl, alkenyl, alkynyl, or aryl group; n in Z 1 - Z 4 is independent of n in A 1 - A 4 and is a whole number that can have value from 0 to 21.
  • Z A can have a similar structure to that of Z B , but Z A has x extra -CH 2 - fragment(s) in one or more A 1 - A 4 fragments, and/or Z A has x extra -CF 2 - fragment(s) in one or more A 1 - A fragments; and/or if Z A contains x number of protons, Z B may contain y number of halogens in the place of protons, and, conespondingly, x - y number of protons remaining in one or more A 1 - A 4 fragments; and/or Z A has x extra -O- fragment(s) in one or more A 1 - A 4 fragments; and/or Z A has x extra -S- fragment(s) in one or more A 1 - A 4 fragments; and or if Z A contains x number of -O- fragment(s), Z B may contain y number of-S- fragment(s) in the place of-O- fragment(s), and, conespondingly
  • Sequence analysis and quantification Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents.
  • Peptide sequence information is automatically generated by selecting peptide ions of a particular mass-to-charge (m z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode.
  • m z mass-to-charge
  • CID collision-induced dissociation
  • the resulting tandem mass spectra can be conelated to sequence databases to identify the protein from which the sequenced peptide originated.
  • Cunently commercial available softwares are Turbo SEQUESTTM by Thermofinigan, MASSSCOTTM by Matrix Science, and SONARTM MS/MS by Proteometiics. Special software development will be necessary for automated relative quantification.
  • Exemplary approaches for practicing the invention 1. Protein sample preparation, which may include protein denaturation, reduction, and proteolytic digestion 2. Treatment of the probe with a desired activating or bridging reagent 3. Treatment of the activated probe with a mixture of peptides 4. Wash off unbound peptides, which don't have the targeted amino acid 5. Combining modified differential labeled peptide mixture 6.
  • Metabolomics and lipidomics The invention also inco ⁇ orates holistic monitoring approaches, metabolomics and lipidomics, including profiling metabolite pools, carbohydrates, lipids, glycoproteins, and glycolipids Various chromatographic methods and other qualitative and/or quantitative methods could be utilized to characterize lipid profiles.
  • FACS fluorescence activated cell sorting
  • Desired products can be detected by incubating the encapsulated cells with fluorescent antibodies (Powell et al. Bio/Technology 8:333-337 (1990)). FACS sorting can also be used by this technique to assay resistance to toxic compounds and antibiotics by selecting droplets that contain multiple cells (i.e., the product of continued division in the presence of a cytotoxic compound; Goguen et al. Nature 363:189-190 (1995)). This method can select for any enzyme that can change the fluorescence of a substrate that can be immobilized in the agarose droplet. Reporter molecule In some aspects of the invention, screening can be accomplished by assaying reactivity with a reporter molecule reactive with a desired feature of, for example, a gene product.
  • cell-cell indicator In other aspects of the invention, screening is done with a cell-cell indicator assay. In this assay format, separate library cells (Cell A, the cell being assayed) and reporter cells (Cell B, the assay cell) are used. Only one component of the system, the library cells, is allowed to evolve. The screening is generally carried out in a two-dimensional immobilized format, such as on plates. The products of the metabolic pathways encoded by these genes (in this case, usually secondary metabolites such as antibiotics, polyketides, carotenoids, etc.) diffuse out of the library cell to the reporter cell.
  • secondary metabolites such as antibiotics, polyketides, carotenoids, etc.
  • the product of the library cell may affect the reporter cell in one of a number of ways.
  • the assay system (indicator cell) can have a simple readout (e.g., green fluorescent protein, luciferase, beta- galactosidase) which is induced by the library cell product but which does not affect the library cell.
  • the desired product can be detected by colorimetric changes in the reporter cells adjacent to the library cell.
  • indicator cells can in turn produce something that modifies the growth rate of the library cells via a feedback mechanism.
  • Growth rate feedback can detect and accumulate very small differences. For example, if the library and reporter cells are competing for nutrients, library cells producing compounds to inhibit the growth of the reporter cells will have more available nutrients, and thus will have more opportunity for growth. This is a useful screen for antibiotics or a library of polyketide synthesis gene clusters where each of the library cells is expressing and exporting a different polyketide gene product.
  • the reporter cell for an antibiotic selection can itself secrete a toxin or antibiotic that inhibits growth of the library cell. Production by the library cell of an antibiotic that is able to suppress growth of the reporter cell will thus allow uninhibited growth of the library cell.
  • the library cell may supply nutrients such as amino acids to an auxotrophic reporter, or growth factors to a growth-factor- dependent reporter. The reporter cell in turn should produce a compound that stimulates the growth of the library cell. Interleukins, growth factors, and nutrients are possibilities.
  • Further possibilities include competition based on ability to kill sunounding cells, positive feedback loops in which the desired product made by the evolved cell stimulates the indicator cell to produce a positive growth factor for cell A, thus indirectly selecting for increased product formation.
  • a different organism or genetic background
  • markers can be added to DNA constructs used for recursive sequence recombination to make the microorganism dependent on the constructs during the improvement process, even though those markers may be undesirable in the final recombinant microorganism.
  • Evnin et al. selected trypsin variants with altered substrate specificity by requiring that variant trypsin generate an essential amino acid for an arginine auxotroph by cleaving arginine beta-naphthylamide. This is thus a selection for arginine-specific trypsin, with the growth rate of the host being proportional to that of the enzyme activity.
  • the pool of cells surviving screening and/or selection is enriched for recombinant genes conferring the desired phenotype (e.g. altered substrate specificity, altered biosynthetic ability, etc.).
  • recombinant gene or pool of such genes surviving one round of screening/selection forms one or more of the substrates for a second round of recombination.
  • recombination can be performed in vivo or in vitro by any of the recursive sequence recombination formats described above. If recursive sequence recombination is performed in vitro, the recombinant gene or genes to form the substrate for recombination should be extracted from the cells in which screening/selection was performed. Optionally, a subsequence of such gene or genes can be excised for more targeted subsequent recombination.
  • the recombinant gene(s) are contained within episomes, their isolation presents no difficulties. If the recombinant genes are chromosomally integrated, they can be isolated by amplification primed from known sequences flanking the regions in which recombination has occuned. Alternatively, whole genomic DNA can be isolated, optionally amplified, and used as the substrate for recombination. Small samples of genomic DNA can be amplified by whole genome amplification with degenerate primers (Banett et al. Nucleic Acids Research 23:3488- 3492 (1995)). These primers result in a large amount of random 3' ends, which can undergo homologous recombination when reintroduced into cells.
  • the second round of recombination is to be performed in vivo, as is often the case, it can be performed in the cell surviving screening/selection, or the recombinant genes can be transfened to another cell type (e.g., a cell is type having a high frequency of mutation and/or recombination).
  • recombination can be effected by introducing additional DNA segment(s) into cells bearing the recombinant genes.
  • the cells can be induced to exchange genetic information with each other by, for example, electroporation.
  • the second round of recombination is performed by dividing a pool of cells surviving screening/selection in the first round into two subpopulations.
  • DNA from one subpopulation is isolated and transfected into the other population, where the recombinant gene(s) from the two subpopulations recombine to form a further library of recombinant genes.
  • the second round of recombination is sometimes performed exclusively among the recombinant molecules surviving selection. However, in other aspects, additional substrates can be introduced.
  • the additional substrates can be of the same form as the substrates used in the first round of recombination, i.e., additional natural or induced mutants of the gene or cluster of genes, forming the substrates for the first round.
  • the additional substrate(s) in the second round of recombination can be exactly the same as the substrate(s) in the first round of replication.
  • recombinant genes conferring the desired phenotype are again selected. The selection process proceeds essentially as before. If a suicide vector bearing a selective marker was used in the first round of selection, the same vector can be used again. Again, a cell or pool of cells surviving selection is selected. If a pool of cells, the cells can be subject to further enrichment.
  • Novel drugs Screening for various potential applications Novel drugs: identifying targets
  • the invention relates to procedures that can be applied to identifying compounds that bind to and modulate the function of target components of a cell whose function is known or unknown, and cell components that are not amenable to other screening methods.
  • the invention relates to generating and/or identifying a compound that binds to and modulates (inhibits or enhances) the function of a component of a cell, thereby producing a phenotypic effect in the cell.
  • Such a screen may involve identifying a biomolecule that 1) binds to, in vitro, a component of a cell that has been isolated from other constituents of the cell and that 2) causes, in vivo, as seen in an assay upon intracellular expression of the biomolecule, a phenotypic effect in the cell which is the usual producer and host of the target cell component.
  • intracellular production of the biomolecule can be in cells grown in culture or in cells introduced into an animal.
  • target cell component in this aspect and in other aspects not limited to pathogens can be one that is found in mammalian cells, especially cells of a type found to cause or contribute to disease or the symptoms of disease (e.g., cells of tumors or cells of other types of hype ⁇ roliferative disorders).
  • the invention provides a process for identifying one or more compounds that produce a phenotypic effect on a cell. The process is at the same time a method for target validation.
  • the process is characterized by identifying a biomolecule which binds an isolated target cell component, constructing cells comprising the target cell component and further comprising a gene encoding the biomolecular binder which can be expressed to produce the biomolecular binder, testing the constructed cells for their ability to produce, upon expression of the gene encoding the biomolecular binder, a phenotypic effect in the cells (e.g., inhibition of growth), wherein the test of the constructed cells can be a test of the cells in culture or a test of the cells after introducing them into host animals, or both, and further, identifying, for a biomolecular binder that caused the phenotypic effect, one or more compounds that compete with the biomolecular binder for binding to the target cell component.
  • a test of the constructed cells after introducing them into host animals is especially well-suited to assessing whether a biomolecular binder can produce a particular phenotype by the expression (regulatable by the researcher) of a gene encoding the biomolecular binder.
  • cells are constructed which have a gene encoding the biomolecular binder, and wherein the biomolecular binder can be produced by regulation of expression of the gene.
  • the constructed cells are introduced into a set of animals. Expression of the gene encoding the biomolecular binder is regulated in one group of the animals (test animals) such that the biomolecular binder is produced.
  • the gene encoding the biomolecular binder is regulated such that the biomolecular binder is not produced (control animals).
  • the cells in the two groups of animals are monitored for a phenotypic change (for example, a change in growth rate). If the phenotypic change is observed in cells in the test animals and not in the cells in the control animals, or to a lesser extent in the control animals, then the biomolecular binder has been proven to be effective in binding to its target cell component under in vivo conditions.
  • a target cell component of a particular cell type (a "first cell") is essential to producing a phenotypic effect on the first cell
  • the method having the steps: isolating the target component of the first cell; identifying a biomolecular binder of the isolated target component of the first cell; constructing a second type of cells (“second cell") comprising the target component and a regulable, exogenous gene encoding the biomolecular binder; and testing the second cell in culture for an altered phenotypic effect, upon production of the biomolecular binder in the second cell; whereby, if the second cell shows the altered phenotypic effect upon production of the biomolecular binder, then the target component of the first cell is essential to producing the phenotypic effect on the first cell.
  • the target cell component in this aspect and in other aspects not limited to pathogens can be one that is found in mammalian cells, especially cells of a type found to cause or contribute to disease or the symptoms of disease (e.g., cells of tumors or cells of other types of hype ⁇ roliferative disorders).
  • One aspect of the invention is a method for identifying a biomolecular inhibitor of growth of pathogen cells by using cell culture techniques, comprising contacting one or more types of biomolecules with isolated target cell component of the pathogen, applying a means of detecting bound complexes of biomolecules and target cell component, whereby, if the bound complexes are detected, one or more types of biomolecules have been identified as a biomolecular binder of the target cell component, constructing a pathogen strain having a regulatable gene encoding the biomolecular binder, regulating expression of the gene encoding the biomolecular binder to express the gene; and monitoring growth of the pathogen cells in culture relative to suitable control cells, whereby, if growth of the pathogen cells is decreased compared to growth of suitable control cells, then the biomolecule is a biomolecular inhibitor of growth of the pathogen cells.
  • Identifying compounds that inhibit infection of a mammal by a pathogen Another aspect of the invention is a method, employing an animal test, for identifying one or more compounds that inhibit infection of a mammal by a pathogen by binding to a target cell component, comprising constructing a pathogen comprising a regulatable gene encoding a biomolecule which binds to the target cell component, infecting test animals with the pathogen, regulating expression of the regulatable gene to produce the biomolecule, monitoring the test animals and suitable control animals for signs of infection, wherein observing fewer or less severe signs of infection in the test animals than in suitable control animals indicates that the biomolecule is a biomolecular inhibitor of infection, and identifying one or more compounds that compete with the biomolecular inhibitor of growth for binding to the target cell component (as by employing a competitive binding assay), then the compound inhibits infection of a mammal by a pathogen by binding to a target.
  • the competitive binding assay to identify binding analogs of biomolecular binders which have been proven to bind to their targets in an intracellular test of binding, can be applied to any target for which a biomolecular binder has been identified, including targets whose function is unknown or targets for which other types of assays are not easily developed and performed. Therefore, the method of the invention offers the advantage of decreasing assay development time when using a gene product of known function as a target cell component and the advantage of bypassing the major hurdle of gene function identification when using a gene product of unknown function as a target cell component.
  • cells comprising a biomolecule and a target cell component, wherein the biomolecule is produced by expression of a regulable gene, and wherein the biomolecule modulates function of the target cell component, thereby causing a phenotypic change in the cells.
  • cells comprising a biomolecule and a target cell component, wherein the biomolecule is a biomolecular binder of the target cell component, and is encoded by a regulatable gene.
  • the cells can include mammalian cells or cells of a pathogen, for instance, and the phenotypic change can be a change in growth rate.
  • the pathogen can be a species of bacteria, yeast, fungus, or parasite, for example.
  • Intracellular validation of a biomolecule provides methods that result in the identification of compounds that cause a phenotypic effect on a cell.
  • the general steps described herein to find a compound for drug development can be thought of as these: (1) identifying a biomolecule that can bind to an isolated target cell component in vitro, (2) confirming that the biomolecule, when produced in cells with the target cell component, can cause a desired phenotypic effect and (3) identifying, by an in vitro screening method, for example, compounds that compete with the biomolecule for binding to the target cell component.
  • a biomolecule is a gene product (e.g., polypeptide, RNA, peptide or RNA oligonucleotide) of an exogenous gene — a gene which has been introduced in the course of construction of the cell. Biomolecules that bind to and alter the function of a candidate target are identified by various in vitro methods.
  • the biomolecule Upon production of the biomolecule within a cell either in vitro or within an animal model system, the biomolecule binds to a specific site on the target, alters its intracellular function, and hence produces a phenotypic change (e.g. cessation of growth, cell death).
  • a phenotypic change e.g. cessation of growth, cell death.
  • cessation of growth or death of the engineered pathogen cells leads to the clearing of infection and animal survival, demonstrating the importance of the target in infection and thereby validating the target.
  • a further aspect of this invention provides for identifying a biomolecule that produces a phenotypic effect on a cell (wherein the cell can be, for instance, a pathogen cell or a mammalian cell) and (2) simultaneous intracellular target validation.
  • the invention includes methods for identifying compounds that inhibit the growth of cells having a target cell component.
  • the target cell component can first be identified as essential to the growth of the cells in culture and/or under conditions in which it is desired that the growth of the cells be inhibited. These methods can be applied, for example, to various types of cells that undergo abnormal or undesirable proliferation, including cells of neoplasms (tumors or growths, either benign or malignant) which, as known in the art, can originate from a variety of different cell types. Such cells can be refened to, for example, as being from adenomas, carcinomas, lymphomas or leukemias.
  • the method can also be applied to cells that proliferate abnormally in certain other diseases, such as arthritis, psoriasis or autoimmune diseases. If intracellular expression of the biomolecular binder inhibits the function of a target essential for growth (presumably by binding to the target at a biologically relevant site) cells monitored in step (2) will exhibit a slow growth or no growth phenotype. Targets found to be essential for growth by these methods are validated starting points for drug discovery, and can be inco ⁇ orated into assays to identify more stable compounds that bind to the same site on the target as the biomolecule. Where the cells are pathogen cells and the desired phenotypic change to be monitored is inhibition of growth, the invention provides a procedure to examine the activity of target (pathogen) cell components in an animal infection model.
  • a target cell component a gene product of a particular cell type
  • a target cell component a gene product of a particular cell type (e.g., a type of pathogenic bacteria), wherein the target cell component is already known as being encoded by a characterized gene, as a potential target for a modulator to be identified.
  • the target cell component can be isolated directly from the cell type of interest, assuming suitable culture methods are available to grow a sufficient number of cells, using methods appropriate to the type of cell component to be isolated (e.g., protein purification methods such as differential precipitation, ion exchange chromatography, gel chromatography, affinity chromatography, HPLC.
  • Target cell component can be produced recombinantly Alternatively, the target cell component can be produced recombinantly, that requires that the gene encoding the target cell component be isolated from the cell type of interest. This can be done by any number of methods, for example known methods such as PCR, using template DNA isolated from the pathogen or a DNA library produced from the pathogen DNA, and using primers based on known sequences or combinations of known and unknown sequences within or external to the chosen gene. See, for example, methods described in "The Polymerase Chain Reaction," Chapter 15 of Cunent Protocols in Molecular Biology, (Ausubel, F.M. et al., eds), John Wiley & Sons, New York, 1998.
  • Other methods include cloning a gene from a DNA library (e.g., a cDNA library from a eukaryotic pathogen) into a vector (e.g., plasmid, phage, phagemid, virus, etc.) and applying a means of selection or screening, to clones resulting from a transformation of vectors (including a population of vectors now having inserted genes) into appropriate host cells.
  • the screening method can take advantage of properties given to the host cells by the expression of the inserted chosen gene (e.g., detection of the gene product by antibodies directed against it, detection of an enzymatic activity of the gene product), or can detect the presence of the gene itself (for instance, by methods employing nucleic acid hybridization).
  • Target proteins can be expressed with E. coli or other prokaryotic gene expression systems, or in eukaryotic gene expression systems. Since many eukaryotic proteins carry unique modifications that are required for their activities, e.g. glycosylation and methylation, protein expression can in some cases be better carried out in eukaryotic systems, such as yeast, insect, or mammalian cells that can perform these modifications. Examples of these expression systems have been reviewed in the following literature: Methods in Enzymology, Volume 185, eds D.V.
  • the gene can be identified and cloned by a method such as that used in Shiba et al., US 5,759,833, Shiba et al., US 5,629,188, Martinis et al., US 5,656,470 and Sassanfar et al., US 5,756,327.
  • Method should be used with target cell components which have not been previously isolated or characterized and whose functions are unknown It is an advantage of the target validation method that it can be used with target cell components which have not been previously isolated or characterized and whose functions are unknown.
  • a segment of DNA containing an open reading frame (ORF; a cDNA can also be used, as appropriate to a eukaryotic cell) which has been isolated from a cell of a type that is to be an object of drug action (e.g., tumor cell, pathogen cell) can be cloned into a vector, and the target gene product of the ORF can be produced in host cells harboring the vector.
  • the gene product can be purified and further studied in a manner similar to that of a gene product that has been previously isolated and characterized.
  • the open reading frame (in some cases, cDNA) can be isolated from a source of DNA of the cells of interest (genomic DNA or a library, as appropriate), and inserted into a fusion protein or fusion polypeptide construct.
  • This construct can be a vector comprising a nucleic acid sequence which provides a control region (e.g., promoter, ribosome binding site) and a region which encodes a peptide or polypeptide portion of the fusion polypeptide wherein the polypeptide encoded by the fusion vector endows the fusion polypeptide with one or more properties that allow for the purification of the fusion polypeptide.
  • the vector can be one from the pGEX series of plasmids (Pharmacia) designed to produce fusions with glutathione S-transferase.
  • Host cells The isolated DNA having an open reading frame, whether encoding a known or an as yet unidentified gene product, when inserted into an expression construct, can be expressed to produce the target cell component in host cells.
  • Host cells can be, for example, Gram-negative or Gram-positive bacterial cells such as Escherichia coli or Bacillus subtilis, respectively, e.g., Bacillus anthracis, or yeast cells such as Saccharomyces cerevisiae, Schizosaccharomyces pombe or Pichia pastoris.
  • the target cell component can be used in target validation studies be produced in a host that is genetically related to the pathogen from which the gene encoding it was isolated.
  • a host that is genetically related to the pathogen from which the gene encoding it was isolated.
  • an E. coli host is prefened over a Pichia pastoris host.
  • the target cell component so produced can then be isolated from the host cells.
  • Many protein purification methods are known that separate proteins on the basis of, for instance, size, charge, or affinity for a binding partner (e.g., for an enzyme, a binding partner can be a substrate or substrate analog), and these methods can be combined in a sequence of steps by persons of skill in the art to produce an effective purification scheme.
  • An isolated cell component or a fusion protein comprising the cell component can be used in a test to identify one or more biomolecular binders of the isolated product (general step (1)).
  • a biomolecular binder of a target cell component can be identified by in vitro assays that test for the formation of complexes of target and biomolecular binder no covalently, bound to each other.
  • the isolated target can be contacted with one or more types of biomolecules under conditions conducive to binding, the unbound biomolecules can be removed from the targets, and a means of detecting bound complexes of biomolecules and targets can be applied.
  • the detection of the bound complexes can be facilitated by having either the potential biomolecular binders or the target labeled or tagged with an adduct that allows detection or separation (e. g., radioactive isotope or fluorescent label; streptavidin, avidin or biotin affinity label).
  • both the potential biomolecular binders and the target can be differentially labeled. For examples of such methods see, e.g., WO 98/19162.
  • Biomolecules to be tested and means for detection The biomolecules to be tested for binding to a target can be from a library of candidate biomolecular binders, (e.g., a peptide or oligonucleotide library).
  • a peptide library can be displayed on the coat protein of a phage (see, for examples of the use of genetic packages such as phage display libraries, Koivunen, E. et al., J Biol. Chem. 268:20205-20210 (1993)).
  • the biomolecules can be detected by means of a chemical tag or label attached to or integrated into the biomolecules before they are screened for binding properties.
  • the label can be a radioisotope, a biotin tag, or a fluorescent label.
  • Those molecules that are found to bind to the target molecule can be called biomolecular binders.
  • Fusion proteins An isolated target cell component, an antigenically similar portion thereof, or a suitable fusion protein comprising all of or a portion of or the entire target can be used in a method to select and identify biomolecules which bind specifically to the target.
  • the target cell component comprises a protein
  • fusion proteins comprising all of, or a portion of, the target linked to a second moiety not occurring in the target as found in nature, can be prepared for use in another aspect of the method.
  • Suitable fusion proteins for this piupose include those in which the second moiety comprises an affinity ligand (e.g., an enzyme, antigen, epitope).
  • the fusion proteins can be produced by the insertion of a gene encoding a target or a suitable portion of such gene into a suitable expression vector, which encodes an affinity ligand (e.g., pGEX-4T-2 and pET- 15b, encoding glutathione S- transferase and His-Tag affinity ligands, respectively).
  • the expression vector can be introduced into a suitable host cell for expression.
  • Host cells are lysed and the lysate, containing fusion protein, can be bound to a suitable affinity matrix by contacting the lysate with an affinity matrix under conditions sufficient for binding of the affinity ligand portion of the fusion protein to the affinity matrix.
  • Fusion protein can be immobilized
  • the fusion protein can be immobilized on a suitable affinity matrix under conditions sufficient to bind the affinity ligand portion of the fusion protein to the matrix, and is contacted with one or more candidate biomolecules (e.g., a mixture of peptides) to be tested as biomolecular binders, under conditions suitable for binding of the biomolecules to the target portion of the bound fusion protein.
  • candidate biomolecules e.g., a mixture of peptides
  • the affinity matrix with bound fusion protein can be washed with a suitable wash buffer to remove unbound biomolecules and non- specifically bound biomolecules. Biomolecules which remain bound can be released by contacting the affinity matrix with fusion protein bound thereto with a suitable elution buffer. Wash buffer can be formulated to permit binding of the fusion protein to the affinity matrix, without significantly disrupting binding of specifically bound biomolecules. In this aspect, elution buffer can be formulated to permit retention of the fusion protein by the affinity matrix, but can be formulated to interfere with binding of the test biomolecule(s) to the target portion of the fusion protein.
  • a change in the ionic strength or pH of the elution buffer can lead to release of biomolecules
  • the elution buffer can comprise a release component or components designed to disrupt binding of biomolecules to the target portion of the fusion protein.
  • Immobilization can be performed prior to, simultaneous with, or after contacting, the fusion protein with biomolecule, as appropriate.
  • Various permutations of the method are possible, depending upon factors such as the biomolecules tested, the affinity matrix-ligand pair selected, and elution buffer formulation.
  • a suitable elution buffer a matrix elution buffer, such as glutathione for a GST fusion.
  • the fusion protein comprises a cleavable linker, such as a thrombin cleavage site
  • cleavage from the affinity ligand can release a portion of the fusion with the biomolecules bound thereto.
  • Bound biomolecule can then be released from the fusion protein or its cleavage product by an appropriate method, such as extraction.
  • an appropriate method such as extraction.
  • one or more candidate biomolecular binders can be tested simultaneously. Where a mixture of biomolecules is tested, the biomolecules selected by the foregoing processes can be separated (as appropriate) and identified by suitable methods (e.g., PCR, sequencing, chromatography).
  • Random sequence RNA libraries can also be screened according to the present method to select RNA molecules which bind to a target. Where biomolecules selected from a combinatorial library by the present method carry unique tags, identification of individual biomolecules by chromatographic methods is possible.
  • biomolecules do not carry tags
  • chromatographic separation followed by mass spectrometry to ascertain structure
  • Other methods to identify biomolecular binders of a target cell component can be used.
  • the two-hybrid system or interaction trap is an in vivo system that can be used to identify polypeptides, peptides or proteins (candidate biomolecular binders) that bind to a target protein.
  • both candidate biomolecular binders and target cell component proteins are produced as fusion proteins.
  • the two-hybrid system and variations on it have been described (US 5,283,173 and US 5,468,614; Golemis, E.A.
  • biomolecular binders of a cell component Once one or more biomolecular binders of a cell component have been identified, further steps can be combined with those taken to identify the biomolecular binder, to identify those biomolecular binders that produce a phenotypic effect on a cell (where "a cell” can mean cells of a cell strain or cell line).
  • a method for identifying a biomolecule that produces a phenotypic effect on a first cell can comprise the steps of identifying a biomolecular binder of an isolated target cell component of the first cell, constructing a second cell comprising the target cell component and a regulable exogenous gene encoding the biomolecular binder, and testing the second cell for the phenotypic effect, upon production of the biomolecular binder in the second cell, where the second cell can be maintained in culture or introduced into an experimental animal. If the second cell shows the phenotypic effect upon intracellular production of the biomolecular binder, then a biomolecule that produces a phenotypic effect on the first cell has been identified.
  • Host cells Engineered to control expression Host cells (also, "second cells" in the terminology used above) of the cell type (e.g., species of pathogenic bacteria) the target was isolated from (or the gene encoding the target was originally isolated from, if the target is produced by recombinant methods), can be engineered to harbor a gene that can regulatably express the biomolecular binder (e.g., under an inducible or repressible promoter). The ability to regulate the expression of the biomolecular binder is desirable because constitutive expression of the biomolecular binder could be lethal to the cell.
  • inducible or regulated expression gives the researcher the ability to control if and when the biomolecular binder is expressed.
  • the gene expressing the biomolecular binder can be present in one or more copies, either on an extra chromosomal structure, such as on a single or multicopy plasmid, or integrated into the host cell genome. Plasmids that provide an inducible gene expression system in pathogenic organisms can be used. For example, plasmids allowing tetracycline- inducible expression of a gene in Staphylococcus aureus have been developed.
  • genes for expression For intracellular expression of a biomolecule to be tested for its phenotypic effect in a eukaryotic cell (e.g., mammalian cell), the genes for expression can be carried on plasmid-based or virus-based vectors, or on a linear piece of DNA or RNA.
  • a eukaryotic cell e.g., mammalian cell
  • the genetic material can be introduced into cells using a variety of techniques, including whole cell or protoplast transformation, electroporation, calcium phosphate-DNA precipitation or DEAE- Dextran transfection, liposome mediated DNA or RNA transfer, or transduction with recombinant viral or retroviral vectors.
  • Expression of the gene can be constitutive (e.g., ADHI promoter for expression in S. cerevisiae (Bennetzen, J.L. and Hall, B.D., J Biol. Chem 257:3026-3031 (1982)), or CMV immediate early promoter and RSV LTR for mammalian expression) or inducible, as the inducible GAL I promoter in yeast (Davis, L.I.
  • E. coli Lac repressor/operator system and TnlO Tet repressor/operator systems have been engineered to govern regulated expression in organisms from bacterial to mammalian cells. Regulated gene expression can also be achieved by activation.
  • gene expression governed by HIV LTR can be activated by HIV or SIV Tat proteins in human cells;
  • GAL4 promoter can be activated by galactose in a nonglucose-containing medium.
  • the location of the biomolecule binder genes can be extra chromosomal or chromosomally integrated.
  • the chromosome integration can be mediated through homologous or nonhomologous recombinations.
  • biomolecule binders For proper localization in the cells, it maybe desirable to tag the biomolecule binders with certain peptide signal sequences (for example, nuclear localization signal (NLS) sequences, mitochondria localization sequences).
  • NLS nuclear localization signal
  • Fused biomolecular binders For presentation of the biomolecular binders in the intracellular system, they can be fused N-terminally, C-terminally, or internally in a carrier protein (if the biomolecular binder is a peptide), and can be fused (5', 3' or internally) in a carrier RNA or DNA molecule (if the biomolecular binder is a nucleic acid).
  • the biomolecular binder can be presented with a protein or nucleic acid structural scaffold.
  • Certain linkages e.g., a 4-glycine linker for a peptide or a stretch of A's for an RNA can be inserted between the biomolecular binder and the carrier proteins or nucleic acids.
  • the effect of this biomolecular binder on the phenotype of the cells can be tested, as a manifestation of the binding (implying binding to a functionally relevant site, thus, an activator, or more likely, an inhibitory) effect of the biomolecular binder on the target used in an in vitro binding assay as described above.
  • An intracellular test can not only determine which biomolecular binders have a phenotypic effect on the cells, but at the same time can assess whether the target in the cells is essential for maintaining the normal phenotype of the cells.
  • a culture of the engineered cells expressing a biomolecular binder can be divided into two aliquots.
  • the first aliquot (“test” cells) can be treated in a suitable manner to regulate (e.g., induce or release repression of, as appropriate) the gene encoding the biomolecular binder, such that the biomolecular binder is produced in the cells.
  • the second aliquot (“control” cells) can be left untreated so that the biomolecular binder is not produced in the cells.
  • a different strain of cells not having a gene that can express the biomolecular binder, can be used as control cells.
  • the phenotype of the cells in each culture (“test” and "control” cells grown under the same conditions, other than the expression of the biomolecular binder), can then be monitored by a suitable means (e.g., enzymatic activity, monitoring, a product of a biosynthetic pathway, antibody to test for presence of cell surface antigen, etc.).
  • the growth of the cells in each culture (“test” and “control” cells grown under the same conditions, other than the expression of the biomolecular binder), can be monitored by a suitable means (e.g., turbidity of liquid cultures, cell count, etc). If the extent of growth, or rate of growth of the test cells is less than the extent of growth or rate of growth of the control cells, then the biomolecular binder can be concluded to be an inhibitor of the growth of the cells, or a biomolecular inhibitor. If the phenotype of the test cells is altered relative to that of the control cells, then the biomolecular binder can be concluded to be one that causes a phenotypic effect.
  • a suitable means e.g., turbidity of liquid cultures, cell count, etc.
  • isolated target cell component having a known function can be tested for modulation of this known function in the presence of biomolecular binder under conditions conducive to binding of the biomolecular binder to the target cell component. Positive results in these tests should encourage the investigator to continue in the drug discovery process with efforts to find a more stable compound (than a peptide, polypeptide or RNA biomolecule) that mimics the binding properties of the biomolecular binder on the tested target cell component.
  • Engineering strain of cells A further test can, again, employ an engineered strain of cells that comprise both the target cell component and one or more genes encoding a biomolecule tested to be a biomolecular binder of the target celPcomponent.
  • the cells of the cell strain can be tested in animals to see if regulable expression of the biomolecular binder in the engineered cells produces an observable or testable change in phenotype of the cells.
  • Both the "in culture” test for the effect of intracellular expression of the biomolecular binder and the “in animal” test (described below) for the effect of intracellular expression of the biomolecular binder can be applied not only towards drug discovery in the categories of antimicrobials and anticancer agents, but also towards the discovery of therapeutic agents to treat inflammatory diseases, cardiovascular diseases, diseases associated with metabolic pathways, and diseases associated with the central nervous system, for example.
  • the object of the test is to see whether production of the biomolecular binder in the engineered strain inhibits growth of these cells after their introduction into an animal by the engineered pathogen.
  • Such a test can not only determine which biomolecular binders are inhibitors of growth of the cells, but at the same time can assess whether the target in the cells is essential for maintaining growth of the cells (infection, for a pathogenic organism) in a host mammal.
  • Suitable animals for such an experiment are, for example, mammals such as mice, rats, rabbits, guinea pigs, dogs, pigs, and the like. Small mammals can be used for reasons of convenience.
  • the engineered cells are introduced into one or more animals ("test” animals) and into one or more animals in a separate group (“control” animals) by a route appropriate to cause symptoms of systemic or local growth of the engineered cells.
  • the route of introduction may be, for example, by oral feeding, by inhalation, by subdermal, intramuscular, intravenous, or intraperitoneal injection as appropriate to the desired result.
  • expression of the gene encoding the biomolecular binder is regulated to allow production of the biomolecular binder in the engineered pathogen cells.
  • the treatment to express the gene encoding the biomolecular binder can be the administration of an inducer substance (where expression of the biomolecular binder or gene is under the control of an inducible promoter) or the functional removal of a repressor substance (where expression of the biomolecular binder gene is under the control of a repressible promoter).
  • an inducer substance where expression of the biomolecular binder or gene is under the control of an inducible promoter
  • a repressor substance where expression of the biomolecular binder gene is under the control of a repressible promoter
  • the animals can be monitored for signs of infection (as the simplest endpoint, death of the animal, but also e.g., lethargy, lack of grooming behavior, hunched posture, not eating, dianhea or other discharges; bacterial titer in samples of blood or other cultured fluids or tissues).
  • signs of infection as the simplest endpoint, death of the animal, but also e.g., lethargy, lack of grooming behavior, hunched posture, not eating, dianhea or other discharges; bacterial titer in samples of blood or other cultured fluids or tissues.
  • the test and control animals can be monitored for the development of tumors or for other indicators of the proliferation of the introduced engineered cells.
  • the biomolecule can be also called a biomolecular inhibitor of growth, or biomolecular inhibitor of infection, as appropriate, as it can be concluded that the expression in vivo of the biomolecular inhibitor is the cause of the relative reduction in growth of the introduced cells in the test animals.
  • further steps of the procedure involve in vitro assays to identify one or more compounds that have binding and activating or inhibitory properties that are similar to those of the biomolecules which have been found to have a phenotypic effect, such as inhibition of growth. That is, compounds that compete for binding to a target cell component with the biomolecule would then be structural analogs of the biomolecules. Assays to identify such compounds can take advantage of known methods to identify competing molecules in a binding assay. These steps comprise general step (3) of the method.
  • a biomolecular inhibitor (or activator) can be contacted with the isolated target-cell component to allow binding, one or more compounds can be added to the milieu comprising the biomolecular inhibitor and the cell component under conditions that allow interaction and binding between the cell component and the biomolecular inhibitor, and any biomolecular inhibitor that is released from the cell component can be detected.
  • Fluorescence One suitable system that allows the detection of released biomolecular inhibitor (or activator) is one in which fluorescence polarization of molecules in the milieu can be measured.
  • the biomolecular inhibitor can have bound to it a fluorescent tag or label such as fluorescein or fluorescein attached to a linker.
  • Assays for inhibition of the binding of the biomolecular inhibitor to the cell component can be done in microtiter plates to conveniently test a set of compounds at the same time.
  • a majority of the fluorescently labeled biomolecular inhibitor must bind to the protein in the absence of competitor compound to allow for the detection of small changes in the bound versus free probe population when a compound which is a competitor with a biomolecular inhibitor is added (B.A. Lynch, et al., Analytical Biochemistry 247:77-82 (1997)). If a compound competes with the biomolecular inhibitor for a binding site on the target cell component, then fluorescently labeled biomolecular inhibitor is released from the target cell component, lowering the polarization measured in the milieu. Radioactive isotope 0226
  • the target cell component can be attached to a solid support, contacted with one or more compounds, and contacted with the biomolecular inhibitor.
  • One or more washing steps can be employed to remove biomolecular inhibitor and compound not bound to the cell component. Either the biomolecular inhibitor bound to the target cell component or the compound bound to the target cell component can be measured.
  • Detection of biomolecular inhibitor or compound bound to the cell compound can be facilitated by the use of a label on either molecule type, wherein the label can be, for instance, a radioactive isotope either inco ⁇ orated into the molecule itself or attached as an adduct, streptavidin or biotin, a fluorescent label or a substrate for an enzyme that can produce from the substrate a colored or fluorescent product.
  • a scintillation counter can be used to measure radioactivity.
  • Radio labeled streptavidin or biotin can be allowed to bind to biotin or streptavidin, respectively, and the resulting complexes detected in a scintillation counter.
  • Alkaline phosphatase conjugated to streptavidin can be added to a biotin- labeled biomolecular inhibitor or compound.
  • Detection and quantitation of a biotin- labeled complex can then be by addition of pNPP substrate of alkaline phosphatase and detection by spectrophotometry, of a product which absorbs UV light at a wavelength of 405 nm.
  • a fluorescent label can also be used, in which case detection of fluorescent complexes can be by a fluorometer.
  • the method for identifying compounds comprises attaching the target cell component to a solid support, contacting the biomolecular inhibitor with the target cell component under conditions suitable for binding of the biomolecular inhibitor to the cell component, removing unbound biomolecular inhibitor from the solid support, contacting one or more compounds (e.g., a mixture of compounds) with the cell component under conditions suitable for binding of the biomolecular inhibitor to the cell component, and testing for unbound biomolecular inhibitor released from the cell component, whereby if unbound biomolecular inhibitor is detected, one or more compounds that displace or compete with the biomolecular inhibitor for a particular site on the target cell component have been identified.
  • compounds e.g., a mixture of compounds
  • Derivatives of these compounds having modifications to confer improved solubility, stability, etc. can also be tested for a desired phenotypic effect.
  • Combining steps Combining steps for testing the phenotypic effects of a biomolecule, as can be produced in an intracellular test, with steps for identifying compounds that compete with the biomolecule for sites on a target cell component, yields a method for identifying a compound which is a functional analog of a biomolecule which produces a phenotypic effect on a cell.
  • steps can be to test, for the phenotypic effect, either in culture or in an animal model, or in both, a cell which produces a biomolecule by regulatable expression of an exogenous gene in the cell, and to identify, if the biomolecule caused the phenotypic effect, one or more compounds that compete with the biomolecule for binding to a target cell component. If a compound is found to compete with the biomolecule for binding to the target cell component, then the compound is a functional analog of a biomolecule which produces a phenotypic effect on the cell. Such a functional analog can cause qualitatively a similar effect on the cell, but to a similar degree, lesser degree or greater degree than the biomolecule.
  • a further aspect of the invention is a method for determining whether a target component of a cell is essential to producing a phenotypic effect on the cell, comprising isolating the target component from the cell, identifying a biomolecular binder of the isolated target component of the cell, constructing a second cell comprising the target component and a regulable, exogenous gene encoding the biomolecular binder, and testing the second cell in culture for an altered phenotypic effect, upon production of the biomolecular binder in the second cell, whereby, if the second cell shows the altered phenotypic effect upon production of the bimolecular binder, then the target component of the first cell is essential to producing the phenotypic effect on the first cell.
  • DHFR Mammalian dihydrofolate reductase
  • MTX Methotrexate
  • NIH 3T3 is a mouse fibroblast cell line that is able to develop spontaneous transformed cells when cultured in low concentration (2%) of calf serum in molecular, cellular and developmental biology medium 402 (MCDB) (M. Chow and H. Rubin, Proc. Natl. Acad. Sci. USA 95(8):4550-4555 (1998)).
  • the transformed cells which can be selectively inhibited by MTX (Chow and Rubin), are isolated.
  • Both the normal and transformed NIH3T3 cells are transfected with pTet- On plasmid (Clontech; Palo Alto, CA).
  • Stable cell lines that express high levels of reverse tetracycline-control led activator (rtTA) are isolated and characterized for their normal or transformed phenotype (Chow and Rubin).
  • the DHFR gene (Genbank Accession # L26316) from the NIH 3T3 cell line is amplified by reverse transcription-PCR (RT-PCR) using poly A' RNA isolated from NIH 3T3 cells (Sambrook, J. et al., Molecular Cloning: A Laboratory Manual, 2nd edition, Cold Spring Harbor Laboratory Press, 1989). Active DHFR is expressed using the BacPAK Baculovirus Expression System (Clontech) or other appropriate systems.
  • the expressed DHFR is purified and biotinylated and subjected to peptide binder identification as exemplified for bacterial proteins.
  • the identified peptides are biochemically characterized for in vitro inhibition of DHFR activity.
  • Peptides that inhibit DHFR are identified.
  • a nucleic acid encoding each peptide can be cloned into a vector such as pGEX-4T2 (Pharmacia) to yield a vector which encodes a fusion polypeptide having the peptide fused to the N- terminus of GST. This can also be done by PCR amplification as exemplified herein for the peptide Pro- 3.
  • the fusion genes are cloned into plasmid pTRE (Clontech) for regulated expression.
  • the constructed plasmid or the vector is co-transfected with pTK-Hyg into the stable NIH 3T3 cell line that expresses rtTA.
  • 3T3N-VITA normal 3T3 cells that express rtTA and the DHFR inhibitory peptides
  • 3T3T-VITA transformed 3T3 cells that express rtTA and the DHFR inhibitory peptides
  • 3T3T-VITA control transformed 3T3 cells that express rtTA and GST
  • 10 2 -l ⁇ ' of 3T3T-VITA or 3T3T-VITA control cells are mixed with 10 5 3T3N-VITA and are grown in MCD 402 medium with 10% calf serum at 37'C for three days.
  • Tetracycline is added to the medium to a final concentration of 0 to 1 ug/ml. In a control, 200 nM of MTX is added. The cultures are incubated for an additional eight days, and the number of foci formed are counted as described by M. Chow and H. Rubin, Proc Natl. Acad Sci. USA 95(8):4550-4555 (1998). Peptides that specifically inhibit foci formation of 3T3 transformed cells are identified. A murine model of fibroblastoma (Kogerman, P. et al., Oncogene
  • 3T3T- VITA or 3T3T-V1TA control cells (10 3 , 10 4 , 10 5 , 10 6 cells) are injected subcutaneously into 5 groups (10 in each group) of athymic nude mice (4-6 weeks old, 18-22 g) to determine the minimal dose needed for development of fibroblastomas in all of the tested animals.
  • 6 groups of athymic nude mice (10 each) are injected subcutaneously (s.e) with the minimal tumorigenic dose for 3T3T-V1TA or 3T3T-VITA control cells to develop fibroblastoma.
  • mice One week after injection, group I mice start receiving MTX s.e at 2 mg/kg/day as positive control, group 2 to 5 start receiving 1, 2, 5, or 10 mg/kg/day of tetracycline, group 6 start receiving saline (vehicle) as control. Five weeks after the introduction of cells, all of the mice are sacrificed and tumors are removed from them. Tumor mass is measured and compared among the groups. An effective peptide identified by these in vivo experiments can be used for screening libraries of compounds to identify those compounds that competitively bind to DHFR.
  • One mechanism of tumorigenesis is overexpression of proto-oncogenes such as Ha-ras (Reviewed by Suarez, H.G., Anticancer Research 9(5):1331-1343 (1989)).
  • Transgenic mice that overexpress human Ha-ras have been produced. Such transgenic mice develop salivary and/or mammary adenocarcinomas (Nielsen, L.L. et al, In Vivo 8(5):1331-1343 (1994)). Secondary transgenic mice that express rtTA can be generated using the pTet-On plasmid from Clontech.
  • Human Ha-ras open reading frame cDNA (Genbank Accession #GO0277) is amplified by RT-PCR using polyA- RNA isolated from human mammary gland or other tissues. Active Ha-ras is expressed using the BacPAK Baculovirus Expression System (Clontech) or other appropriate systems. The expressed Ha-ras is purified and biotinylated and subjected to peptide binder identification as exemplified herein for bacterial proteins as target cell components. The identified peptides are biochemically characterized for in vitro inhibition of Ha- ras GTPase activity.
  • Peptides that inhibit Ha-ras are cloned into plasmid pTPE (Clontech) for regulated expression as an N-terminal fusion of GST. Such constructs are used to generate tertiary transgenic mice using the secondary transgenic mice. Transgenic mice that are able to overexpress peptide genes are identified by Northern and Western analysis. Control mice that express GST are also identified. Various doses of tetracycline are administered to the tertiary transgenic mice by s.e or I.P. injection before or after tumor onset. Prevention or regression of tumors resulting from expression of the peptide genes are analyzed as described above for murine fibroblastoma.
  • Peptides found to be effective in in vivo experiments will be used to screen compounds that inhibit human Ha-ras activity for cancer therapy.
  • Disease targets The method of the invention can be applied more generally to mammalian diseases caused by: (1) loss or gain of protein function, (2) over- expression or loss of regulation of protein activity. In each case the starting point is the identification of a putative protein target or metabolic pathway involved in the disease.
  • the protocol can sometimes vary with the disease indication, depending on the availability of cell culture and animal model systems to study the disease. In all cases the process can deliver a validated target and assay combination to support the initiation of drug discovery.
  • Appropriate disease indications include, but are not limited to, Alzheimer's, arthritis, cancer, cardiovascular diseases, central nervous system disorders, diabetes, depression, hypertension, inflammation, obesity and pain.
  • Appropriate protein targets putatively linked to disease indications include, but are not limited to (1) the leptin protein, putatively linked to obesity and diabetes; (2) a mitogen- activated protein kinase putatively linked to arthritis, osteoporosis and atherosclerosis; (3) the interleukin- 1 beta converting protein putatively linked to arthritis, asthma and inflammation; (4) the caspase proteins putatively linked to neurodegenerative diseases such as Alzheimer's, Parkinson's and stroke, and (5) the tumor necrosis factor protein putatively linked to obesity and diabetes.
  • Appropriate protein targets include also, but are not limited to, enzymes catalyzing the following types of reactions: (1) oxido-reductases, (2) transferases, (3) hydrolases, (4) lyases, (5) isomerases, and (6) ligases.
  • the arachidonic acid pathway constitutes one of the main mechanisms for the production of pain and inflammation.
  • the pathway produces different classes of end products, including the prostaglandins, thromboxane and leukotrienes.
  • Prostaglandins an end product of cyclooxygenase metabolism, modulate immune function, mediate vascular phases of inflammation and are potent vasodilators.
  • COX cyclooxygenase
  • Anti- inflammatory potencies of different NSAIDs have been shown to be proportional to their action as COX inhibitors. It has also been shown that COX inhibition produces toxic side effects such as erosive gastritis and renal toxicity. The knowledge base regarding the toxic side effects of COX inhibitors has been gained through years of monitoring human therapies and human suffering. Two kinds of COX enzymes are now known to exist, with inhibition of COX 1 related to toxicity, and inhibition of COX2 related to reduction of inflammation. Thus, selective COX2 inhibition is a desirable characteristic of new anti-inflammatory drugs.
  • the method of the invention can provide a route from identification of potential drug targets to validating these targets (for example, COX1 and COX2) as playing a role in disease (pain and inflammation) to an examination of the phenotype for the inhibition of one or both target isozymes without human suffering. Importantly, this information can be collected in vivo.
  • the method of the invention can be used to define the phenotype of "genes of unknown function" obtained from various human genome sequencing projects or to assess the phenotype resulting, from inhibition of one isozyme subtype or one member of a family of related protein targets.
  • Target (also, "target component of a cell,” or “target cell component”) a constituent of a cell which contributes to and is necessary for the production or maintenance of a phenotype of the cell in which it is found.
  • a target can be a single type of molecule or can be a complex of molecules.
  • a target can be the product of a single gene, but can also be a complex comprising more than one gene product (for example, an enzyme comprising alpha and beta subunits, mRNA, tRNA, ribosomal RNA or a ribonucleoprotein particle such as a snRNP).
  • Targets can be the product of a characterized gene (gene of known function) or the product of an uncharacterized gene (gene of unknown function).
  • Target Validation the process of determining whether a target is essential to the maintenance of a phenotype of the cell type in which the target normally occurs. For example, for pathogenic bacteria, researchers developing antimicrobials want to know if a compound which is potentially an antimicrobial agent not only binds to a target in vitro, but also binds to, and modulates the function of, a target in the bacteria in vivo, and especially under the conditions in which the bacteria are producing an infection — those conditions under which the antimicrobial agent must work to inhibit bacterial growth in an infected animal or human.
  • Phenotypic Effect a change in an observable characteristic of a cell which can include, e.g., growth rate, level or activity of an enzyme produced by the cell, sensitivity to various agents, antigenic characteristics, and level of various metabolites of the cell.
  • a phenotypic effect can be a change away from wild type (normal) phenotype, or can be a change towards wild type phenotype, for example.
  • a phenotypic effect can be the causing or curing of a disease state, especially where mammalian cells are refened to herein.
  • a phenotypic effect can be the slowing of growth rate or cessation of growth.
  • Biomolecule a molecule which can be produced as a gene product in cells that have been appropriately constructed to comprise one or more genes encoding the biomolecule. Production of the biomolecule can be turned on, when desired, by an inducible promoter.
  • a biomolecule can be a peptide, polypeptide, or an RNA or RNA oligonucleotide, a DNA or DNA oligonucleotide, but is preferably a peptide.
  • biomolecules can also be made synthetically.
  • peptides see Merrifield, J., J. Am. Chem. Soc. 85: 2140-2154 (1963).
  • an Applied Biosystems 431 A Peptide Synthesizer Perkin Elmer
  • Biomolecules produced as gene products intracellularly are tested for their interaction with a target in the intracellular steps described herein (tests performed with cells in culture and tests performed with cells that have been introduced into animals).
  • the same biomolecules produced synthetically are tested for their binding to an isolated target in an initial in vitro method described herein.
  • Synthetically produced biomolecules can also be used for a final step of the method for finding compounds that are competitive binders of the target.
  • Biomolecular Binder (of a target): a biomolecule which has been tested for its ability to bind to an isolated target cell component in vitro and has been found to bind to the target.
  • Biomolecular Inhibitor of Growth a biomolecule which has been tested for its ability to inhibit the growth of cells constructed to produce the biomolecule in an "in culture” test of the effect of the biomolecule on growth of the cells, and has been found, in fact, to inhibit the growth of the cells in this test in culture.
  • Biomolecular Inhibitor of Infection a biomolecule which has been tested for its ability to ameliorate the effects of infection, and has been found to do so.
  • pathogen cells constructed to regulably express the biomolecule are introduced into one or more animals, the gene encoding the biomolecule is regulated so as to allow production of the biomolecule in the cells, and the effects of production of the biomolecule are observed in the infected animals compared to one or more suitable control animals.
  • Isolated term used herein to indicate that the material in question exists in a physical milieu distinct from that in which it occurs in nature.
  • an isolated target cell component of the invention may be substantially isolated with respect to the complex cellular milieu in which it naturally occurs.
  • the absolute level of purity is not critical, and those skilled in the art can readily determine appropriate levels of purity according to the use to which the material is to be put.
  • the isolated material will form part of a composition (for example, a more or less crude extract containing other substances), buffer system or reagent mix.
  • the material may be purified to essential homogeneity, for example as determined by PAGE or column chromatography (for example, HPLC).
  • Pathogen or Pathogenic Organism an organism which is capable of causing disease, detectable by signs of infection or symptoms characteristic of disease.
  • Pathogens can include prokaryotes (which include, for example, medically significant Gram- positive bacteria such as Streptococcus pneumoniae, Enterococcus faecalis and Staphylococcus aureus, Gram-negative bacteria such as Escherichia coli, Pseudomonas aeroginosa and Klebsiella pneumoniae, and "acid- fast" bacteria such as Mycobacteria, especially M. tuberculosis), eukaryotes such as yeast and fungi (for example, Candida albicans and Aspergillus fumigatus) and parasites.
  • prokaryotes which include, for example, medically significant Gram- positive bacteria such as Streptococcus pneumoniae, Enterococcus faecalis and Staphylococcus aureus
  • Gram-negative bacteria such as Escherichia coli, Pseudomonas aeroginosa and Klebsiella pneumoniae
  • pathogens can include such organisms as soil-dwelling organisms and "normal flora" of the skin, gut and orifices, if such organisms colonize and cause symptoms of infection in a human or other mammal, by abnormal proliferation or by growth at a site from which the organism cannot usually be cultured.
  • compositions e.g., mixed bed multidimensional liquid chromatographs
  • methods for simultaneously identifying individual proteins in complex mixtures of biological molecules and quantifying the expression levels of those proteins e.g., proteome analyses.
  • the methods compare two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation.
  • the proteins in the standard and investigated samples are subjected separately to a series of chemical modifications, i.e., differential chemical labeling, and fragmentation, e.g., by proteolytic digestion and/or other enzymatic reactions or physical fragmenting methodologies.
  • the chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides.
  • Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but of similar properties, such that peptides with the same sequence from both samples are eluted together in the separation procedure and their ionization and detection properties regarding the mass spectrometry are very similar.
  • Differential chemical labeling can be performed on reactive functional groups on some or all of the carboxy- and/or amino- termini of proteins and peptides and/or on selected amino acid side chains.
  • a combination of chemical labeling, proteolytic digestion and other enzymatic reaction steps, physical fragmentation and/or fractionation can provide access to a variety of residues to general different specifically labeled peptides to enhance the overall selectivity of the procedure.
  • the standard and the investigated samples are combined, subjected to multidimensional chromatographic separation, and analyzed by mass spectrometry methods. Mass spectrometry data is processed by special software, which allows for identification and quantification of peptides and proteins.
  • LC-LC-MS/MS The combination of multidimensional liquid chromatography and tandem mass spectrometry can be called "LC-LC-MS/MS.” LC-LC-MS/MS was first developed by Link A. and Yates J.
  • proteins can be first substantially or partially isolated from the biological samples of interest.
  • the polypeptides can be treated before selective differential labeling; for example, they can be denatured, reduced, preparations can be desalted, and the like.
  • Conversion of samples of proteins into mixtures of differentially labeled peptides can include preliminary chemical and/or enzymatic modification of side groups and/or termini; proteolytic digestion or fragmentation; post-digestion or post-fragmentation chemical and/or enzymatic modification of side groups and/or termini.
  • the differentially modified polypeptides and peptides are then combined into one or more peptide mixtures. Solvent or other reagents can be removed, neutralized or diluted, if desired or necessary.
  • the buffer can be modified, or, the peptides can be redissolved in one or more different buffers, such as a "MudPIT" (see below) loading buffer.
  • the peptide mixture is then loaded onto chromatography column, such as a liquid chromatography column, a 2D capillary column or a multidimensional chromatography column, to generate an eluate.
  • the eluate is fed into a mass spectrograph, such as a tandem mass spectrograph.
  • a mass spectrograph such as a tandem mass spectrograph.
  • an LC ESI MS and MS/MS analysis is complete.
  • peptides can generated for mass spectrograph analysis.
  • Two or more samples can be differentially labeled by selective labeling of each sample.
  • Peptide modifications, i.e., labeling are stable.
  • Reagents having differing masses or reactive groups can be chosen to maximize the number of reactive groups and differentially labeled samples, thus allowing for a multiplex analysis of sample, polypeptides and peptides.
  • a "MudPIT" protocol is used for peptide analysis, as described herein.
  • the methods of the invention can be fully automated and can essentially analyze every protein in a sample.
  • the invention provides apparatus (e.g., mixed bed multi-dimensional liquid chromatographs) and methods for high throughput, comparative proteome characterization.
  • the invention provides a broad-based method for global profiling protein expression, which is a combination of differential peptides labeling, multidimensional chromatography coupled with mass spectrometry for separation, identification and quantification. Proteins are identified in complex mixtures with rapid speed, high sensitivity and accurate quantitative information. Using sets of labeling tags and modification methods, protein are differentially and efficiently modified with stable and flexible labeling.
  • the invention provides methods accurate and sensitive comparative proteomics in complex systems.
  • the invention provides compositions (e.g., mixed bed multidimensional liquid chromatographs) and methods for high throughput, comparative proteome characterization.
  • the goal is to provide a broad-based method for global profiling protein expression, which is a combination of differential peptides labeling, multi-dimensional chromatography coupled with mass spectrometry for separation, identification and quantification. This method significantly improves over traditional methods. Proteins are identified in complex mixture with rapid speed, high sensitivity and accurate quantitative information.
  • the invention provides novel approaches for modifying proteins differentially and efficiently with stable and flexible labeling.
  • the methods provide the speed and sensitivity for accurate comparative proteomics in complex systems.
  • invention provides: Differential peptide labeling Compare various modifications and identify the top candidate(s) Optimize reaction conditions for desired peptide/protein modification Method validation
  • ModPIT Multi-dimensional Protein Identification Technique
  • the invention provides a high throughput proteomics technology with high speed, high efficiency and accurate quantitation, which can be employed for quantitative analysis of global protein expression in complex samples, and the detection and quantitation of specific proteins in complex samples.
  • An exemplary high throughput, comparative proteomics method uses a model pathway study of Streptomyces diver sa (S. diver sa).
  • the use of mass spectrometry to identify proteins whose sequences are present in either DNA or protein databases is well established and integrated to the field of Proteomics.
  • One goal of Proteomics is to define the expressed proteins associated with a given cellular state, and another goal is to quantify changes in protein expression between cellular states. Many techniques have been developed to achieve these goals (see below).
  • the present invention provides a non-gel based method of identifying individual proteins in complex protein mixtures simultaneously and quantifying protein expression level globally. It overcomes the limitations inherent in traditional techniques. Comparative Proteomics Techniques 2D gel electrophoresis (2D GE) is the most commonly used technique in proteomics.
  • 2D GE proteins are separated by isoelectric focusing according to their PI difference in the first dimension and by electrophoresis mobility according to their molecular weight difference in the second dimension. Separated proteins are usually visualized by staining. Quantitation is achieved by comparing the spot density.
  • spot identification the method involves spot cutting, in gel digestion and peptide extraction. The next stage is analyzing these peptides using mass spectrometry or tandem mass spectrometry and database searching for identifications.
  • the disadvantages of 2D GE approach are that it is very time consuming and labor intensive, and it does not work well for hydrophobic proteins, proteins with extreme pi, and non-abundant proteins.
  • Isotope-coded affinity tag is one of the new non-gel based methodologies that have a great impact on proteome research 1 .
  • the method is based on a newly synthesized class of chemical reagents (ICAT) used in combination with tandem mass spectrometry.
  • the ICAT reagent contains a biotin affinity tag and a thiol specific reactive group (cysteine side chain), which are joined by a spacer domain available in two forms: regular (light), and isotopically heavy which includes eight deuterium atoms.
  • a reduced protein mixture representing one cell state is derivatized with the isotopically light version of the ICAT reagent, while the conesponding reduced protein mixture representing a second cell state is derivatized with the isotopically heavy version of the ICAT reagent.
  • the labeled samples are combined and proteolytically digested to produce peptide fragments.
  • the tagged cysteine containing peptide fragments are isolated by avidin affinity chromatography.
  • the isolated tagged peptides are separated and analyzed by microcapillary tandem mass spectrometry.
  • Differential isotopic labeling of peptides for global quantification of proteins 2 is another method used cu ⁇ ently, in which two different protein mixtures for quantitative comparison were digested to peptide mixtures.
  • the peptide mixtures were separately methylated using either dO- or d3-methanol, the mixtures of methylated peptide were combined, and subjected to microcapillary HPLC-MS/MS.
  • Parent proteins of methylated peptides were identified by co ⁇ elative database searching of fragment ion spectra using SEQUEST or automated de novo sequencing that compared all tandem mass spectra of dO- and d3 -methylated peptide ion pairs.
  • Ratios of proteins in the two original mixtures were calculated by normalization of the area under the curve for dO- to d3 -methylated peptide pairs.
  • differential labeling reagents relied on stable isotopes which are expensive and not flexible to differential labeling of more than two mixtures of peptides;
  • labeling methods are limited only to methylation of c-terminal;
  • protein expression profiling is limited to duplex comparison;
  • one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't have enough capacity and resolving power for complex mixtures of peptides.
  • the invention overcomes the shortcomings of the cu ⁇ ently available quantitative proteomics methods described above.
  • the technology of the present method has speed, high efficiency and accurate quantitation, which is employed for quantitative analysis of global protein expression in complex samples.
  • the basic approach described is employed for: (i) quantitative analysis of global protein expression in complex samples (such as cells, tissues, fractions and etc.), (ii) the detection and quantitation of specific proteins in complex samples, and (iii) quantitative measurement of specific enzymatic activities in complex samples.
  • Novelties of this approach include: (i) design of differential labeling reagents for peptides and methods for efficient peptide modification; (ii) multiplex analysis; (iii) combination of labeling by chemical modifications of termini and/or side chains of peptides; (iv) combination of chemical modification and proteolytic digestions in order to achieve the most favorable and selective chemical modification of peptides; (v) improvement of multidimensional chromatography for better protein peptide separation and identification.
  • Experimental Design and Methods The present application provides a non-gel based method of identifying individual proteins in complex protein mixtures simultaneously and quantifying protein expression level globally. It overcomes the limitations inherent in traditional techniques.
  • two or more samples of proteins are compared, one of which is considered as the standard sample and all others are considered as samples under investigation.
  • the proteins in the standard and investigated samples are subjected to a sequence of proteolytic digestion and/or other enzymatic reaction in separate tubes. Then, these digested peptides are modified (novel differential chemical labeling). Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but they have similar properties such that the differential labeled peptides are eluted together in the separation procedure and their ionization and fragmentation properties regarding the mass spectrometry are very similar.
  • the samples are combined, separated by multidimensional chromatography, and analyzed by mass spectrometry methods.
  • the combined mixtures of peptides are separated by improving a cunent chromatography method called Multidimensional Protein Identification Technique (MudPIT) 3 .
  • MudPIT Multidimensional Protein Identification Technique 3 .
  • Chemical transformations involved in differential labeling (1) Esterification of C-termini of the peptides and carboxylic acid groups in the side chains; (2) Amidation of C-termini of the peptides and carboxylic acid groups in the side chains; (might require protection of amine groups first); (3) Acylation of N-termini of the peptides and amino and hydroxyl groups in the side chains.
  • the esterification, amidation, and acylation reactions are performed on the mixtures of peptides in a fashion similar to other reactions of the types already described in previous part, or modified as needed in each particular case.
  • Reagents for differential labeling Mixtures of peptides coming from the standard protein samples and the investigated protein samples are labeled separately with differential reagents. These differential reagents differ in molecular mass, but do not differ in retention properties regarding the separation method used and in ionization and detection properties regarding the mass spectrometry methods used. Thus, these differential reagents differ either in their isotope composition (isotopical reagents) or they differ structurally by a rather small fragment, which change does not alter the properties stated above (homologous reagents). The obvious choices for such reagents are aliphatic alcohols, aliphatic amines, and aliphatic acids.
  • Isotopic reagents based on aliphatic alcohols, amines, or acids contain different amount of protons and deuterons in different reagents, e.g., CH 3 CH 2 OH and CD CD 2 OH (mass difference is 5 Da) or CH 3 CH 2 CO 2 H and CD 3 CD 2 CO 2 H (mass difference is 5 Da).
  • the homologous reagents differ from each other by the number of CH 2 moieties in their molecules, e.g., CH 3 OH and CH 3 CH 2 OH (mass difference is 14 Da) or CH 3 CO 2 H and CH 3 CH 2 CO 2 H (mass difference is 14 Da).
  • the alcohol reagents esterify peptide C-terminals and/or Glu and Asp side chains, the amines form amide bond with peptide C-terminals and/or Glu and Asp side chains, and the acids form amide bond with peptide N-terminals and/or Lys and Arg side chains.
  • Substituents may be introduced into the mass-labeling reagents in order to tune their retention, ionization, and detection properties.
  • Differential labeling progress The peptide esterification is performed using different alcohols.
  • Figure 2 shows one example: a peptide is differential labeled by one of the homologous reagent pairs. In this case: methanol and ethanol. The physical/chemical properties of those differential labeled peptide pairs was further tested, and it was found that they are very similar in terms of reverse phase LC elution and ionization efficiency. Differential labeled peptide pairs with a methyl group difference serve as ideal mutual internal standards for quantification.
  • FIG. 1 is an illustration of a MALDI MS spectrum of a peptide pairs.
  • peptides are differentially esterified by either methanol or ethanol. They have the identical sequence before the labeling.
  • Methods for peptide/protein separation, detection and analysis a. Peptide separation and detection
  • the cutting edge methodology that represents a significant step forward in proteome analysis is the use of multidimensional liquid chromatography coupled to tandem mass spectrometry (LC-LC-MS/MS), which was first developed by Link A. and Yates J. R. 4 ' 5 ' 6 and further improved by Washburn M., Wolters D., and Yates J. R. 3 .
  • the existence and further improvement of this technique are critical factors in the present approach for the application of complex peptide separation and full automation, which makes it the most ideal technology for high throughput proteomics.
  • MudPIT has been previously reported in various incarnations involving reverse phase columns coupled to either cation exchange columns or size exclusion columns 8 . However, it was only when the technique was employed with a mixed bed microcapillary column containing strong cation exchange (SCX) and reverse phase chromatography (RPC) resins that the true utility of MudPIT was demonstrated.
  • SCX strong cation exchange
  • RPC reverse phase chromatography
  • a discrete fraction of the absorbed peptides are displaced from the SCX column onto the RPC column using a step gradient of salt, causing the peptides to be retained on the RPC column while contaminating salts and buffers are washed through.
  • Peptides are then eluted from the RPC column using an acetonitrile gradient, and analyzed by MS/MS. This process is repeated using increasing salt concentration to displace additional fractions from the SCX column. This is applied in an iterative manner, typically involving 10-20 steps, and the MS/MS data from all of the fractions are analyzed by database searching 9 ' 10 and combined to give an overall picture of the protein components present in the initial sample.
  • the MudPIT technique can be run in a fully automated system.
  • the three-dimensional microcapillary columns of the invention are operably linked to tandem mass spectrographs (3D LC LC MS/MS), ion trap mass spectrographs or a combination of tandem mass spectrographs and ion trap mass spectrographs (LC-LCQ-MS/MS or LC-LTQ- MS/MS), as described herein.
  • a three-dimensional microcapillary system of the invention can provide rapid metabolite identification and proteomic profiling to accelerate drug discovery and development. See Example 3, below, and Figures 4, 14 and 22 for exemplary 3D LC apparatus of the invention.
  • the novel three-dimensional microcapillary columns of the invention can be used to improve on MudPIT techniques.
  • the three-dimensional microcapillary columns of the invention also comprise tandem mass spectrometers ("3D LC LC MS/MS", as described herein), an ion trap mass spectrometer (LCQ or LTQ), such as a Finnigan LCQ Deca XPTM or MDLC LTQTM (Thermo Electron Co ⁇ oration, San Jose, CA) ion trap mass spectrometer, or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer, or a combination of tandem mass spectrometry and ion trap mass spectrometry ("3D LC LCQ MS/MS” or "3D LC LTQ MS/MS", as described herein).
  • the MDLC LTQTM is the Finnigan LTQ FT
  • the Agilent LC/MSD Trap is an 1100 series LC/MSD TRAPTM, or, the LC/MSD Trap SLTM, or, the LC/MSD Trap XCT TM (Agilent Technologies, Palo Alto, CA).
  • the invention uses the 3D LC MS/MS apparatus and methods of the invention, the invention provides a rapid one-fraction protocol for protein extraction, e.g., a rapid one-fraction protocol for extraction, fractionation and/or isolation of proteins of a proteome.
  • the 3D LC MS/MS, 3D LC LCQ MS/MS or 3D LC LTQ MS/MS apparatus and methods of the invention can be used to fractionate/ isolate 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%,
  • the 3D LC MS/MS, 3D LC LCQ MS/MS or 3D LC LTQ MS/MS apparatus and methods of the invention provide a one-fraction protocol to fractionate/ isolate 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21 %, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%,
  • a denatured and reduced protein mixture is digested with trypsin to produce peptide fragments. Without desalting, the mixture is directly loaded onto a microcapillary column containing RPC resin, SCX resin and RPC resin, accordingly, eluted directly into a tandem mass spectrometer. A discrete fraction of the absorbed peptides are displaced from the first RPC to the SCX section using a reverse phase gradient (0-X%).
  • This fraction of peptides are retained onto SCX section and then sub-fractionated from the SCX column onto the RPC column using a step gradient of salt, causing part of the peptides to be eluted and retained on the last RPC section while contaminating salts and buffers are washed through.
  • Peptides are then eluted from the RPC column using the same reverse phase gradient (0-X%), and analyzed by MS/MS. This process is repeated using increasing salt concentration to displace additional sub-fractions from the SCX column following each step by a reverse phase gradient. Once the completion of the whole sequence of salt steps, next cycle begins with a higher reverse phase gradient (0-Y%, Y>X).
  • FIG. 3 illustrates 3D LC set-up and process.
  • the mixed bed multi-dimensional liquid chromatographs of the invention (designated 3D LC MS, or, 3D LC MS/MS; see Example 3, below, and Figures 3, 4, 14 and 22 for exemplary 3D LC apparatus of the invention) are fully automated apparatus techniques using LC in combination with mass spectrometry and database search for highly complex mixtures.
  • the 3D LC MS, or, 3D LC MS/MS of the invention is competitive toward the 2D GE technique in the following terms. It is universal, identifies proteins with extremes in pi, MW, and wide variety of protein classes. It can access hydrophobic proteins. It has high sensitivity, peak capacity and gives dynamic range greater than 10,000 to 1. It is time and labor efficient with its automatic workflow.
  • the mixed bed multi-dimensional liquid chromatographs (e.g., 3D LC) of the invention play an important role on both qualitative proteomics as well as quantitative proteomics with the combination of novel tagging method (see Examples 3, 4, and 5, below).
  • the chromatographs and methods of the invention are used to analyze the entire proteome of a cell, e.g., a microorganism, such as Bacillus anthracis and Desulfovibrio vulgaris).
  • a microorganism such as Bacillus anthracis and Desulfovibrio vulgaris.
  • Sequence analysis and quantification Both quantity and sequence identity of the protein from which the modified peptide originated is determined by multistage MS. This is achieved by the operation of the mass spectrometer in a dual mode in which it alternates in successive scans between measuring the relative quantities of peptides eluting from the capillary column and recording the sequence information of selected peptides.
  • Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents.
  • Peptide sequence information is automatically generated by selecting peptide ions of a particular mass-to-charge (m/z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode 6 ' 1 ' ,12 .
  • CID collision-induced dissociation
  • a program such as SEQUESTTM (Thermo Finnigan, San Jose, CA) or equivalent, e.g., U.S. Patent Nos. 6,017,693 and 5,538,897, can be used to inspect/ analyze the spectra with multiple peaks (e.g., more than 7 peaks/spectrum) for potential duplicates (see discussion in Example 3, below).
  • the spectra comparisons are carried out using a dot-product criteria, e.g., as in Stein (1997) Am. Soc. Mass Spectrom. 5:859, in combination with the retention time, precursor m/z constraints, and index-peak matching.
  • data acquired from the differentially labeled peptides are subjected to the following exemplary data analysis algorithm of the invention: 1.
  • Component extraction comprising the following sub-steps: a. For every MS spectrum from the beginning of the LC elution, select the "significant" ions, which are above the local noise background and 17 contain predominately C isotopes. b. For every "significant" ion, generate a "selected ion chromatogram" using the neighboring MS spectra. In one aspect, the width of the region is at least 2X of the expected width of the peptide elution (DO). c. Determine the peak location, quality, area and baseline level based on the "selected ion chromatogram". d.
  • the spectra equivalency is declared if the spectra pair satisfy the following requirements: 1. Their precursor m/z values are within the pre-defined tolerance; 2. Their elution times are within a pre-defined tolerance; 3. Their "signature” peaks achieved a pre-defined degree of match; and 4. Their "dot-products" in both forward and backward direction exceed pre-defined thresholds, b.
  • the duplicated spectra are merged based on the m/z position of the peaks.
  • the elution times of the first (TI) and last (T2) spectra are stored as a part of the description of the merged spectrum.
  • the intensity of the precursor ions is calculated from the MSI spectra by integrating the region where the precursor ions are detected.
  • FIG. 17 is a schematic, a flow chart, illustrating an exemplary data analysis algorithm of the invention for quantitative proteomics.
  • Figure 18 is a schematic, a flow chart, illustrating the "component extraction” section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17.
  • Figure 19 is a schematic, a flow chart, illustrating the "precursor integration” section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17.
  • Figure 20 is a schematic, a flow chart, illustrating the "spectra comparison" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19.
  • Figure 21 is a schematic, a flow chart, illustrating the "identity and merge of duplicates LC-MS spectra" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19.
  • the invention provides data analysis algorithms as illustrated in Figure 17, and further described in Figures 18 to 21 , in whole, and/or, in part.
  • the data analysis algorithm described in Figure 17, and further described in Figures 18 to 21, in whole, or, in part, can be used to analyze data generated by the systems and methods of the invention. For example, this analysis can be used to reconstruct a series of differentially labeled peptides based on a predictable elution behavior in combination with the predicted mass differences, which can be generated by the systems and methods of the invention.
  • this algorithm in whole, or, in part, can be used to analyze data generated by other applications, e.g., to analyze data generated by any LC, MS, LC-MS or other analytical system.
  • Computer Systems and computer program products In one aspect, the invention provides computer program products comprising computer-implemented methods and/or programs comprising data analysis algorithms as described in Figure 17, and further described in Figures 18 to 21, in whole, or, in part.
  • the invention provides computer systems, e.g., comprising computer program products, operably linked to the multidimensional columns of the invention, or the 3D LC LC MS/MS or 3D LC LCQ MS/MS systems of the invention.
  • the invention provides a storage medium (e.g., a diskette, a tape, a CD, a hard drive, a memory chip) with a computer program of the invention (e.g., a computer-implemented method, a data-analysis algorithm of the invention) stored thereon.
  • a computer program of the invention e.g., a computer-implemented method, a data-analysis algorithm of the invention
  • the invention provides computer program products comprising a computer useable medium having computer program logic recorded thereon, where computer program code logic is configured to perform operations comprising the computer- implemented methods, the data-analysis algorithms, of the invention.
  • the invention provides computer systems comprising a processor and a computer program product of the invention.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention provides mixed bed multi-dimensional liquid chromatographs and methods of making and using them. The invention provides systems comprising the mixed bed multi-dimensional liquid chromatographs of the invention operatively linked to mass spectrometry devices. The invention provides novel systems and methods for determining polypeptide profiles and protein expression variations, as with proteome analyses. The present invention provides systems and methods for simultaneously identifying and quantifying individual proteins in complex protein mixtures by selective differential labeling of amino acid residues followed by chromatographic and mass spectrographic analysis. The invention also provides computer program products and computer implemented methods for practicing the systems and methods of the invention.

Description

MIXED BED MULTI-DIMENSIONAL CHROMATOGRAPHY SYSTEMS AND METHODS OF MAKING AND USING THEM
RELATED APPLICATIONS This application claims the benefit of priority under 35 U.S.C. § 119(e) of U.S. Provisional Application Nos. 60/496,540, filed June 6, 2003 , and 60/492,027, filed
August 01, 2003. Each of the aforementioned applications are explicitly incoφorated herein by reference in their entirety and for all purposes.
TECHNICAL FIELD This invention relates to proteomics and mass spectrometry technology. In particular, the invention provides novel systems and methods for determining polypeptide profiles and protein expression variations, as with proteome analyses. The present invention provides systems and methods for simultaneously identifying and quantifying individual proteins in complex protein mixtures by selective differential labeling of amino acid residues followed by chromatographic and mass spectrographic analysis. The invention also provides computer program products and computer implemented methods for practicing the systems and methods of the invention.
BACKGROUND The predisposition for diagnosis and treatment of a variety of diseases and disorders may often be accomplished through identification and quantitative measurement of polypeptide expression variations between different cell types and cell states. Biochemical pathways and metabolic networks can also be analyzed by globally and quantitatively measuring protein expression in various cell types and biological states (see, e.g., Ideker (2001) Science 292:929-934). State-of-the-art techniques such as liquid-chromatography- electrospray-ionization tandem mass spectrometry have, in conjunction with database- searching computer algorithms, revolutionized the analysis of biochemical species from complex biological mixtures. With these techniques, it is now possible to perform high-throughput protein identification at picomolar to subpicomolar levels from complex mixtures of biological molecules (see, e.g., Dongre (1997) Trends Biotechnol. 15:418-425). One such method is based on a class of chemical reagents termed isotope-coded affinity tags (ICATs) and tandem mass spectrometry. The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source. The measured differences in protein expression correlated with known yeast metabolic function under glucose- repressed conditions. In another technique, two different protein mixtures for quantitative comparison are digested to peptide mixtures, the peptides mixtures are separately methylated using either dO- or d3 -methanol, the mixtures of methylated peptide combined and subjected to microcapillary HPLC-MS/MS (see, e.g., Goodlett, David R., et al., (2000) "Differential Isotopic Labeling of Peptides for Global Quantification of Proteins and de novo Sequence Derivation," 49th ASMS). Parent proteins of methylated peptides are identified by correlative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3 -methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3-methylated peptide pairs. However, there are several limitations to this approach, including: use of differential labeling reagents, which relied on stable isotopes, which are expensive, and not flexible to differential labeling of more than two mixtures of peptides; labeling methods limited only to methylation of carboxy-termini; protein expression profiling limited to duplex comparison; one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't has enough capacity and resolving power for complex mixtures of peptides. Screening and Selection Overview of screening and selection Screening is, in general, a two-step process in which one first determines which cells do and do not express a screening marker and then physically separates the cells having the desired property. Screening markers include, for example, luciferase, beta-galactosidase, and green fluorescent protein. Screening can also be done by observing a cell holistically including but not limited to utilizing methods pertaining to genomics, RNA profiling, proteomics, metabolomics, and lipidomics as well as observing such aspects of growth as colony size, halo formation, etc. Additionally, screening for production of a desired compound, such as a therapeutic drug or "designer chemical" can be accomplished by observing binding of cell products to a receptor or ligand, such as on a solid support or on a column. Such screening can additionally be accomplished by binding to antibodies, as in an ELISA. In some instances the screening process can be automated so as to allow screening of suitable numbers of colonies or cells. Some examples of automated screening devices include fluorescence activated cell sorting (FACS), especially in conjunction with cells immobilized in agarose (see Powell et. al. Bio/Technology 8:333-337 (1990); Weaver et. al. Methods 2:234- 247 (1991)), automated ELISA assays, scintillation proximity assays (Hart, H.E. et al., Molecular Immunol. 16:265-267 (1979)) and the formation of fluorescent, colored or UV absorbing compounds on agar plates or in microtiter wells (Krawiec, S., Devel. Indust. Microbiology 31:103-114 (1990)). Selection is a form of screening in which identification and physical separation are achieved simultaneously, for example, by expression of a selectable marker, which, in some genetic circumstances, allows cells expressing the marker to survive while other cells die (or vice versa). Selectable markers can include, for example, drug, toxin resistance, or nutrient synthesis genes. Selection is also done by such techniques as growth on a toxic substrate to select for hosts having the ability to detoxify a substrate, growth on a new nutrient source to select for hosts having the ability to utilize that nutrient source, competitive growth in culture based on ability to utilize a nutrient source, etc. In particular, uncloned but differentially expressed proteins (e.g., those induced in response to new compounds, such as biodegradable pollutants in the medium) can be screened by differential display (Appleyard et al. Mol. Gen. Gent. 247:338-342 (1995)). Hopwood (Phil Trans R. Soc. Lond B 324:549-562) provides a review of screens for antibiotic production. Omura (Microbio. Rev. 50:259-279 (1986) and Nisbet (Ann Rev. Med. Chem. 21 :149-157 (1986)) disclose screens for antimicrobial agents, including supersensitive bacteria, detection of beta-lactamase and D,D- carboxypeptidase inhibition, beta-lactamase induction, chromogenic substrates and monoclonal antibody screens. Antibiotic targets can also be used as screening targets in high throughput screening. Antifungals are typically screened by inhibition of fungal growth. Pharmacological agents can be identified as enzyme inhibitors using plates containing the enzyme and a chromogenic substrate, or by automated receptor assays. Hydrolytic enzymes (e.g., proteases, amylases) can be screened by including the substrate in an agar plate and scoring for a hydrolytic clear zone or by using a colorimetric indicator (Steele et al. Ann. Rev. Microbiol. 45:89-106 (1991)). This can be coupled with the use of stains to detect the effects of enzyme action (such as congo red to detect the extent of degradation of celluloses and hemicelluloses). Tagged substrates can also be used. For example, Upases and esterases can be screened using different lengths of fatty acids linked to umbelliferyl. The action of upases or esterases removes this tag from the fatty acid, resulting in a quenching or enhancement of umbelliferyl fluorescence. These enzymes can be screened in microtiter plates by a robotic device.
High-throughput cellular screening: utilizing various types of "omics" Functional genomics seeks to discover gene function once nucleotide sequence information is available. Proteomics (the study of protein properties such as expression, post-translational modifications, interactions, etc.) and metabolomics (analysis of metabolite pools) are fast-emerging fields complementing functional genomics, that provide a global, integrated view of cellular processes. The variety of techniques and methods used in this effort include the use of bioinformatics, gene- array chips, mRNA differential display, disease models, protein discovery and expression, and target validation. The ultimate goal of many of these efforts has been to develop high- throughput screens for genes of unknown function. For review see Greenbaum D. et al. Genome Res, 11(9): 1463-8 (2001). Genomics Genomics can refer to various investigative techniques that are broad in scope but often refers to measuring gene expression for multitudes of genes simultaneously. For a review see Lockhart, D.J. and Winzeler, E.A. 2000. Genomics, gene expression and DNA arrays. Nature, 405 (6788): 827-36. Biological Chips General considerations In some systems, an oligonucleotide probe is tethered, i.e., by covalent attachment, to a solid support, and arrays of oligonucleotide probes immobilized on solid supports have been used to detect specific nucleic acid sequences in a target nucleic acid. See, e.g., PCT patent publication Nos. WO 89/10977 and 89/11548. Others have proposed the use of large numbers of oligonucleotide probes to provide the complete nucleic acid sequence of a target nucleic acid but failed to provide an enabling method for using arrays of immobilized probes for this purpose. See U.S. Patent Nos. 5,202,231 and 5,002,867 and PCT patent publication No. WO 93/17126. See U.S. Patent No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092. Microfabricated arrays of large numbers of oligonucleotide probes, called "DNA chips" offer great promise for a wide variety of applications. New methods and reagents are required to realize this promise. Informatics Informatics is the study and application of computer and statistical techniques to the management of information. In genome projects, bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence, structure and function from DNA sequence data. Increasingly, molecular biology is shifting from the laboratory bench to the computer desktop. Today's researchers require advanced quantitative analyses, database'comparisons, and computational algorithms to explore the relationships between sequence and phenotype. Thus, by all accounts, researchers can not and will not be able to avoid using computer resources to explore gene expression, gene sequencing and molecular structure. One use of bioinformatics involves studying an organism's genome to determine the sequence and placement of its genes and their relationship to other sequences and genes within the genome or to genes in other organisms. Another use of bioinformatics involves studying genes differentially or commonly expressed in different tissues or cell lines (e.g. normal and cancerous tissue). Such information is of significant interest in biomedical and pharmaceutical research, for instance to assist in the evaluation of drug efficacy and resistance. The sequence tag method involves generation of a large number (e.g., thousands) of Expressed Sequence Tags ("ESTs") from cDNA libraries (each produced from a different tissue or sample). ESTs are partial transcript sequences that may cover different parts of the cDNA(s) of a gene, depending on cloning and sequencing strategy. Each EST includes about 50 to 300 nucleotides. If it is assumed that the number of tags is proportional to the abundance of transcripts in the tissue or cell type used to make the cDNA library, then any variation in the relative frequency of those tags, stored in computer databases, can be used to detect the differential abundance and potentially the expression of the corresponding genes. To make genomic and EST information manipulation easy to perform and understand, sophisticated computer database systems have been developed. In one database system, developed by Incyte Pharmaceuticals, Inc. of Palo Alto, CA, genomic sequence data and the abundance levels of mRNA species represented in a given sample is electronically recorded and annotated with information available from public sequence databases such as GenBank. Examples of such databases include GenBank (NCBI) and TIGR. The resulting information is stored in a relational database that may be employed to determine relationships between sequences and genes within and among genomes and establish a cDNA profile for a given tissue and to evaluate changes in gene expression caused by disease progression, pharmacological treatment, aging, etc. In one database system, developed by Incyte Pharmaceuticals, Inc. of Palo Alto, Calif, abundance levels of mRNA species represented in a given sample are electronically recorded and annotated with information available from public sequence databases such as GenBank. The resulting information is stored in a relational database that may be employed to establish a cDNA profile for a given tissue and to evaluate changes in gene expression caused by disease progression, pharmacological treatment, aging, etc. Genetic information for a number of organisms has been catalogued in computer databases. Genetic databases for organisms such as Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium, and Mycoplasma pneumoniae, among others, are publicly available. At present, however, complete sequence data is available for relatively few species, and the ability to manipulate sequence data within and between species and databases is limited. While genetic data processing and relational database systems such as those developed by Incyte Pharmaceuticals, Inc. provide great power and flexibility in analyzing genetic information and gene expression information, this area of technology is still in its infancy and further improvements in genetic data processing and relational database systems and their content will help accelerate biological research for numerous applications. In genome projects, bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence and structure from DNA sequence data. Increasingly, molecular biology is shifting from the laboratory bench to the computer desktop. Advanced quantitative analyses, database comparisons, and computational algorithms are needed to explore the relationships between sequence and phenotype. The predisposition for or diagnosis and treatment of a variety of diseases and disorders may often be accomplished through identification and quantitative measurement of polypeptide expression variations between different cell types and cell states. Biochemical pathways and metabolic networks can also be analyzed by globally and quantitatively measuring protein expression in various cell types and biological states (see, e.g., Ideker (2001) Science 292:929-934). State-of-the-art techniques such as liquid-chromatography- electrospray-ionization tandem mass spectrometry have, in conjunction with database- searching computer algorithms, revolutionized the analysis of biochemical species from complex biological mixtures. With these techniques, it is now possible to perform high-throughput protein identification at picomolar to subpicomolar levels from complex mixtures of biological molecules (see, e.g., Dongre (1997) Trends Biotechnol. 15:418-425). One such method is based on a class of chemical reagents termed isotope-coded affinity tags (ICATs) and tandem mass spectrometry. The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source. The measured differences in protein expression correlated with known yeast metabolic function under glucose- repressed conditions. In another technique, two different protein mixtures for quantitative comparison are digested to peptide mixtures, the peptides mixtures are separately methylated using either dO- or d3-methanol, the mixtures of methylated peptide combined and subjected to microcapillary HPLC-MS/MS (see, e.g., Goodlett, David R., et al., (2000) "Differential Isotopic Labeling of Peptides for Global Quantification of Proteins and de novo Sequence Derivation," 49th ASMS). Parent proteins of methylated peptides are identified by correlative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3 -methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3 -methylated peptide pairs. However, there are several limitations to this approach, including: use of differential labeling reagents, which relied on stable isotopes, which are expensive, and not flexible to differential labeling of more than two mixtures of peptides; labeling methods limited only to methylation of carboxy-termini; protein expression profiling limited to duplex comparison; one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't has enough capacity and resolving power for complex mixtures of peptides. SUMMARY The invention provides methods for cellular screening, including cellular screening in genomics, e.g., as in high throughput genomics. "High throughput genomics" refers to application of genomic or genetic data or analysis techniques that use microarrays or other genomic technologies to rapidly identify large numbers of genes or proteins, or distinguish their structure, expression or function from normal or abnormal cells or tissues. In the methods of the invention, an observer can be a person viewing a slide with a microscope or an observer who views digital images. Alternatively, an observer can be a computer-based image analysis system, which automatically observes, analyses and quantitates biological arrayed samples with or without user interaction. The present invention provides for the use of arrays of oligonucleotide probes immobilized in microfabricated patterns on silica chips for analyzing molecular interactions of biological interest. The invention provides several strategies employing immobilized arrays of probes for comparing a reference sequence of known sequence with a target sequence showing substantial similarity with the reference sequence, but differing in the presence of, e.g., mutations. In one aspect, the invention provides a tiling strategy employing an array of immobilized oligonucleotide probes comprising at least two sets of probes. A first probe set comprises a plurality of probes, each probe comprising a segment of at least three nucleotides exactly complementary to a subsequence of the reference sequence, the segment including at least one interrogation position complementary to a corresponding nucleotide in the reference sequence. A second probe set comprises a corresponding probe for each probe in the first probe set, the corresponding probe in the second probe set being identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the at least one interrogation position, except that the at least one interrogation position is occupied by a different nucleotide in each of the two corresponding probes from the first and second probe sets. The probes in the first probe set have at least two interrogation positions corresponding to two contiguous nucleotides in the reference sequence. One interrogation position corresponds to one of the contiguous nucleotides, and the other interrogation position to the other. In another aspect, the invention provides a tiling strategy employing an array comprising four probe sets. A first probe set comprises a plurality of probes, each probe comprising a segment of at least three nucleotides exactly complementary to a subsequence of the reference sequence, the segment including at least one interrogation position complementary to a corresponding nucleotide in the reference sequence. Second, third and fourth probe sets each comprise a corresponding probe for each probe in the first probe set. The probes in the second, third and fourth probe sets are identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the at least one interrogation position, except that the at least one interrogation position is occupied by a different nucleotide in each of the four corresponding probes from the four probe sets. The first probe can have at least 100 interrogation positions corresponding to 100 contiguous nucleotides in the reference sequence. The first probe set can have an interrogation position corresponding to every nucleotide in the reference sequence. The segment of complementarity within the probe set is usually about 9 to 21 nucleotides. Although probes may contain leading or trailing sequences in addition to the 9-21 sequences, many probes consist exclusively of a 9-21 segment of complementarity. In another aspect, the invention provides immobilized arrays of probes tiled for multiple reference sequences, one such array comprises at least one pair of first and second probe groups, each group comprising first and second sets of probes as defined in the first aspect. Each probe in the first probe set from the first group is exactly complementary to a subsequence of a first reference sequence, and each probe in the first probe set from the second group is exactly complementary to a subsequence of a second reference sequence. Thus, the first group of probes are tiled with respect to a first reference sequence and the second group of probes with respect to a second reference sequence. Each group of probes can also include third and fourth sets of probes as defined in the second aspect. In some arrays of this type, the second reference sequence is a mutated form of the first reference sequence. In another aspect, the invention provides arrays for block tiling. Block tiling is a species of the general tiling strategies described above. The usual unit of a block tiling array is a group of probes comprising a wildtype probe, a first set of three mutant probes and a second set of three mutant probes. The wildtype probe comprises a segment of at least three nucleotides exactly complementary to a subsequence of a reference sequence. The segment has at least first and second interrogation positions corresponding to first and second nucleotides in the reference sequence. The probes in the first set of three mutant probes are each identical to a sequence comprising the wildtype probe or a subsequence of at least three nucleotides thereof including the first and second interrogation positions, except in the first interrogation position, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. The probes in the second set of three mutant probes are each identical to a sequence comprising the wildtype probes or a subsequence of at least three nucleotides thereof including the first and second interrogation positions, except in the second interrogation position, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. In another aspect, the invention provides methods of comparing a target sequence with a reference sequence using arrays of immobilized pooled probes. The arrays employed in these methods represent a further species of the general tiling arrays noted above. In these methods, variants of a reference sequence differing from the reference sequence in at least one nucleotide are identified and each is assigned a designation. An array of pooled probes is provided, with each pool occupying a separate cell of the array. Each pool comprises a probe comprising a segment exactly complementary to each variant sequence assigned a particular designation. The array is then contacted with a target sequence comprising a variant of the reference sequence. The relative hybridization intensities of the pools in the array to the target sequence are determined. The identity of the target sequence is deduced from the pattern of hybridization intensities. Often, each variant is assigned a designation having at least one digit and at least one value for the digit. In this case, each pool comprises a probe comprising a segment exactly complementary to each variant sequence assigned a particular value in a particular digit. When variants are assigned successive numbers in a numbering system of base m having n digits, n x (m-1) pooled probes are used are used to assign each variant a designation. In another aspect, the invention provides a pooled probe for trellis tiling, a further species of the general tiling strategy. In trellis tiling, the identity of a nucleotide in a target sequence is determined from a comparison of hybridization intensities of three pooled trellis probes. A pooled trellis probe comprises a segment exactly complementary to a subsequence of a reference sequence except at a first interrogation position occupied by a pooled nucleotide N, a second interrogation position occupied by a pooled nucleotide selected from the group of three consisting of (1) M or K, (2) R or Y and (3) S or W, and a third interrogation position occupied by a second pooled nucleotide selected from the group. The pooled nucleotide occupying the second interrogation position comprises a nucleotide complementary to a corresponding nucleotide from the reference sequence when the second pooled probe and reference sequence are maximally aligned, and the pooled nucleotide occupying the third interrogation position comprises a nucleotide complementary to a corresponding nucleotide from the reference sequence when the third pooled probe and the reference sequence are maximally aligned. Standard IUPAC nomenclature is used for describing pooled nucleotides. In trellis tiling, an array comprises at least first, second and third cells, respectively occupied by first, second and third pooled probes, each according to the generic description above. However, the segment of complementarity, location of interrogation positions, and selection of pooled nucleotide at each interrogation position may or may not differ between the three pooled probes subject to the following constraint. One of the three interrogation positions in each of the three pooled probes must align with the same corresponding nucleotide in the reference sequence. This interrogation position must be occupied by a N in one of the pooled probes, and a different pooled nucleotide in each of the other two pooled probes. In another aspect, the invention provides arrays for bridge tiling. Bridge tiling is a species of the general tiling strategies noted above, in which probes from the first probe set contain more than one segment of complementarity. In bridge tiling, a nucleotide in a reference sequence is usually determined from a comparison of four probes. A first probe comprises at least first and second segments, each of at least three nucleotides and each exactly complementary to first and second subsequences of a reference sequences. The segments including at least one interrogation position corresponding to a nucleotide in the reference sequence. Either
(1) the first and second subsequences are noncontiguous in the reference sequence, or
(2) the first and second subsequences are contiguous and the first and second segments are inverted relative to the first and second subsequences. The arrays of the invention can further comprise second, third and fourth probes, which are identical to a sequence comprising the first probe or a subsequence thereof comprising at least three nucleotides from each of the first and second segments, except in the at least one interrogation position, which differs in each of the probes. In a species of bridge tiling, referred to as deletion tiling, the first and second subsequences are separated by one or two nucleotides in the reference sequence. In another aspect, the invention provides arrays of probes for multiplex tiling. Multiplex tiling is a strategy, in which the identity of two nucleotides in a target sequence is determined from a comparison of the hybridization intensities of four probes, each having two interrogation positions. Each of the probes comprising a segment of at least 7 nucleotides that is exactly complementary to a subsequence from a reference sequence, except that the segment may or may not be exactly complementary at two interrogation positions. The nucleotides occupying the interrogation positions are selected by the following rules: (1) the first interrogation position is occupied by a different nucleotide in each of the four probes, (2) the second interrogation position is occupied by a different nucleotide in each of the four probes, (3) in first and second probes, the segment is exactly complementary to the subsequence, except at no more than one of the interrogation positions, (4) in third and fourth probes, the segment is exactly complementary to the subsequence, except at both of the interrogation positions. In another aspect, the invention provides arrays of immobilized probes including helper mutations. Helper mutations are useful for, e.g., preventing self- annealing of probes having inverted repeats. In this strategy, the identity of a nucleotide in a target sequence is usually determined from a comparison of four probes. A first probe comprises a segment of at least 7 nucleotides exactly complementary to a subsequence of a reference sequence except at one or two positions, the segment including an interrogation position not at the one or two positions. The one or two positions are occupied by helper mutations. Second, third and fourth mutant probes are each identical to a sequence comprising the wildtype probe or a subsequence thereof including the interrogation position and the one or two positions, except in the interrogation position, which is occupied by a different nucleotide in each of the four probes. In another aspect, the invention provides arrays of probes comprising at least two probe sets, but lacking a probe set comprising probes that are perfectly matched to a reference sequence. Such arrays are usually employed in methods in which both reference and target sequence are hybridized to the array. The first probe set comprising a plurality of probes, each probe comprising a segment exactly complementary to a subsequence of at least 3 nucleotides of a reference sequence except at an interrogation position. The second probe set comprises a corresponding probe for each probe in the first probe set, the corresponding probe in the second probe set being identical to a sequence comprising the corresponding probe from the first probe set or a subsequence of at least three nucleotides thereof that includes the interrogation position, except that the interrogation position is occupied by a different nucleotide in each of the two corresponding probes and the complement to the reference sequence. In another aspect, the invention provides methods of comparing a target sequence with a reference sequence comprising a predetermined sequence of nucleotides using any of the arrays described above. The methods comprise hybridizing the target nucleic acid to an array and determining which probes, relative to one another, in the array bind specifically to the target nucleic acid. The relative specific binding of the probes indicates whether the target sequence is the same or different from the reference sequence. In some such methods, the target sequence has a substituted nucleotide relative to the reference sequence in at least one undetermined position, and the relative specific binding of the probes indicates the location of the position and the nucleotide occupying the position in the target sequence. In some methods, a second target nucleic acid is also hybridized to the array. The relative specific binding of the probes then indicates both whether the target sequence is the same or different from the reference sequence, and whether the second target sequence is the same or different from the reference sequence. In some methods, when the array comprises two groups of probes tiled for first and second reference sequences, respectively, the relative specific binding of probes in the first group indicates whether the target sequence is the same or different from the first reference sequence. The relative specific binding of probes in the second group indicates whether the target sequence is the same or different from the second reference sequence. Such methods are particularly useful for analyzing heterologous alleles of a gene. Some methods entail hybridizing both a reference sequence and a target sequence to any of the arrays of probes described above. Comparison of the relative specific binding of the probes to the reference and target sequences indicates whether the target sequence is the same or different from the reference sequence. In another aspect, the invention provides arrays of immobilized probes in which the probes are designed to tile a reference sequence from a human immunodeficiency virus. Reference sequences from either the reverse transcriptase gene or protease gene of HIV are of particular interest. Some chips further comprise arrays of probes tiling a reference sequence from a 16S RNA or DNA encoding the 16S RNA from a pathogenic microorganism. The invention further provides methods of using such arrays in analyzing a HIV target sequence. The methods are particularly useful where the target sequence has a substituted nucleotide relative to the reference sequence in at least one position, the substitution conferring resistance to a drug use in treating a patient infected with a HIV virus. The methods reveal the existence of the substituted nucleotide. The methods are also particularly useful for analyzing a mixture of undetermined proportions of first and second target sequences from different HIV variants. The relative specific binding of probes indicates the proportions of the first and second target sequences. In another aspect, the invention provides arrays of probes tiled based on reference sequence from a CFTR gene. An exemplary array comprises at least a group of probes comprising a wildtype probe, and five sets of three mutant probes. The wildtype probe is exactly complementary to a subsequence of a reference sequence from a cystic fibrosis gene, the segment having at least five interrogation positions corresponding to five contiguous nucleotides in the reference sequence. The probes in the first set of three mutant probes are each identical to the wildtype probe, except in a first of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. The probes in the second set of three mutant probes are each identical to the wildtype probe, except in a second of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. The probes in the third set of three mutant probes are each identical to the wildtype probe, except in a third of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. The probes in the fourth set of three mutant probes are each identical to the wildtype probe, except in a fourth of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. The probes in the fifth set of three mutant probes are each identical to the wildtype probe, except in a fifth of the five interrogation positions, which is occupied by a different nucleotide in each of the three mutant probes and the wildtype probe. A chip can comprise two such groups of probes. The first group comprises a wildtype probe exactly complementary to a first reference sequence, and the second group comprises a wildtype probe exactly complementary to a second reference sequence that is a mutated form of the first reference sequence. The invention further provides methods of using the arrays of the invention for analyzing target sequences from a CFTR gene. The methods are capable of simultaneously analyzing first and second target sequences representing heterozygous alleles of a CFTR gene. In another aspect, the invention provides arrays of probes tiling a reference sequence from a p53 gene, an hMLHl gene and/or an MSH2 gene. The invention further provides methods of using the arrays described above to analyze these genes. The method are useful, e.g., for diagnosing patients susceptible to developing cancer. In another aspect, the invention provides arrays of probes tiling a reference sequence from a mitochondrial genome. The reference sequence may comprise part or all of the D-loop region, or all, or substantially all, of the mitochondrial genome. The invention further provides method of using the arrays described above to analyze target sequences from a mitochondrial genome. The methods are useful for identifying mutations associated with disease, and for forensic, epidemiological and evolutionary studies. The invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated. In one aspect, the sample of step (a) comprises a cell or a cell extract. The method can further comprise providing two or more samples comprising a polypeptide. One or more of the samples can be derived from a wild type cell and one sample can be derived from an abnormal or a modified cell. The abnormal cell can be a cancer cell. The modified cell can be a cell that is mutagenized &/or treated with a chemical, a physiological factor, or the presence of another organism (including, e.g. a eukaryotic organism, prokaryotic organism, virus, vector, prion, or part thereof), &/or exposed to an environmental factor or change or physical force (including, e.g., sound, light, heat, sonication, and radiation). The modification can be genetic change (including, for example, a change in DNA or RNA sequence or content) or otherwise. In one aspect, the method further comprises purifying or fractionating the polypeptide before the fragmenting of step (c). The method can further comprise purifying or fractionating the polypeptide before the labeling of step (d). The method can further comprise purifying or fractionating the labeled peptide before the chromatography of step (e). In alternative aspects, the purifying or fractionating comprises a method selected from the group consisting of size exclusion chromatography, size exclusion chromatography, HPLC, reverse phase HPLC and affinity purification. In one aspect, the method further comprises contacting the polypeptide with a labeling reagent of step (b) before the fragmenting of step (c). In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: ZAOH and ZBOH, to esterify peptide C-terminals and/or Glu and Asp side chains; ZANH2 and ZBNH2, to form amide bond with peptide C-terminals and/or Glu and Asp side chains; and ZΛCO2H and ZBCO2H. to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein ZΛ and ZB independently of one another comprise the general formula R-Z1- A'-Z2-A2-Z3-A3-Z4-A4-, Z1, Z2, Z3, and Z4 independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR'), and R and R1 is an alkyl group, A1, A2, A3, and A4 independently of one another, are selected from the group consisting of nothing or (CRR')n, wherein R, R1, independently from other R and R1 in Z1 to Z4 and independently from other R and R1 in A1 to A4, are selected from the group consisting of a hydrogen atom, a halogen atom and an alkyl group; "n" in Z1 to Z4, independent of n in A1 to A4, is an integer having a value selected from the group consisting of 0 to about 51; 0 to about 41; 0 to about 31; 0 to about 21, 0 to about 11 and 0 to about 6. In one aspect, the alkyl group (see definition below) is selected from the group consisting of an alkenyl, an alkynyl and an aryl group. One or more C-C bonds from (CRR')n can be replaced with a double or a triple bond; thus, in alternative aspects, an R or an R1 group is deleted. The (CRR')n can be selected from the group consisting of an o-arylene, an w-arylene and a »-arylene, wherein each group has none or up to 6 substituents. The (CRR')n can be selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom. In one aspect, two or more labeling reagents have the same structure but a different isotope composition. For example, in one aspect, ZΛ has the same structure as ZB, while Z has a different isotope composition than ZB. In alternative aspects, the isotope is boron- 10 and boron- 11; carbon- 12 and carbon-13; nitrogen- 14 and nitrogen-15; and, sulfur-32 and sulfur-34. In one aspect, where the isotope with the lower mass is x and the isotope with the higher mass is y, and x and y are integers, x is greater than y. In alternative aspects, x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51. In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: CD3(CD2)nOH / CH3(CH2)nOH, to esterify peptide C-terminals, where n = 0, 1 , 2 or y; CD3(CD2)nNH2 / CH3(CH2)nNH2, to form amide bond with peptide C-terminals, where n = 0, 1, 2 or y; and, D(CD2)nCO2H / H(CH2)nCO2H, to form amide bond with peptide N-terminals, where n = 0, 1, 2 or y; wherein D is a deuteron atom, and y is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51. In one aspect, the labeling reagent of step (b) can comprise the general formulae selected from the group consisting of: ZAOH and ZBOH to esterify peptide C-terminals; ZANH2 / ZBNH2 to form an amide bond with peptide C-terminals; and, ZΛCO2H / ZBCO2H to form an amide bond with peptide N-terminals; wherein ZA and ZB have the general formula R-Z'-A'-Z2-A2-Z3-A3-Z4-A4- ; Z1, Z2, Z3, and Z4, independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(0R)(0R') , OBR(OR'), OBRR1, and OB(OR)(OR*); A1, A2, A3, and A4, independently of one another, are selected from the group consisting of nothing and the general formulae (CRR*)n, and, R and R1 is an alkyl group. In one aspect, a single C-C bond in a (CRR')n group is replaced with a double or a triple bond; thus, the R and R1 can be absent. The (CRR')n can comprise a moiety selected from the group consisting of an o-arylene, an w-arylene and ap- arylene, wherein the group has none or up to 6 substituents. The group can comprise a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom. In one aspect, R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R in A - A4, are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group. The alkyl group (see definition below) can be an alkenyl, an alkynyl or an aryl group. In one aspect, the "n" in Z1 - Z4 is independent of n in A1 - A4 and is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11 and about 6. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CH2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CF2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA comprises x number of protons and ZB comprises y number of halogens in the place of protons, wherein x and y are integers. In one aspect, ZA contains x number of protons and ZB contains y number of halogens, and there are x - y number of protons remaining in one or more A1 - A fragments, wherein x and y are integers. In one aspect, ZA further comprises x number of -O- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA further comprises x number of -S- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA further comprises x number of -O- fragment(s) and ZB further comprises y number of-S- fragment(s) in the place of-O- fragment(s), wherein and y are integers. In one aspect, ZA further comprises x -y number of -O- fragment(s) in one or more A1 - A4 fragments, wherein x andy are integers. In alternative aspects, x and y are integers selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y. In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: CH3(CH )nOH/CH3(CH )n+mOH, to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; CH3(CH2)n NH2 / CH3(CH2)n+mNH , to form amide bond with peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; and, H(CH2)nCO2H / H(CH2)n+mCO2H, to form amide bond with peptide N-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; wherein n, m and y are integers. In one aspect, n, m and y are integers selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51. As noted above, in one aspect, two or more labeling reagents have the same structure but a different isotope composition. An exemplary labeling reagent pair is N, N, dimethyl-iodoacetamide and N, N, d6-dimethyl-iodoacetamide, having the structures:
Figure imgf000021_0001
Λ/,Λ -dimethyliodoacetamide Λ/,Λ/-dimethyl-d6H'odoacetamide In one aspect, the methyl group can be replaced by any lower alkyl group (e.g., ethyl, butyl and the like). In one aspect, the separating of step (e) comprises a liquid chromatography system, such as a multidimensional liquid chromatography (e.g., a system of the invention) or a capillary chromatography system. In one aspect, the mass spectrometer comprises a tandem mass spectrometry device or an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™ (Thermo Electron Coφoration, San Jose, CA), or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer. In one aspect, the Agilent LC/MSD Trap is an 1100 series LC/MSD TRAP™, or, the LC/MSD Trap SL™, or, the LC/MSD Trap XCT ™ (Agilent Technologies, Palo Alto, CA), or equivalent device. In one aspect, the method further comprises quantifying the amount of each polypeptide or each peptide. The invention provides a method for defining the expressed proteins associated with a given cellular state, the method comprising the following steps: (a) providing a sample comprising a cell in the desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cell into peptide fragments by enzymatic digestion or by non- enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated, thereby defining the expressed proteins associated with the cellular state. The invention provides a method for quantifying changes in protein expression between at least two cellular states, the method comprising the following steps: (a) providing at least two samples comprising cells in a desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cells into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents, wherein the labels used in one same are different from the labels used in other samples; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which identifies from which sample each peptide was derived, compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated, and compares the amount of each polypeptide in each sample, thereby quantifying changes in protein expression between at least two cellular states. The invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by multidimensional liquid chromatography to generate an eluate; (f) feeding the eluate of step (e) into a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof, and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometers); (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™. The invention provides a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope. The isotope(s) can be in the first domain or the second domain. For example, the isotope(s) can be in the biotin. In alternative aspects, the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur-32 or a sulfur-34 isotope. The chimeric labeling reagent can comprise two or more isotopes. The chimeric labeling reagent reactive group capable of covalently binding to an amino acid can be a succimide group, an isothiocyanate group or an isocyanate group. The reactive group can be capable of covalently binding to an amino acid binds to a lysine or a cysteine. The chimeric labeling reagent can further comprising a linker moiety linking the biotin group and the reactive group. The linker moiety can comprise at least one isotope. In one aspect, the linker is a cleavable moiety that can be cleaved by, e.g., enzymatic digest or by reduction. The invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the small molecule tags are structurally identical but differ in their isotope composition, and the small molecules comprise reactive groups that covalently bind to cysteine or lysine residues or both; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (d) comparing relative protein concentrations of each sample. In one aspect, the sample comprises a complete or a fractionated cellular sample. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ
In one aspect of the method, the differential small molecule tags comprise a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and, (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope. The isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur- 32 or a sulfur-34 isotope. The chimeric labeling reagent can comprise two or more isotopes. The reactive group can be capable of covalently binding to an amino acid is selected from the group consisting of a succimide group, an isothiocyanate group and an isocyanate group. The invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the differential small molecule tags comprise a chimeric labeling reagent comprising (i) a first domain comprising a biotin; and, (ii) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) isolating the tagged polypeptides on a biotin-binding column by binding tagged polypeptides to the column, washing non-bound materials off the column, and eluting tagged polypeptides off the column; (e) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (f) comparing relative protein concentrations of each sample. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™. The invention provides chromatography systems comprising a first reverse phase column (RPC) (a first dimension), an ion exchange column (e.g., a cation (CX) or anion exchange column) (a second dimension), a second reverse phase column (RPC) (a third dimension), wherein the first reverse phase column (RPC), the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC) are connected in series; the first reverse phase column (RPC) has a free distal end and a proximal end connected to the ion exchange column (e.g., a cation (CX) or anion exchange column), or, first reverse phase column (RPC) is configured such that either the distal end or the proximal end are connected to the ion exchange column such that a sample can be loaded into and eluted out of first reverse phase column (RPC) to the ion exchange column from the same end (which can be the distal end or the proximal end); and, the second reverse phase column (RPC) has a free distal end and a proximal end connected to the ion exchange column, and the first reverse phase column (RPC) has a greater capacity than the second reverse phase column (RPC). In one aspect, the second reverse phase column (RPC), or the first reverse phase column (RPC), or both, are connected to an analytical device on its distal end such that an eluate can be fed into the analytical device. The analytical device can comprise a mass spectrometer. The mass spectrometer can further comprise a nano-spray apparatus. In one aspect, the mass spectrometer comprises a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof. In one aspect, the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC) are enclosed in one housing and the first reverse phase column (RPC) is enclosed in a second housing. In one aspect, the three dimensions, or columns, are all in different housings, or, the columns are arranged such that they can be easily, and individually, replaced. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™. In one aspect, a flow valve, e.g., a low volume flow valve (e.g., a microvalve) and/or an inline microfilter assembly connects the various columns (e.g., the various housings). For example, in one aspect, each dimensions, or column, is in a different housing and one, two or all of the housings are connected from each other by a flow valve, e.g., an inline microfilter assembly, and the like. In one aspect, a flow valve separates the first housing and the second housing. In one aspect, a flow valve, e.g., a low volume flow valve and/or an inline microfilter assembly connects the first reverse phase column (RPC) to the ion exchange column (e.g., a cation (CX) or anion exchange column) and the second reverse phase column (RPC). In one aspect, the first reverse phase column (RPC), the ion exchange column and the second reverse phase column (RPC) are enclosed in one housing. In one aspect, inputs and/or outputs to or from any or all of the columns are fitted with valves. In one aspect, the flow valve is a one-way, a two-way, a three-way (a "T-valve") or a four way valve. In one aspect the housing(s) comprise fused silica capillaries. In one aspect, a valve is fitted on the distal end of either reverse phase column (the end not connected to the ion exchange), or both distal ends of the reverse phase columns (this alternative aspect can be in addition to having a valve between the first reverse phase column and the ion exchange/ second reverse phase column assembly). In one aspect, the flow valve is a one-way, a two-way, a three-way (a "T- valve") or a four way valve. In one aspect, this valve or valves are a flow valve, e.g., a low volume flow valve. In one aspect, the valve connection assembly can further comprise an inline microfilter assembly. In one aspect, the system of the invention is fully automated. The system can comprise a sample injector fully integrated with the automated system. In one aspect, the system is integrated to a computer, which can be programmed to run samples on the system, including equilibrating columns, washing, step elution of samples, and the like. In one aspect, an automated system of the invention is used for high throughput proteome profiling with on-line sample collection. In one aspect, the first, second or both reverse phase columns are packed with a reverse phase resin or equivalent. The first, second or both reverse phase resins can comprise a C18 reverse phase resin or equivalent. The ion exchange column can comprise a strong cation exchange (SCX) resin or equivalent. The strong cation exchange (SCX) resin can comprise a polysulfoethyl A strong cation exchange resin. In one aspect, the first reverse phase column (RPC), the second first reverse phase column (RPC), or both are connected to an HPLC on a distal end. In one aspect, the first reverse phase column (RPC) has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%, 1000%, or more, greater capacity than the second reverse phase column (RPC) (which, in one aspect, is the third dimension in an exemplary 3-D LC-MS/MS or 3D LC LCQ MS/MS or 3D LC LTQ MS/MS system of the invention (e.g., comprising a Finnigan MDLC LTQ™ or LTQ FT™, Thermo Electron Coφoration, San Jose, CA, or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer), or a combination thereof). In one aspect, the Agilent LC/MSD Trap is an 1100 series LC/MSD TRAP™, or, the LC/MSD Trap SL™, or, the
LC/MSD Trap XCT ™ (Agilent Technologies, Palo Alto, CA), or equivalent device. In one aspect, the first reverse phase column (RPC) has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%, 1000%, or more, resin than the second reverse phase column (RPC), e.g., where in one aspect, the first RPC comprises the same, or equivalent, resin as the second RPC (which, in one aspect, is the third dimension of the exemplary 3-D LC-MS/MS system of the invention or 3D LC LCQ MS/MS or 3D LC LTQ MS/MS system (including, e.g., Finnigan MDLC LTQ™ or LTQ FT™, or one of Agilent's LC/MSD Trap devices) of the invention, or a combination thereof). In one aspect, the loading capacity is proportional to the column dimension. For example, in one aspect, the loading capacity is approximately 100 ug protein digest per 10 cm X 180 um CI 8 column, up to milligram sized sample. In one aspect, the chromatography systems can further comprise a computer system operatively linked to the cliromatography system, thereby making the chromatography system an automated operation. The chromatography systems can further comprise a computer system operatively linked to the mass spectrometer for quantifying the amount of each peptide by use of data from the mass spectrometer. The chromatography systems can further comprise a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer. The invention provides mixed bed multi-dimensional liquid chromatographs comprising a first resin bed (a first dimension), a second resin bed (a second dimension) and a third resin bed (a third dimension) connected in series, wherein the first resin bed comprises a reverse phase resin, the second resin bed comprises an ion exchange (e.g., a cation or anion exchange) resin bed and the third resin bed comprises a reverse phase resin, and the reverse phase resin of the first bed has a free distal end and a proximal end connected to the ion exchange bed, or, the reverse phase resin of the first bed is configured such that the distal end and/or the proximal end are connected to the ion exchange column such that a sample can be loaded into and eluted out of first reverse phase column (RPC) to the ion exchange column from the same end (which can be either the distal end or the proximal end), and the reverse phase resin of the third bed has a free distal end and a proximal end connected to the ion exchange bed. In one aspect, the reverse phase resin of the first bed has a greater capacity than the reverse phase resin of the third bed, or, the reverse phase resin of the third bed has a greater capacity than the reverse phase resin of the first bed. The reverse phase resin of the first bed, the reverse phase resin of the third bed, or both, can be connected to an analytical device such that an eluate can be fed into the analytical device. In one aspect, the loading capacity is proportional to the column dimension. For example, in one aspect, the loading capacity is approximately 100 ug protein digest per 10 cm X 180 um C18 column, or equivalent, up to milligram sized sample. In one aspect, the analytical device comprises a mass spectrometer.
The mass spectrometer can further comprise a nano-spray apparatus. The mass spectrometer can comprise a tandem mass spectrometer or an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof. In one aspect, the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™. In one aspect, each resin bed is enclosed in a separate housing (for, in some aspects, easy, independent replacement of any individual resin bed). In one aspect, the second resin bed and a third resin bed are enclosed in one housing and the first resin bed is enclosed in a second housing. In one aspect, a flow valve, e.g., a low volume flow valve and/or an inline microfilter assembly, connects each housing to each other and/or to any inputs or outputs. In one aspect, a flow valve connects the first housing and the second housing. The inline microfilter assembly can further comprise a valve, e.g., a one way or two way valve. In one aspect, a flow valve (e.g., a low volume flow valve, or directional control flow valve, e.g., a one way or two way flow valve) and/or an inline microfilter assembly connects the first bed to the second and third resin beds. In one aspect, the first reverse phase resin bed, the ion exchange resin bed and the second reverse phase resin bed are enclosed in one housing. In one aspect, the mixed bed multi-dimensional liquid chromatographs of the invention are fully automated. The chromatographs can comprise a sample injector fully integrated with the automated system. In one aspect, the chromatographs of the invention are integrated to a computer, which can be programmed to run samples, including equilibrating columns, washing, step elution of samples, and the like. In one aspect, chromatographs of the invention are used for high throughput proteome profiling with on-line sample collection. See Figure 22 for an exemplary automated chromatograph system of the invention. In one aspect, the reverse phase resin of the first bed, the reverse phase resin of the third bed or both reverse phase resin beds are packed with a Cx reverse phase resin or equivalent, wherein X is an integer between five and thirty. In one aspect, the Cx reverse phase resin or equivalent comprises a CI 8 reverse phase resin or equivalent. In one aspect, the ion exchange bed is packed with a strong cation exchange (SCX) resin or equivalent. The strong cation exchange resin (SCX) can comprise a polysulfoethyl A strong cation exchange resin. In one aspect, the reverse phase resin of the first bed, or the reverse phase resin of the third bed, or both, are connected to an HPLC. In one aspect, the first reverse phase resin bed has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%,, 375%, 400%, 425%, 450%, 475%,, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%,, 800%, 825%, 850%, 875%, 900%, 925%,, 950%, 975%, 1000%, or more, greater capacity than the second reverse phase resin bed. In one aspect, the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the chromatography system, thereby making the chromatography system an automated operation. In one aspect, the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the mass spectrometer for quantifying the amount of each peptide by use of data from the mass spectrometer. In one aspect, the mixed bed multi-dimensional liquid chromatographs further comprise a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer. The invention provides methods for separating proteins comprising the following steps: (a) providing a sample comprising a polypeptide; (b) fragmenting the polypeptide into peptide fragments; and (c) separating the peptides by chromatography to generate an eluate using a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention. In one aspect, the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system. In one aspect, the peptide fragments are eluted through the distal end of the reverse phase resin of the first bed and/or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph, or the peptide fragments are eluted through the distal end of the first or the second RP column of the chromatography system. In one aspect, the peptide fragments are eluted through the same end from which they were loaded. The peptide fragments can be generated by enzymatic digestion or by non-enzymatic fragmentation. The enzymatic digestion can be by trypsin, endoproteinase or a combination thereof. In one aspect, the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system without desalting or removing the detergent, or both. The peptide fragments can be solubilized in a detergent or a denaturing agent before loading into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system, and, in one aspect, loaded without having to remove the detergent. In one aspect, the system is used to analyze membrane proteins, or other hydrophobic proteins or compounds (e.g., organic compounds, e.g., steroids, fats, lipopolysaccharides) by loading samples without removing detergents. In one aspect, the detergent or denaturing agent is SDS or urea. Thus, in one aspect, the multi-dimensional chromatographs of the invention are detergent tolerant, and thus are excellent for membrane proteins or any protein or compound needing detergent to be solubilized. In one aspect, the peptide fragments are loaded into reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system using a pressure bomb. The method can further comprise feeding the eluate into a mass spectrometer and quantifying the amount of each peptide. The method can further comprise feeding the eluate into a mass spectrometer and generating the sequence of each peptide by use of the mass spectrometer. The method can further comprise inputting the sequence into a computer program product to compare the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which a sequenced peptide originated. In one aspect of the methods of the invention, the separating of step (c) comprises (i) loading a labeled peptide mixture into the first reverse phase column (RPC) of the chromatography system or the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph, wherein the first RPC or first reverse phase resin bed absorbs a plurality of peptides; (ii) eluting a fraction of the first RPC-absorbed or first resin bed-absorbed plurality of peptides to the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or the ion exchange (CX) resin bed of the mixed bed multidimensional liquid chromatograph, using a reverse phase gradient; (iii) eluting a fraction of the ion exchange column-absorbed or CX resin bed-absorbed plurality of peptides onto the second reverse phase column (RPC) of the chromatography system or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph using a salt gradient; and (iv) eluting a fraction of the second RPC- absorbed or second reverse phase resin bed-absorbed plurality of peptides. The plurality of peptides eluted in step (iv) can be eluted through the distal end of the second reverse phase column (RPC) of the cliromatography system or the distal end of the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph. In one aspect, the plurality of peptides eluted in step (iv) is eluted back through the proximal end of the second RPC of the chromatography system or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph, through the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or CX resin bed of the mixed bed multi-dimensional liquid chromatograph, and back through the proximal end of the first RPC of the chromatography system or the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph, and the eluate passes through the distal end of the first RPC or the first reverse phase resin bed. In one aspect, in step (iv) the fraction of the second RPC-absorbed or third resin bed-absorbed plurality of peptides are eluted using the same reverse phase gradient used to elute the first RPC-absorbed or first resin bed-absorbed fraction of peptides in step (ii). In one aspect, the method further comprises: after step (iii) is completed and before the step (iv) eluting a fraction of the second RPC-absorbed or second reverse phase resin bed-absorbed plurality of peptides is begun, washing the column free of the salts and buffers used to elute a fraction of the ion exchange column-absorbed or CX resin bed-absorbed plurality of peptides. In one aspect, a discrete fraction of the first RPC-absorbed or first resin bed-absorbed plurality of peptides is eluted to the ion exchange column (e.g., a cation (CX) or anion exchange column) of the chromatography system or the ion exchange (CX) resin bed of the mixed bed multi-dimensional liquid chromatograph from using a reverse phase gradient. In one aspect, the reverse phase gradient comprises (Xn-Xn+ι%B) over 120 minutes with a flow rate of 250 nl/min, and B comprises a buffer B comprising 80% ACN/0.1% formic acid, or equivalent, and n is an integer, n=0, 1, 2, 3, etc. In one aspect, upon the completion of a series of salt elution steps, the entire elution sequence is repeated, employing a higher reverse phase gradient comprising Xn+ι- Xn+2%, Xn+2 Xn+ι%, n=0, 1, 2, 3, etc. The separation can comprise 5 reverse phase cycles comprising Xo%=0%B, X,%=8%B, X2%=15%B, X3%=30%B, X4%=50%B, and X5%=100%B, each one followed by a salt gradient step. In one aspect, the salt gradient steps comprise 12 salt gradient steps comprising 25 mM, 50 mM, 75 mM, 100 mM, 125 mM, 150 mM, 175 mM, 200 mM, 225mM, 250mM, and 2M ammonium acetate, or equivalent. In one aspect, the method further comprises labeling the peptide fragments before loading them into the chromatography system or the mixed bed multi-dimensional liquid chromatograph. The sample can be derived from a cell, a seed or a spore. The cell can be a prokaryotic cell or a eukaryotic cell. The cell, seed or spore can be derived from a bacteria, a yeast, an insect, a plant, a fungus, a protozoa or a mammal. The mammalian cell can be a human cell or a mouse cell. The bacterial cell or spore can be a Bacillus anthracis. The invention provides methods for separating and detecting proteins by differential labeling of peptides. In one aspect, the method comprises the following steps: (a) providing at least two samples comprising a polypeptide; (b) providing at least two sets of labeling reagents (e.g., at least one pair of labeling reagents), wherein each set of labeling reagent differs in molecular mass from the other sets (e.g., wherein each member of a pair differs in molecular mass from the second member of a pair) and the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptides into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), wherein each sample is labeled with a different labeling reagent, thereby differentially labeling the peptides; (e) separating the labeled peptides by chromatography to generate an eluate using a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention. In one aspect, the method further comprises a step (f) comprising feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer. In one aspect, the method further comprises providing two or more samples from different sources. In one aspect, one sample is derived from a wild type cell and one sample is derived from an abnormal or a modified cell. The abnormal cell can be a cancer cell. The peptide fragments can be labeled with a reagent comprising a general formula selected from the group consisting of: ZAOH for labeling at least a first sample and ZBOH for labeling at least a second sample, to esterify peptide C-terminals and/or Glu and Asp side chains; ZANH2 for labeling at least a first sample and ZBNH2 for labeling at least a second sample, to form amide bond with peptide C-terminals and/or Glu and Asp side chains; and ZACO2H for labeling at least a first sample and ZBCO2Hτ for labeling at least a second sample to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein ZA and ZB independently of one another comprise the general formula R-Z -A -Z -A - Z3-A3-Z4-A4- , Z1, Z2, Z3, and Z4 independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR,
OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR'), and R and R1 is an alkyl group, A1, A2, A3, and A independently of one another, are selected from the group consisting of nothing or (CRR')n, wherein R, R1, independently from other R and R1 in Z1 to Z4 and independently from other R and R1 in A1 to A4, are selected from the group consisting of a hydrogen atom, a halogen atom and an alkyl group; n in Z1 to Z4, independent of n in A to A , is an integer having a value selected from the group consisting of 0 to about 51 ; 0 to about 41 ; 0 to about 31 ; 0 to about 21 , 0 to about 11 and 0 to about 6. In one aspect of the method, the alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group. In one aspect, one or more C- C bonds from (CRR1),, are replaced with a double or a triple bond. In one aspect, an R and/or an R1 group are absent. In one aspect, (CRR')n is selected from the group consisting of an o-arylene, an w-arylene and a »-arylene, wherein each group has none or up to 6 substituents. In one aspect, (CRR1),, is selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom. In one aspect of the method, two or more labeling reagents have the same structure but a different isotope composition. ZA can have the same structure as ZB, but ZA has a different isotope composition than ZB. The isotope can be boron-10 and boron-11, carbon-12 and carbon-13, nitrogen-14 and nitrogen-15, sulfur-32 and or sulfur-34. The isotope with the lower mass can be x and the isotope with the higher mass is y, and x and y are integers, x is greater than y. In one aspect, x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51. In one aspect of the method, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: i. CD3(CD2)nOH for labeling at least a first sample and CH3(CH2)nOH for labeling at least a second sample, to esterify peptide C-terminals, where n = 0, 1, 2 or y; ii. CD3(CD2)nNH2 for labeling at least a first sample and CH3(CH2)nNH2, to form amide bond with peptide C-terminals for labeling at least a second sample, where n = 0, 1, 2 or y; and iii. D(CD2)nCO2H for labeling at least a first sample and H(CH2)nCO2H for labeling at least a second sample, to form amide bond with peptide N-terminals, where n = 0, 1, 2 or y; wherein D is a deuteron atom, and y is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51. In one aspect of the method, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: i. ZAOH for labeling at least a first sample and ZBOH for labeling at least a second sample to esterify peptide C-terminals; ii. ZANH2 for labeling at least a first sample and ZBNH2 for labeling at least a second sample to form an amide bond with peptide C-terminals; and iii. ZACO2H for labeling at least a first sample and ZBCO2H for labeling at least a second A R sample to form an amide bond with peptide N-terminals; wherein Z and Z have the general formula R-Z'-A'-Z2-A2-Z3-A3-Z4-A4- Z1, Z2, Z3, and Z4, independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(0R)(0R') , OBR(OR'), OBRR1, and OB(OR)(OR'); A1, A2, A3, and A4, independently of one another, are selected from the group consisting of nothing and the general formulae (CRR')n, and, R and R1 is an alkyl group. In one aspect of the method, a single C-C bond in a (CRR1),, group is replaced with a double or a triple bond. In one aspect, R and R1 are absent. In one aspect, (CRR1)!, comprises a moiety selected from the group consisting of an o- arylene, an /n-arylene and a p-arylene, wherein the group has none or up to 6 substituents. In one aspect, the (CRR1),, group comprises a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom. In one aspect, R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R1 in A1 - A4, are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group. In one aspect, alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group. In one aspect of the method, n in Z - Z is independent of n in A - A and is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11 and about 6. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CH2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CF2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA comprises x number of protons and ZB comprises y number of halogens in the place of protons, wherein x and y are integers. In one aspect, ZA contains x number of protons and ZB contains y number of halogens, and there are x - y number of protons remaining in one or more A1 - A4 fragments, wherein x and y are integers. In one aspect, ZA further comprises x number of -O- fragment(s) in one or more A1 - A fragments, wherein x is an integer. In one aspect, ZA further comprises x number of -S- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA further comprises x number of -O- fragment(s) and Z further compnses y number of -S- fragment(s) in the place of -O- fragment(s), wherein and y are integers. In one aspect, ZA further comprises x -y number of -O- fragment(s) in one or more A - A fragments, wherein x and y are integers. In one aspect, x and y are integers independently selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y. In one aspect of the method, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: i. CH3(CH2)nOH for labeling at least a first sample and CH (CH2)n+mOH for labeling at least a second sample, to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; ii. CH3(CH2)n NH2 for labeling at least a first sample and CH3(CH2)n+mNH2 for labeling at least a second sample, to form amide bond with peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; and, iii. H(CH2)„CO H for labeling at least a first sample and H(CH2)n+mCO2H for labeling at least a second sample, to form amide bond with peptide N-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; wherein n, m and y are integers. In one aspect, the labeling reagent pair used in the method is N, N, dimethyl-iodoacetamide and N, N, d6-dimethyl-iodoacetamide, having the structures: 0 ,CH3 N CH3 C*D3 1 Λ/,Λ/-dimethyliodoacetamide Λ/,/V-dimethyl-c/6-iodoacetamide In one aspect, the invention provides methods for separating and detecting a hydrophobic protein (e.g., membrane protein) or a hydrophobic compound, the method comprising the following steps: (a) providing a sample comprising the hydrophobic protein (e.g., membrane protein) or the hydrophobic compound; (b) solubilizing the hydrophobic protein (e.g., membrane protein) or the hydrophobic compound in a detergent or urea; (c) loading the detergent or urea solubilized hydrophobic protein or hydrophobic compound into a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention; and (d) separating the hydrophobic proteins or the hydrophobic compounds by chromatography to generate an eluate using the chromatography system or the mixed bed multi-dimensional liquid chromatograph of the invention. In one aspect, the hydrophobic protein is a membrane protein such as an integral membrane protein, e.g., a protein expressed on the surface of a pathogenic cell or a cancer cell. The hydrophobic compound can be a lipid or a steroid. The invention provides computer program products comprising a computer useable medium having computer program logic recorded thereon for analyzing data generated by a chromatography system, said computer program logic comprising computer program code logic configured to perform operations as set forth in Figure 17, Figure 18, Figure 19, Figure 20 or Figure 21. The invention provides computer program products wherein the chromatography system comprises a system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention. The invention provides computer-implemented methods for analyzing data generated by a chromatography system comprising the following steps: providing a chromatography system capable of outputting data to a computer; providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product of the invention; and, inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system. In one aspect, the chromatography system comprises a system of the invention or a mixed bed multi- dimensional liquid chromatograph of the invention. In one aspect, an exemplary computer-implemented method comprises an LC-MS data file operatively linked to a component extraction file, operatively linked to a precursor integration and series reconstruction files, operatively linked to a progression file, as schematically illustrated in Example 17. The component extraction aspect of the computer-implemented method is schematically illustrated in Figure 18. The precursor integration aspect of the computer-implemented method is . schematically illustrated in Figure 19, where LC-MS and MS/MS spectra data is compared and merged to generate i,h spectrum; and, spectra are also ranged (RT = Ti -Do / 2, T2 + Do / 2) and sum intensities in each spectrum are ranged (m/z = p - ? p, p + ? p) and recorded into the precursor (i,h spectrum). An exemplary spectra comparison is schematically illustrated in Figures 20 and 21. The invention provides quantitative proteomics systems comprising a chromatography system comprising a system of the invention or a mixed bed multidimensional liquid chromatograph of the invention, wherein the system is capable of outputting data to a processor; a processor; and a computer program product of the invention embodied within the processor. The invention provides methods for fractionating a proteome of a cell comprising (a) providing a chromatography system comprising a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25; (b) providing a proteome preparation; and (c) fractionating the proteome preparation with the chromatography system, wherein 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41 %, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51 %, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%o, 96%, 97%, 98%, or 99%, or more of the proteome is fractionated. In one aspect of the method, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%, or more of the proteome is fractionated in a one-fraction protocol. The invention also provides methods of the invention comprising use of a computer-implemented method for analyzing data generated by a chromatography system comprising the following steps: (a) providing a chromatography system capable of outputting data to a computer; (b) providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product of the invention; (c) inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system. The invention provides quantitative proteomics systems comprising: (a) a chromatography system of the invention or a mixed bed multi-dimensional liquid chromatograph of the invention, and a mass spectrometer, wherein the system is capable of outputting data to a processor; (b) a processor; and (c) a computer program product (e.g., a computer program product of the invention) embodied within the processor. In one aspect, the mass spectrometer comprises an ion trap mass spectrometer, such as a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™. 226
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims. All publications, patents and patent applications cited herein are hereby expressly incoφorated by reference for all puφoses.
DESCRIPTION OF DRAWINGS The following drawings are illustrative of aspects of the invention and are not meant to limit the scope of the invention as encompassed by the claims. The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. Figure 1 illustrates an exemplary process of the invention wherein samples are combined, separated by multidimensional chromatography, and analyzed by mass spectrometry methods, as described in detail, below. Figure 2 is an illustration of a MALDI MS spectrum of a peptide pairs, as described in detail, below. Figure 3 illustrates an exemplary 3D LC set-up and process, as described in detail, below. Figure 4 illustrates an exemplary multi-dimensional chromatography apparatus of the invention, as described in detail in Example 3, below. Figure 5 graphically depicts the statistics of an exemplary mixed resin chromatography analysis and protein identification (Figure 5A graphically depicts the # MS/ MS spectra; Figure 5B graphically depicts the annotated spectra (%); Figure 5C graphically depicts the # protein ID), as described in detail in Example 3, below. Figure 6 gives a three-dimensional view of proteins identified using the exemplary apparatus and methods of the invention), as described in detail in Example 3, below. Figure 6A shows an overlay of the predicted (and also observed) membrane proteins (solid circles) over the total population (open circles). Certain functional classes are depicted by the overlays in Figures 6B, 6C, and 6D, illustrating the class of proteins belonging to "protein synthesis", "glycolysis" and "protein glycosylation", respectively, as described in detail in Example 3, below. Figure 7 illustrates the sequence of pyruvate decarboxylase set forth in SEQ ID NO: 1 as generated using an exemplary chromatography system and method of the invention, as described in detail in Example 3, below. Figure 8 illustrates an exemplary method of the invention, as described in detail in Examples 3 and 4, below. Figure 9 illustrates an exemplary sample preparation protocol of the invention, see Example 4, below. Figure 10 illustrates the results of salt extraction subfractions in a reverse phase sub-fraction for analysis of the B. anthracis proteome, as described Examples 3 and 4, below. Figure 11 illustrates the results of an analysis of a B. anthracis proteome using a chromatography system of the invention, as described in Example 4, below. Figure 12 summarizes a "matrix" of protein distribution from different B. anthracis samples, as described in Example 4, below. Figure 13 summarizes the discovered protein distribution by "role" category. Figure 14 illustrates an exemplary multi-dimensional chromatography apparatus of the invention, as described in detail in Example 3, below. Figure 15 describes the metabolic pathways identified in the yeast proteome using an exemplary multi-dimensional chromatography apparatus and methods of the invention, as described in detail in Example 3, below. Figure 16 illustrates proteins (highlighted in blue) from the glycolysis pathway identified using this system. Figure 17 is a schematic, a flow chart, illustrating an exemplary data analysis algorithm of the invention for quantitative proteomics. Figure 18 is a schematic, a flow chart, illustrating the "component extraction" section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17. Figure 19 is a schematic, a flow chart, illustrating the "precursor integration" section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17. Figure 20 is a schematic, a flow chart, illustrating the "spectra comparison" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19. Figure 21 is a schematic, a flow chart, illustrating the "identity and merge of duplicates LC-MS spectra" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19. Figure 22 illustrates an exemplary automated chromatograph system of the invention. Figure 23 illustrates the results of an MS/MS of the separated peptides in a proteome analysis, as discussed in Example 3, below. Figure 24 schematically illustrates the design of an oxidative stress experiment, as discussed in Example 5, below. Figure 25 schematically illustrates the design of a sample preparation protocol used in oxidative stress experiments, as discussed in Example 5, below. Figure 26 graphically illustrates data representing the number of protein identifications 3D LC-MS/MS analyses in oxidative stress experiments, as discussed in Example 5, below. Figure 27 summarizes data representing differences in the number of proteins identified in non-stressed and stressed cell samples in oxidative stress experiments, as discussed in Example 5, below. Figure 28 summarizes data representing a down-regulation in superoxide reductase ("Sor") protein levels after oxidative stress of Desulfovibrio vulgaris cells, as discussed in Example 5, below. Figure 29 illustrates that after oxidative stress oi Desulfovibrio vulgaris cells a concerted down-regulation of proteins along the polyglucose utilization pathway (schematically illustrated) was found, as discussed in Example 5, below. Figure 30 summarizes the results of proteome analysis from different organisms using an exemplary 3D LC LCQ MS/MS system of the invention, as discussed in Example 6, below. Figure 31 summarizes the results of proteome analysis comparing two exemplary 3D LC LCQ MS/MS systems of the invention: 3D LC LCQ MS/MS versus 3D LC LTQ MS/MS, as discussed in Example 6, below. Figure 32 illustrates the results of an LTQ and LCQ MS/MS Human Embryonic Kidney HEK293 proteome analysis, as discussed in Example 7, below. Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTION Specific Strategies for utilizing nucleic acid arrays The invention provides a number of strategies for comparing a polynucleotide of known sequence (a reference sequence) with variants of that sequence (target sequences). The comparison can be performed at the level of entire genomes, chromosomes, genes, exons or introns, or can focus on individual mutant sites and immediately adjacent bases. The strategies allow detection of variations, such as mutations or polymoφhisms, in the target sequence irrespective whether a particular variant has previously been characterized. The strategies both define the nature of a variant and identify its location in a target sequence. The strategies employ arrays of oligonucleotide probes immobilized to a solid support. Target sequences are analyzed by determining the extent of hybridization at particular probes in the array. The strategy in selection of probes facilitates distinction between perfectly matched probes and probes showing single- base or other degrees of mismatches. The strategy usually entails sampling each nucleotide of interest in a target sequence several times, thereby achieving a high degree of confidence in its identity. This level of confidence is further increased by sampling of adjacent nucleotides in the target sequence to nucleotides of interest. The number of probes on the chip can be quite large (e.g., 105-106). However, usually only a small proportion of the total number of probes of a given length are represented. Some advantage of the use of only a small proportion of all possible probes of a given length include: (i) each position in the array is highly informative, whether or not hybridization occurs; (ii) nonspecific hybridization is minimized; (iii) it is straightforward to correlate hybridization differences with sequence differences, particularly with reference to the hybridization pattern of a known standard; and (iv) the ability to address each probe independently during synthesis, using high resolution photolithography, allows the array to be designed and optimized for any sequence. For example the length of any probe can be varied independently of the others. The present tiling strategies result in sequencing and comparison methods suitable for routine large-scale practice with a high degree of confidence in the sequence output.
General Tiling Strategies Selection of Reference Sequence The chips can be designed to contain probes exhibiting complementarity to one or more selected reference sequence whose sequence is known. The chips are used to read a target sequence comprising either the reference sequence itself or variants of that sequence. Target sequences may differ from the reference sequence at one or more positions but show a high overall degree of sequence identity with the reference sequence (e.g., at least 75, 90, 95, 99, 99.9 or 99- 99%). Any polynucleotide of known sequence can be selected as a reference sequence. Reference sequences of interest include sequences known to include mutations or polymoφhisms associated with phenotypic changes having clinical significance in human patients. For example, the CFTR gene and P53 gene in humans have been identified as the location of several mutations resulting in cystic fibrosis or cancer respectively. Other reference sequences of interest include those that serve to identify pathogenic microorganisms and/or are the site of mutations by which such microorganisms acquire drug resistance (e.g., the HIV reverse transcriptase gene). Other reference sequences of interest include regions where polymoφhic variations are known to occur (e.g., the D-loop region of mitochondrial DNA). These reference sequences have utility for, e.g., forensic or epidemiological studies. Other reference sequences of interest include p34 (related to p53), p65 (implicated in breast, prostate and liver cancer), and DNA segments encoding cytochromes P450 (see Meyer et al., Pharmac. Ther. 46, 349-355 (1990)). Other reference sequences of interest include those from the genome of pathogenic viruses (e.g., hepatitis J, B, or Q, heφes virus (e.g., VZV, HSV-1, HAV-6, HSV-II, and CMV, Epstein Barr virus), adenovirus, influenza virus, flaviviruses, echovirus, rhinovirus, coxsackie virus, cornovirus, respiratory syncytial virus, mumps virus, rotavirus, measles virus, rubella virus, parvovirus, vaccinia virus, HTLV virus, dengue virus, papillomavirus, molluscum virus, poliovirus, rabies virus, JC virus and arboviral encephalitis virus. Other reference sequences of interest are from genomes or episomes of pathogenic bacteria, particularly regions that confer drug resistance or allow phylogenic characterization of the host (e.g., 16S rRNA or corresponding DNA). For example, such bacteria include Chlamydia, rickettsial bacteria, mycobacteria, staphylococci, streptococci, pneumonococci, meningococci and conococci, klebsiella, proteus, serratia, pseudomonas, legionella, diphtheria, salmonella, bacilli, cholera, tetanus, botulism, anthrax, plague, leptospirosis, and Lymes disease bacteria. Other reference sequences of interest include those in which mutations result in the following autosomal recessive disorders: sickle cell anemia, beta-thalassemia, phenylketonuria, galactosemia, Wilson's disease, hemochromatosis, severe combined immunodeficiency, alpha-1-antitrypsin deficiency, albinism, alkaptonuria, lysosomal storage diseases and Ehlers-Danlos syndrome. Other reference sequences of interest include those in which mutations result in X-linked recessive disorders: hemophilia, glucose-6-phosphate dehydrogenase, agammaglobulemia, diabetes insipidus, Lesch- Nyhan syndrome, muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's disease and fragile X- syndrome. Other reference sequences of interest includes those in which mutations result in the following autosomal dominant disorders: familial hypercholesterolemia, polycystic kidney disease, Huntingdon's disease, hereditary spherocytosis, Marian's syndrome, von Willebrand's disease, neurofϊbromatosis, tuberous sclerosis, hereditary hemorrhagic telangiectasia, familial colonic polyposis, Ehlers-Danlos syndrome, myotomc dystrophy, muscular dystrophy, osteogenesis imperfecta, acute intermittent poφhyria, and von Hippel- Lindau disease. The length of a reference sequence can vary widely from a full-length genome, to an individual chromosome, episome, gene, component of a gene, such as an exon, intron or regulatory sequences, to a few nucleotides. A reference sequence of between about 2, 5, 10, 20, 50, 100, 5000, 1000, 5,000 or 10,000, 20,000 or 100,000 nucleotides is common. Sometimes only particular regions of a sequence (e.g., exons of a gene) are of interest. In such situations, the particular regions can be considered as separate reference sequences or can be considered as components of a single reference sequence, as matter of arbitrary choice. A reference sequence can be any naturally occurring, mutant, consensus or purely hypothetical sequence of nucleotides, RNA or DNA. For example, sequences can be obtained from computer data bases, publications or can be determined or conceived de novo. Usually, a reference sequence is selected to show a high degree of sequence identity to envisaged target sequences. Often, particularly, where a significant degree of divergence is anticipated between target sequences, more than one reference sequence is selected. Combinations of wildtype and mutant reference sequences are employed in several applications of the tiling strategy. Chip Design
Basic Tiling Strategy The basic tiling strategy provides an array of immobilized probes for analysis of target sequences showing a high degree of sequence identity to one or more selected reference sequences. The strategy is first illustrated for an exemplary array that is subdivided into four probe sets, although it will be apparent that in some situations, satisfactory results are obtained from only two probe sets. A first probe set comprises a plurality of probes exhibiting perfect complementarity with a selected reference sequence. The perfect complementarity usually exists throughout the length of the probe. However, probes having a segment or segments of perfect complementarity that is/are flanked by leading or trailing sequences lacking complementarity to the reference sequence can also be used. Within a segment of complementarity, each probe in the first probe set has at least one interrogation position that corresponds to a nucleotide in the reference sequence. That is, the inteπogation position is aligned with the corresponding nucleotide in the reference sequence, when the probe and reference sequence are aligned to maximize complementarity between the two. If a probe has more than one interrogation position, each corresponds with a respective nucleotide in the reference sequence. The identity of an interrogation position and corresponding nucleotide in a particular probe in the first probe set cannot be determined simply by inspection of the probe in the first set. As will become apparent, an interrogation position and corresponding nucleotide is defined by the comparative structures of probes in the first probe set and corresponding probes from additional probe sets. A probe can have an inteπogation position at each position in the segment complementary to the reference sequence. An inteπogation position can be located away from the ends of a segment of complementarity. Inteπogation positions may provide more accurate data when located away from the ends of a segment of complementarity. A probe can have a segment of complementarity of length x does not contain more than x-2 inteπogation positions. Since probes are typically 9-21 nucleotides, and usually all of a probe is complementary, a probe typically has 1-19 interrogation positions. The probes can contain a single inteπogation position, at or near the center of probe. For each probe in the first set, there can be three conesponding probes from three additional probe sets. Thus, there can be four probes conesponding to each nucleotide of interest in the reference sequence. Each of the four conesponding probes has an inteπogation position aligned with that nucleotide of interest. The probes from the three additional probe sets can be identical to the conesponding probe from the first probe set with one exception. The exception is that at least one (and often only one) inteπogation position, which occurs in the same position in each of the four conesponding probes from the four probe sets, is occupied by a different nucleotide in the four probe sets. For example, for an A nucleotide in the reference sequence, the conesponding probe from the first probe set has its inteπogation position occupied by a T, and the conesponding probes from the additional three probe sets have their respective inteπogation positions occupied by A, C, or G, a different nucleotide in each probe. Of course, if a probe from the first probe set comprises trailing or flanking sequences lacking complementarity to the reference sequences, these sequences need not be present in conesponding probes from the three additional sets. Likewise conesponding probes from the three additional sets can contain leading or trailing sequences outside the segment of complementarity that are not present in the conesponding probe from the first probe set. Occasionally, the probes from the additional three probe set are identical (with the exception of intenogation position(s)) to a contiguous subsequence of the full complementary segment of the conesponding probe from the first probe set. In this case, the subsequence includes the inteπogation position and usually differs from the full- length probe only in the omission of one or both terminal nucleotides from the termini of a segment of complementarity. That is, if a probe from the first probe set has a segment of complementarity of length n, conesponding probes from the other sets will usually include a subsequence of the segment of at least length n-2. Thus, the subsequence is usually at least 3, 4, 7, 9, 15, 21, or 25 nucleotides long, most typically, in the range of 9-21 nucleotides. The subsequence should be sufficiently long to allow a probe to hybridize detectably more strongly to a variant of the reference sequence mutated at the inteπogation position than to the reference sequence. The probes can be oligodeoxyribonucleotides or oligoribonucleotides, or any modified forms of these polymers that are capable of hybridizing with a target nucleic sequence by complementary base-pairing. Complementary base pairing means sequence-specific base pairing which includes e.g., Watson-Crick base pairing as well as other forms of base pairing such as Hoogsteen base pairing. Modified forms include 2'-0-methyl oligoribonucleotides and so-called PNAs, in which oligodeoxyribonucleotides are linked via peptide bonds rather than phophodiester bonds. The probes can be attached by any linkage to a support (e.g., 3', 5' or via the base). 3' attachment is more usual as this orientation is compatible with a chemistry for solid phase synthesis of oligonucleotides. The number of probes in the first probe set (and as a consequence the number of probes in additional probe sets) depends on the length of the reference sequence, the number of nucleotides of interest in the reference sequence and the number of intenogation positions per probe. In general, each nucleotide of interest in the reference sequence requires the same intenogation position in the four sets of probes. A reference sequence can have 100 nucleotides, 50 of which are of interest, and probes each having a single intenogation position. In this situation, the first probe set requires fifty probes, each having one intenogation position conesponding to a nucleotide of interest in the reference sequence. The second, third and fourth probe sets each have a conesponding probe for each probe in the first probe set, and so each also contains a total of fifty probes. The identity of each nucleotide of interest in the reference sequence is determined by comparing the relative hybridization signals at four probes having inteπogation positions conesponding to that nucleotide from the four probe sets. In some reference sequences, every nucleotide is of interest. In other reference sequences, only certain portions in which variants (e.g., mutations or polymoφhisms) are concentrated are of interest. In other reference sequences, only particular mutations or polymoφhisms and immediately adjacent nucleotides are of interest. Usually, the first probe set has intenogation positions selected to conespond to at least a nucleotide (e.g., representing a point mutation) and one immediately adjacent nucleotide. Usually, the probes in the first set have inteπogation positions conesponding to at least 3, 10, 50, 100, 1000, or 20,000 contiguous nucleotides. The probes usually have intenogation positions conesponding to at least 5, 10, 30, 50, 75, 90, 99 or sometimes 100%, of the nucleotides in a reference sequence. The probes in the first probe set can completely span the reference sequence and overlap with one another relative to the reference sequence. For example, in one common anangement each probe in the first probe set differs from another probe in that set by the omission of a 3' base complementary to the reference sequence and the acquisition of a 5' base complementary to the reference sequence. The probes in a set can be aπanged in order of the sequence in a lane across the chip. A lane contains a series of overlapping probes, which represent or tile across, the selected reference sequence. The components of the four sets of probes are usually laid down in four parallel lanes, collectively constituting a row in the horizontal direction and a series of 4-member columns in the vertical direction. Conesponding probes from the four probe sets (i.e., complementary to the same subsequence of the reference sequence) occupy a column. Each probe in a lane usually differs from its predecessor in the lane by the omission of a base at one end and the inclusion of additional base at the other end. However, this orderly progression of probes can be interrupted by the inclusion of control probes or omission of probes in certain columns of the anay. Such columns serve as controls to orient the chip, or gauge the background, which can include target sequence nonspecifically bound to the chip. The probes sets can be laid down in lanes such that all probes having an intenogation position occupied by an A form an- A-lane, all probes having an intenogation position occupied by a C fonn a C-lane, all probes having an intenogation position occupied by a G form a G-lane, and all probes having an inteπogation position occupied by a T (or U) form a T lane (or a U lane). Note that in this anangement there is not a unique conespondence between probe sets and lanes. Thus, the probe from the first probe set is laid down in the A-lane, C-lane, A-lane, A- lane and T-lane for the five columns. The intenogation position on a column of probes conesponds to the position in the target sequence whose identity is determined from analysis of hybridization to the probes in that column. The inteπogation position can be anywhere in a probe but is usually at or near the central position of the probe to maximize differential hybridization signals between a perfect match and a single-base mismatch. For example, for an 11 mer probe, the central position is the sixth nucleotide. Although the anay of probes is usually laid down in rows and columns as described above, such a physical anangement of probes on the chip is not essential. Provided that the spatial location of each probe in an anay is known, the data from the probes can be collected and processed to yield the sequence of a target inespective of the physical anangement of the probes on a chip. In processing the data, the hybridization signals from the respective probes can be reassorted into any conceptual anay desired for subsequent data reduction whatever the physical anangement of probes on the chip. A range of lengths of probes can be employed in the chips. As noted above, a probe may consist exclusively of a complementary segments, or may have one or more complementary segments juxtaposed by flanking, trailing and/or intervening segments. In the latter situation, the total length of complementary segment(s) is more important than the length of the probe. In functional terms, the complementarity segment(s) of the first probe sets should be sufficiently long to allow the probe to hybridize detectably more strongly to a reference sequence compared with a variant of the reference including a single base mutation at the nucleotide conesponding to the inteπogation position of the probe. Similarly, the complementarity segment(s) in conesponding probes from additional probe sets can be sufficiently long to allow a probe to hybridize detectably more strongly to a variant of the reference sequence having a single nucleotide substitution at the intenogation position relative to the reference sequence. A probe can have a single complementary segment having a length of at least 3 nucleotides, and more usually at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 or bases exhibiting perfect complementarity (other than possibly at the intenogation position(s) depending on the probe set) to the reference sequence. In bridging strategies, where more than one segment of complementarity is present, each segment provides at least three complementary nucleotides to the reference sequence and the combined segments provide at least two segments of three or a total of six complementary nucleotides. As in the other strategies, the combined length of complementary segments is typically from 6-30 nucleotides, or, from about 9-21 nucleotides. The two segments are often approximately the same length. Often, the probes (or segment of complementarity within probes) have an odd number of bases, so that an inteπogation position can occur in the exact center of the probe. In some chips, all probes are the same length. Other chips employ different groups of probe sets, in which case the probes are of the same size within a group, but differ between different groups. For example, some chips have one group comprising four sets of probes as described above in which all the probes are 11 mers, together with a second group comprising four sets of probes in which all of the probes are 13 mers. Of course, additional groups of probes can be added. Thus, some chips contain, e.g., four groups of probes having sizes of 11 mers, 13 mers, 15 mers and 17 mers. Other chips have different size probes within the same group of four probe sets. In these chips, the probes in the first set can vary in length independently of each other. Probes in the other sets are usually the same length as the probe occupying the same column from the first set. However, occasionally different lengths of probes can be included at the same column position in the four lanes. The different length probes are included to equalize hybridization signals from probes inespective of whether A-T or C-G bonds are formed at the intenogation position. The length of probe can be important in distinguishing between a perfectly matched probe and probes showing a single- base mismatch with the target sequence. The discrimination is usually greater for short probes. Shorter probes are usually also less susceptible to formation of secondary structures. However, the absolute amount of target sequence bound, and hence the signal, is greater for larger probes. The probe length representing the optimum compromise between these competing considerations may vary depending on inter alia the GC content of a particular region of the target DNA sequence, secondary structure, synthesis efficiency and cross- hybridization. In some regions of the target, depending on hybridization conditions, short probes (e.g., 1 1 mers) may provide information that is inaccessible from longer probes (e.g., 19 mers) and vice versa. Maximum sequence information can be read by including several groups of different sized probes on the chip as noted above. However, for many regions of the target sequence, such a strategy provides redundant information in that the same sequence is read multiple times from the different groups of probes. Equivalent information can be obtained from a single group of different sized probes in which the sizes are selected to maximize readable sequence at particular regions of the target sequence. The strategy of customizing probe length within a single group of probe sets minimizes the total number of probes required to read a particular target sequence. This leaves ample capacity for the chip to include probes to other reference sequences. The invention provides an optimization block which allows systematic variation of probe length and inteπogation position to optimize the selection of probes for analyzing a particular nucleotide in a reference sequence. The block comprises alternating columns of probes complementary to the wildtype target and probes complementary to a specific mutation. The inteπogation position is varied between columns and probe length is varied down a column. Hybridization of the chip to the reference sequence or the mutant form of the reference sequence identifies the probe length and inteπogation position providing the greatest differential hybridization signal. The probes are designed to be complementary to either strand of the reference sequence (e.g., coding or non-coding), some chips contain separate groups of probes, one complementary to the coding strand, the other complementary to the noncoding strand. Independent analysis of coding and noncoding strands provides largely redundant information. However, the regions of ambiguity in reading the coding strand are not always the same as those in reading the noncoding strand. Thus, combination of the information from coding and noncoding strands increases the overall accuracy of sequencing. Some chips contain additional probes or groups of probes designed to be complementary to a second reference sequence. The second reference sequence can often be a subsequence of the first reference sequence bearing one or more commonly occurring mutations or interstrain variations. The second group of probes is designed by the same principles as described above except that the probes exhibit complementarity to the second reference sequence. The inclusion of a second group is particular useful for analyzing short subsequences of the primary reference sequence in which multiple mutations are expected to occur within a short distance commensurate with the length of the probes (i.e., two or more mutations within 9 to 21 bases). Of course, the same principle can be extended to provide chips containing groups of probes for any number of reference sequences. Alternatively, the chips may contain additional probe(s) that do not form part of a tiled anay as noted above, but rather serves as probe(s) for a conventional reverse dot blot. For example, the presence of mutation can be detected from binding of a target sequence to a single oligomeric probe harboring the mutation. An additional probe containing the equivalent region of the wildtype sequence can be included as a control. The chips can be read by comparing the intensities of labeled target bound to the probes in an anay. In one aspect, a comparison is performed between each lane of probes (e.g., A, C, G and T lanes) at each columnar position (physical or conceptual). For a particular columnar position, the lane showing the greatest hybridization signal is called as the nucleotide present at the position in the target sequence conesponding to the inteπogation position in the probes. The conesponding position in the target sequence is that aligned with the intenogation position in conesponding probes when the probes and target are aligned to maximize complementarity. Of the four probes in a column, only one can exhibit a perfect match to the target sequence whereas the others usually exhibit at least a one base pair mismatch. The probe exhibiting a perfect match usually produces a substantially greater hybridization signal than the other three probes in the column and is thereby easily identified. However, in some regions of the target sequence, the distinction between a perfect match and a one-base mismatch is less clear. Thus, a call ratio is established to define the ratio of signal from the best hybridizing probes to the second best hybridizing probe that must be exceeded for a particular target position to be read from the probes. A high call ratio ensures that few if any eπors are made in calling target nucleotides, but can result in some nucleotides being scored as ambiguous, which could in fact be accurately read. A lower call ratio can result in fewer ambiguous calls, but can result in more enoneous calls. It has been found that at a call ratio of 1.2 virtually all calls are accurate. However, a small but significant number of bases (e.g., up to about %) may have to be scored as ambiguous. Although small regions of the target sequence can sometimes be ambiguous, these regions usually occur at the same or similar segments in different target sequences. Thus, for pre-characterized mutations, it is known in advance whether that mutation is likely to occur within a region of unambiguously determinable sequence. An anay of probes is most useful for analyzing the reference sequence from which the probes were designed and variants of that sequence exhibiting substantial sequence similarity with the reference sequence (e.g., several single- base mutants spaced over the reference sequence). When an anay is used to analyze the exact reference sequence from which it was designed, one probe exhibits a perfect match to the reference sequence, and the other three probes in the same column exhibits single-base mismatches. Thus, discrimination between hybridization signals is usually high and accurate sequence is obtained. High accuracy is also obtained when an anay is used for analyzing a target sequence comprising a variant of the reference sequence that has a single mutation relative to the reference sequence, or several widely spaced mutations relative to the reference sequence. At different mutant loci, one probe exhibits a perfect match to the target, and the other three probes occupying the same column exhibit single-base mismatches, the difference (with respect to analysis of the reference sequence) being the lane in which the perfect match occurs. For target sequences showing a high degree of divergence from the reference strain or incoφorating several closely spaced mutations from the reference strain, a single group of probes (i.e., designed with respect to a single reference sequence) will not always provide accurate sequence for the highly variant region of this sequence. At some particular columnar positions, it may be that no single probe exhibits perfect complementarity to the target and that any comparison must be based on different degrees of mismatch between the four probes. Such a comparison does not always allow the target nucleotide conesponding to that columnar position to be called. Deletions in target sequences can be detected by loss of signal from probes having intenogation positions encompassed by the deletion. However, signal may also be lost from probes having intenogation positions closely proximal to the , deletion resulting in some regions of the target sequence that cannot be read. Target sequence bearing insertions will also exhibit short regions including and proximal to the insertion that usually cannot be read. The presence of short regions of difficult-to-read target because of closely spaced mutations, insertions or deletion, does not prevent determination of the remaining sequence of the target as different regions of a target sequence are determined independently. Moreover, such ambiguities as might result from analysis of diverse variants with a single group of probes can be avoided by including multiple groups of probe sets on a chip. For example, one group of probes can be designed based on a full-length reference sequence, and the other groups on subsequences of the reference sequence incoφorating frequently occurring mutations or strain variations. In one aspect, the sequencing strategy of the invention has the capacity to simultaneously detect and quantify proportions of multiple target sequences. Such capacity is valuable, e.g., for diagnosis of patients who are heterozygous with respect to a gene or who are infected with a virus, such as HIV, which is usually present in several polymoφhic forms. Such capacity is also useful in analyzing targets from biopsies of tumor cells and sunounding tissues. The presence of multiple target sequences is detected from the relative signals of the four probes at the anay columns conesponding to the target nucleotides at which diversity occurs. The relative signals at the four probes for the mixture under test are compared with the conesponding signals from a homogeneous reference sequence. An increase in a signal from a probe that is mismatched with respect to the reference sequence, and a conesponding decrease in the signal from the probe which is matched with the reference sequence signal the presence of a mutant strain in the mixture. The extent in shift in hybridization signals of the probes is related to the proportion of a target sequence in the mixture. Shifts in relative hybridization signals can be quantitatively related to proportions of reference and mutant sequence by prior calibration of the chip with seeded mixtures of the mutant and reference sequences. By this means, a chip can be used to detect variant or mutant strains constituting as little as 1, 5, 20, or 25 % of a mixture of stains. Similar principles allow the simultaneous analysis of multiple target sequences even when none is identical to the reference sequence. For example, with a mixture of two target sequences bearing first and second mutations, there would be a variation in the hybridization patterns of probes having intenogation positions conesponding to the first and second mutations relative to the hybridization pattern with the reference sequence. At each position, one of the probes having a mismatched intenogation position relative to the reference sequence would show an increase in hybridization signal, and the probe having a matched inteπogation position relative to the reference sequence would show a decrease in hybridization signal. Analysis of the hybridization pattern of the mixture of mutant target sequences, in some aspect, in comparison with the hybridization pattern of the reference sequence, indicates the presence of two mutant target sequences, the position and nature of the mutation in each strain, and the relative proportions of each strain. In a variation of the above method, the different components in a mixture of target sequences are differentially labeled before being applied to the anay. For example, a variety of fluorescent labels emitting at different wavelength are available. The use of differential labels allows independent analysis of different targets bound simultaneously to the anay. For example, the methods permit comparison of target sequences obtained from a patient at different stages of a disease. Omission of Probes The general strategy of the aspects of the invention outlined above employs four probes to read each nucleotide of interest in a target sequence. One probe (from the first probe set) shows a perfect match to the reference sequence and the other three probes (from the second, third and fourth probe sets) exhibit a mismatch with the reference sequence and a perfect match with a target sequence bearing a mutation at the nucleotide of interest. The provision of three probes from the second, third and fourth probe sets allows detection of each of the three possible nucleotide substitutions of any nucleotide of interest. However, in some reference sequences or regions of reference sequences, it is known in advance that only certain mutations are likely to occur. Thus, for example, at one site it might be known that an A nucleotide in the reference sequence may exist as a T mutant in some target sequences but is unlikely to exist as a C or G mutant. Accordingly, for analysis of this region of the reference sequence, one might include only the first and second probe sets, the first probe set exhibiting perfect complementarity to the reference sequence, and the second probe set having an inteπogation position occupied by an invariant A residue (for detecting the T mutant). In other situations, one might include the first, second and third probes sets (but not the fourth) for detection of a wildtype nucleotide in the reference sequence and two mutant variants thereof in target sequences. In some chips, probes that would detect silent mutations (i.e., not affecting amino acid sequence) are omitted. In some chips, the probes from the first probe set are omitted conesponding to some or all positions of the reference sequences. Such chips comprise at least two probe sets. The first probe set has a plurality of probes. Each probe comprises a segment exactly complementary to a subsequence of a reference sequence except in at least one intenogation position. A second probe set has a conesponding probe for each probe in the first probe set. The conesponding probe in the second probe set is identical to a sequence comprising the conesponding probe form the first probe set or a subsequence thereof that includes the at least one (and usually only one) inteπogation position except that the at least one intenogation position is occupied by a different nucleotide in each of the two conesponding probes from the first and second probe sets. A third probe set, if present, also comprises a conesponding probe for each probe in the first probe set except at the at least one intenogation position, which 0226
differs in the conesponding probes from the three sets. Omission of probes having a segment exhibiting perfect complementarity to the reference sequence results in loss of control information, i.e., the detection of nucleotides in a target sequence that are the same As those in a reference sequence. However, similar information can be obtained by hybridizing a chip lacking probes from the first probe set to both target and reference sequences. The hybridization can be performed sequentially, or concuπently, if the target and reference are differentially labeled. In this situation, the presence of a mutation is detected by a shift in the background hybridization intensity of the reference sequence to a perfectly matched hybridization signal of the target sequence, rather than by a comparison of the hybridization intensities of probes from the first set with conesponding probes from the second, third and fourth sets. Wildtype Probe Lane When the chips comprise four probe sets, as discussed supra, and the probe sets are laid down in four lanes, an A-lane, a C-lane, a G-lane and a T or U- lane, the probe having a segment exhibiting perfect complementarity to a reference sequence varies between the four lanes from one column to another. This does not present any significant difficulty in computer analysis of the data from the chip. However, visual inspection of the hybridization pattern of the chip is sometimes facilitated by provision of an extra lane of probes, in which each probe has a segment exhibiting perfect complementarity to the reference sequence. This segment-is identical to a segment from one of the probes in the other four lanes (which lane depending on the column position). The extra lane of probes (designated the wildtype lane) hybridizes to a target sequence at all nucleotide positions except those in which deviations from the reference sequence occurs. The hybridization pattern of the wildtype lane thereby provides a simple visual indication of mutations.
Deletion, Insertion and Multiple-Mutation Probes In some aspects, the chips provide an additional probe set specifically designed for analyzing deletion mutations. The additional probe set comprises a probe conesponding to each probe in the first probe set as described above. However, a probe from the additional probe set differs from the conesponding probe in the first probe set in that the nucleotide occupying the intenogation position is deleted in the probe from the additional probe set. Optionally, the probe from the additional probe set bears an additional nucleotide at one of its termini relative to the conesponding probe from the first probe set. The probe from the additional probe set will hybridize more strongly than the conesponding probe from the first probe set to a target sequence having a single base deletion at the nucleotide conesponding to the intenogation position. Additional probe sets are provided in which not only the intenogation position, but also an adjacent nucleotide is detected. Similarly, other chips provide additional probe sets for analyzing insertions. For example, one additional probe set has a probe conesponding to each probe in the first probe set as described above. However, the probe in the additional probe set has an extra T nucleotide inserted adjacent to the intenogation position. Optionally, the probe has one fewer nucleotide at one of its termini relative to the conesponding probe from the first probe set. The probe from the additional probe set hybridizes more strongly than the conesponding probe from the first probe set to a target sequence having an A nucleotide inserted in a position adjacent to that conesponding to the intenogation position. Similar additional probe sets are constructed having C, G or T/U nucleotides inserted adjacent to the intenogation position. Usually, four such probe sets, one for each nucleotide, are used in combination. Other chips provide additional probes (multiple-mutation probes) for analyzing target sequences having multiple closely spaced mutations. A multiple- mutation probe is usually identical to a conesponding probe from the first set as described above, except in the base occupying the inteπogation position, and except at one or more additional positions, conesponding to nucleotides in which substitution may occur in the reference sequence. The one or more additional positions in the multiple mutation probe are occupied by nucleotides complementary to the nucleotides occupying conesponding positions in the reference sequence when the possible substitutions have occuned. Block Tiling As noted in the discussion of the general tiling strategy, in one aspect, a probe in the first probe set can have more than one intenogation position. In this situation, a probe in the first probe set is sometimes matched with multiple groups of at least one, and usually, three additional probe sets. Three additional probe sets are used to allow detection of the three possible nucleotide substitutions at any one position. If only certain types of substitution are likely to occur (e.g., transitions), only one or two additional probe sets are required (analogous to the use of probes in the basic tiling strategy). To illustrate for the situation where a group comprises three additional probe sets, a first such group comprises second, third and fourth probe sets, each of which has a probe conesponding to each probe in the first probe set. The conesponding probes from the second, third and fourth probes sets differ from the conesponding probe in the first set at a first of the intenogation positions. Thus, the relative hybridization signals from conesponding probes from the first, second, third and fourth probe sets indicate the identity of the nucleotide in a target sequence conesponding to the first inteπogation position. A second group of three probe sets (designated fifth, sixth and seventh probe sets), each also have a probe conesponding to each probe in the first probe set. These conesponding probes differ from that in the first probe set at a second intenogation position. The relative hybridization signals from conesponding probes from the first, fifth, sixth, and seventh probe sets indicate the identity of the nucleotide in the target sequence conesponding to the second intenogation position. As noted above, the probes in the first probe set often have seven or more intenogation positions. If there are seven intenogation positions, there are seven groups of three additional probe sets, each group of three probe sets serving to identify the nucleotide conesponding to one of the seven intenogation positions. Each block of probes allows short regions of a target sequence to be read. For example, for a block of probes having seven intenogation positions, seven nucleotides in the target sequence can be read. Of course, a chip can contain any number of blocks depending on how many nucleotides of the target are of interest. The hybridization signals for each block can be analyzed independently of any other block. The block tiling strategy can also be combined with other tiling strategies, with different parts of the same reference sequence being tiled by different strategies. The block tiling strategy offers two advantages over the basic strategy in which each probe in the first set has a single intenogation position. One advantage is that the same sequence information can be obtained from fewer probes. A second advantage is that each of the probes constituting a block (i.e., a probe from the first probe set and a conesponding probe from each of the other probe sets) can have identical 3' and 5' sequences, with the variation confined to a central segment containing the inteπogation positions. The identity of 3' sequence between different probes simplifies the strategy for solid phase synthesis of the probes on the chip and results in more uniform deposition of the different probes on the chip, thereby in turn increasing the uniformity of signal to noise ratio for different regions of the chip. A third advantage is that greater signal uniformity is achieved within a block. Multiplex Tiling In one aspect, in the block tiling strategy discussed above, the identity of a nucleotide in a target or reference sequence is determined by comparison of hybridization patterns of one probe having a segment showing a perfect match with that of other probes (usually three other probes) showing a single base mismatch. In multiplex tiling of the invention, the identity of at least two nucleotides in a reference or target sequence is determined by comparison of hybridization signal intensities of four probes, two of which have a segment showing perfect complementarity or a single base mismatch to the reference sequence, and two of which have a segment showing perfect complementarity or a double-base mismatch to a segment. The four probes whose hybridization patterns are to be compared each have a segment that is exactly complementary to a reference sequence except at two intenogation positions, in which the segment may or may not be complementary to the reference sequence. The intenogation positions conespond to the nucleotides in a reference or target sequence which are determined by the comparison of intensities. The nucleotides occupying the intenogation positions in the four probes are selected according to the following rule. The first intenogation position is occupied by a different nucleotide in each of the four probes. The second intenogation position is also occupied by a different nucleotide in each of the four probes. In two of the four probes, designated the first and second probes, the segment is exactly complementary to the reference sequence except at not more than one of the two inteπogation positions. In other words, one of the intenogation positions is occupied by a nucleotide that is complementary to the conesponding nucleotide from the reference sequence and the other intenogation position may or may not be so occupied. In the other two of the four probes, designated the third and fourth probes, the segment is exactly complementary to the reference sequence except that both inteπogation positions are occupied by nucleotides which are non-complementary to the respective conesponding nucleotides in the reference sequence. There are number of ways of satisfying these conditions depending on whether the two nucleotides in the reference sequence conesponding to the two inteπogation positions are the same or different. If these two nucleotides are different in the reference sequence (probability 3/4), the conditions are satisfied by each of the two inteπogation positions being occupied by the same nucleotide in any given probe. For example, in the first probe, the two inteπogation positions would both be A, in the second probe, both would be C, in the third probe, each would be G, and in the fourth probe each would be T or U. If the two nucleotides in the reference sequence conesponding to the two intenogation positions are different, the conditions noted above are satisfied by each of the intenogation positions in any one of the four probes being occupied by complementary nucleotides. For example, in the first probe, the intenogation positions could be occupied by A and T, in the second probe by C and G, in the third probe by G and C and in the four probe, by T and A. When the four probes are hybridized to a target that is the same as the reference sequence or differs from the reference sequence at one (but not both) of the intenogation positions, two of the four probes show a double-mismatch with the target and two probes show a single mismatch. The identity of probes showing these different degrees of mismatch can be determined from the different hybridization signals. From the identity of the probes showing the different degrees of mismatch, the nucleotides occupying both of the intenogation positions in the target sequence can be deduced. For ease of illustration, the multiplex strategy has been initially described for the situation where there are two nucleotides of interest in a reference sequence and only four probes in an anay. Of course, the strategy can be extended to analyze any number of nucleotides in a target sequence by using additional probes. In one variation, each pair of inteπogation positions is read from a unique group of four probes. In a block variation, different groups of four probes exhibit the same segment of complementarity with the reference sequence, but the inteπogation positions move within a block. The block and standard multiplex tiling variants can of course be used in combination for different regions of a reference sequence. Either or both variants can also be used in combination with any of the other tiling strategies described. Helper Mutations Occasionally small regions of a reference sequence give a low hybridization signal as a result of annealing of probes. The self-annealing reduces the amount of probe effectively available for hybridizing to the target. Although such regions of the target are generally small and the reduction of hybridization signal is usually not so substantial as to obscure the sequence of this region, this concern can be avoided by the use of probes incoφorating helper mutations. The helper mutation(s) serve to break-up regions of internal complementarity within a probe and thereby prevent annealing. Usually, one or two helper mutations are quite sufficient for this puφose. The inclusion of helper mutations can be beneficial in any of the tiling strategies noted above. In general each probe having a particular intenogation position has the same helper mutation(s). Thus, such probes have a segment in common which shows perfect complementarity with a reference sequence, except that the segment contains at least one helper mutation (the same in each of the probes) and at least one intenogation position
(different in all of the probes). For example, in the basic tiling strategy, a probe from the first probe set comprises a segment containing an intenogation position and showing perfect complementarity with a reference sequence except for one or two helper mutations. The conesponding probes from the second, third and fourth probe sets usually comprise the same segment (or sometimes a subsequence thereof including the helper mutation(s) and intenogation position), except that the base occupying the intenogation position varies in each probe. Usually, the helper mutation tiling strategy is used in conjunction with one of the tiling strategies described above. The probes containing helper mutations are used to tile regions of a reference sequence otherwise giving low hybridization signal (e.g., because of self-complementarity), and the alternative tiling strategy is used to tile intervening regions. Pooling Strategies Pooling strategies of the invention can also employ anays of immobilized probes. Probes can be immobilized in cells of an anay, and the hybridization signal of each cell can be determined independently of any other cell. A particular cell may be occupied by pooled mixture of probes. Although the identity of each probe in the mixture is known, the individual probes in the pool are not separately addressable. Thus, the hybridization signal from a cell is the aggregate of that of the different probes occupying the cell. In general, a cell is scored as hybridizing to a target sequence if at least one probe occupying the cell comprises a segment exhibiting perfect complementarity to the target sequence. A simple strategy to show the increased power of pooled strategies over a standard tiling is to create three cells each containing a pooled probe having a single pooled position, the pooled position being the same in each of the pooled probes. At the pooled position, there are two possible nucleotides, allowing the pooled probe to hybridize to two target sequences. In tiling terminology, the pooled position of each probe is an inteπogation position. As will become apparent, comparison of the hybridization intensities of the pooled probes from the three cells reveals the identity of the nucleotide in the target sequence conesponding to the intenogation position (i.e., that is matched with the intenogation position when the target sequence and pooled probes are maximally aligned for complementarity). The three cells are assigned probe pools that are perfectly complementary to the target except at the pooled position, which is occupied by a different pooled nucleotide in each probe. With 3 pooled probes, all 4 possible single base pair states (wild and 3 mutants) are detected. A pool hybridizes with a target if some probe contained within that pool is complementary to that target. A cell containing a pair (or more) of oligonucleotides lights up when a target complementary to any of the oligonucleotide in the cell is present. Using the simple strategy, each of the four possible targets (wild and three mutants) yields a unique hybridization pattern among the three cells. Since a different pattern of hybridizing pools is obtained for each possible nucleotide in the target sequence conesponding to the pooled intenogation position in the probes, the identity of the nucleotide can be determined from the hybridization pattern of the pools. Whereas, a standard tiling requires four cells to detect and identify the possible single-base substitutions at one location, this simple pooled 45 strategy only requires three cells. In another aspect, pooling strategy for sequence analysis is the 'Trellis' strategy. In this strategy, each pooled probe has a segment of perfect complementarity to a reference sequence except at three pooled positions. One pooled position is an N pool. The three pooled positions may or may not be contiguous in a probe. The other two pooled positions are selected from the group of three pools consisting of (1) M or K, (2) R or Y and (3) W or S, where the single letters are IUPAC standard ambiguity codes. The sequence of a pooled probe is thus, of the form XXXN[(M/K) or (R/Y) or (W/S)][(M/K) or (R/Y) or (W/S)]XXXXX, where XXX represents bases complementary to the reference sequence. The three pooled positions may be in any order, and may be contiguous or separated by intervening nucleotides. For, the two positions occupied by [(M/K) or (R/Y) or (W/S)], two choices must be made. First, one must select one of the following three pairs of pooled nucleotides (1) M/K, (2) R/Y and (3) W/S. The one of three pooled nucleotides selected may be the same or different at the two pooled positions. Second, supposing, for example, one selects M K at one position, one must then chose between M or K. This choice should result in selection of a pooled nucleotide comprising a nucleotide that complements the conesponding nucleotide in a reference sequence, when the probe and reference sequence are maximally aligned. The same principle governs the selection between R and Y, and between W and S. A trellis pool probe has one pooled position with four possibilities, and two pooled positions, each with two possibilities. Thus, a trellis pool probe comprises a mixture of 16 (4 x 2 x 2) probes. Since each pooled position includes one nucleotide that complements the conesponding nucleotide from the reference sequence, one of these 16 probes has a segment that is the exact complement of the reference sequence. A target sequence that is the same as the reference sequence (i.e., a wildtype target) gives a hybridization signal to each probe cell. Here, as in other tiling methods, the segment of complementarity should be sufficiently long to permit specific hybridization of a pooled probe to a reference sequence be detected relative to a variant of that reference sequence. Typically, the segment of complementarity is about 9-21 nucleotides. A target sequence is analyzed by comparing hybridization intensities at three pooled probes, each having the structure described above. The segments complementary to the reference sequence present in the three pooled probes show some overlap. Sometimes the segments are identical (other than at the inteπogation positions). However, this need not be the case. For example, the segments can tile across a reference sequence in increments of one nucleotide (i.e., one pooled probe differs from the next by the acquisition of one nucleotide at the 5' end and loss of a nucleotide at the 3' end). The three intenogation positions may or may not occur at the same relative positions within each pooled probe (i.e., spacing from a probe terminus). All that is required is that one of the three intenogation positions from each of the three pooled probes aligns with the same nucleotide in the reference sequence, and that this intenogation position is occupied by a different pooled nucleotide in each of the three probes. In one of the three probes, the intenogation position is occupied by an N. In the other two pooled probes the inteπogation position is occupied by one of (M/K) or (R/Y) or (W/S). In the simplest form of the trellis strategy, three pooled probes are used to analyze a single nucleotide in the reference sequence. Much greater economy of probes is achieved when more pooled probes are included in an anay. For example, consider an anay of five pooled probes each having the general structure outlined above. Three of these pooled probes have an intenogation position that aligns with the same nucleotide in the reference sequence and are used to read that nucleotide. A different combination of three probes have an intenogation position that aligns with a different nucleotide in the reference sequence. Comparison of these three probe intensities allows analysis of this second nucleotide. Still another combination of three pooled probes from the set of five have an inteπogation position that aligns with a third nucleotide in the reference sequence and these probes are used to analyze that nucleotide. Thus, three nucleotides in the reference sequence are fully analyzed from only five pooled probes. By comparison, the basic tiling strategy would require 12 probes for a similar analysis. The trellis strategy can employ an anay of probes having at least three cells, each of which is occupied by a pooled probe as described above. Consider the use of three such pooled probes for analyzing a target sequence, of which one position may contain any single base substitution to the reference sequence (i.e., there are four possible target sequences to be distinguished). Three cells are occupied by pooled probes having a pooled intenogation position conesponding to the position of possible substitution in the target sequence, one cell with an N', one cell with one of M' or K', and one cell with R' or Y'. An inteπogation position conesponds to a nucleotide in the target sequence if it aligns adjacent with that nucleotide when the probe and target sequence are aligned to maximize 45 complementarity. Note that although each of the pooled probes has two other pooled positions, these positions are not relevant for the present illustration. The positions are only relevant when more than one position in the target sequence is to be read, a circumstance that will be considered later. For present puφoses, the cell with the N' in the intenogation position lights up for the wildtype sequence and any of the three single base substitutions of the target sequence. A further class of strategies involving pooled probes are termed coding strategies. These strategies assign code words from some set of numbers to variants of a reference sequence. Any number of variants can be coded. The variants can include multiple closely spaced substitutions, deletions or insertions. The designation letters or other symbols assigned to each variant may be any arbitrary set of numbers, in any order. For example, a binary code is often used, but codes to other bases are entirely feasible. The numbers are often assigned such that each variant has a designation having at least one digit and at least one nonzero value for that digit. For example, in a binary system, a variant assigned the number 101, has a designation of three digits, with one possible nonzero value for each digit. The designation of the variants are coded into an anay of pooled probes comprising a pooled probe for each nonzero value of each digit in the numbers assigned to the variants. For example, if the variants are assigned successive number in a numbering system of base m, and the highest number assigned to a variant has n digits, the array would have about n x (m -1) pooled probes. In general, logm (3N+1) probes are required to analyze all variants of N locations in a reference sequence, each having three possible mutant substitutions. For example, 10 base pairs of sequence may be analyzed with only 5 pooled probes using a binary coding system. Each pooled probe has a segment exactly complementary to the reference sequence except that certain positions are pooled. The segment should be sufficiently long to allow specific hybridization of the pooled probe to the reference sequence relative to a mutated form of the reference sequence. As in other tiling strategies, segments lengths of 9-21 nucleotides are typical. Often the probe has no nucleotides other than the 9-21 nucleotide segment. The pooled positions comprise nucleotides that allow the pooled probe to hybridize to every variant assigned a particular nonzero value in a particular digit. Usually, the pooled positions further comprises a nucleotide that allows the pooled probe to hybridize to the reference sequence. Thus, a wildtype target (or reference sequence) is immediately recognizable from all the pooled probes being lit. When a target is hybridized to the pools, only those pools comprising a component probe having a segment that is exactly complementary to the target light up. The identity of the target is then decoded from the pattern of hybridizing pools. Each pool that lights up is conelated with a particular value in a particular digit. Thus, the aggregate hybridization patterns of each lighting pool reveal the value of each digit in the code defining the identity of the target hybridized to the anay. Bridging Strategy Probes that contain partial matches to two separate (i.e., non contiguous) subsequences of a target sequence sometimes hybridize strongly to the target sequence. In certain instances, such probes have generated stronger signals than probes of the same length which are perfect matches to the target sequence. It is believed (but not necessary to the invention) that this observation results from interactions of a single target sequence with two or more probes simultaneously. This invention exploits this observation to provide anays of probes having at least first and second segments, which are respectively complementary to first and second subsequences of a reference sequence. Optionally, the probes may have a third or more complementary segments. These probes can be employed in any of the, strategies noted above. The two segments of such a probe can be complementary to disjoint subsequences of the reference sequences or contiguous subsequences. If the latter, the two segments in the probe are inverted relative to the order of the complement of the reference sequence. The two subsequences of the reference sequence each typically comprises about 3 to 30 contiguous nucleotides. The subsequences of the reference sequence are sometimes separated by 0, 1, 2 or 3 bases. Often the sequences, are adjacent and nonoverlapping. The bridging strategy can offer the following advantages: (1) Higher discrimination between matched and mismatched probes, (2) The possibility of using longer probes in a bridging tiling, thereby increasing the specificity of the hybridization, without sacrificing discrimination, (3) The use of probes in which an intenogation position is located very off-center relative to the regions of target complementarity. This may be of particular advantage when, for example, when a probe centered about one region of the target gives low hybridization signal. The low signal is overcome by using a probe centered about an adjoining region giving a higher hybridization signal. (4) Disruption of secondary structure that might result in annealing of certain probes (see previous discussion of helper mutations). Deletion Tiling The invention also provides a deletion tiling strategy. Deletion tiling is related to both the bridging and helper mutant strategies described above. In the deletion strategy, comparisons are performed between probes sharing a common deletion but differing from each other at an intenogation position located outside the deletion. For example, a first probe comprises first and second segments, each exactly complementary to respective first and second subsequences of a reference sequence, wherein the first and second subsequences of the reference sequence are separated by a short distance (e.g., 1 or 2 nucleotides). The order of the first and second segments in the probe is usually the same as that of the complement to the first and second subsequences in the reference sequence. Such tilings sometimes offer superior discrimination in hybridization intensities between the probe having an inteπogation position complementary to the target and other probes. Thermodynamically, the difference between the hybridizations to matched and mismatched targets for the probe set shown above is the difference between a single-base bulge, and a large asymmetric loop (e.g., two bases of target, one of probe). This often results in a larger difference in stability than the comparison of a perfectly matched probe with a probe showing a single base mismatch in the basic tiling strategy. The use of deletion or bridging probes is quite general. These probes can be used in any of the tiling strategies of the invention. As well as offering superior discrimination, the use of deletion or bridging strategies is advantageous for certain probes to avoid self-hybridization (either within a probe or between two probes of the same sequence) . Preparation of Target Samples The target polynucleotide, whose sequence is to be determined, is usually isolated from a tissue sample. If the target is genomic, the sample may be from any tissue (except exclusively red blood cells). For example, whole blood, peripheral blood lymphocytes or PBMC, skin, hair or semen are convenient sources of clinical samples. These sources are also suitable if the target is RNA. Blood and other body fluids are also a convenient source for isolating viral nucleic acids. If the target is mRNA, the sample is obtained from a tissue in which the mRNA is expressed. If the polynucleotide in the sample is RNA, it is usually reverse transcribed to DNA. DNA samples or cDNA resulting from reverse transcription are usually amplified, e.g., by PCR. Depending on the selection of primers and amplifying enzyme(s), the amplification product can be RNA or DNA. Paired primers are selected to flank the borders of a target polynucleotide of interest. More than one target can be simultaneously amplified by multiplex PCR in which multiple paired primers are employed. The target can be labeled at one or more nucleotides during or after amplification. For some target polynucleotides (depending on size of sample), e.g., episomal DNA, sufficient DNA is present in the tissue sample to dispense with the amplification step. When the target strand is prepared in single-stranded form as in preparation of target RNA, the sense of the strand should of course be complementary to that of the probes on the chip. This is achieved by appropriate selection of primers. The target can be fragmented before application to the chip to reduce or eliminate the formation of secondary structures in the target. The average size of targets segments following hybridization is usually larger than the size of probe on the chip. Sequencing This invention provides a method of performing whole cell engineering that comprises the step of cell screening. In one aspect, the step of cell screening may comprise the step of genomic sequencing. In one exemplification, genome sequencing can be accomplished according to the enzymatic/Sanger method (described in F. Sanger, S. Nicklen, and A. R. Coulson, Proc. Natl. Acad. Sci, USA, 74:5463-5467 (1977)) and involve cloning and subcloning (described in U.S. Patent No. 4725677; Chen and Seeburg, DNA 4, 165-170 (1985); Lim et al., Gene Anal., Techn. 5, 32-39 (1988); PCR Protocols- A Guide to Methods and Applications. Innis et al., editors, Academic Press, San Diego (1990); Innis et al., Proc. Nat. Acad. Sci. USA 85, 9436-9440 (1988)). In another exemplification, sequencing can be accomplished according to the chemical/Maxam and Gilbert method which is described in references: A. M. Maxam, and W. Gilbert, Proc. Nat. Acad. of Sci., USA, 74:560-564 (1977) and
Church et al., Proc. Natl. Acad. Sci., 81 :1991 (1984). In additional exemplifications, genome sequencing can be accomplished by methodology described by Guo and Wu (Guo and Wu, Nucleic Acids Res., 10:2065 (1982); and Meth. Enz., 100:60 (1983)) or those methods that utilize 3'hydroxy-protected and labeled nucleotides as exemplified in the following references: Churchich, J.E., Eur. J. Biochem., 231 :736 (1995);
Metzket, M.L., et al., Nucleic Acids Research, 22:4259 (1994); Beabealashvilli, R.S. et al, Biochimica et Biophysica Acta, 868:136 (1986); Chidgeavadze, Z.G.; Kukhanova, M.K. et al. Biochimica et Biophysica Acta, 868: 145 (1986); Hiratsuka, T et Biophysica Acta, 742:496 (1983); Jeng, S.J. and Guillory, R.J. J., Supramolecular Structure, 3:448 (1975). The invention also provides that sequencing may be read by autoradiography using radioisotopes (as described in Orastein et al., Biotechniques 2, 476 (1985)) or by using non-radioactively labeling strategies that have been integrated into partly automated DNA sequencing procedures (Smith et al., Nature M, 674-679 (1986) and EPO Patent No. 873 00998.9; Du Pont De Nemours EPO Application No. 03 59225; Ansorge et al., L Biochem. Biophys. Method 13, 325-32 (19860; Prober et al. Science M, 336-41 (1987); Applied Biosystems, PCT Application WO 91/05060; Smith et al., Science 235, G89 (1987); U.S. Patent Nos. 570973 and 689013), Du Pont De Nemours, U.S. Patents Nos. 881372 and 57566, Ansorge et al. Nucleic Acids Res. 15-, 4593-4602 (1987) and EMBL Patent Application DE P3724442 and P3805808.1) and Hitachi (JP 1-90844 and DE 4011991 Al; U.S. Patent No. 4,729,947; PCT Application W092/02635; U.S. Patent No. 594676; Beck, O'Keefe, Coull and Kδster, Nucleic Acids Res. 7, 5115- 5123 (1989) .L7 and Beck and Kδster, Anal. Chem. 62 2258-2270 (1990); Church et al., Science 240, 185-188 (1988); Kόster et al., Nucleic Acids Res. Symposium Ser. No. 24, 318-321 (1991), University of Utah, PCT Application No. WO 90/15883; Smith et al., Nature (1986) 321:674- 679; Orion- Yhtyma Oy, U.S. Patent No. 277643; M. Uhlen et al. Nucleic Acids Res. 16, 3025-38 (1988); Cemu Bioteknik, PCT Application No. WO 89/09282 and Medical Research Council, GB, PCT Application No. WO 92/03575; Du Pont De Nemours, PCT Application WO 91/11533). In addition, this invention provides for various methods of reading sequencing data such as capillary zone electrophoresis (described in Jorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al., Nucleic Acids Res. 18, 1415- 1419 (1990)), mass spectrometry (including ES [described in Fenn et al. J. Phys. Chem. 18, 4451-59 (1984); PCT Application No. WO 90/14148; R.D. Smith et al., Anal. Chem. 62, 882-89 (1990) and B. Ardrey, Electrospray Mass Spectrometry, Spectroscopy Europe 4, 10-18 (1992)] and MALDI [Hillenkamp et al. Matrix Assisted UV-Laser Desoφtion/Ionization: A New Approach to Mass Spectrometry of Large Biomolecules, Biological Mass Spectrometry (Burlingame and McCloskey, editors), Elsevier Science Publishers, Amsterdam, pp. 49-60, (1990); Williams et al., Science, 246, 1585-87 (1989); Williams et al., Rapid Communications in Mass Spectrometry, 4, 348-351 (1990)]), tube gel electrophoresis and a mass analyzer to sequence (described in EPO Patent Applications No. 0360676 Al and 0360677). In order to analyze the sequencing data, this invention provides for the use of probes in large anays (as described in PCT patent Publication No. 92/10588; U.S. Patent No. 5,143,854; U.S. Application Serial No. 07/805,727; U.S. Patent No. 5,202,231; PCT patent Publication No. 89/10977). The invention provides a method of performing whole cell engineering comprising the step of cell screening. In one aspect, the method includes DNA amplification. DNA can be amplified by a variety of procedures including cloning (Sambrook et at., Molecular Cloning : A Laboratory Manual., Cold Spring Harbor Laboratory Press, 1989), polymerase chain reaction (PCR) (CR. Newton and A. Graham, PCF, BIOS Publishers, 1994; Bevan et al., "Sequencing of PCR-Amplified DNA" PCR Meth. App. 4:222 (1992)), ligase chain reaction (LCR) (F. Barany Proc. Natl. Acad Sci USA 88, 189-93 (1991), strand displacement amplification (SDA) (G. Teπance Walker et al., Nucleic Acids Res. 22, 2670-77 (1994)) and variations such as RT-PCR (Arens, M. Clin Microbiol Rev, 12(4): 612-26 (1999)), allele-specific amplification (ASA) (Nichols, W.C. et al. Genomics. Oct;5(3):535-40(1989); Giffard, P.M. et al. Anal Biochem, ;292(2):207-15 (2001)). In additional aspects of this invention, it provides for additional sequencing methods (as described in Labeit et al., MA 5, 173-177 (1986); Amersham, PCT- Application GB86/00349; Eckstein et al., Nucleic Acids Res. 1~, 9947 (1988); Max- Planck- Geselischaft, DE 3930312 Al; Saiki, R. et al., Science 239:487-491 (1998); Sarkat, G. and Bolander Mark E., Semi Exponential Cycle Sequencing Nucleic Acids Research, 1995, Vol. 23, No. 7, p. 1269-1270). This invention also provides for the following sequencing strategies: shotgun sequencing, transposon-mediated directed sequencing (Sfrathmann, M. et al. Proc Natl. Acad Sci USA (1991) 88:1247- 1250), and large scale variations thereof (as exemplified in K. B. Mullis et al., U.S. Pat. Nos. 4,683,202; 7/1987; 435/91; and 4,683,195, 7/1987; 435/6). In alternative aspects, the step of genomic sequencing includes constructing ordered clone maps of DNA sequencing (as described in sections of U.S. Patent Publication No. 5604100 and PCT Patent Publication No. WO9627025). This invention provides that the method of genome sequencing be achieved by various steps that may utilize modifications of certain methods mentioned above (described in the following patents: PCT Publication Nos. WO9737041, WO9742348, WO9627025, WO9831834, WO9500530, and WO9831833; US Patent Publication Nos.US5604100, US5670321, US5453247, US5994058, and US5354656). Annotating In one aspect this invention provides for the use of a relational database system for storing and manipulating biomolecular sequence information and storing and displaying genetic information, the database including genomic libraries for a plurality of types of organisms, the libraries having multiple genomic sequences, at least some of which represent open reading frames located along a contiguous sequence on each the plurality of organisms' genomes, and a user interface capable of receiving a selection of two or more of the genomic libraries for comparison and displaying the results of the comparison. Associated with the database is a software system that allows a user to determine the relative position of a selected gene sequence within a genome. The system allows execution of a method of displaying the genetic locus of a biomolecular sequence. The method involves providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. The system also provides a user interface capable of receiving a selection of one or more probe open reading frames for use in determining homologous matches between such probe open reading frame(s) and the open reading frames in the genomic libraries, and displaying the results of the determination. An open reading frame for the sequence is selected and displayed together with adjacent open reading frames located upstream and downstream in the relative positions in which they occur on the contiguous sequence. In one aspect, the invention provides a relational database system for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more protein function hierarchies. The hierarchies allow searches for sequences based upon a protein's biological function or molecular function. Also disclosed is a mechanism for automatically grouping new sequences into protein function hierarchies. This mechanism uses descriptive information obtained from "external hits" which are matches of stored sequences against gene sequences stored in an external database such as GenBank. The descriptive information provided with the external database is evaluated according to a specific algorithm and used to automatically group the external hits (or the sequences associated with the hits) in the categories. Ultimately, the biomolecular sequences stored in databases of this invention are provided with both descriptive information from the external hit and category information from a relevant hierarchy or hierarchies. Disclosed is a relational database system for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to association with one or more projects for obtaining full-length biomolecular sequences from shorter sequences. The relational database has sequence records containing information identifying one or more projects to which each of the sequence records belong. Each project groups together one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The computer system has a user interface allowing a user to selectively view information regarding one or more projects. The relational database also provides interfaces and methods for accessing and manipulating and analyzing project-based information. Polymer sequences can be assembled into bins. A first number of bins are populated with polymer sequences. The polymer sequences in each bin are assembled into one or more consensus sequences representative of the polymer sequences of the bin. The consensus sequences of the bins are compared to determine relationships, if any, between the consensus sequences of the bins. The bins are modified based on the relationships between the consensus sequences of the bins. The polymer sequences are reassembled in the modified bins to generate one or more modified consensus sequences for each bin representative of the modified bins. In another aspect of the invention, sequence similarities and dissimilarities are analyzed in a set of polymer sequences. Pairwise alignment data is generated for pairs of the polymer sequences. The pairwise alignment data defines regions of similarity between the pairs of polymer sequences with boundaries. Additional boundaries in particular polymer sequences are determined by applying at least one boundary from at least one pairwise alignment for one pair of polymer sequences to at least one other pairwise alignment for another pair of polymer sequences including one of the particular polymer sequences. Additional regions of similarity are generated based on the boundaries. ANNOTATING - GENERAL METHODOLOGY In one aspect the invention provides relational databases for storing and retrieving biological information. More particularly the invention relates to systems and methods for providing sequences of biological molecules in a relational format allowing retrieval in a client-server environment and for providing full-length cDNA sequences in a relational format allowing retrieval in a client-server environment. ANNOTATING - EXEMPLARY ASPECTS The annotation methods of this invention include those described in PCT patent publication Nos. 98/26407, 98/26408, and 99/49403 and United States Patent Nos. 6,023,659 and 5,953,727. Thus, in one aspect, this present invention provides relational database systems for storing and analyzing biomolecular sequence information together with biological annotations detailing the source and inteφretation the sequence data. The present invention provides a powerful database tool for drug development and other research and development piuposes. The present invention provides relational database systems for storing and analyzing biomolecular sequence information together with biological detailing the source and inteφretation the sequence data. Disclosed is a relational database systems for storing and displaying genetic information. Associated with the database is a software system the allows a user to determine the relative position of a selected gene sequence within a genome. The system allows execution of a method of displaying the genetic locus of a biomolecular sequence. The method involves providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. An open reading frame for the sequence is selected and displayed together with adjacent open reading frames located upstream and downstream in the relative positions in which they occur on the contiguous sequence. The invention provides a method of displaying the genetic locus of a biomolecular sequence. The method involve providing a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. The method further involves identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame. The adjacent open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence, textually and/or graphically. The method of the invention may be practiced with sequences from microbial organisms, and the sequences may include nucleic acid or protein sequences. The invention also provides a computer system including a database having multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. The computer system also includes a user interface capable of identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame. The adjacent the open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence. The user interface may also capable of detecting a scrolling command, and based upon the direction and magnitude of the scrolling command, identifying a new selected open reading frame from the contiguous sequence. The invention further provides a computer program product comprising a computer-usable medium having computer-readable program code embodied thereon relating to a database including multiple biomolecular sequences, at least some of which represent open reading frames located along a contiguous sequence on an organism's genome. The computer program product includes computer-readable program code for identifying a selected open reading frame, and displaying the selected open reading frame together with adjacent open reading frames located upstream and downstream from the selected open reading frame. The adjacent open reading frames and the selected open reading frame are displayed in the relative positions in which they occur on the contiguous sequence. Comparative Genomics is a feature of the database system of the present invention which allows a user to compare the sequence data of sets of different organism types. Comparative searches may be formulated in a number of ways using the Comparative Genomics feature. For example, genes common to a set of organisms may be identified through a "commonality" query, and genes unique to one of a set of organisms may be identified through a "subtraction" query. Electronic Southern is a feature of the present database system which is useful for identifying genomic libraries in which a given gene or ORF exists. A Southern analysis is a conventional molecular biology technique in which a nucleic acid of known sequence is used to identify matching (complementary) sequences in a sample of nucleic acid to be analyzed. Like their laboratory counteφarts, Electronic Southerns according to the present invention may be used to locate homologous matches between a "probe" DNA sequence and a large number of DNA sequences in one or more libraries. The present invention provides a method of comparing genetic complements of different types of organisms. The method involves providing a database having sequence libraries with multiple biomolecular sequences for different types of organisms, where at least some of the sequences represent open reading frames located along one or more contiguous sequences on each of the organisms' genomes. The method further involves receiving a selection of two or more of the sequence libraries for comparison, determining open reading frames common or unique to the selected sequence libraries, and displaying the results of the determination. The invention also provides a method of comparing genomic complements of different types of organisms. The method involves providing a database having genomic sequence libraries with multiple biomolecular sequences for different types of organisms, where at least some of the sequences represent open reading frames located along one or more contiguous sequences on each of the organisms' genomes. The method further involves receiving a selection of two or more of the sequence libraries for comparison, determining sequences common or unique to the selected sequence libraries, and displaying the results of the determination. The invention further provides a computer system including a database containing genomic libraries for different types of organisms, which libraries have multiple genomic sequences, at least some of which representing open reading frames located along one or more contiguous sequences on each the organisms' genomes. The system also includes a user interface capable of receiving a selection of two or more genomic libraries for comparison and displaying the results of the comparison. Another aspect of the present invention provides a method of identifying libraries in which a given gene exists. The method involves providing a database including genomic libraries for one or more types of organisms. The libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes. The method further involves receiving a selection of one or more probe sequences, determining homologous matches between the selected probe sequences and the sequences in the genomic libraries, and displaying the results of the determination. The invention also provides a computer system including a database including genomic libraries for one or more types of organisms, which libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes. The system also includes a user interface capable of receiving a selection of one or more probe sequences for use in determining homologous matches between one or more probe sequences and the sequences in the genomic libraries, and displaying the results of the determination. Also provided is a computer program product including a computer- usable medium having computer-readable program code embodied thereon relating to a database including genomic libraries for one or more types of organisms. The libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes. The computer program product includes computer-readable program code for providing, within a computing system, an interface for receiving a selection of two or more genomic libraries for comparison, determining sequences common or unique to the selected genomic libraries, and displaying the results of the determination. Additionally provided is a computer program product including a computer-usable medium having computer-readable program code embodied thereon relating to a database including genomic libraries for one or more types of organisms. The libraries have multiple genomic sequences, at least some of which represent open reading frames located along one or more contiguous sequences on each the organisms' genomes. The computer program product includes computer-readable program code for providing, within a computing system, an interface for receiving a selection of one or more probe open reading frames, determining homologous matches between the probe sequences and the sequences in the genomic libraries, and displaying the results of the determination. The invention further provides a method of presenting the genetic complement of an organism. The method involves providing a database including sequence libraries for a plurality of types of orgamsms, where the libraries have multiple biomolecular sequences, at least some of which represent open reading 226
frames located along one or more contiguous sequences on each of the organisms' genomes. The method further involves receiving a selection of one of the sequence libraries, determining open reading frames within the selected sequence library, and displaying the results as one or more unique identifiers for groups of related opening reading frames. The present invention provides relational database systems for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more protein function hierarchies. The hierarchies are provided to allow carefully tailored searches for sequences based upon a protein's biological function or molecular function. To make this capability available in large sequence databases, the invention provides a mechanism for automatically grouping new sequences into protein function hierarchies. This mechanism takes advantage of descriptive information obtained from "external hits" which are matches of stored sequences against gene sequences stored in an external database such as GenBank. The descriptive information provided with GenBank is evaluated according to a specific algorithm and used to automatically group the external hits (or the sequences associated with the hits) in the categories. Ultimately, the biomolecular sequences stored in databases of this invention are provided with both descriptive information from the external hit and category information from a relevant hierarchy or hierarchies. The invention provides a computer system having a database containing records pertaining to a plurality of biomolecular sequences. At least some of the biomolecular sequences are grouped into a first hierarchy of protein function categories, the protein function categories specifying biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy. The hierarchy includes a first set of protein function categories specifying biological functions at a cellular level, and a second set of protein function categories specifying biological functions at a level above the cellular level. The computer system of the invention also includes a user interface allowing a user to selectively view information regarding the plurality of biomolecular sequences as it relates to the first hierarchy. The computer system may also include additional protein function categories based, for example, on molecular or enzymatic function of proteins. The biomolecular sequences may include nucleic acid or amino acid sequences. Some of said biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about such projects. The invention also provides a method of using a computer system to present information pertaining to a plurality of biomolecular sequence records stored in a database. The method involves displaying a list of the records or a field for entering information identifying one or more of the records, identifying one or more of the records that a user has selected from the list or field, matching the one or more selected records with one or more protein function categories from a first hierarchy of protein function categories into which at least some of the biomolecular sequence records are grouped, and displaying the one or more categories matching the one or more selected records. The protein function categories specify biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy includes a first set of protein function categories specifying biological functions at a cellular level, and a second set of protein function categories specifying biological functions at a tissue level. The method may also involve matching the records against other protein function hierarchies, such as hierarchies based on molecular and/or enzymatic function, and displaying the results. At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects. Additionally, the invention provides a method of using a computer system to present information pertaining to a plurality of biomolecular sequence records stored in a database. The method involves displaying a list of one or more protein biological function categories from a first hierarchy of protein biological function categories into which at least some of the biomolecular sequence records are grouped, identifying one or more of the protein biological function categories that a user has selected from the list, matching the one or more selected protein biological function categories with one or more biomolecular sequence records which are grouped in the selected protein biological function categories, and displaying the one or more sequence records matching the one or more selected protein biological function categories. The protein biological function categories specify biological functions of proteins conesponding to the biomolecular sequences and the first hierarchy includes a first set of protein biological function categories specifying biological functions at a cellular level, and a second set of protein biological function categories specifying biological functions at a tissue level. The method may also involve matching the records against other protein function hierarchies, such as hierarchies based on molecular and/or enzymatic function, and displaying the results. At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects. Another aspect of the invention provides a database system having a plurality of internal records. The database includes a plurality of sequence records specifying biomolecular sequences, at least some of which records reference hits to an external database, which hits specify genes having sequences that at least partially match those of the biomolecular sequences. The database also includes a plurality of external hit records specifying the hits to the external database, and at least some of the records reference protein function hierarchy categories which specify at least one of biological functions of proteins or molecular functions of proteins. At least some of the biomolecular sequences may be provided as part of one or more projects for obtaining full-length gene sequences from shorter sequences, and the database records may contain information about those projects. Further aspects of the present invention provide a method of using a computer system and a computer readable medium having program instructions to automatically categorize biomolecular sequence records into protein function categories in an internal database. The method and program involve receiving descriptive information about a biomolecular sequence in the internal database from a record in an external database pertaining to a gene having a sequence that at least partially matches that of the biomolecular sequence. Next, a determination is made whether the descriptive information contains one or more terms matching one or more keywords associated with a first protein function category, the keywords being terms consistent with a classification in the first protein function category. When at least one keyword is found to match a term in the descriptive information, a determination is made whether the descriptive information contains a term matching one or more anti- keywords associated with the first protein function category, the anti- keywords being terms inconsistent with a classification in the first protein function category. Then, the biomolecular sequence is grouped in the first protein function category when the descriptive information contains a term matching a keyword but contains no term matching an anti- keyword. The present invention provides relational database systems for storing biomolecular sequence information in a manner that allows sequences to be catalogued and searched according to one or more characteristics. The sequence information of the database is generated by one or more "projects" which are concerned with identifying the full- length coding sequence of a gene (i.e., mRNA). The projects involve the extension of an initial sequenced portion of a clone of a gene of interest (e.g., an EST) by a variety of methods which use conventional molecular biological techniques, recently developed adaptations of these techniques, and certain novel database applications. Data accumulated in these projects may be provided to the database of the present invention throughout the course of the projects and may be available to database users (subscribers) throughout the course of these projects for research, product (i.e., drug) development, and other puφoses. In one aspect, the database of the present invention and its associated projects may provide sequence and related data in amounts and forms not previously available. The present invention can make partial and full-length sequence information for a given gene available to a user both during the course of the data acquisition and once the full-length sequence of the gene has been elucidated. The database can provide a variety of tools for analysis and manipulation of the data, including Northern analysis and Expression summaries. The present invention should permit more complete and accurate annotation, of sequence data, as well as the study of relationships between genes of different tissues, systems or organisms, and ultimately detailed expression studies of full-length gene sequences. The invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong. Each project groups together one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The computer system also has a user interface allowing a user to selectively view information regarding one or more projects. The biomolecular sequences may include nucleic acid or amino acid sequences. The user interface may allow users to view at least three levels of project information including a project information results level listing at least some of the projects in said database, a sequence information results level listing at least some of the sequences associated with a given project, and a sequence retrieval results level sequentially listing monomers which comprise a given sequence. A method of using a computer system and a computer program product to present information pertaining to a plurality of sequence records stored in a database are also provided by the present invention. The sequence records contain information identifying one or more projects to which each of the sequence records belong. Each of the projects groups one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The method and program involve providing an interface for entering query information relating to one or more projects, locating data conesponding to the entered query information, and displaying the data conesponding to the entered query information. Additionally, the invention provides a method of using a computer system to present information pertaining to a plurality of sequence records stored in a database. The sequence records contains information identifying one or more projects to which each of the sequence records belong. Each of the projects groups one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The method involves displaying a list of one or more project identifiers, determining which project identifier or identifiers from the list is selected by a user, then displaying a second list of one or more biomolecular sequence identifiers associated with the selected project identifier or identifiers, determining which sequence identifier or identifiers from the second list has been selected by a user, and displaying a third list of one or more sequences conesponding to the selected sequence identifier or identifiers. Following the display of the third list, a determination may be made whether and which sequence from the third list has been selected by a user. If a sequence is selected, a sequence alignment search of the selected sequence against other data-based sequences may be initiated, and the results of the alignment search displayed. For Electronic Northern analysis, the invention further provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of said projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The system also has a user interface capable of allowing a user to select one or more project identifiers or project member identifiers specifying one or more sequences to be compared with one or more cDNA sequence libraries, and displaying matches resulting from that comparison. 226
A method of using a computer system to present comparative information pertaining to a plurality of sequence records stored in a database is also provided by the present invention. The sequence records contain information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The method involves providing an interface capable of allowing a user to select one or more project identifiers or project member identifiers specifying one or more sequences, comparing the one or more specified sequences with one or more cDNA sequence libraries, and displaying matches resulting from the comparison. In addition, for Expression analysis, the invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. The system also has a user interface allowing a user to view expression information pertaining to the projects by selecting one or more expression categories for a query, and displaying the result of the query. A method of using a computer system to view expression information pertaining to one or more projects, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence, is also provided by the invention. The computer system includes a database storing a plurality of sequence records, the sequence records containing information identifying one or more projects to which each of the sequence records belong. The method involves providing an interface which allows a user to select one or more expression categories as a query, locating projects belonging to the selected one or more expression categories, and displaying a list of located projects. The present invention provides a computer system including a database having sequence records containing information identifying one or more projects to which each of the sequence records belong, each of the projects grouping one or more biomolecular sequences generated during work to obtain a full-length gene sequence from a shorter sequence. This computer system has a user interface allowing a user to selectively view information regarding said one or more projects and which displays information to a user in a format common to one or more other sequence databases. Polymer sequences are assembled into bins. A first number of bins are populated with polymer sequences. The polymer sequences in each bin are assembled into one or more consensus sequences representative of the polymer sequences of the bin. The consensus sequences of the bins are compared to determine relationships, if any, between the consensus sequences. The bins are modified based on the relationships between the consensus sequences. The polymer sequences are reassembled in the modified bins to generate one or more modified consensus sequences for each bin representative of the modified bins. In another aspect of the invention, sequence similarities and dissimilarities are analyzed in a set of polymer sequences. Pairwise alignment data is generated for pairs of the polymer sequences. The pairwise alignment data defines regions of similarity between the pairs of polymer sequences with boundaries. Additional boundaries in particular polymer sequences are determined by applying at least one boundary from at least one pairwise alignment for one pair of polymer sequences to at least one other pairwise alignment for another pair of polymer sequences including one of the particular polymer sequences. Additional regions of similarity are generated based on the boundaries. ANNOTATING - RELATIONAL DATABASES The present invention provides an improved relational database for storing and manipulating genomic sequence information. While the invention is described in terms of a database optimized for microbial data, it is by no means so limited. The invention may be employed to investigate data from various sources. For example, the invention covers databases optimized for other sources of sequence data, such as animal sequences (e.g., human, primate, rodent, amphibian, insect, etc.), plant sequences and microbial sequences. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without limitation to some of the specific details presented herein. Generally, the present invention provides an improved relational database for storing sequence information. The invention may be employed to investigate data from various sources. For example, it may catalogue animal sequences (e.g., human, primate, rodent, amphibian, insect, etc.), plant sequences, and microbial sequences. 0226
Transcriptome analysis or RNA profiling The characterization of RNA expression and transcript populations (the transcriptome) can be refened to as RNA profiling and/or expression profiling, utilizing high throughput techniques such as RNA differential displays and DNA microanays. One potential method to characterize gene expression, SAGE (Serial Analysis of Gene Expression) utilizes combinatorial chemistry technology and short sequence tags in the screening of compound libraries. For further information see references: Burge, C.B. 2001. Chipping away at the transcriptome. Nat Genet, 27(3): 232-4; Hughes, T.R. and Shoemaker, D.D. 2001. DNA microanays for expression profiling. Cun. Opin Chem. Biol., 5(1): 21-5; Yamamoto, M. et al. 2001. Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods 250(1- 2):45-66. Screening and selecting nucleotides for protein binding One aspect of the invention provides for screening methods that include the user of recombinant and in vitro chemical synthesis methods. In these hybrid methods, cell-free enzymatic machinery is employed to accomplish the in vitro synthesis of the library members (i.e., peptides or polynucleotides). In one type of method, RNA molecules with the ability to bind a predetermined protein or a predetermined dye molecule were selected by alternate rounds of selection and PCR amplification (Tuerk and Gold, 1990; Ellington and Szostak, 1990). A similar technique was used to identify DNA sequences which bind a predetermined human transcription factor (Thiesen and Bach, 1990; Beaudry and Joyce, 1992; PCT patent publications WO 92/05258 and WO 92/14843). Proteomics In another aspect of this invention, this invention relates to the emerging field of proteomics. Proteomics involves the qualitative and quantitative measurement of gene activity by detecting and quantitating expression at the protein level, rather than at the messenger RNA level. Proteomics also involves the study of non-genome encoded events, including the post-translational modification of proteins (including glycosylation or other modifications), interactions between proteins, and the location of proteins within a cell. The structure, function, and or level of activity of the proteins expressed by the cell are also of interest. Essentially, proteomics involves the study of part or all of the status of the total protein contained within or secreted by a cell. Proteomics requires means of separating proteins in complex mixtures and identifying both low-and high-abundance species. Examples of powerful methods cunently used to resolve complex protein mixtures are 2D gel electrophoresis, reverse phase HPLC, capillary electrophoresis, isoelectric focusing and related hybrid techniques. Commonly used protein identification techniques include N-terminal Edman and mass spectrometry (electrospray [ESI] or matrix- assisted laser desoφtion ionization [MALDI] MS) and sophisticated database search programs, such as SEQUEST™ (see, e.g., U.S. Patent Nos. 6,017,693 and 5,538,897), to identify proteins in World Wide Web protein and nucleic acid databases from the MS-MS spectra of their peptides. SEQUEST™ coπelates uninteφreted tandem mass spectra of peptides with amino acid sequences from protein and nucleotide databases. SEQUEST™ can determine the amino acid sequence and thus the protein(s) and organism(s) that conespond to the mass spectrum being analyzed. SEQUEST™ uses algorithms described in U.S. Patent Nos. 6,017,693 and 5,538,897. Using a computer, the output of the mass spectrometry can be analyzed so as to link a gene and the particular protein for which it codes. This overall process is sometimes refeπed to as "functional genomics". For general information on proteome research, see, for example, J.S. Fruton, 1999, Proteins, Enzymes, Genes: The Inteφlay of Chemistry and Biology, Yale Univ. Pr.; Wilkins et al., 1997, Proteome Research: New Frontiers in Functional Genomics (Principles and Practice), Springer Verlag; A.J. Link, 1999, 2-D Proteome Analysis Protocols (Methods in Molecular Biology, 112, Humana Pr.); and Kamp et al., 1999, Proteome and Protein Analysis, Springer Verlag. Signal Transduction See also, James, Peter, "Protein identification in the post-genome era: the rapid rise of proteomics", Q. Rev. Biophysics, Vol. 30, No. 4, pp. 279-331 (1997). Screening peptides: Peptide Display Methods The present invention is further directed to a method for generating a selected mutant polynucleotide sequence (or a population of selected polynucleotide sequences) typically in the form of amplified and/or cloned polynucleotides, whereby the selected polynucleotide sequences(s) possess at least one desired phenotypic characteristic (e.g., encodes a polypeptide, promotes transcription of linked polynucleotides, binds a protein, and the like) which can be selected for. One method for identifying hybrid polypeptides that possess a desired structure or functional property, such as binding to a predetermined biological macrotnolecule (e.g., a receptor), involves the screening of a large library of polypeptides for individual library members which possess the desired structure or functional property confeπed by the amino acid sequence of the polypeptide. One method of screening peptides involves the display of a peptide sequence, antibody, or other protein on the surface of a bacteriophage particle or cell. Generally, in these methods each bacteriophage particle or cell serves as an individual library member displaying a single species of displayed peptide in addition to the natural bacteriophage or cell protein sequences. Each bacteriophage or cell contains the nucleotide sequence information encoding the particular displayed peptide sequence; thus, the displayed peptide sequence can be ascertained by nucleotide sequence determination of an isolated library member. A well-known peptide display method involves the presentation of a peptide sequence on the surface of a filamentous bacteriophage, typically as a fusion with a bacteriophage coat protein. The bacteriophage library can be incubated with an immobilized, predetermined macromolecule or small molecule (e.g., a receptor) so that bacteriophage particles which present a peptide sequence that binds to the immobilized macromolecule can be differentially partitioned from those that do not present peptide sequences that bind to the predetermined macromolecule. The bacteriophage particles (i.e., library members) which are bound to the immobilized macromolecule are then recovered and replicated to amplify the selected bacteriophage sub-population for a subsequent round of affinity enrichment and phage replication. After several rounds of affinity enrichment and phage replication, the bacteriophage library members that are thus selected are isolated and the nucleotide sequence encoding the displayed peptide sequence is determined, thereby identifying the sequence(s) of peptides that bind to the predetermined macromolecule (e.g., receptor). Such methods are further described in PCT patent publications WO 91/17271, WO 91/18980, WO 91/19818 and WO 93/08278. The latter PCT publication describes a recombinant DNA method for the display of peptide ligands that involves the production of a library of fusion proteins with each fusion protein composed of a first polypeptide portion, typically comprising a variable sequence, that is available for potential binding to a predetermined macromolecule, and a second polypeptide portion that binds to DNA, such as the DNA vector encoding the individual fusion protein. When transformed host cells are cultured under conditions that allow for expression of the fusion protein, the fusion protein binds to the DNA vector encoding it. Upon lysis of the host cell, the fusion protein/vector DNA complexes can be screened against a predetermined macromolecule in much the same way as bacteriophage particles are screened in the phage-based display system, with the replication and sequencing of the DNA vectors in the selected fusion protein/vector DNA complexes serving as the basis for identification of the selected library peptide sequence(s). The displayed peptide sequences can be of varying lengths, typically from 3-5000 amino acids long or longer, frequently from 5-100 amino acids long, and often from about 8-15 amino acids long. A library can comprise library members having varying lengths of displayed peptide sequence, or may comprise library members having a fixed length of displayed peptide sequence. Portions or all of the displayed peptide sequence(s) can be random, pseudorandom, defined set kernal, fixed, or the like. The present display methods include methods for in vitro and in vivo display of single-chain antibodies, such as nascent scFv on polysomes or scfv displayed on phage, which enable large-scale screening of scfv libraries having broad diversity of variable region sequences and binding specificities. The present invention also provides random, pseudorandom, and defined sequence framework peptide libraries and methods for generating and screening those libraries to identify useful compounds (e.g., peptides, including single-chain antibodies) that bind to receptor molecules or epitopes of interest or gene products that modify peptides or RNA in a desired fashion. The random, pseudorandom, and defined sequence framework peptides are produced from libraries of peptide library members that comprise displayed peptides or displayed single-chain antibodies attached to a polynucleotide template from which the displayed peptide was synthesized. The mode of attachment may vary according to the specific aspect of the invention selected, and can include encapsulation in a phage particle or incoφoration in a cell. Screening that utilizes in vitro translation systems An aspect of this invention provides for the use of in vitro translation during the step of screening. In vitro translation has been used to synthesize proteins of interest and has been proposed as a method for generating large libraries of peptides. These methods, generally comprising stabilized polysome complexes, are described further in PCT patent publications WO 88/08453, WO 90/05785, WO 90/07003, WO 91/02076, WO 91/05058, and WO 92/02536. Applicants have described methods in which library members comprise a fusion protein having a first polypeptide portion with DNA binding activity and a second polypeptide portion having the library member unique peptide sequence; such methods are suitable for use in cell-free in vitro selection formats, among others. Affinity enrichment One aspect of this invention provides for the use of affinity enrichment which allows a very large library of peptides and single-chain antibodies to be screened and the polynucleotide sequence encoding the desired peptide(s) or single-chain antibodies to be selected. The polynucleotide can then be isolated and shuffled to recombine combinatorially the amino acid sequence of the selected peptide(s) (or predetermined portions thereof) or single-chain antibodies (or just VHI, VLI or CDR portions thereof). Using these methods, one can identify a peptide or single-chain antibody as having a desired binding affinity for a molecule and can exploit the process of shuffling to converge rapidly to a desired high-affinity peptide or scfv. The peptide or antibody can then be synthesized in bulk by conventional means for any suitable use (e.g., as a therapeutic or diagnostic agent). A significant advantage of the present invention is that no prior information regarding an expected ligand structure is required to isolate peptide ligands or antibodies of interest. The peptide identified can have biological activity, which is meant to include at least specific binding affinity for a selected receptor molecule and, in some instances, will further include the ability to block the binding of other compounds, to stimulate or inhibit metabolic pathways, to act as a signal or messenger, to stimulate or inhibit cellular activity, and the like. The present invention also provides a method for shuffling a pool of polynucleotide sequences selected by affinity screening a library of polysomes displaying nascent peptides (including single-chain antibodies) for library members which bind to a predetermined receptor (e.g., a mammalian proteinaceous receptor such as, for example, a peptidergic hormone receptor, a cell surface receptor, an intracellular protein which binds to other protein(s) to form intracellular protein complexes such as hetero-dimers and the like) or epitope (e.g., an immobilized protein, glycoprotein, oligosaccharide, and the like). The invention also provides peptide libraries comprising a plurality of individual library members of the invention, wherein (1) each individual library member of said plurality comprises a sequence produced by shuffling of a pool of selected sequences, and (2) each individual library member comprises a variable peptide segment sequence or single-chain antibody segment sequence which is distinct from the variable peptide segment sequences or single-chain antibody sequences of other individual library members in said plurality (although some library members may be present in more than one copy per library due to uneven amplification, stochastic probability, or the like). Antibody Display The present method can be used to shuffle, by in vitro and/or in vivo recombination by any of the disclosed methods, and in any combination, polynucleotide sequences selected by antibody display methods, wherein an associated polynucleotide encodes a displayed antibody which is screened for a phenotype (e.g., for affinity for binding a predetermined antigen (ligand). Various prokaryotic expression systems have been developed that can be manipulated to produce combinatorial antibody libraries which may be screened for high-affinity antibodies to specific antigens. Recent advances in the expression of antibodies in Escherichia coli and bacteriophage systems {see "alternative peptide display methods", infra) have raised the possibility that virtually any specificity can be obtained by either cloning antibody genes from characterized hybridomas or by de novo selection using antibody gene libraries (e.g., from Ig cDNA). Combinatorial libraries of antibodies have been generated in bacteriophage lambda expression systems which may be screened as bacteriophage plaques or as colonies of lysogens (Huse et al, 1989); Caton and Koprowski, 1990; Mullinax et al, 1990; Persson et al, 1991). Various aspects of bacteriophage antibody display libraries and lambda phage expression libraries have been described (Kang et al, 1991; Clackson et al, 1991; McCafferty et al, 1990; Burton et al, 1991; Hoogenboom et al, 1991 ; Chang et al, 1991 ; Breitling et al, 1991 ; Marks et al, 1991 , p. 581; Barbas et al, 1992; Hawkins and Winter, 1992; Marks et al, 1992, p. 779; Marks et al, 1992, p. 16007; and Lowman et al, 1991; Lerner et al, 1992; all incoφorated herein by reference). Typically, a bacteriophage antibody display library is screened with a receptor (e.g., polypeptide, carbohydrate, glycoprotein, nucleic acid) that is immobilized (e.g., by covalent linkage to a chromatography resin to enrich for reactive phage by affinity chromatography) and/or labeled (e.g., to screen plaque or colony lifts). One aspect of the invention uses the so-called single-chain fragment variable (scfv) libraries (Marks et al, 1992, p. 779; Winter and Milstein, 1991; Clackson et al, 1991; Marks et al, 1991, p. 581; Chaudhary et al, 1990; Chiswell et al, 1992; McCafferty et al, 1990; and Huston et al, 1988). Various aspects of scfv libraries displayed on bacteriophage coat proteins have been described. Bacteriophage display of scfv have already yielded a variety of useful antibodies and antibody fusion proteins. A bispecific single chain antibody has been shown to mediate efficient tumor cell lysis (Gruber et al, 1994). Intracellular expression of an anti-Rev scfv has been shown to inhibit HIV-1 virus replication in vitro (Duan et al, 1994), and intracellular expression of an anti-p21rar, scfv has been shown to inhibit meiotic maturation of Xenopus oocytes (Biocca et al, 1993). Recombinant scfv which can be used to diagnose HIV infection have also been reported, demonstrating the diagnostic utility of scfv (Lilley et al, 1994). Fusion proteins wherein an scFv is linked to a second polypeptide, such as a toxin or fibrinolytic activator protein, have also been reported (Holvost et al, 1992; Nicholls et al, 1993). Various methods have been reported for increasing the combinatorial diversity of a scfv library to broaden the repertoire of binding species (idiotype spectrum). Enzymatic inverse PCR mutagenesis has been shown to be a simple and reliable method for constructing relatively large libraries of scfv site-directed hybrids (Stemmer et al, 1993), as has enor-prone PCR and chemical mutagenesis (Deng et al, 1994). Riechmann (Riechmann et al, 1993) showed semi-rational design of an antibody scfv fragment using site-directed randomization by degenerate oligonucleotide PCR and subsequent phage display of the resultant scfv hybrids. Barbas (Barbas et al, 1992) attempted to circumvent the problem of limited repertoire sizes resulting from using biased variable region sequences by randomizing the sequence in a synthetic CDR region of a human tetanus toxoid-binding Fab. Displayed peptide/polynucleotide complexes (library members) which encode a variable segment peptide sequence of interest or a single-chain antibody of interest are selected from the library by an affinity enrichment technique. This is accomplished by means of a immobilized macromolecule or epitope specific for the peptide sequence of interest, such as a receptor, other macromolecule, or other epitope species. Repeating the affinity selection procedure provides an enrichment of library members encoding the desired sequences, which may then be isolated for pooling and shuffling, for sequencing, and/or for further propagation and affinity enrichment. The library members without the desired specificity are removed by washing. The degree and stringency of washing required will be determined for each peptide sequence or single-chain antibody of interest and the immobilized predetermined macromolecule or epitope. A certain degree of control can be exerted over the binding characteristics of the nascent peptide/DNA complexes recovered by adjusting the conditions of the binding incubation and the subsequent washing. The temperature, pH, ionic strength, divalent cations concentration, and the volume and duration of the washing will select for nascent peptide/DNA complexes within particular ranges of affinity for the immobilized macromolecule. Selection based on slow dissociation rate, which is usually predictive of high affinity, is often the most practical route. This may be done either by continued incubation in the presence of a saturating amount of free predetermined macromolecule, or by increasing the volume, number, and length of the washes. In each case, the rebinding of dissociated nascent peptide/DNA or peptide RNA complex is prevented, and with increasing time, nascent peptide/DNA or peptide/RNA complexes of higher and higher affinity are recovered. Additional modifications of the binding and washing procedures may be applied to find peptides with special characteristics. The affinities of some peptides are dependent on ionic strength or cation concentration. This is a useful characteristic for peptides that will be used in affinity purification of various proteins when gentle conditions for removing the protein from the peptides are required. One variation involves the use of multiple binding targets (multiple epitope species, multiple receptor species), such that a scfv library can be simultaneously screened for a multiplicity of scfv which have different binding specificities. Given that the size of a scfv library often limits the diversity of potential scfv sequences, it is typically desirable to us scfv libraries of as large a size as possible. The time and economic considerations of generating a number of very large polysome scFv-display libraries can become prohibitive. To avoid this substantial problem, multiple predetermined epitope species (receptor species) can be concomitantly screened in a single library, or sequential screening against a number of epitope species can be used. In one variation, multiple target epitope species, each encoded on a separate bead (or subset of beads), can be mixed and incubated with a polysome-display scfv library under suitable binding conditions. The collection of beads, comprising multiple epitope species, can then be used to isolate, by affinity selection, scfv library members. Generally, subsequent affinity screening rounds can include the same mixture of beads, subsets thereof, or beads containing only one or two individual epitope species. This approach affords efficient screening, and is compatible with laboratory automation, batch processing, and high throughput screening methods. Expression systems The DNA expression constructs will typically include an expression control DNA sequence operably linked to the coding sequences, including naturally-associated or heterologous promoter regions. The expression control sequences can be eukaryotic promoter systems in vectors capable of transforming or transfecting eukaryotic host cells. Once the vector has been incoφorated into the appropriate host, the host is maintained under conditions suitable for high level expression of the nucleotide sequences, and the collection and purification of the mutant' "engineered" antibodies. The DNA sequences will be expressed in hosts after the sequences have been operably linked to an expression control sequence (i.e., positioned to ensure the transcription and translation of the structural gene). These expression vectors are typically replicable in the host organisms either as episomes or as an integral part of the host chromosomal DNA. Commonly, expression vectors will contain selection markers, e.g., tetracycline or neomycin, to permit detection of those cells transformed with the desired DNA sequences (see, e.g., USPN 4,704,362). In addition to eukaryotic microorganisms such as yeast, mammalian tissue cell culture may also be used to produce the polypeptides of the present invention (see Winnacker, 1987), which is incoφorated herein by reference). Eukaryotic cells can be used because a number of suitable host cell lines capable of secreting intact immunoglobulins have been developed in the art, and include the CHO cell lines, various COS cell lines, HeLa cells, and myeloma cell lines, or transformed B cells or hybridomas. Expression vectors for these cells can include expression control sequences, such as an origin of replication, a promoter, an enhancer (Queen et al, 1986), and necessary processing information sites, such as ribosome binding sites, RNA splice sites, polyadenylation sites, and transcriptional terminator sequences. Expression control sequences can be promoters derived from immunoglobulin genes, cytomegalovirus, SV40, Adenovirus, Bovine Papilloma Virus, and the like. Eukaryotic DNA transcription can be increased by inserting an enhancer sequence into the vector. Enhancers are cis-acting sequences of between 10 to 300 bp that increase transcription by a promoter. Enhancers can effectively increase transcription when either 5' or 3' to the transcription unit. They are also effective if located within an intron or within the coding sequence itself. Typically, viral enhancers are used, including SV40 enhancers, cytomegalovirus enhancers, polyoma enhancers, and adenovirus enhancers. Enhancer sequences from mammalian systems are also commonly used, such as the mouse immunoglobulin heavy chain enhancer. Mammalian expression vector systems will also typically include a selectable marker gene. Examples of suitable markers include, the dihydrofolate reductase gene (DHFR), the thymidine kinase gene (TK), or prokaryotic genes conferring drug resistance. The first two marker genes can use mutant cell lines that lack the ability to grow without the addition of thymidine to the growth medium. Transformed cells can then be identified by their ability to grow on non-supplemented media. Examples of prokaryotic drug resistance genes useful as markers include genes conferring resistance to G418, mycophenolic acid and hygromycin. The vectors containing the DNA segments of interest can be transfened into the host cell by well-known methods, depending on the type of cellular host. For example, calcium chloride transfection is commonly utilized for prokaryotic cells, whereas calcium phosphate treatment, lipofection, or electroporation may be used for other cellular hosts. Other methods used to transform mammalian cells include the use of Polybrene, protoplast fusion, liposomes, electroporation, and micro-injection (see, generally, Sambrook et al, 1982 and 1989). Once expressed, the antibodies, individual mutated immunoglobulin chains, mutated antibody fragments, and other immunoglobulin polypeptides of the invention can be purified according to standard procedures of the art, including ammonium sulfate precipitation, fraction column chromatography, gel electrophoresis and the like; see, e.g., Scopes, 1982. Once purified, partially or to homogeneity as desired, the polypeptides may then be used therapeutically or in developing and performing assay procedures, immunofluorescent stainings, and the like (see, generally, Lefkovits and Perm's, 1979 and 1981; Lefkovits, 1997). Two-Hybrid Based Screening Assays This invention provides a two-hybrid screening system to identify library members which bind a predetermined polypeptide sequence. The selected library members are pooled and shuffled by in vitro and/or in vivo recombination. The shuffled pool can then be screened in a yeast two hybrid system to select library members which bind said predetermined polypeptide sequence (e. g., and SH2 domain) or which bind an alternate predetermined polypeptide sequence (e.g., an SH2 domain from another protein species). An approach to identifying polypeptide sequences which bind to a predetermined polypeptide sequence has been to use a so-called "two-hybrid" system wherein the predetermined polypeptide sequence is present in a fusion protein (Chien et al, 1991). This approach identifies protein-protein interactions in vivo through reconstitution of a transcriptional activator (Fields and Song, 1989), the yeast Gal4 transcription protein. Typically, the method is based on the properties of the yeast Gal4 protein, which consists of separable domains responsible for DNA-binding and transcriptional activation. Polynucleotides encoding two hybrid proteins, one consisting of the yeast Gal4 DNA-binding domain fused to a polypeptide sequence of a known protein and the other consisting of the Gal4 activation domain fused to a polypeptide sequence of a second protein', are constructed and introduced into a yeast host cell. Intermolecular binding between the two fusion proteins reconstitutes the Gal4 DNA-binding domain with the Gal4 activation domain, which leads to the transcriptional activation of a reporter gene (e.g., lacz, HIS3) which is operably linked to a Gal4 binding site. Typically, the two-hybrid method is used to identify novel polypeptide sequences which interact with a known protein (Silver and Hunt, 1993; Durfee et al, 1993; Yang et al, 1992; Luban et al, 1993; Hardy et al, 1992; Bartel et al, 1993; and Vojtek et al, 1993). However, variations of the two-hybrid method have been used to identify mutations of a known protein that affect its binding to a second known protein (Li and Fields, 1993; Lalo et al, 1993; Jackson et al, 1993; and Madura et al, 1993). Two-hybrid systems have also been used to identify interacting structural domains of two known proteins (Bardwell et al, 1993; Chakrabarty et al, 1992; Staudinger et al, 1993; and Milne and Weaver 1993) or domains responsible for oligomerization of a single protein (Iwabuchi et al, 1993; Bogerd et al, 1993). Variations of two-hybrid systems have been used to study the in vivo activity of a proteolytic enzyme (Dasmahapatra et al, 1992). Alternatively, an E. coli BCCP interactive screening system (Germino et al, 1993; Guarente, 1993) can be used to identify interacting protein sequences (i.e., protein sequences which heterodimerize or form higher order he teromul timers). Sequences selected by a two-hybrid system can be pooled and shuffled and introduced into a two-hybrid system for one or more subsequent rounds of screening to identify polypeptide sequences which bind to the hybrid containing the predetermined binding sequence. The sequences thus identified can be compared to identify consensus sequence(s) and consensus sequence kernals. Improved methods for cellular engineering, protein expression profiling, differential labeling of peptides. and novel reagents therefor The invention relates to peptide chemistry, proteomics, and mass spectrometry technology. In particular, the invention provides novel methods for determining polypeptide profiles and protein expression variations, as with proteome analyses. The present invention provides methods of simultaneously identifying and quantifying individual proteins in complex protein mixtures by selective differential labeling of amino acid residues followed by chromatographic and mass spectrographic analysis. The diagnosis and treatment, as well as the predisposition of, a variety of diseases and disorders may often be accomplished through identification and quantitative measurement of polypeptide expression variations between different cell types and cell states. Biochemical pathways and metabolic networks can also be analyzed by globally and quantitatively measuring protein expression in various cell types and biological states (see, e.g., Ideker (2001) Science 292:929-934). State-of-the-art techniques such as liquid-chromatography- electrospray-ionization tandem mass spectrometry have, in conjunction with database- searching computer algorithms, revolutionized the analysis of biochemical species from complex biological mixtures. With these techniques, it is now possible to perform high-throughput protein identification at picomolar to subpicomolar levels from complex mixtures of biological molecules (see, e.g., Dongre (1997) Trends Biotechnol. 15:418-425). One such method is based on a class of chemical reagents termed isotope-coded affinity tags (ICATs) and tandem mass spectrometry or ion trap mass spectrometry or a combination thereof. The method labels multiple cysteinyl residues and uses stable isotope dilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol or galactose as a carbon source. The measured differences in protein expression conelated with known yeast metabolic function under glucose-repressed conditions. In another technique, two different protein mixtures for quantitative comparison are digested to peptide mixtures, the peptides mixtures are separately methylated using either dO- or d3 -methanol, the mixtures of methylated peptide combined and subjected to microcapillary HPLC-MS/MS (see, e.g., Goodlett, D. R., et al., (2000) "Differential stable isotope labeling of peptides for quantitation and de novo sequence derivation," 49th ASMS; Zhou, H; Watts, JD; Aebersold, R. A systematic approach to the analysis of protein phosphorylation.; Comment In: Nat Biotechnol. 2001 Apr;19(4):317-8; Nature Biotechnology 2001 Apr, 19(4):375-8). Parent proteins of methylated peptides are identified by coπelative database searching of fragment ion spectra using a computer program assisted paradigms or automated de novo sequencing that compares all tandem mass spectra of dO- and d3-methylated peptide ion pairs. In Goodlett (2000) supra, ratios of proteins in two different mixtures were calculated for dO- to d3-methylated peptide pairs. However, there are several limitations to this approach, including: use of differential labeling reagents, which relied on stable isotopes, which are expensive, and not flexible to differential labeling of more than two mixtures of peptides; labeling methods limited only to methylation of carboxy-termini; protein expression profiling limited to duplex comparison; one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't has enough capacity and resolving power for complex mixtures of peptides. In one aspect this invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non- enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated. In one aspect, the sample of step (a) comprises a cell or a cell extract. The method can further comprise providing two or more samples comprising a polypeptide. One or more of the samples can be derived from a wild type cell and one sample can be derived from an abnormal or a modified cell. The abnormal cell can be a cancer cell. The modified cell can be a cell that is mutagenized &/or treated with a chemical, a physiological factor, or the presence of another organism (including, e.g. a eukaryotic organism, prokaryotic organism, virus, vector, prion, or part thereof), &/or exposed to an environmental factor or change or physical force (including, e.g., sound, light, heat, sonication, and radiation). The modification can be genetic change (including, for example, a change in DNA or RNA sequence or content) or otherwise. In one aspect, the method further comprises purifying or fractionating the polypeptide before the fragmenting of step (c). The method can further comprise purifying or fractionating the polypeptide before the labeling of step (d). The method can further comprise purifying or fractionating the labeled peptide before the chromatography of step (e). In alternative aspects, the purifying or fractionating comprises a method selected from the group consisting of size exclusion chromatography, size exclusion chromatography, HPLC, reverse phase HPLC and affinity purification. In one aspect, the method further comprises contacting the polypeptide with a labeling reagent of step (b) before the fragmenting of step (c). In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: ZAOH and ZBOH, to esterify peptide C-terminals and/or Glu and Asp side chains; ZANH2 and ZBNH , to form amide bond with peptide C-terminals and or Glu and Asp side chains; and ZACO2H and ZBCO2H. to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein ZA and ZB independently of one another comprise the general formula R-Z1- A'-Z2-A2-Z3-A3-Z4-A4-, Z1, Z2, Z3, and Z4 independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O,
C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR'), and R anc R1 is an alkyl group, A1, A2, A3, and A4 independently of one another, are selected from the group consisting of nothing or (CRR )„, wherein R, R , independently from other R and R1 in Z1 to Z4 and independently from other R and R1 in A1 to A4, are selected from the group consisting of a hydrogen atom, a halogen atom and an alkyl group; "n" in Z1 to Z4, independent of n in A1 to A4, is an integer having a value selected from the group consisting of 0 to about 51; 0 to about 41; 0 to about 31; 0 to about 21, 0 to about 11 and 0 to about 6. In one aspect, the alkyl group (see definition below) is selected from the group consisting of an alkenyl, an alkynyl and an aryl group. One or more C-C bonds from (CRR1),, can be replaced with a double or a triple bond; thus, in alternative aspects, an R or an R1 group is deleted. The (CRR1),, can be selected from the group consisting of an ø-arylene, an w-arylene and a 7-arylene, wherein each group has none or up to 6 substituents. The (CRR )„ can be selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom. In one aspect, two or more labeling reagents have the same structure but a different isotope composition. For example, in one aspect, Z has the same structure as Z , while Z has a different isotope composition than Z . In alternative aspects, the isotope is boron- 10 and boron- 11; carbon- 12 and carbon- 13; nitrogen- 14 and nitrogen-15; and, sulfur-32 and sulfur-34. In one aspect, where the isotope with the lower mass is x and the isotope with the higher mass is y, and x and y are integers, x is greater than y. In alternative aspects, x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51. In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: CD (CD2)nOH / CH (CH2)nOH, to esterify peptide C-terminals, where n = 0, 1 , 2 or y; CD3(CD2)nNH2 / CH3(CH2)nNH2, to form amide bond with peptide C-terminals, where n = 0, 1, 2 or y; and, D(CD2)„CO2H / H(CH2)„CO H, to form amide bond with peptide N-terminals, where n = 0, 1, 2 or y; wherein D is a deuteron atom, and y is an integer selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51. In one aspect, the labeling reagent of step (b) can comprise the general formulae selected from the group consisting of: ZAOH and ZBOH to esterify peptide C-terminals; ZANH2 / ZBNH2 to fonn an amide bond with peptide C-terminals; and, ZACO2H / ZBCO2H to form an amide bond with peptide N-terminals; wherein ZA and ZB have the general formula R-Z1-A1-Z2-A2-Z3-A3-Z4-A4- ; Z1, Z2, Z3, and Z4, independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR1); A1, A2, A3, and A4, independently of one another, are selected from the group consisting of nothing and the general formulae (CRR')n, and, R and R1 is an alkyl group. In one aspect, a single C-C bond in a (CRR')n group is replaced with a double or a triple bond; thus, the R and R1 can be absent. The (CRR^n can comprise a moiety selected from the group consisting of an ø-arylene, an m-arylene and ap- arylene, wherein the group has none or up to 6 substituents. The group can comprise a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom. In one aspect, R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R1 in A1 - A4, are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group. The alkyl group (see definition below) can be an alkenyl, an alkynyl or an aryl group. In one aspect, the "n" in Z1 - Z4 is independent of n in A1 - A4 and is an integer selected from the group consisting of about 51 ; about 41 ; about 31 ; about 21 , about 11 and about 6. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CH2- fragment(s) in one or more A - A fragments, wherein x is an integer. In one aspect, ZA has the same structure a ZB but ZA further comprises x number of -CF2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA comprises x number of protons and ZB comprises y number of halogens in the place of protons, wherein x and y are integers. In one aspect, Z contains x number of protons and ZB contains^ number of halogens, and there are x - y number of protons remaining in one or more A1 - A4 fragments, wherein x andy are integers. In one aspect, ZA further comprises x number of -O- fragment(s) in one or more A - A fragments, wherein x is an integer. In one aspect, Z further comprises x number of -S- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer. In one aspect, ZA further comprises x number of -O- fragment(s) and ZB further comprises y number of-S- firagment(s) in the place of-O- fragment(s), wherein x and y are integers. In one aspect, ZA further comprises x - y number of -O- fragment(s) in one or more A1 - A4 fragments, wherein x and y are integers. In alternative aspects, x and are integers selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater than y. In one aspect, the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: CH3(CH2)nOH/CH (CH2)n+mOH, to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; CH3(CH2)„ NH2 / CH3(CH2)n+mNH2, to form amide bond with peptide C-terminals, where n = 0, 1, 2, ... , y; m = 1 , 2, ... , y; and, H(CH2)„CO2H / H(CH2)n+mCO2H, to form amide bond with peptide N-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; wherein n, m and y are integers. In one aspect, n, m and y are integers selected from the group consisting of about 51; about 41; about 31; about 21, about 11; about 6 and between about 5 and 51. In one aspect, the separating of step (e) comprises a liquid chromatography system, such as a multidimensional liquid chromatography or a capillary chromatography system. In one aspect, the mass spectrometer comprises a tandem mass spectrometry device or an ion trap mass spectrometer or a combination thereof. In one aspect, the method further comprises quantifying the amount of each polypeptide or each peptide. The invention provides a method for defining the expressed proteins associated with a given cellular state, the method comprising the following steps: (a) providing a sample comprising a cell in the desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cell into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated, thereby defining the expressed proteins associated with the cellular state. The invention provides a method for quantifying changes in protein expression between at least two cellular states, the method comprising the following steps: (a) providing at least two samples comprising cells in a desired cellular state; (b) providing a plurality of labeling reagents which differ in molecular mass that can generate differential labeled peptides that do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting polypeptides derived from the cells into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation;
(d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents, wherein the labels used in one same are different from the labels used in other samples; (e) separating the peptides by chromatography to generate an eluate; (f) feeding the eluate of step
(e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer; (g) inputting the sequence to a computer program product which identifies from which sample each peptide was derived, compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated, and compares the amount of each polypeptide in each sample, thereby quantifying changes in protein expression between at least two cellular states. The invention provides a method for identifying proteins by differential labeling of peptides, the method comprising the following steps: (a) providing a sample comprising a polypeptide; (b) providing a plurality of labeling reagents which differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, wherein the differences in molecular mass are distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptide into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), thereby labeling the peptides with the differential labeling reagents; (e) separating the peptides by multidimensional liquid chromatography to generate an eluate; (f) feeding the eluate of step (e) into a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer(s); (g) inputting the sequence to a computer program product which compares the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which the sequenced peptide originated. The invention provides a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope. The isotope(s) can be in the first domain or the second domain. For example, the isotope(s) can be in the biotin. In alternative aspects, the isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen-15 isotope, or, a sulfur-32 or a sulfur-34 isotope. The chimeric labeling reagent can comprise two or more isotopes. The chimeric labeling reagent reactive group capable of covalently binding to an amino acid can be a succimide group, an isothiocyanate group or an isocyanate group. The reactive group can be capable of covalently binding to an amino acid binds to a lysine or a cysteine. The chimeric labeling reagent can further comprising a linker moiety linking the biotin group and the reactive group. The linker moiety can comprise at least one isotope. In one aspect, the linker is a cleavable moiety that can be cleaved by, e.g., enzymatic digest or by reduction. The invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the small molecule tags are structurally identical but differ in their isotope composition, and the small molecules comprise reactive groups that covalently bind to cysteine or lysine residues or both; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (d) comparing relative protein concentrations of each sample. In one aspect, the sample comprises a complete or a fractionated cellular sample. In one aspect of the method, the differential small molecule tags comprise a chimeric labeling reagent comprising (a) a first domain comprising a biotin; and, (b) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope. The isotope can be a deuterium isotope, a boron- 10 or boron- 11 isotope, a carbon- 12 or a carbon- 13 isotope, a nitrogen- 14 or a nitrogen- 15 isotope, or, a sulfur- 32 or a sulfur-34 isotope. The chimeric labeling reagent can comprise two or more isotopes. The reactive group can be capable of covalently binding to an amino acid is selected from the group consisting of a succimide group, an isothiocyanate group and an isocyanate group. The invention provides a method of comparing relative protein concentrations in a sample comprising (a) providing a plurality of differential small molecule tags, wherein the differential small molecule tags comprise a chimeric labeling reagent comprising (i) a first domain comprising a biotin; and, (ii) a second domain comprising a reactive group capable of covalently binding to an amino acid, wherein the chimeric labeling reagent comprises at least one isotope; (b) providing at least two samples comprising polypeptides; (c) attaching covalently the differential small molecule tags to amino acids of the polypeptides; (d) isolating the tagged polypeptides on a biotin-binding column by binding tagged polypeptides to the column, washing non-bound materials off the column, and eluting tagged polypeptides off the column; (e) determining the protein concentrations of each sample in a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof; and, (f) comparing relative protein concentrations of each sample. The invention provides methods for simultaneously identifying individual proteins in complex mixtures of biological molecules and quantifying the expression levels of those proteins, e.g., proteome analyses. The methods compare two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation. The proteins in the standard and investigated samples are subjected separately to a series of chemical modifications, i.e., differential chemical labeling, and fragmentation, e.g., by proteolytic digestion and/or other enzymatic reactions or physical fragmenting methodologies. The chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides. Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but of similar properties, such that peptides with the same sequence from both samples are eluted together in the separation procedure and their ionization and detection properties regarding the mass spectrometry are very similar. Differential chemical labeling can be performed on reactive functional groups on some or all of the carboxy- and/or amino- termini of proteins and peptides and/or on selected amino acid side chains. A combination of chemical labeling, proteolytic digestion and other enzymatic reaction steps, physical fragmentation and or fractionation can provide access to a variety of residues to general different specifically labeled peptides to enhance the overall selectivity of the procedure. The standard and the investigated samples are combined, subjected to multidimensional chromatographic separation, and analyzed by mass spectrometry methods. Mass spectrometry data is processed by special software, which allows for identification and quantification of peptides and proteins. Depending on the complexity and composition of the protein samples, it may be desirable, or be necessary, to perform protein fractionation using such methods as size exclusion, ion exchange, reverse phase, or other methods of affinity purifications prior to one or more chemical modification steps, proteolytic digestion or other enzymatic reaction steps, or physical fragmentation steps. In one aspect of the invention, combined mixtures of peptides are first separated by a chromatography method, such as a multidimensional liquid chromatography system of the invention, before being fed into a coupled mass spectrometry device, such as a tandem mass spectrometry device or an ion trap mass spectrometer or a combination thereof. The combination of multidimensional liquid chromatography and tandem mass spectrometry can be called "LC-LC-MS/MS." LC- LC-MS/MS was first developed by Link A. and Yates J. R., as described, e.g., by Link (1999) Nature Biotechnology 17:676-682; Link (1999) Electrophoresis 18: 1314- 1334; Washburn, MP; Wolters, D; Yates, JR , Nature Biotechnology 2001 Mar, 19(3):242-7. Another exemplary system of the invention comprises the combination of multidimensional liquid chromatography and tandem mass spectrometry and an ion trap mass spectrometry, designated 3D LC LCQ MS/MS or 3D LC LTQ MS/MS, as described herein (e.g., comprising Finnigan MDLC LTQ™ or LTQ FT™, Thermo Electron Coφoration, San Jose, CA, or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer). In practicing the methods of the invention, proteins can be first substantially or partially isolated from the biological samples of interest. The polypeptides can be treated before selective differential labeling; for example, they can be denatured, reduced, preparations can be desalted, and the like. Conversion of samples of proteins into mixtures of differentially labeled peptides can include preliminary chemical and/or enzymatic modification of side groups and or termini; proteolytic digestion or fragmentation; post-digestion or post-fragmentation chemical and/or enzymatic modification of side groups and/or termini. The differentially modified polypeptides and peptides are then combined into one or more peptide mixtures. Solvent or other reagents can be removed, neutralized or diluted, if desired or necessary. The buffer can be modified, or, the peptides can be re-dissolved in one or more different buffers, such as a "MudPIT" (see below) loading buffer. The peptide mixture is then loaded onto chromatography column, such as a liquid chromatography column, a 2D capillary column or a multidimensional chromatography column, to generate an eluate. The eluate is fed into a mass spectrometer, such as a tandem mass spectrometer, an ion trap mass spectrometer (LCQ or LTQ) or a combination thereof In one aspect, an LC ESI MS and MS/MS analysis is complete. Finally, data output is processed by appropriate software using database searching and data analysis. In practicing the methods of the invention, high yields of peptides can generated for mass spectrograph analysis. Two or more samples can be differentially labeled by selective labeling of each sample. Peptide modifications, i.e., labeling, are stable. Reagents having differing masses or reactive groups can be chosen to maximize the number of reactive groups and differentially labeled samples, thus allowing for a multiplex analysis of sample, polypeptides and peptides. In one aspect, a "MudPIT" protocol is used for peptide analysis, as described herein. The methods of the invention can be fully automated and can essentially analyze every protein in a sample.
Definitions Unless defined otherwise, all technical and scientific terms used herein have the meaning commonly understood by a person skilled in the art to which this invention belongs. As used herein, the following terms have the meanings ascribed to them unless specified otherwise. As used herein, the term "alkyl" is used to refer to a genus of compounds including branched or unbranched, saturated or unsaturated, monovalent hydrocarbon radicals, including substituted derivatives and equivalents thereof. In one aspect, the hydrocarbons have from about 1 to about 100 carbons, about 1 to about 50 carbons or about 1 to about 30 carbons, about 1 to about 20 carbons, about 1 to about 10 carbons. When the alkyl group has from about 1 to 6 carbon atoms, it is refened to as a "lower alkyl." Suitable alkyl radicals include, e.g., structures containing one or more methylene, methine and/or methyne groups ananged in acyclic and/or cyclic forms. Branched structures have a branching motif similar to isopropyl, tert-butyl isobutyl, 2-ethylpropyl, etc. As used herein, the term encompasses "substituted alkyls." "Substituted alkyl" refers to alkyl as just described including one or more functional groups such as lower alkyl, aryl, acyl, halogen (i.e., alkylhalos, e.g., CF3), hydroxy, amino, alkoxy, alkylamino, acylamino, thioamido, acyloxy, aryloxy, arylamino, aryloxyalkyl, mercapto, thia, aza, oxo, both saturated and unsaturated cyclic hydrocarbons, heterocycles and the like. These groups may be attached to any carbon of the alkyl moiety. Additionally, these groups may be pendent from, or integral to, the alkyl chain. The term "alkoxy" is used herein to refer to the to a COR group, where
R is a lower alkyl, substituted lower alkyl, aryl, substituted aryl, arylalkyl or substituted arylalkyl wherein the alkyl, aryl, substituted aryl, arylalkyl and substituted arylalkyl groups are as described herein. Suitable alkoxy radicals include, for example, methoxy, ethoxy, phenoxy, substituted phenoxy, benzyloxy phenethyloxy, tert.-butoxy, etc. The term "aryl" is used herein to refer to an aromatic substituent that may be a single aromatic ring or multiple aromatic rings which are fused together, linked covalently, or linked to a common group such as a methylene or ethylene moiety. The common linking group may also be a carbonyl as in benzophenone. The aromatic ring(s) may include phenyl, naphthyl, biphenyl, diphenylmethyl and benzophenone among others. The term "aryl" encompasses
"arylalkyl." "Substituted aryl" refers to aryl as just described including one or more functional groups such as lower alkyl, acyl, halogen, alkylhalos (e.g., CF3), hydroxy, amino, alkoxy, alkylamino, acylamino, acyloxy, phenoxy, mercapto and both saturated and unsaturated cyclic hydrocarbons which are fused to the aromatic ring(s), linked covalently or linked to a common group such as a methylene or ethylene moiety. The linking group may also be a carbonyl such as in cyclohexyl phenyl ketone. The term "substituted aryl" encompasses "substituted arylalkyl." The term "arylalkyl" is used herein to refer to a subset of "aryl" in which the aryl group is further attached to an alkyl group, as defined herein. The term "biotin" as used herein refers to any natural or synthetic biotin or variant thereof, which are well known in the art; ligands for biotin, and ways to modify the affinity of biotin for a ligand, are also well known in the art; see, e.g., U.S. Patent Nos. 6,242,610; 6,150,123; 6,096,508; 6,083,712; 6,022,688; 5,998,155; 5,487,975. The phrase "labeling reagents which ... do not differ in ionization and detection properties in mass spectrographic analysis" means that the amount and/or mass sequence of the labeling reagents can be detected using the same mass spectrographic conditions and detection devices. The term "polypeptide" includes natural and synthetic polypeptides, or mimetics, which can be either entirely composed of synthetic, non-natural analogues of amino acids, or, they can be chimeric molecules of partly natural peptide amino acids and partly non-natural analogs of amino acids. The term "polypeptide" as used herein includes proteins and peptides of all sizes. The term "sample" as used herein includes any polypeptide-containing sample, including samples from natural sources, or, entirely synthetic samples. The term "column" as used herein means any substrate surface, including beads, filaments, anays, tubes and the like. The phrase "do not differ in chromatographic retention properties" as used herein means that two compositions have substantially, but not necessary exactly, the same retention properties in a chromatograph, such as a liquid chromatograph. For example, two compositions do not differ in chromatographic retention properties if they elute together, i.e., they elute in what a skilled artisan would consider the same elution fraction. Differential labeling of peptides and polypeptides In practicing the methods of the invention, proteins and peptides are subjected to a series of chemical modifications, i.e., differential chemical labeling. The chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides. Differential labeling reagents can differ in their isotope composition (i.e., isotopical reagents), in their structural composition (i.e., homologous reagents), but by a rather small fragment which change does not alter the properties stated above, i.e., the labeling reagent differ in molecular mass but do not differ in chromatographic retention properties and do not differ in ionization and detection properties in mass spectrographic analysis, and the differences in molecular mass are distinguishable by mass spectrographic analysis. In one aspect of the invention, mixtures of polypeptides and/or peptides coming from the "standard" protein sample and the "investigated" protein sample(s) are labeled separately with differential reagents, or, one sample is labeled and other sample remains unlabeled. As noted above, these differential reagents differ in molecular mass, but do not differ in retention properties regarding the separation method used (e.g., chromatography) and the mass spectrometry methods used will not detect different ionization and detection properties. Thus, these differential reagents differ either in their isotope composition (i.e., they are isotopical reagents) or they differ structurally by a rather small fragment which change does not alter the properties stated above (i.e., they are homologous reagents). Differential chemical labeling can include esterification of C-termini, amidation of C-termini and/or acylation of N-termini. Esterification targets C-termini of peptides and carboxylic acid groups in amino acid side chains. Amidation targets C-termini of peptides and carboxylic acid groups in amino acid side chains. Amidation may require protection of amine groups first. Acylation targets N-termini of peptides and amino and hydroxy groups in amino acid side chains. Acylation may require protection of carboxylic groups first. The skilled artisan will recognize that the chemical syntheses and differential chemical labeling of peptides and polypeptides (e.g., esterification, amidation, and acylation) used to practice the methods of the invention can be by a variety of procedures and methodologies, which are well described in the scientific and patent literature, e.g., Organic Syntheses Collective Volumes, Gilman et al. (Eds), John Wiley & Sons, Inc., NY; Venuti (1989) Pharm. Res. 6: 867-873; the Beilstein Handbook of Organic Chemistry (Beilstein Institut fuer Literatur der Organischen Chemie, Frankfurt, Germany); Beilstein online database and references obtainable therein; "Organic Chemistry," Morrison & Boyd, 7th edition, 1999, Prentice-Hall, Upper Saddle River, NJ. The invention can be practiced in conjunction with any method or protocol known in the art, which are well described in the scientific and patent literature. For example, the esterification, amidation, and acylation reactions may be performed on the mixtures of peptides in a fashion similar to other reaction of these types already described in prior art, such as:
Figure imgf000110_0001
In alternative aspects, reagents comprise the general formulae: ZAOH and ZBOH to esterify peptide C-terminals and or Glu and Asp side chains; ZANH2 / ZBNH2 to form amide bond with peptide C-terminals and/or Glu and Asp side chains; or ZACO2H / ZBCO2H to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein ZA and ZB independently of one another can be R-Z1 -A1 -Z2-
A2-Z3-A3-Z4-A4- , and Z1, Z2, Z3, and Z4 independently of one another can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR])O, BR(OR'), BRR1, B(0R)(0R') , OBR(OR'), OBRR1, OB(OR)(OR'), or, Z1, Z2, Z3, and Z4 independently of one another may be absent, and R is an alkyl group; and, A1, A2, A , and A4 independently of one another can be selected from (CRR*)n, and R is an alkyl group. In alternative aspects, some single C-C bonds from (CRR')n may be replaced with double or triple bonds, in which case some groups R and R1 will be absent, (CRR')n can be an o- arylene, an -arylene, or a -arylene with up to 6 substituents, carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms in the cycle with or without heteroatoms (O, N, S) and with or without substituents, or A1 , A2, A3, and A4 independently of one another can be absent; R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R1 in A1 - A4, can be hydrogen, halogen or an alkyl group, such as an alkenyl, an alkynyl or an aryl group; n in Z1 - Z4, independent of n in A1 - A4, is an integer that can have value from 0 to about 51; 0 to about 41; 0 to about 31; 0 to about 21, 0 to about 11; 0 to about 6; In alternative aspects, ZA has the same structure as ZB, but they have different isotope compositions. Any isotope may be used. In alternative aspects, if ZA contains x number of protons, Z may contain y number of deuterons in the place of protons, and, conespondingly, x - y number of protons remaining; and or if ZA contains x number of borons- 10, Z may contain y number of borons- 11 in the place of borons- 10, and, conespondingly, x - y number of borons- 10 remaining; and/or if ZA contains x number of carbons- 12, Z may contain y number of carbons- 13 in the place of carbons-12, and, conespondingly, x - y number of carbons-12 remaining; and/or if ZA contains x number of nitrogens- 14, ZB may contain y number of nitrogens- 15 in the place of nitrogens- 14, and, conespondingly, x - y number of nitrogens- 14 remaining; and/or if Z contains x number of sulfurs-32, Z may contain y number of sulfurs-34 in the place of sulfurs-32, and, conespondingly, x - y number of sulfurs-32 remaining; and so on for all elements which may be present and have different stable isotopes; x andy are whole numbers such that x is greater thany. In one aspect, x and y are between 1 and about 11 , between 1 and about 21 , between 1 and about 31 , between 1 and about 41, between 1 and about 51. In alternative aspects, reagent pairs/series comprise the general formulae: CD3(CD2)„OH / CH3(CH2)nOH to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; (delta mass = 3 + 2n); CD3(CD2)„NH2 / CH3(CH2)nNH2 to form amide bond with peptide C- terminals where n = 0, 1, 2, ..., y (delta mass = 3+ 2n); D(CD2)„CO2H / H(CH2)„CO2H to form amide bond with peptide N- terminals, where n = 0, 1, 2, ..., y (delta mass = l+2n); wherein y is an integer that can have value of about 51; about 41; about 31 ; about 21 , about 11 ; about 6, or between about 5 and 51. Other exemplary reagents can be presented by general formulae: i. ZAOH and ZBOH to esterify peptide C-terminals; ZANH2 / ZBNH2 to form an amide bond with peptide C-terminals; ZACO2H / ZBCO2H to form an amide bond with peptide N-terminals; wherein ZA and ZB can be R-Z1-A1-Z2-A2-Z3-A3-Z4-A4- and Z1, Z2, Z3, and Z4, independently of one another, can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR1)O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, or OB(OR)(OR'); or, Z1, Z2, Z3, and Z4, independently of one another, can be absent, and, R is an alkyl group; A1 , A2, A3, and A4, independently of one another, can be a moiety comprising the general formulae (CRR')n. In alternative aspects, single C-C bonds in some (CRR')n groups may be replaced with double or triple bonds, in which case some groups R and R will be absent, or (CRR )n can be an o-arylene, an w-arylene, or ap-arylene with up to 6 substituents, or a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without heteroatoms (e.g., O, N or S atoms), or, with or without substituents, or, A1 - A4 independently of one another may be absent; In alternative aspects, R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R1 in A - A4, can be a hydrogen atom, a halogen or an alkyl group, such as an alkenyl, an alkynyl or an aryl group; In alternative aspects, n in Z1 - Z4 is independent of n in A1 - A4 and is an integer that can have value of about 51 ; about 41 ; about 31 ; about 21 , about 11 ; about 6. In alternative aspects, ZA has a similar structure to that of ZB, but ZA has x extra -CH2- fragment(s) in one or more A1 - A4 fragments, and or ZA has x extra -CF2- fragment(s) in one or more A1 - A4 fragments. Alternatively, ZA can contain x number of protons and ZB may contain >> number of halogens in the place of protons. A R
Alternatively, where Z contains x number of protons and Z contains y number of halogens, there are x - y number of protons remaining in one or more A1 - A4 fragments; and/or Z has x extra -O- fragment(s) in one or more A - A fragments; and or ZA has x extra -S- fragment(s) in one or more A1 - A4 fragments; and/or if ZA contains x number of-O- fragment(s), ZB may contain^ number of-S- fragment(s) in the place of-O- fragment(s), and, conespondingly, x - y number of-O- fragment(s) remaining in one or more A1 - A4 fragments; and the like. In alternative aspects, x andy are integers that can have value of between 1 about 51; of between 1 about 41; of between 1 about 31; of between 1 about 21, of between 1 about 11; of between 1 about 6, such that x is greater thany. Exemplary homologous reagents pairs/series are CH3(CH2)nOH/CH3(CH2)n+mOH to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y (delta mass = 14m) CH3(CH2)n NH2 / CH3(CH2)n+mNH2 to form amide bond with peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y (delta mass = 14m) H(CH2)„CO2H / H(CH2)n+mCO2Hto form amide bond with peptide N- terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y (delta mass = 14m) wherein y is an integer that can have value of about 51; about 41; about 31 ; about 21 , about 11 ; about 6, or between about 5 and 51. Methods for peptide/protein separation and detection The methods of the invention use chromatographic techniques to separate tagged polypeptides and peptides. In one aspect, a liquid chromatography is used, e.g., a multidimensional liquid chromatography, such as the mixed bed multidimensional liquid chromatograph of the invention. In one exemplary system of the invention, a chromatogram eluate is coupled to a mass spectrometer, such as a tandem mass spectrometry device (e.g., a "3D LC-LC-MS/MS" system of the invention, as described herein), or an ion trap mass spectrometer (e.g., 3D LC LCQ MS/MS or 3D LC LTQ MS/MS systems of the invention, as described herein), or a combination of LC-LTQ-MS/MS or LC-LCQ-MS/MS and LC-LC-MS/MS. Any variation and equivalent thereof can be used to separate and detect peptides. LC-LC-MS/MS was first developed by Link A. and Yates J. R., as described, e.g., in (Link (1999) Nature Biotechnology 17:676-682; Link (2000) Electrophoresis 18, 1314-1334. In one aspect, the LC-LC-MS/MS technique is used; it is effective for complexed peptide separation and it is easily automated. LC-LC-MS/MS is commonly known by the acronym "MudPIT," for "Multi-dimensional Protein Identification Technique." Variations and equivalents of LC-LC-MS/MS and LC-LCQ-MS/MS or LC-LTQ-MS/MS systems of the invention used in the methods of the invention include methodologies involving reverse phase columns coupled to either cation exchange columns (as described, e.g., by Opiteck (1997) Anal. Chem. 69: 1518-1524; or, size exclusion columns (as described, e.g., by Opiteck (1997) Anal. Biochem. 258:349-361). In one aspect, an LC-LC-MS/MS or LC-LCQ-MS/MS or LC-LTQ- MS/MS technique uses a mixed bed microcapillary column containing strong cation exchange (SCX) and reverse phase (RPC) resins. Other exemplary alternatives include protein fractionation combined with one-dimensional LC-ESI MS/MS or peptide fractionation combined MALDI MS/MS. Depending on the complexity or the property of the protein samples, any protein fractionation method, including size exclusion chromatography, ion exchange chromatography, reverse phase chromatography, or any of the possible affinity purifications, can be introduced prior to labeling and proteolysis. In some circumstances, use of several different methods may be necessary to identify all proteins or specific proteins in a sample.
Sequence analysis and quantification In one aspect of the systems of the invention, both quantity and sequence identity of the protein from which the modified peptide originated is determined by a mass spectrometry device, such as a "multistage mass spectrometry" (MS), including 3D LC-LC-MS/MS or LC-LCQ-MS/MS or LC-LTQ-MS/MS systems of the invention, as described herein. This can be achieved by the operation of the mass spectrometer in a dual mode in which it alternates in successive scans between measuring the relative quantities of peptides eluting from the capillary column and recording the sequence information of selected peptides. Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents. Peptide sequence information can be automatically generated by selecting peptide ions of a particular mass-to-charge (m z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode, as described, e.g., by Link (1997) Electrophoresis 18: 1314-1334; Gygi (1999) Nature Biotechnol. 17:994-999; Gygi (1999) Cell Biol. 19:1720-1730. The resulting tandem mass spectra can be conelated to sequence databases to identify the protein from which the sequenced peptide originated. Exemplary commercial available softwares include TURBO SEQUEST™ by Thermo Finnigan, San Jose, CA; MASSSCOT™ by Matrix Science, SONAR MS/MS™ by Proteometrics. Routine software modifications may be necessary for automated relative quantification.
Mass spectrometry devices The methods of the invention can use mass spectrometry to identify and quantify differentially labeled peptides and polypeptides. Any mass spectrometry system can be used. In one aspect of the invention, combined mixtures of peptides are separated by a chromatography system of the invention comprising multidimensional liquid chromatography coupled to tandem mass spectrometry, or, "LC-LC-MS/MS," see, e.g., Link (1999) Biotechnology 17:676-682; Link (1999)
Electrophoresis 18: 1314-1334. In one aspect of the invention, combined mixtures of peptides are separated by a chromatography method comprising a multidimensional liquid chromatography system of the invention coupled to a combination tandem mass spectrometry and an ion trap mass spectrometry device of the invention, or, LC-LCQ- MS/MS or LC-LTQ-MS/MS, as described herein. Exemplary ion trap mass spectrometry devices that can be used in the systems and methods of the invention include, for example, the LCQ Deca XP™ electrospray ionization/ion trap mass spectrometer, including a Finnigan LCQ Deca XP™ or LCQ Deca XP MAX™, or MDLC LTQ™, from Thermo Electron Coφoration, San Jose, CA, , or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer. In these systems, in one aspect, a sample can be introduced by direct infusion using a syringe pump, by flow injection using a injection valve and an LC pump, or by LC fitted with a column (LC/MS). Exemplary mass spectrometry devices also include those incoφorating matrix-assisted laser desoφtion-ionization-time-of-flight (MALDI-TOF) mass spectrometry (see, e.g., Isola (2001) Anal. Chem. 73:2126-2131; Van de Water (2000) Methods Mol. Biol. 146:453-459; Griffin (2000) Trends Biotechnol. 18:77-84; Ross (2000) Biotechniques 29:620-626, 628-629). The inherent high molecular weight resolution of MALDI-TOF MS conveys high specificity and good signal-to-noise ratio for performing accurate quantitation. Use of mass spectrometry, including MALDI-TOF MS, and its use in detecting nucleic acid hybridization and in nucleic acid sequencing, is well known in the art, see, e.g., U.S. Patent Nos. 6,258,538; 6,238,871; 6,238,869; 6,235,478; 6,232,066; 6,228,654; 6,225,450; 6,051,378; 6,043,031.
Fragmentation and proteolytic digestion In practicing the methods of the invention, polypeptides can be fragmented, e.g., by proteolytic, i.e., enzymatic, digestion and/or other enzymatic reactions or physical fragmenting methodologies. The fragmentation can be done before and/or after reacting the peptides/ polypeptides with the labeling reagents used in the methods of the invention. Methods for proteolytic cleavage of polypeptides are well known in the art, e.g., enzymes include trypsin (see, e.g., U.S. Patent No. 6,177,268; 4,973,554), chymotrypsin (see, e.g., U.S. Patent No. 4,695,458; 5,252,463), elastase (see, e.g., U.S. Patent No. 4,071,410); subtilisin (see, e.g., U.S. Patent No. 5,837,516) and the like. In one aspect, a chimeric labeling reagent of the invention includes a cleavable linker. Exemplary cleavable linker sequences include, e.g., Factor Xa or enterokinase (Invitrogen, San Diego CA). Other purification facilitating domains can be used, such as metal chelating peptides, e.g., polyhistidine tracts and histidine- tryptophan modules that allow purification on immobilized metals, protein A domains that allow purification on immobilized immunoglobulin, and the domain utilized in the FLAGS extension/affinity purification system (Immunex Coφ, Seattle WA).
Biological Samples The methods are based on comparison of two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation. For example, in one aspect, the invention provides a method for quantifying changes in protein expression between at least two cellular states, such as, an activated cell versus a resting cell, a normal cell versus a cancerous cell, a stem cell versus a differentiated cell, an injured cell or infected cell versus an uninjured cell or uninfected cell; or, for defining the expressed proteins associated with a given cellular state. Sample can be derived from any biological source, including cells from, e.g., bacteria, insects, yeast, mammals and the like. In one aspect, the proteome of the Bacillus anthracis microbe is analyzed using the mixed bed multi-dimensional liquid chromatographs and methods of the invention. Cells can be harvested from any body fluid or tissue source, or, they can be in vitro cell lines or cell cultures.
Detection Devices and Methods The devices and methods of the invention can also incoφorate in whole or in part designs of detection devices as described, e.g., in U.S. Patent Nos.
6,197,503; 6,197,498; 6,150,147; 6,083,763; 6,066,448; 6,045,996; 6,025,601; 5,599,695; 5,981,956; 5,698,089; 5,578,832; 5,632,957. A number of aspects of the invention have been described.
Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. References Unless otherwise indicated, all references cited herein (supra and infra) are incoφorated by reference in their entirety.
Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R.: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat
Biotechnol 17(10):994-9 (Oct) 1999. Hopkins MJ, Shaφ R, Macfarlane GT.: Age and disease related changes in intestinal bacterial populations assessed by cell culture, 16S rRNA abundance, and community cellular fatty acid profiles. Gut 48(2): 198-205 (Feb) 2001.
Ritchie NJ, Schutter ME, Dick RP, Myrold DD.: Use of length heterogeneity PCR and fatty acid methyl ester profiles to characterize microbial communities in soil. Appl Environ Microbiol 66(4): 1668-75 (Apr) 2000.
Khan AA, Wang RF, Cao WW, Franklin W, Cerniglia CE.: Reclassification of a polycyclic aromatic hydrocarbon-metabolizing bacterium, Beijerinckia sp. strain Bl, as Sphingomonas yanoikuyae by fatty acid analysis, protein pattern analysis, DNA-
DNA hybridization, and 16S ribosomal DNA sequencing. IntJSyst. Bacteriol 46(2):466-9 (Apr) 1996.
Peltroche-Llacsahuanga H, Schmidt S, Lutticken R, Haase G.: Discriminative power of fatty acid methyl ester (FAME) analysis using the microbial identification system (MIS) for Candida (Torulopsis) glabrata and Saccharomyces cerevisiae. Diagn
Microbiol Infect Dis 38(4):213-21 (Dec) 2000.
SA Gerber et al.: Analysis of rates of multiple enzymes in cell lysates by electrospray ionization mass spectrometry. J. Am. Chem. Soc. 121:1102-3 1999. David Goodlett discusses the latest in genomics - ICAT reagents
Written by: Marian Moser Jones
Dec 20, 2000
WO0011208; Filed Aug 25, 1999, Published March 2, 2000. Aebersold RH, Gelb
MH, Gygi, SP, Scott CR, Turecek F, Gerber SA, Rist B: Rapid quantitative analysis of proteins or protein function in complex mixtures.
WO9905221; Filed July 27 1998, Published Feb. 4,1999. Cummins WJ, West RM,
Smith JA: Cyanine Dyes.
US4876350; Filed Dec 16, 1987, Issued Oct 24, 1989. McGarrity J, Tenud L:
Process for the production of (+) biotin. US5776723; Filed Feb 8, 1996, Issued July 7, 1998. Herold CD, O'Hagan M: Rapid detection of mycobacterium tuberculosis.
US6136173; Filed June 24, 1996, Issued Oct. 24, 2000. Anderson NL, Anderson NG,
Goodman J: Automated system for two-dimensional electrophoresis.
US6127134; Filed April 20, 1995, Issued Oct. 3, 2000. Minden J, Waggoner A: Difference gel electrophoresis using matched multiple dyes.
US6064754; Filed Dec 1, 1997, Issued May 16, 2000. Parekh RB, Amess R, Bruce
JA, Prime SB, Plait AE, Stoney RM: Computer-assisted methods and apparatus for identification and characterization of biomolecules in a biological sample.
US6013165; Filed May 22, 1998, Issued Jan 11,2000. Wiktorowicz JE, Raysberg Y: Electrophoresis apparatus and method.
Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA, Struhl K
Editors.Cunent Protocols In Molecular Biology. Vol 2. John Wiley & Sons, Ine, ©
2001, 10.21.4-10.21.6, 10.22.5-10.22.10, 10.22.14, 10.22.15-10.22.20.
Sambrook J, Russell DW Editors. Molecular Cloning A Laboratory Manual 3rd ed. Cold Spring Harbor Laboratory Press, New York, © 2001, 18.3, 18.62, 18.66.
Alting-Mecs MA and Short JM: Polycos vectors: a system for packaging filamentous phage and phagemid vectors using lambda phage packaging extracts. Gene 137: 1, 93-
100, 1993. Arkin AP and Youvan DC: An algorithm for protein engineering: simulations of recursive ensemble mutagenesis. Proc Natl Acad Sci USA 89(16):7811-7815, (Aug 15)
1992.
Arnold FH: Protein engineering for unusual environments. Current Opinion in
Biotechnology 4(4):450-455, 1993.
Ausubel FM, et al Editors. Cunent Protocols in Molecular Biology. Vols. 1 and 2 and supplements, (a.k.a. "The Red Book") Greene Publishing Assoc, Brooklyn, NY, ©1987.
Ausubel FM, et al Editors. Cunent Protocols in Molecular Biology. Vols. 1 and 2 and supplements. (a.k.a. "The Red Book") Greene Publishing Assoc, Brooklyn, NY, ©1989.
Ausubel FM, et al Editors. Short Protocols in Molecular Biology: A Compendium of
Methods from Cunent Protocols in Molecular Biology. Greene Publishing Assoc,
Brooklyn, NY, ©1989.
Ausubel FM, et al Editors. Short Protocols in Molecular Biology: A Compendium of
Methods from Cunent Protocols in Molecular Biology, 2nd Edition. Greene Publishing
Assoc, Brooklyn, NY, ©1992.
Barbas CF 3d, Bain JD, Hoekstra DM, Lerner RA: Semisynthetic combinatorial antibody libraries: a chemical solution to the diversity problem. Proc Natl Acad Sci USA
89(10):4457-4461, 1992.
Bardwell AJ, Bardwell L, Johnson DK, Friedberg EC: Yeast DNA recombination and repair proteins Radl and RadlO constitute a complex in vivo mediated by localized hydrophobic domains. Mol Microbiol 8(6): 1177-1188, 1993.
Banet AJ, et al., eds.: Enzyme Nomenclature: Recommendations of the Nomenclature
Committee of the International Union of Biochemistry and Molecular Biology. San
Diego: Academic Press, Inc., 1992.
Barrel P, Chien CT, Sternglanz R, Fields S: Elimination of false positives that arise in using the two-hybrid system. Biotechniques 14(6):920-924, 1993.
Beaudry AA and Joyce GF: Directed evolution of an RNA enzyme. Science
257(5070):635-641, 1992.
Berger and Kimmel, Methods in Enzymolo y. Volume 152, Guide to Molecular Cloning
Techniques. Academic Press, Inc., San Diego, CA, ©1987. (Cumulative Subject Index:
Volumes 135-139, 141-167, 1990, 272 pp.)
Bevan M: Binary Agrobacterium vectors for plant transformation. Nucleic Acids
Research 12(22):8711-21, 1984. Biocca S, Pierandrei-Amaldi P, Cattaneo A: Intracellular expression of anti-p21ras single chain Fv fragments inhibits meiotic maturation of xenopus oocytes. Biochem
Biophys Res Commun 197(2):422-427, 1993.
Bird et al. Plant Mol Biol 11:651, 1988..
Bogerd HP, Fridell RA, Blair WS, Cullen BR: Genetic evidence that the Tat proteins of human immunodeficiency virus types 1 and 2 can multimerize in the eukaryotic cell nucleus. J Virol 67(8):5030-5034, 1993.
Boyce COL, ed.: Novo's Handbook of Practical Biotechnology. 2nd ed. Bagsvaerd,
Denmark, 1986.
Brederode FT, Koper-Zawrthoff EC, Bol JF: Complete nucleotide sequence of alfalfa mosaic virus RNA 4. Nucleic Acids Research 8(10):2213-23, 1980.
Breitling F, Dubel S, Seehaus T, Klewinghaus I, Little M: A surface expression vector for antibody screening. Gene 104(2):147-153, 1991.
Brown NL, Smith M: Cleavage specificity of the restriction endonuclease isolated from
Haemophilus gallinarum (Hga I). Proc Natl Acad Sci USA 74(8):3213-6, (Aug) 1977.
Burton DR, Barbas CF 3d, Persson MA, Koenig S, Chanock RM, Lerner RA: A large anay of human monoclonal antibodies to type 1 human immunodeficiency virus from combinatorial libraries of asymptomatic seropositive individuals. Proc Natl Acad Sci U S
A 88(22): 10134-7, (Nov 15) 1991.
Caldwell RC and Joyce GF: Randomization of genes by PCR mutagenesis. PCR
Methods Appl 2(\0):28-33, 1992.
Caton AJ and Koprowski H: Influenze virus hemagglutinin-specific antibodies isolated from a combinatorial expression library are closely related to the immune response of the donor. Proc Natl Acad Sci USA 87(16):6450-6454, 1990.
Chakraborty T, Martin JF, Olson EN: Analysis of the oligomerization of myogenin and
E2A products in vivo using a two-hybrid assay system. J Biol Chem 267(25): 17498-
501, 1992.
Chang CN, Landolfi NF, Queen C: Expression of antibody Fab domains on bacteriophage surfaces. Potential use for antibody selection. J Immunol 147(10):3610-4,
(Nov 15) 1991.
Chaudhary VK, Batra JK, Gallo MG, Willingham MC, FitzGerald DJ, Pastan I: A rapid method of cloning functional variable-region antibody genes in Escherichia coli as single-chain immunotoxins. Proc Natl Acad Sci USA 87(3): 1066-1070, 1990. Chien CT, Bartel PL, Sternglanz R, Fields S: The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc Natl
Acad Sci USA 88(21):9578-9582, 1991.
Chiswell DJ, McCafferty J: Phage antibodies: will new 'coliclonal' antibodies replace monoclonal antibodies? Trends Biotechnol 10(3):80-84, 1992.
Chothia C and Lesk AM: Canonical structures for the hypervariable regions of immunoglobulins. JMolBiol 196)4):901-917, 1987.
Chothia C, Lesk AM, Tramontano A, Levitt M, Smith-Gill SJ, Air G, Sheriff S, Padlan
EA, Davies D, Tulip WR, et al: Conformations of immunoglobulin hypervariable regions. Nature 342(6252): 877-883, 1989.
Clackson T, Hoogenboom HR, Griffiths AD, Winter G: Making antibody fragments using phage display libraries. Nature 352(6336):624-628, 1991.
Conrad M, Topal MD: DNA and spermidine provide a switch mechanism to regulate the activity of restriction enzyme Nae I. Proc Natl Acad Sci USA 86(24):9707-l 1, (Dec)
1989.
Coruzzi G, Broglie R, Edwards C, Chua NH: Tissue-specific and light-regulated expression of a pea nuclear gene encoding the small subunit of ribulose-l,5-bisphosphate carboxylase. EMBO J 3(8): 1671-9, 1984.
Dasmahapatra B, DiDomenico B, Dwyer S, Ma J, Sadowski I, Schwartz J: A genetic system for studying the activity of a proteolytic enzyme. Proc Natl Acad Sci USA
89(9):4159-4162, 1992.
Davis LG, Dibner MD, Battey JF. Basic Methods in Molecular Biology. Elsevier, New
York, NY, ©1986.
Delegrave S and Youvan DC. Biotechnology Research 11 :1548-1552, 1993.
DeLong EF, Wu KY, Prezelin BB, Jovine RV: High abundance of Archaea in Antarctic marine picoplankton. Nature 371(6499):695-697, 1994.
Deng SJ, MacKenzie CR, Sadowska J, Michniewicz J, Young NM, Bundle Dr, Narang
SA: Selection of antibody single-chain variable fragments with improved carbohydrate binding by phage display. J Biol Chem 269(13):9533-9538, 1994.
Drauz K, Waldman H, eds.: Enzyme Catalysis in Organic Synthesis: A Comprehensive
Handbook. Vol. 1. New York: VCH Publishers, 1995.
Drauz K, Waldman H, eds.: Enzyme Catalysis in Organic Synthesis: A Comprehensive
Handbook. Vol. 2. New York: VCH Publishers, 1995. Duan L, Bagasra O, Laughlin MA, Oakes JW, Pomerantz RJ: Potent inhibition of human immunodeficiency virus type 1 replication by an intracellular anti-Rev single-chain antibody. Proc Natl Acad Sci USA 91(11):5075-5079, 1994.
Durfee T, Becherer K, Chen PL, Yeh SH, Yang Y, Kilburn AE, Lee WH, Elledge SJ:
The retinoblastoma protein associates with the protein phosphatase type 1 catalytic subunit. Genes Dev 7(4):555-569, 1993.
Ellington AD and Szostak JW: In vitro selection of RNA molecules that bind specific ligands. Nature 346(6287):818-822, 1990.
Fields S and Song 0: A novel genetic system to detect protein-protein interactions.
Nature 340(6230):245-246, 1989.
Firek S, Draper J, Owen MR, Gandecha A, Cockburn B, Whitelam GC: Secretion of a functional single-chain Fv protein in transgenic tobacco plants and cell suspension cultures. Plant Mol Biol 23(4):861-870, 1993.
Forsblom S, Rigler R, Ehrenberg M, Philipson L: Kinetic studies on the cleavage of adenovirus DNA by restriction endonuclease Eco RI. Nucleic Acids Res 3(12):3255-69,
(Dec) 1976.
Foster GD, Taylor SC, eds.: Plant Virology Protocols: From Virus Isolation to
Transgenic Resistance. Methods in Molecular Biology, Vol. 81. New Jersey: Humana
Press Inc., 1998.
Franks F, ed.: Protein Biotechnology: Isolation. Characterization, and Stabilization. New
Jersey: Humana Press Inc., 1993.
Germino FJ, Wang ZX, Weissman SM: Screening for in vivo protein-protein interactions. Proc Natl Acad Sci USA 90(3):933-937, 1993.
Gingeras TR, Brooks JE: Cloned restriction/modification system from Pseudomonas aeruginosa. Proc Natl Acad Sci USA 80(2):402-6, 1983 (Jan).
Gluzman Y: SV40-transformed simian cells support the replication of early SV40 mutants. Cell 23(1): 175-182, 1981.
Godfrey T, West S, eds.: Industrial Enzymology. 2nd ed. London: Macmillan Press Ltd,
1996.
Gottschalk G: Bacterial Metabolism. 2nd ed. New York: Springer- Verlag Inc., 1986.
Gresshoff PM, ed.: Technology Transfer of Plant Biotechnology. Cunent Topics in Plant
Molecular Biology. Boca Raton: CRC Press, 1997.
Griffin HG, Griffin AM, eds.: PCR Technology: Cunrent Innovations. Boca Raton: CRC Press, Inc., 1994.
Gruber M, Schodin BA, Wilson ER, Kranz DM: Efficient tumor cell lysis mediated by a bispecific single chain antibody expressed in Escherichia coli. J Immunol 152(11):5368-
5374, 1994.
Guarente L: Strategies for the identification of interacting proteins. Proc Natl Acad Sci
USA 90(5): 1639- 1641, 1993.
Guilley H, Dudley RK, Jonard G, Balazs E, Richards KE: Transcription of Cauliflower mosaic virus DNA: detection of promoter sequences, and characterization of transcripts.
Cell 30(3):763-73, 1982.
Hansen G, Chilton MD: Lessons in gene transfer to plants by a gifted microbe. Curr Top
Microbiol Immunol 240:21-57, 1999.
Hardy CF, Sussel L, Shore D: A RAP 1 -interacting protein involved in transcriptional silencing and telomere length regulation. Genes Dev 6(5):801-814, 1992.
Hartmann HT, et al.: Plant Propagation: Principles and Practices. 6th ed. New Jersey:
Prentice Hall, Inc., 1997.
Hawkins RE and Winter G: Cell selection strategies for making antibodies from variable gene libraries: trapping the memory pool. Eur J Immunol 22(3):867-870, 1992.
Holvoet P, Laroche Y, Lijnen HR, Van Hoef B, Brouwers E, De Cock F, Lauwereys M,
Gansemans Y, Collen D: Biochemical characterization of single-chain chimeric plasminogen activators consisting of a single-chain Fv fragment of a fibrin-specific antibody and single-chain urokinase. Eur J Biochem 210(3):945-952, 1992.
Honjo T, Alt FW, Rabbitts TH (eds): Immunoglobulin genes. Academic Press: San
Diego, CA, pp. 361-368, ©1989.
Hoogenboom HR, Griffiths AD, Johnson KS, Chiswell DJ, Judson P, Winter G: Multi- subunit proteins on the surface of filamentous phage: methodologies for displaying antibody (Fab) heavy and light chains. Nucleic Acids Res 19(15):4133-4137, 1991.
Huse WD, Sastry L, Iverson SA, Kang AS, Alting-Mees M, Burton DR, Benkovic SJ,
Lerner RA: Generation of a large combinatorial library of the immunoglobulin repertoire in phage lambda. Science 246(4935): 1275-1281, 1989.
Huston JS, Levinson D, Mudgett-Hunter M, Tai MS, Novotney J, Margolies MN, Ridge
RJ, Bruccoleri RE, Haber E, Crea R, et al: Protein engineering of antibody binding sites: recovery of specific activity in an anti-digoxin single-chain Fv analogue produced in
Escherichia coli. Proc Natl Acad Sci USA 85(16):5879-5883, 1988. Ivan Lefkovits, Editor. Immunology methods manual : the comprehensive sourcebook of techniques. Academic Press, San Diego, ©1997.
Iwabuchi K, Li B, Bartel P, Fields S: Use of the two-hybrid system to identify the domain of p53 involved in oligomerization. Oncogene 8(6): 1693-1696, 1993.
Jackson AL, Pahl PM, Harrison K, Rosamond J, Sclafani RA: Cell cycle regulation of the yeast Cdc7 protein kinase by association with the Dbf4 protein. Mol Cell Biol
13(5):2899-2908, 1993.
Johnson S and Bird RE: Methods Enzymol 203:88, 1991.
Kabat et al: Sequences of Proteins of Immunological Interest. 4th Ed. U.S. Department of Health and Human Services, Bethesda, MD (1987)
Kang AS, Barbas CF, Janda KD, Benkovic SJ, Lerner RA: Linkage of recognition and replication functions by assembling combinatorial antibody Fab libraries along phage surfaces. Proc Natl Acad Sci USA 88(10):4363-4366, 1991.
Kettleborough CA, Ansell KH, Allen RW, Rosell-Vives E, Gussow DH, Bendig MM:
Isolation of tumor cell-specific single-chain Fv from immunized mice using phage- antibody libraries and the re-construction of whole antibodies from these antibody fragments. Eur J Immunol 24(4):952-958, 1994.
Kruger DH, Barcak GJ, Reuter M, Smith HO: EcoRII can be activated to cleave refractory DNA recognition sites. Nucleic Acids Res 16(9):3997-4008, (May 1 1) 1988.
Lalo D, Carles C, Sentenac A, Thuriaux P: Interactions between three common subunits of yeast RNA polymerases I and III. Proc Natl Acad Sci USA 90(12):5524-5528, 1993.
Laskowski M Sr: Purification and properties of venom phosphodiesterase. Methods
Enzymol 65(l):276-84, 1980.
Lefkovits I and Pernis B, Editors. Immunological Methods. Vols. I and II. Academic
Press, New York, NY. Also Vol. Ill published in Orlando and Vol. IV published in San
Diego. ©1979-.
Lerner RA, Kang AS, Bain JD, Burton DR, Barbas CF 3d: Antibodies without immunization. Science 258(5086):1313-1314, 1992.
Leung, D.W., et al, Technique, 1 :11-15, 1989.
Li B and Fields S: Identification of mutations in p53 that affect its binding to SV40 large
T antigen by using the yeast two-hybrid system. FASEB J 7(10):957-963, 1993.
Lilley GG, Doelzal O, Hillyard CJ, Bernard C, Hudson PJ: Recombinant single-chain antibody peptide conjugates expressed in Escherichia coli for the rapid diagnosis of HF . J Immunol Methods 171(2):211-226, 1994.
Lowman HB, Bass SH, Simpson N, Wells JA: Selecting high-affinity binding proteins by monovalent phage display. Biochemistry 30(45): 10832-10838, 1991. Luban J, Bossolt KL, Franke EK, Kalpana GV, Goff SP: Human immunodeficiency virus type 1 Gag protein binds to cyclophilins A and B. Cell 73(6): 1067-1078, 1993. Madura K, Dohmen RJ, Varshavsky A: N-recognin/Ubc2 interactions in the N-end rule pathway. J Biol Chem 268(16): 12046-54, (Jun 5) 1993.
Marks JD, Griffiths Ad, Malmqvist M, Clackson TP, Bye JM, Winter G: By-passing immunization: building high affinity human antibodies by chain shuffling. Biotechnology (NY) 10(7):779-783, 1992.
Marks JD, Hoogenboom HR, Bonnert TP, McCafferty J, Griffiths AD, Winter G: Bypassing immunization. Human antibodies from V-gene libraries displayed on phage. J Mol Biol 222(3): 581-597, 1991.
Marks JD, Hoogenboom HR, Griffiths AD, Winter G: Molecular evolution of proteins on filamentous phage. Mimicking the strategy of the immune system. J Biol Chem 267(23): 16007-16010, 1992.
Maxam AM, Gilbert W: Sequencing end-labeled DNA with base-specific chemical cleavages. Methods Enzymol 65(l):499-560, 1980.
McCafferty J, Griffiths AD, Winter G, Chiswell DJ: Phage antibodies: filamentous phage displaying antibody variable domains. Nature 348(6301):552-554, 1990. Method of DNA sequencing.
Miller JH. A Short Course in Bacterial Genetics: A Laboratory Manual and Handbook for Escherichia coli and Related Bacteria (see inclusively p. 445). Cold Spring Harbor Laboratory Press, Plainview, NY, ©1992.
Milne GT and Weaver DT: Dominant negative alleles of RAD52 reveal a DNA repair/ recombination complex including Rad51 and Rad52. Genes Dev 7(9):1755-1765, 1993. Mullinax RL, Gross EA, Amberg JR, Hay BN, Hogrefe HH, Kubtiz MM, Greener A, Alting-Mees M, Ardourel D, Short JM, et al: Identification of human antibody fragment clones specific for tetanus toxoid in a bacteriophage lambda immunoexpression library. Proc natl Acad Sci USA 87(20):8095-9099, 1990.
Nath K, Azzolina BA: in Gene Amplification and Analysis (ed. Chirikjian JG), vol. 1, p. 113, Elsevier North Holland, Inc., New York, New York, ©1981. Needleman SB and Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443-453, 1970.
Nelson M, Christ C, Schildkraut I: Alteration of apparent restriction endonuclease recognition specificities by DNA methylases. Nucleic Acids Res 12(13):5165-73, 1984
(Jul 11).
Nicholls PJ, Johnson VG, Andrew SM, Hoogenboom HR, Raus JC, Youle RJ:
Characterization of single-chain antibody (sFv)-toxin fusion proteins produced in vitro in rabbit reticulocyte lysate. J Biol Chem 268(7):5302-5308, 1993.
Oiler AR, Vanden Broek W, Conrad M, Topal MD: Ability of DNA and spermidine to affect the activity of restriction endonucleases from several bacterial species.
Biochemistry 30(9):2543-9, (Mar 5) 1991.
Owen MRL, Pen J: Transgenic Plants: A Production System for Industrial and
Pharmaceutical Proteins. Chichester: John Wiley & Sons, 1996.
Owens RJ and Young RJ: The genetic engineering of monoclonal antibodies. J Immunol
Methods 168(2):149-165, 1994.
Pearson WR and Lipman DJ: Improved tools for biological sequence comparison. Proc
Natl Acad Sci USA 85(8):2444-2448, 1988.
Pein CD, Reuter M, Meisel A, Cech D, Kruger DH: Activation of restriction endonuclease EcoRII does not depend on the cleavage of stimulator DNA. Nucleic Acids
Re s 19(19): 5139-42, (Oct 11) 1991.
Persson MA, Caothien RH, Burton DR: Generation of diverse high-affinity human monoclonal antibodies by repertoire cloning. Proc Natl Acad Sci USA 88(6):2432-2436,
1991.
Perun TJ, Propst CL, eds.: Computer-Aided Drug Design: Methods and Applications.
New York: Marcel Dekker, Inc., 1989.
Qiang BQ, McClelland M, Poddar S, Spokauskas A, Nelson M: The apparent specificity of Notl (5'-GCGGCCGC-3') is enhanced by M.FnuDII or M.BepI methyltransferases (5'- mCGCG-3'): cutting bacterial chromosomes into a few large pieces. Gene 88(1): 101-5,
(Mar 30) 1990.
Queen C, Foster J, Stauber C, Stafford J: Cell-type specific regulation of a kappa immunoglobulin gene by promoter and enhance elements. Immunol Rev 89:49-68, 1986.
Raleigh EA, Wilson G: Escherichia coli K-12 restricts DNA containing 5- methylcytosine. Proc Natl Acad Sci U S A 83(23):9070-4, (Dec) 1986.
Reidhaar-Olson JF and Sauer RT: Combinatorial cassette mutagenesis as a probe of the informational content of protein sequences. Science 241(4861):53-57, 1988. Riechmann L and Weill M: Phage display and selection of a site-directed randomized single-chain antibody Fv fragment for its affinity improvement. Biochemistry 32(34):8848-8855, 1993.
Roberts RJ, Macelis D: REBASE—restriction enzymes and methylases. Nucleic Acids Res 24(l):223-35, (Jan 1) 1996.
Ryan AJ, Royal CL, Hutchinson J, Shaw CH: Genomic sequence of a 12S seed storage protein from oilseed rape (Brassica napus c.v. jet neuf). Nucl Acids Res 17(9):3584, 1989.
Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, ©1982. Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning: A Laboratory Manual. Second Edition. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, ©1989. Scopes RK. Protein Purification: Principles and Practice. Springer- Verlag, New York, NY, © 1982.
Segel IH: Enzyme Kinetics: Behavior and Analysis of Rapid Equilibrium and Steady- State Enzyme Systems. New York: John Wiley & Sons, Inc., 1993. Silver SC and Hunt SW 3d: Techniques for cloning cDNAs encoding interactive transcriptional regulatory proteins. Mol Biol Rep 17(3):155-165, 1993. Smith TF, Waterman MS, Fitch WM: Comparative biosequence metrics. J MolEvol S18(l):38-46, 1981.
Smith TF, Waterman MS. Adv Appl Math 2: 482-end of article, 1981. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 147(l):195-7, (Mar 25) 1981.
Smith TF, Waterman MS: Overlapping genes and information theory. J Theor Biol 91(2):379-80, (Jul 21) 1981.
Staudinger J, Perry M, Elledge SJ, Olson EN: Interactions among vertebrate helix-loop- helix proteins in yeast using the two-hybrid system. J Biol Chem 268(7):4608-4611, 1993.
Stemmer WP, Morris SK, Wilson BS: Selection of an active single chain Fv antibody from a protein linker library prepared by enzymatic inverse PCR. Biotechniques 14(2):256-265, 1993. Stemmer WP: DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. Proc Natl Acad Sci USA 91(22): 10747-10751,
1994.
Sun D, Hurley LH: Effect of the (+)-CC-1065-(N3-adenine)DNA adduct on in vitro
DNA synthesis mediated by Escherichia coli DNA polymerase. Biochemistry 31 :10,
2822-9, (Mar 17) 1992,
Tague BW, Dickinson CD, Chrispeels MJ: A short domain of the plant vacuolar protein phytohemagglutinin targets invertase to the yeast vacuole. Plant Cell 2(6): 533-46, (June)
1990.
Takahashi N, Kobayashi I: Evidence for the double-strand break repair model of bacteriophage lambda recombination. Proc Natl Acad Sci USA 87(7):2790-4, (Apr)
1990.
Thiesen HJ and Bach C: Target Detection Assay (TDA): a versatile procedure to determine DNA binding sites as demonstrated on SP1 protein. Nucleic Acids Res
18(l l):3203-3209, 1990.
Thomas M, Davis RW: Studies on the cleavage of bacteriophage lambda DNA with
EcoRI Restriction endonuclease. J Mol Biol 91(3):315-28, (Jan 25) 1975.
Tingey SV, Walker EL, Corruzzi GM: Glutamine synthetase genes of pea encode distinct polypeptides which are differentially expressed in leaves, roots and nodules.
EMBO J 6(1): 1-9, 1987.
Topal MD, Thresher RJ, Conrad M, Griffith J: Nael endonuclease binding to pBR322
DNA induces looping. Biochemistry 30(7):2006-10, (Feb. 19) 1991.
Tramontano A, Chothia C, Lesk AM: Framework residue 71 is a major determinant of the position and conformation of the second hypervariable region in the VH domains of immunoglobulins. J Mol Biol 215(1): 175-182, 1990.
Tuerk C and Gold L: Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249(4968):505-510, 1990.
USPN 4,683,195; Filed Feb. 7, 1986, Issued Jul 28. 1987. Mullis KB, Erlich HA,
Arnheim N, Horn GT, Saiki RK, Scharf SJ: Process for Amplifying, Detecting, and/or
Cloning Nucleic Acid Sequences.
USPN 4,683,202; Filed Oct. 25, 1985, Issued Jul. 28, 1987. Mullis KB: Process for
Amplifying Nucleic Acid Sequences.
USPN 4,704,362; Filed Nov. 5, 1979, Issued Nov. 3, 1987. Itakura K, Riggs AD:
Recombinant Cloning Vehicle Microbial Polypeptide Expression. USPN 4,713,337; Filed Jan. 3, 1985, Issued Dec. 15, 1987. Jasin M, Schimmel PR:
Method for deletion of a gene from a bacteria.
USPN 4,732,856; Filed April 3, 1984, Issued March 22, 1988. Federoff NV:
Transposable elements and process for using same.
USPN 4,963,487; Filed Sept. 14, 1987, Issued Jan. 16, 1990. Schimmel PR: Method for deletion of a gene from a bacteria.
USPN 5,354,656; Filed Oct. 2, 1989, Issued Oct. 11, 1994. Sorge, Joseph A. ; Huse,
William D.:
USPN 5,385,835; Filed May 19, 1994, Issued Jan. 31, 1995. Helentjaris, Timothy ;
Nienhuis, James: Identification and localization and introgression into plants of desired multigenic traits.
USPN 5,453,247; Filed Nov. 23, 1993, Issued Sept. 26, 1995. Beavis, Ronald C. ; Chait,
Brian T.: Instrument and method for the sequencing of genome.
USPN 5,604,100; Filed July 19, 1995, Issued Feb. 18, 1997. Perlin, Mark W.: Method and system for sequencing genomes.
USPN 5,670,321; Filed May 10, 1995, Issued Sept. 23, 1997. Kimmel, Bruce E. ; Ellis,
Michael ; Ruddy, David: Efficient method to conduct large-scale genome sequencing.
USPN 5,925,808; Filed Dec. 19, 1997, Issued July 20, 1999. Oliver, Melvin John ;
Quisenberry, Jerry Edwin ; Trolinder, Norma Lee Glover ; Keim, Don Lee: Control Of
Plant Gene Expression.
USPN 5,953,727; Filed March 6, 1997, Issued Sept. 14, 1999. Maslyn, Timothy J. ; Au-
Young, Janice ; Hillman, Jennifer L. ; Hibbert, Harold ; Akerblom, Ingrid E. ; Cheng,
Rachel J. ; Tang, Yuanhua T.:Project-based full-length biomolecular sequence database.
USPN 5,965,443; Filed Sept. 9, 1996, Issued Oct. 12, 1999. Reznikoff WS, Goryshin IY:
System for in vitro transposition.
USPN 5,981,177; Filed Jan. 25, 1995, Issued Nov. 9, 1999. Demirjian DC, Casadaban
MJ, Weber M, Gaines GL: Protein fusion method and constructs.
USPN 5,994,058; Filed March 20, 1995, Issued Nov. 30, 1999. Senapathy,
Periannan:Method For Contiguous Genome Sequencing.
USPN 6,023,659; Filed March 6, 1997, Issued Feb. 8, 2000. Seilhamer, Jeffrey J. ;
Akerblom, Ingrid E. ; Altus, Christina M. ; Klingler, Tod M. ; Russo, Frank ; Au- Young,
Janice ; Hillman, Jennifer L. ; Maslyn, Timothy J.: Database System Employing Protein
Function Hierarchies For Viewing Biomolecular Sequence Data. van de Poll ML, Lafleur MV, van Gog F, Vrieling H, Meerman JH: N-acetylated and deacetylated 4'-fluoro-4-aminobiphenyl and 4-aminobiphenyl adducts differ in their ability to inhibit DNA replication of single-stranded Ml 3 in vitro and of single-stranded phi X174 in Escherichia coli. Carcinogenesis 13(5):751-8, (May) 1992.
Vojtek AB, Hollenberg SM, Cooper JA: Mammalian Ras interacts directly with the serine/threonine kinase Raf. Cell 74(1):205-214, 1993.
Wenzler H, Mignery G, Fisher L, Park W: Sucrose-regulated expression of a chimeric potato tuber gene in leaves of transgenic tobacco plants. Plant Mol Biol 13(4):347-54,
1989.
White JS, White DC: Source Book of Enzymes. Boca Raton: CRC Press, 1997.
Williams and Barclay, in Immunoglobulin Genes. The Immunoglobulin Gene
Superfamily
Winnacker EL. From Genes to Clones: Introduction to Gene Technology. VCH
Publishers, New York, NY, ©1987.
Winter G and Milstein C: Man-made antibodies. Nature 349(6307):293-299, 1991.
WO 00/04190; Filed July 15, 1999, Published Jan. 27, 2000. Del Cardayre S, Tobin M,
Stemmer WP, Ness JE, Minshull J, Patten PA, Subramanian V, Castle LA, Krebber CM,
Bass S, Zhang Y, Cox T, Huisman G, Yuan L, Affholter JA: Evolution of whole cells and organisms by recursive sequence recombination.
WO 00/09755; Filed Aug. 12, 1999, Published Feb. 24, 2000. Zarling D, Reddy G, Pati
S: Domain specific gene evolution.
WO 88/08453; Filed Apr. 14, 1988, Published Nov. 3, 1988. Alakhov JB, Baranov, VI,
Ovodov SJ, Ryabova LA, Spirin AS: Method of Obtaining Polypeptides in Cell-Free
Translation System.
WO 90/05785; Filed Nov. 15, 1989, Published May 31, 1990. Schultz P: Method for
Site-Specifically Incoφorating Unnatural Amino Acids into Proteins.
WO 90/07003; Filed Jan. 27, 1989, Published June 28, 1990. Baranov VI, Morozov IJ,
Spirin AS: Method for Preparative Expression of Genes in a Cell-free System of
Conjugated Transcription/translation.
WO 91/02076; Filed June 14, 1990, Published Feb. 21, 1991. Baranov VI, Ryabova LA,
Yarchuk OB, Spirin AS: Method for Obtaining Polypeptides in a Cell-free System.
WO 91/05058; Filed Oct. 5, 1989, Published Apr. 18, 1991. Kawasaki G: Cell-free
Synthesis and Isolation of Novel Genes and Polypeptides. WO 91/17271; Filed May 1, 1990, Published Nov. 14, 1991. Dower WJ, Cwirla SE:
Recombinant Library Screening Methods.
WO 91/18980; Filed May 13, 1991, Published Dec 12, 1991. Devlin JJ: Compositions and Methods for Indentifying Biologically Active Molecules.
WO 91/19818; Filed June 20, 1990, Published Dec. 26, 1991. Dower WJ, Cwirla SE,
Banett RW: Peptide Library and Screening Systems.
WO 92/02536; Filed Aug. 1, 1991, Published Feb. 20, 1992. Gold L, Tuerk C:
Systematic Polypeptide Evolution by Reverse Translation.
WO 92/03918; Filed Aug. 28, 1991, Published Mar. 19, 1992. Lonberg N, Kay RM:
Transgenic Non-human Animals Capable of Producing Heterologous Antibodies.
WO 92/05258; Filed Sept. 17, 1991, Published Apr. 2, 1992. Fincher GB: Gene
Encoding Barley Enzyme.
WO 92/14843; Filed Feb. 21, 1992, Published Sept. 3, 1992. Toole JJ, Griffin LC, Bock
LC, Latham JA, Muenchau DD, Krawczyk S: Aptamers Specific for Biomolecules and
Method of Making.
WO 93/08278; Filed Oct. 15, 1992, Published Apr. 29, 1993. Schatz PJ, Cull MG, Miller
JF, Stemmer WP: Peptide Library and Screening Method.
WO 93/12227; Filed Dec. 17, 1992, Published June 24, 1993. Lonberg N, Kay RM:
Transgenic Non-human Animals Capable of Producing Heterologous Antibodies.
WO 94/25585; Filed Apr. 25, 1994, Published Nov. 10, 1994. Lonberg N, Kay RM:
Transgenic Non-human Animals Capable of Producing Heterologous Antibodies.
WO 95/00530; Filed June 6, 1994, Published Jan. 1, 1995. Fodor, Stephen, P., A. ;
Lipshutz, Robert, J. ; Huang, Xiaohua ; Jevons, Luis, Carlos: Hybridization and
Sequencing of Nucleic Acids.
WO 96/21031; Filed June 7, 1995, Published July 11, 1996. Tricoli, David, M. ; Carney,
Kim, J. ; Russell, Paul, F. ; Quemada, Hector, D. ; Mcmaster, J., Russell ; Reynolds,
John, F. ; Deng, Rosaline, Z.: Transgenic Plants Expressing DNA Constructs Containing
A Plurality Of Genes To Impart Virus Resistance.
WO 96/27025; Filed Feb. 21, 1996, Published Sept. 6, 1996. Rabani, Ely,
Michael: Device, Compounds, Algorithms, And Methods Of Molecular Characterization
And Manipulation With Molecular Parallelism.
WO 97/17429; Filed Nov. 8, 1996, Published May 15, 1997. Oglevee-O'donovan, Wendy
; Arteca, Richard, N. ; Arteca, Jeannette ; Stoots, Eleanor: Method For The Commercial Production Of Transgenic Plants.
WO 97/35966; Filed March 20, 1997, Published Oct. 2, 1997. Minshull J, Stemmer WP:
Methods and compositions for cellular and metabolic engineering.
WO 97/37041; Filed March 18, 1997, Published Oct. 9, 1997. Kόster, Hubert: DNA
Sequencing By Mass Spectrometry.
WO 97/42348; Filed May 5, 1997, Published Nov. 13, 1997. Kόster, Hubert ; Van Den
Boom, Dirk ; Ruppert, Andreas: Process For Direct Sequencing During Template
Amplification.
WO 98/26407; Filed Dec. 11, 1997, Published June 18, 1998. Sabatini, Cathryn, E. ;
Heath, Joe, Don ; Covitz, Peter, A. ; Klinger, Tod, M. ; Russo, Frank, D. ; Berry,
Stephanie, F.: Database And System For Storing, Comparing And Displaying Genomic
Information.
WO 98/26408; Filed Dec. 11, 1997, Published June 18, 1998. Sabatini, Cathryn, E. ;
Heath, Joe, Don ; Covitz, Peter, A. ; Klingler, Tod, M. ; Russo, Frank, D. ; Berry,
Stephanie, F.:Database And System For Determining, Storing And Displaying Gene
Locus Information.
WO 98/31833; Filed Dec. 12, 1997, Published July 23, 1998. Ju, Jingyue: Nucleic Acid
Sequencing With Solid Phase Capturable Terminators.
WO 98/31834; Filed Dec 12, 1997, Published July 23, 1998. Ju, Jingyue: Sets Of
Labeled Energy Transfer Fluorescent Primers And Their Use In Multi Component
Analysis.
WO 98/31837; Filed Jan. 16, 1998, Published July 23, 1998. Delcardayre SB, Tobin MB,
Stemmer WP, Ness JE, Minshull J, Patten P: Evolution of whole cells and organisms by recursive sequence recombination.
WO 98/36085; Filed Feb. 13, 1998, Published Aug. 20, 1998. Sutliff, Thomas, D. ;
Rodriguez, Raymond, L.: Production Of Mature Proteins In Plants.
WO 98/37223; Filed Feb. 18, 1998, Published Aug. 27, 1998. Pang, Sheng-Zhi ;
Gonsalves, Dennis ; Jan, Fuh-Jyh: DNA Construct To Confer Multiple Traits On Plants.
WO 99/35494; Filed Jan. 8, 1999, Published July 15, 1999. Tally FP, Tao J, Wendler PA,
Connelly G, Gallant PL: Method for identifying validated target and assay combinations.
WO 99/37755; Filed Dec. 11, 1998, Published July 29, 1999. Pati S, Zarling David,
Lehman CW, Zeng H: The use of consensus sequences for targeted homologous gene isolation and recombination in gene families. WO 99/49403; Filed March 25, 1999, Published Sept. 30, 1999. Lincoln, Stephen, E. ;
Hodgson, David, M. ; Spiro, Peter, A. ; Russo, Frank, D. ; Akerblom, Ingrid, E. ;
Hillman, Jennifer, L. ; Jones, Anissa, Lee ; Bratcher, Shawn, Robert ; Cohen, Howard,
Jerome ; Dufour, Gerard ; Wood, Michael, Peter ; Koleszar, Alexander, George ;
Banville, Steven, C: System And Methods For Analyzing Biomolecular Sequences.
WO95/11995; Filed Oct. 26, 1994, Published May 4, 1995. Chee M, Cronin MT, Fodor
SP, Gingeras TR, Huang XC, Hubbell EA, Lipshutz RJ, Lobban PE, Miyada CG, Morris
MS, Shah N, Sheldon EL: Anays Of Nucleic Acid Probes On Biological Chips.
Wong CH, Whitesides GM: Enzymes in Synthetic Organic Chemistry. Vol. 12. New
York: Elsevier Science Publications, 1995.
Yang X, Hubbard EJ, Carlson M: A protein kinase substrate identified by the two-hybrid system. Science 257(5070):680-2, (Jul 31) 1992.
Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R.: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17(10):994-9
(Oct) 1999.
Hopkins MJ, Shaφ R, Macfarlane GT.: Age and disease related changes in intestinal bacterial populations assessed by cell culture, 16S rRNA abundance, and community cellular fatty acid profiles. Gut 48(2): 198-205 (Feb) 2001.
Ritchie NJ, Schutter ME, Dick RP, Myrold DD.: Use of length heterogeneity PCR and fatty acid methyl ester profiles to characterize microbial communities in soύAppl
Environ Microbiol 66(4): 1668-75 (Apr) 2000.
Khan AA, Wang RF, Cao WW, Franklin W, Cerniglia CE.: Reclassification of a polycyclic aromatic hydrocarbon-metabolizing bacterium, Beijerinckia sp. strain Bl, as
Sphingomonas yanoikuyae by fatty acid analysis, protein pattern analysis, DNA-DNA hybridization, and 16S ribosomal DNA sequencing. Int J Syst Bacteriol 46(2):466-9
(Apr) 1996.
Peltroche-Llacsahuanga H, Schmidt S, Lutticken R, Haase G.: Discriminative power of fatty acid methyl ester (FAME) analysis using the microbial identification system (MIS) for Candida (Torulopsis) glabrata and Saccharomyces cerevisiae. Diagn Microbiol Infect
Dis 38(4):213-21 (Dec) 2000.
SA Gerber et al.: Analysis of rates of multiple enzymes in cell lysates by electrospray ionization mass spectrometry. J. Am. Chem. Soc. 121 :1102-3 1999.
WO0011208; Filed Aug 25, 1999, Published March 2, 2000. Aebersold RH, Gelb MH, Gygi, SP, Scott CR, Turecek F, Gerber SA, Rist B: Rapid quantitative analysis of proteins or protein function in complex mixtures.
WO9905221; Filed July 27 1998, Published Feb. 4,1999. Cummins WJ, West RM, Smith
JA: Cyanine Dyes.
US4876350; Filed Dec 16, 1987, Issued Oct 24, 1989. McGanity J, Tenud L: Process for the production of (+) biotin.
US5776723; Filed Feb 8, 1996, Issued July 7, 1998. Herold CD, O'Hagan M: Rapid detection of mycobacterium tuberculosis.
US6136173; Filed June 24, 1996, Issued Oct. 24, 2000. Anderson NL, Anderson NG,
Goodman J: Automated system for two-dimensional electrophoresis.
US6127134; Filed April 20, 1995, Issued Oct. 3, 2000. Minden J, Waggoner A:
Difference gel electrophoresis using matched multiple dyes.
US6064754; Filed Dec 1, 1997, Issued May 16, 2000. Parekh RB, Amess R, Bruce JA,
Prime SB, Platt AE, Stoney RM: Computer-assisted methods and apparatus for identification and characterization of biomolecules in a biological sample.
US6013165; Filed May 22, 1998, Issued Jan 11,2000. Wiktorowicz JE, Raysberg Y:
Electrophoresis apparatus and method.
Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA, Struhl K
Editors.Cunent Protocols In Molecular Biology. Vol 2. John Wiley & Sons, Ine, © 2001,
10.21.4-10.21.6, 10.22.5-10.22.10, 10.22.14, 10.22.15-10.22.20.
Sambrook J, Russell DW Editors. Molecular Cloning A Laboratory Manual 3rd ed. Cold
Spring Harbor Laboratory Press, New York, © 2001, 18.3, 18.62, 18.66.
Additional methods for differential analysis
Protein expression profiling using selective differential labeling The use of mass spectrometry to identify proteins whose sequences are present in either DNA or protein databases is well established and integral to the field of Proteomics. Protein and peptide mass can be determined at high accuracy by several mass spectrometric techniques. Peptide can be further fragmented in a tandem or ion trap mass spectrometer yielding sequence information of the peptide. Both types of mass information can be used to identify protein in a sequence database. One goal of Proteomics is to define the expressed proteins associated with a given cellular state and another is to quantify changes in protein expression between cellular states. One of the new methodologies that have a great impact on proteome research is known as isotope-coded affinity tag (ICAT) peptide labeling (17). The method is based on a newly synthesized class of chemical reagents (ICATs) used in combination with tandem mass spectrometry. The ICAT reagent contains a biotin affinity tag and a thiol specific reactive group, which are joined by a spacer domain that is available in two forms: regular and isotopically heavy, which includes eight deuterium atoms. First, a reduced protein mixture representing one cell state is derivatized with the isotopically light version of the ICAT reagent, while the conesponding reduced protein mixture representing a second cell state is derivatized with the isotopically heavy version of the ICAT reagent. Second, the labeled samples are combined and proteolytically digested to produce peptide fragments. Third, the tagged cysteine containing peptide fragments are isolated by avidin affinity chromatography. Finally, the isolated tagged peptides are separated and analyzed by microcapillary tandem mass spectrometry. There are, however, limitations associated with their approach: (i) differential labeling reagents relied on stable isotopes which is expensive and not very flexible to multiplex differential labeling; (ii) The moieties attached to the original peptides are approximately 500 Dalton heavy, which is heavier than some peptides and is likely to affect peptide ionization and fragmentation process; (iii) Some bonds in the labeling reagent are week compared to the amide bond, which might complicate the MS/MS spectrum, (iv) Protein expression profiling is limited to duplex comparison; (v) The affinity interaction between biotin and avidin is too strong to release the immobilized peptide efficiently. In one aspect, this present invention provides a method for simultaneous identification and quantification of expression levels of individual proteins carrying certain functional groups in their side chains. The proteins may be analyzed in complex mixtures. The method is based on comparison of two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation. The samples of proteins are subjected to a sequence of manipulations including (i) proteolytic digestion into mixtures of peptides, (ii) treatment of the mixtures of peptides with chemical probes, (iii) washing away and discarding the unbound peptides from the mixtures, (iv) cleaving the chemical probes and the consequential release of the peptides still carrying parts of the chemical probes into solution. This sequence of manipulations may also include one or more auxiliary chemical and/or enzymatic modifications of functional groups in side chains and/or in the free termini of the proteins and/or peptides in order to achieve selective and the most favorable modification for the next steps in the protocol. The auxiliary modifications may be performed between any steps of the main sequence. The core structure of the chemical probe consists of (i) a solid support,
(ii) a spacer, (iii) a cleavable moiety, (iv) a differential mass labeling unit, and (v) a reactive group. The chemical probes perform three functions: (i) they attach peptides carrying specific functional groups in their side chains and/or termini to a solid support by forming covalent chemical bonds to the reactive group of the probe, (ii) they provide means for selective cleavage of the attached peptide from the solid support such that a part of the probe still remains attached to the peptide, and (iii) they serve as differential labeling reagents. Differential labeling results from attaching of chemical moieties of different mass but of similar properties to a protein or a peptide such that peptides with the same sequence but with different labels are eluted together in the separation procedure and their ionization and detection properties regarding mass spectrometrical analysis are very similar. The differential mass labeling unit remains covalently bound to the peptide after it is cleaved from the solid support part of the probe. Signals conesponding to peptides with the same sequence but marked with differential mass labels are assigned to different original protein samples. The auxiliary chemical and/or enzymatic modification can be used to introduce additional differential mass labels into the peptides. The reactive group on the chemical probe may be activated or modified by a bridging reagent prior to a reaction with mixtures of peptides. Such activation or modification provides for a greater flexibility in design of the chemical probe since the same core structure of a chemical probe may be tuned to increase reactivity and/or selectivity towards different functional groups in side chains and/or in termini of the peptides. After being cleaved from the solid support part of the chemical probe, the differentially labeled peptide mixtures are combined, subjected to multidimensional chromatographic separation, and analyzed by mass spectrometry methods. Mass spectrometry data is processed by special software, which allows for determination and tracing the composition and sequence of peptides in the mixture to identification of the original proteins and their quantification. This approach can be used for duplex or potentially multiplex protein expression profiling. The complexity of the sample is simplified by targeting peptides containing particular amino acids, which selected by a reaction with chemical probes. Alternative aspects of this invention include: (i) design of solid phase- based differential mass labeling reagents for selective peptide modification; (ii) design of various kinds of differential mass unit; (iii) combination of differential mass probes with various bridge reagent to target certain amino acid specifically; (iv) multiplex analysis; (v) combination of proteolytic digestion and chemical and/or enzymatic modifications in side chains and/or in termini of proteins and peptides in order to achieve selective and the most favorable modifications for the next steps in the protocol; (vi) combination of differential chemical labeling with MudPIT, and possible all other protein peptide separation or purification technologies if necessary. One aspect of this invention provides reagents and procedures for quantification of protein expression using combination of selective differential peptides labeling, and the mixed bed multi-dimensional liquid chromatographs of the invention, e.g., 3D LC MS/MS, 3D LC-LC MS/MS or LC-LCQ-MS/MS or LC-LTQ- MS/MS systems of the invention, as described herein. This invention overcomes the limitations inherent in traditional techniques. The basic approach described can be employed for quantitative analysis of protein expression in complex samples (such as cells, tissues, and fraction etc.), the detection and quantitation of specific proteins in complex samples, and quantitative measurement of specific enzymatic activities in complexed samples.
Technical description 1. Probe design: The solid support part of the chemical probe may consist of any of the following materials or any combination of them: gel, glass beads, magnetic beads, polymers, silicon wafer, membrane, or resin. The spacer between the solid phase part and the cleavable unit of the chemical probe may be included for convenience and improved yields in synthetic preparation of the chemical probe. The spacer may consist of a chain of 2 to 8 atoms, which can be C, O, N, B, Si, S, P, Se ..., covalently bound to each other. In order to satisfy the valence requirements, the atoms may carry hydrogen atoms, halogens, or one of the following groups containing up to 25 atoms: alkyl, hydroxy, alkoxy, amino, alkylamino... The spacer may contain cyclic moieties with or without heteroatoms and with or without substituents. The cleavable moiety provides means for selective detachment of the solid phase part of the chemical probe from the differential mass label attached to peptide. It is designed such that it can be cleaved by treating the probe with a chemical reagent or any kind of electromagnetic inadiation, photochemically, enzymatically, or thermally. Differential mass labeling units differ in molecular mass, but do not differ in retention properties regarding the separation method used and in ionization and detection properties regarding the mass spectrometry methods used. These moieties differ either in their isotope composition (isotopic labels) or they differ structurally by a rather small fragment, which change does not alter the properties stated above (homologous labels). The isotopic labels can be presented by general formulae: ZA and ZB ZA and ZB = R-Z1-A1-Z2-A2-Z3-A3-Z4-A4- Z1, Z2, Z3, and Z4 independently of one another can be selected from O, OC (O), OC (S), OC (O) O, OC (O) NR, OC (S) NR, OSiRR1, S, SC (O), SC (S), SS, S (O), S (O2), NR, NRR1+, C (O), C (O) O, C (S), C (S) O, C (O) S, C (O) NR, C (S) NR, SiRR1, (Si (RR1) O) n, SnRR1, Sn (RR1) O, BR (OR1), BRR1, B (ORXOR1), OBR (OR1), OBRR1, OB (OR)(OR') or Z1 - Z4 may be absent; A , A , A , and A independently of one another can be selected from (CRR')n, in which some single C-C bonds may be replaced with double or triple bonds, in which case some groups R and R1 will be absent, o-arylene, -arylene, p- arylene with up to 6 substituents, carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms in the cycle with or without heteroatoms (O, N, S) and with or without substituents, or A1 - A4 may be absent; R, R1 independently from other R and R'in Z1 - Z4 and independently from other R and R1 in A1 - A4 is hydrogen, halogen, an alkyl, alkenyl, alkynyl, or aryl group; n in Z1 - Z4 is independent of n in A1 - A4 and is a whole number that can have value from 0 to 21. A R Z can have the same structure as Z , but they have different isotope A R composition. For instance, if Z contains x number of protons, Z may contain^ number of deuterons in the place of protons, and, conespondingly, x -y number of protons remaining; and/or if ZA contains x number of borons- 10, ZB may contain y number of borons- 11 in the place of borons- 10, and, conespondingly, x - y number of borons- 10 remaining; and/or if ZA contains x number of carbons-12, ZB may contain y number of carbons- 13 in the place of carbons- 12, and, conespondingly, x -y number of carbons-12 remaining; and/or if ZA contains x number of nitrogens- 14, ZB may contain y number of nitrogens- 15 in the place of nitrogens- 14, and, conespondingly, x - y number of nitrogens- 14 remaining; and/or if ZA contains x number of sulfurs-32, ZB may contain y number of sulfurs-34 in the place of sulfurs-32, and, conespondingly, x -y number of sulfurs-32 remaining; and so on for all elements which may be present and have different stable isotopes; x and are whole numbers between 1 and 21 such that x is greater thany. An example of an isotopical label pairs/series: (CD2)n / (CH2)n, where n = 0, 1, 2, ..., 21; (delta mass = 2n). The homologous reagents can be presented by general formulae: ZA and ZB where ZA and ZB = R-Z1-A1-Z2-A2-Z3-A3-Z4-A4- Z1, Z2, Z3, and Z4 independently of one another can be selected from O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR1+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'),
OBRR1, OB(OR)(OR') or Z1 - Z4 may be absent; A1, A2, A3, and A4 independently of one another can be selected from
(CRR')n, in which some single C-C bonds may be replaced with double or triple bonds, in which case some groups R and R will be absent, o-arylene, m-arylene, p- arylene with up to 6 substituents, carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms in the cycle with or without heteroatoms (O, N, S) and with or without substituents, or A1 - A4 may be absent; R, R1 independently from other R and R'in Z1 - Z4 and independently from other R and R1 in A1 - A4 is hydrogen, halogen, an alkyl, alkenyl, alkynyl, or aryl group; n in Z1 - Z4 is independent of n in A1 - A4 and is a whole number that can have value from 0 to 21. ZA can have a similar structure to that of ZB, but ZA has x extra -CH2- fragment(s) in one or more A1 - A4 fragments, and/or ZA has x extra -CF2- fragment(s) in one or more A1 - A fragments; and/or if ZA contains x number of protons, ZB may contain y number of halogens in the place of protons, and, conespondingly, x - y number of protons remaining in one or more A1 - A4 fragments; and/or ZA has x extra -O- fragment(s) in one or more A1 - A4 fragments; and/or ZA has x extra -S- fragment(s) in one or more A1 - A4 fragments; and or if ZA contains x number of -O- fragment(s), ZB may contain y number of-S- fragment(s) in the place of-O- fragment(s), and, conespondingly, x -y number of-O- fragment(s) remaining in one or more A1 - A4 fragments; and so on; x and y are whole numbers between 1 and 21 such that x is greater thany. An examples of homologous label pairs/series: (CH2)n/(CH2)n+m, where n = 0, 1, 2, ..., 21; m = l, 2, ..., 21 (delta mass = 14m) Bridging and activating reagents In alternative aspects, commercially available cross linkers or custom designed cross-linked are used. a. Reactive site 1 : probe specific b. Reactive site 2: amino acid specific Methods for peptide/protein separation and detection On line 2 dimensional capillary LC ESI MS/MS (MuDPIT) as described in the global differential profiling disclosure, or 1 D LC ESI MS/MS, MALDI MS.
Sequence analysis and quantification Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents. Peptide sequence information is automatically generated by selecting peptide ions of a particular mass-to-charge (m z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode. (Link et al, Electrophoresis 18:1314-34 (1997); Gygi et al. Nature Biotechnol 17:994-9) (1999); Gygi et al, cell Biol 19: 1720-30 (1999)). The resulting tandem mass spectra can be conelated to sequence databases to identify the protein from which the sequenced peptide originated. Cunently commercial available softwares are Turbo SEQUEST™ by Thermofinigan, MASSSCOT™ by Matrix Science, and SONAR™ MS/MS by Proteometiics. Special software development will be necessary for automated relative quantification. Exemplary approaches for practicing the invention: 1. Protein sample preparation, which may include protein denaturation, reduction, and proteolytic digestion 2. Treatment of the probe with a desired activating or bridging reagent 3. Treatment of the activated probe with a mixture of peptides 4. Wash off unbound peptides, which don't have the targeted amino acid 5. Combining modified differential labeled peptide mixture 6. Release peptides by cleaving the probe (steps 5 and 6 can be switched) 7. Removing solvent or desalting if necessary 8. Redisovling peptide in LC loading buffer 9. LC ESI MS and MS/MS analysis MALDI MS and MS/MS analysis 10. Database searching and data analysis
Metabolomics and lipidomics The invention also incoφorates holistic monitoring approaches, metabolomics and lipidomics, including profiling metabolite pools, carbohydrates, lipids, glycoproteins, and glycolipids Various chromatographic methods and other qualitative and/or quantitative methods could be utilized to characterize lipid profiles.
In the area of metabolomics, methods that compare concentrations of metabolites/small molecules, using a variety of chemical analysis tools, e.g. mass spec, NMR, other spectroscopic techniques, biosensors could be utilized. For some specific method examples, see the following references: J.
C. Lindon et al., Prog. NMR Spear., 29, 1 (1996)1- J. C. Lindon et al, Drug. Met.
Rev., 29, 705 (1997); B. Vogler et al., J Nat. Prod., 61, 175 (1998); and JA.
Wolfender et al., Cun. Org. Chem. 2, 575 (1998); J. K. Nicholson et al., Xenobiotica,
29, 1181(1999). Screening tools FACS In one aspect, fluorescence activated cell sorting (FACS) methods are used for selection screening. In some instances a fluorescent molecule is made within a cell (e.g., green fluorescent protein). The cells producing the protein can simply be sorted by FACS. Gel microdrop technology allows screening of cells encapsulated in agarose microdrops (Weaver et al. Methods 2:234-247 (1991)). In this technique products secreted by the cell (such as antibodies or antigens) are immobilized with the cell that generated them. Sorting and collection of the drops containing the desired product thus also collects the cells that made the product, and provides a ready source for the cloning of the genes encoding the desired functions. Desired products can be detected by incubating the encapsulated cells with fluorescent antibodies (Powell et al. Bio/Technology 8:333-337 (1990)). FACS sorting can also be used by this technique to assay resistance to toxic compounds and antibiotics by selecting droplets that contain multiple cells (i.e., the product of continued division in the presence of a cytotoxic compound; Goguen et al. Nature 363:189-190 (1995)). This method can select for any enzyme that can change the fluorescence of a substrate that can be immobilized in the agarose droplet. Reporter molecule In some aspects of the invention, screening can be accomplished by assaying reactivity with a reporter molecule reactive with a desired feature of, for example, a gene product. Thus, specific functionalities such as antigenic domains can be screened with antibodies specific for those determinants. Cell-cell indicator In other aspects of the invention, screening is done with a cell-cell indicator assay. In this assay format, separate library cells (Cell A, the cell being assayed) and reporter cells (Cell B, the assay cell) are used. Only one component of the system, the library cells, is allowed to evolve. The screening is generally carried out in a two-dimensional immobilized format, such as on plates. The products of the metabolic pathways encoded by these genes (in this case, usually secondary metabolites such as antibiotics, polyketides, carotenoids, etc.) diffuse out of the library cell to the reporter cell. The product of the library cell may affect the reporter cell in one of a number of ways. The assay system (indicator cell) can have a simple readout (e.g., green fluorescent protein, luciferase, beta- galactosidase) which is induced by the library cell product but which does not affect the library cell. In these examples the desired product can be detected by colorimetric changes in the reporter cells adjacent to the library cell.
Feedback mechanism In other aspects, indicator cells can in turn produce something that modifies the growth rate of the library cells via a feedback mechanism. Growth rate feedback can detect and accumulate very small differences. For example, if the library and reporter cells are competing for nutrients, library cells producing compounds to inhibit the growth of the reporter cells will have more available nutrients, and thus will have more opportunity for growth. This is a useful screen for antibiotics or a library of polyketide synthesis gene clusters where each of the library cells is expressing and exporting a different polyketide gene product.
Screening Secreted molecules Another variation of this theme is that the reporter cell for an antibiotic selection can itself secrete a toxin or antibiotic that inhibits growth of the library cell. Production by the library cell of an antibiotic that is able to suppress growth of the reporter cell will thus allow uninhibited growth of the library cell. Conversely, if the library is being screened for production of a compound that stimulates the growth of the reporter cell (for example, in improving chemical syntheses, the library cell may supply nutrients such as amino acids to an auxotrophic reporter, or growth factors to a growth-factor- dependent reporter. The reporter cell in turn should produce a compound that stimulates the growth of the library cell. Interleukins, growth factors, and nutrients are possibilities. Further possibilities include competition based on ability to kill sunounding cells, positive feedback loops in which the desired product made by the evolved cell stimulates the indicator cell to produce a positive growth factor for cell A, thus indirectly selecting for increased product formation. In some aspects of the invention it can be advantageous to use a different organism (or genetic background) for screening than the one that will be used in the final product. For example, markers can be added to DNA constructs used for recursive sequence recombination to make the microorganism dependent on the constructs during the improvement process, even though those markers may be undesirable in the final recombinant microorganism. Likewise, in some aspects it is advantageous to use a different substrate for screening an evolved enzyme than the one that will be used in the final product. For example, Evnin et al. (Proc. Natl. Acad. Sci. U.S.A. 87:6659-6663 (1990)) selected trypsin variants with altered substrate specificity by requiring that variant trypsin generate an essential amino acid for an arginine auxotroph by cleaving arginine beta-naphthylamide. This is thus a selection for arginine-specific trypsin, with the growth rate of the host being proportional to that of the enzyme activity. The pool of cells surviving screening and/or selection is enriched for recombinant genes conferring the desired phenotype (e.g. altered substrate specificity, altered biosynthetic ability, etc.). Further enrichment can be obtained, if desired, by performing a second round of screening and/or selection without generating additional diversity. The recombinant gene or pool of such genes surviving one round of screening/selection forms one or more of the substrates for a second round of recombination. Again, recombination can be performed in vivo or in vitro by any of the recursive sequence recombination formats described above. If recursive sequence recombination is performed in vitro, the recombinant gene or genes to form the substrate for recombination should be extracted from the cells in which screening/selection was performed. Optionally, a subsequence of such gene or genes can be excised for more targeted subsequent recombination. If the recombinant gene(s) are contained within episomes, their isolation presents no difficulties. If the recombinant genes are chromosomally integrated, they can be isolated by amplification primed from known sequences flanking the regions in which recombination has occuned. Alternatively, whole genomic DNA can be isolated, optionally amplified, and used as the substrate for recombination. Small samples of genomic DNA can be amplified by whole genome amplification with degenerate primers (Banett et al. Nucleic Acids Research 23:3488- 3492 (1995)). These primers result in a large amount of random 3' ends, which can undergo homologous recombination when reintroduced into cells. If the second round of recombination is to be performed in vivo, as is often the case, it can be performed in the cell surviving screening/selection, or the recombinant genes can be transfened to another cell type (e.g., a cell is type having a high frequency of mutation and/or recombination). In this situation, recombination can be effected by introducing additional DNA segment(s) into cells bearing the recombinant genes. In other methods, the cells can be induced to exchange genetic information with each other by, for example, electroporation. In some methods, the second round of recombination is performed by dividing a pool of cells surviving screening/selection in the first round into two subpopulations. DNA from one subpopulation is isolated and transfected into the other population, where the recombinant gene(s) from the two subpopulations recombine to form a further library of recombinant genes. In these methods, it is not necessary to isolate particular genes from the first subpopulation or to take steps to avoid random shearing of DNA during extraction. Rather, the whole genome of DNA sheared or otherwise cleaved into manageable sized fragments is transfected into the second subpopulation. This approach is particularly useful when several genes are being evolved simultaneously and/or the location and identity of such genes within chromosome are not known. The second round of recombination is sometimes performed exclusively among the recombinant molecules surviving selection. However, in other aspects, additional substrates can be introduced. The additional substrates can be of the same form as the substrates used in the first round of recombination, i.e., additional natural or induced mutants of the gene or cluster of genes, forming the substrates for the first round. Alternatively, the additional substrate(s) in the second round of recombination can be exactly the same as the substrate(s) in the first round of replication. After the second round of recombination, recombinant genes conferring the desired phenotype are again selected. The selection process proceeds essentially as before. If a suicide vector bearing a selective marker was used in the first round of selection, the same vector can be used again. Again, a cell or pool of cells surviving selection is selected. If a pool of cells, the cells can be subject to further enrichment.
Screening for various potential applications Novel drugs: identifying targets The invention relates to procedures that can be applied to identifying compounds that bind to and modulate the function of target components of a cell whose function is known or unknown, and cell components that are not amenable to other screening methods. The invention relates to generating and/or identifying a compound that binds to and modulates (inhibits or enhances) the function of a component of a cell, thereby producing a phenotypic effect in the cell. Such a screen may involve identifying a biomolecule that 1) binds to, in vitro, a component of a cell that has been isolated from other constituents of the cell and that 2) causes, in vivo, as seen in an assay upon intracellular expression of the biomolecule, a phenotypic effect in the cell which is the usual producer and host of the target cell component. In an assay demonstrating characteristic 2) above, intracellular production of the biomolecule can be in cells grown in culture or in cells introduced into an animal. Further methods within these procedures are those methods comprising an assay for a phenotypic effect in the cell upon intracellular production of the biomolecule, either in cells in culture or in cells that have been introduced into one or more animals, and an assay to identify one or more compounds that behave as competitors of the biomolecule in an assay of binding to the target cell component. The target cell component in this aspect and in other aspects not limited to pathogens can be one that is found in mammalian cells, especially cells of a type found to cause or contribute to disease or the symptoms of disease (e.g., cells of tumors or cells of other types of hypeφroliferative disorders).
Process for identifying one or more compounds that produce a phenotypic effect on a cell In one aspect, the invention provides a process for identifying one or more compounds that produce a phenotypic effect on a cell. The process is at the same time a method for target validation. The process is characterized by identifying a biomolecule which binds an isolated target cell component, constructing cells comprising the target cell component and further comprising a gene encoding the biomolecular binder which can be expressed to produce the biomolecular binder, testing the constructed cells for their ability to produce, upon expression of the gene encoding the biomolecular binder, a phenotypic effect in the cells (e.g., inhibition of growth), wherein the test of the constructed cells can be a test of the cells in culture or a test of the cells after introducing them into host animals, or both, and further, identifying, for a biomolecular binder that caused the phenotypic effect, one or more compounds that compete with the biomolecular binder for binding to the target cell component. A test of the constructed cells after introducing them into host animals is especially well-suited to assessing whether a biomolecular binder can produce a particular phenotype by the expression (regulatable by the researcher) of a gene encoding the biomolecular binder. In this method, cells are constructed which have a gene encoding the biomolecular binder, and wherein the biomolecular binder can be produced by regulation of expression of the gene. The constructed cells are introduced into a set of animals. Expression of the gene encoding the biomolecular binder is regulated in one group of the animals (test animals) such that the biomolecular binder is produced. In another group of animals, the gene encoding the biomolecular binder is regulated such that the biomolecular binder is not produced (control animals). The cells in the two groups of animals are monitored for a phenotypic change (for example, a change in growth rate). If the phenotypic change is observed in cells in the test animals and not in the cells in the control animals, or to a lesser extent in the control animals, then the biomolecular binder has been proven to be effective in binding to its target cell component under in vivo conditions. In one aspect of the invention is a method for determining whether a target cell component of a particular cell type (a "first cell") is essential to producing a phenotypic effect on the first cell, the method having the steps: isolating the target component of the first cell; identifying a biomolecular binder of the isolated target component of the first cell; constructing a second type of cells ("second cell") comprising the target component and a regulable, exogenous gene encoding the biomolecular binder; and testing the second cell in culture for an altered phenotypic effect, upon production of the biomolecular binder in the second cell; whereby, if the second cell shows the altered phenotypic effect upon production of the biomolecular binder, then the target component of the first cell is essential to producing the phenotypic effect on the first cell. The target cell component in this aspect and in other aspects not limited to pathogens can be one that is found in mammalian cells, especially cells of a type found to cause or contribute to disease or the symptoms of disease (e.g., cells of tumors or cells of other types of hypeφroliferative disorders).
Identifying a biomolecular inhibitor of growth of pathogen cells One aspect of the invention is a method for identifying a biomolecular inhibitor of growth of pathogen cells by using cell culture techniques, comprising contacting one or more types of biomolecules with isolated target cell component of the pathogen, applying a means of detecting bound complexes of biomolecules and target cell component, whereby, if the bound complexes are detected, one or more types of biomolecules have been identified as a biomolecular binder of the target cell component, constructing a pathogen strain having a regulatable gene encoding the biomolecular binder, regulating expression of the gene encoding the biomolecular binder to express the gene; and monitoring growth of the pathogen cells in culture relative to suitable control cells, whereby, if growth of the pathogen cells is decreased compared to growth of suitable control cells, then the biomolecule is a biomolecular inhibitor of growth of the pathogen cells.
Identifying compounds that inhibit infection of a mammal by a pathogen Another aspect of the invention is a method, employing an animal test, for identifying one or more compounds that inhibit infection of a mammal by a pathogen by binding to a target cell component, comprising constructing a pathogen comprising a regulatable gene encoding a biomolecule which binds to the target cell component, infecting test animals with the pathogen, regulating expression of the regulatable gene to produce the biomolecule, monitoring the test animals and suitable control animals for signs of infection, wherein observing fewer or less severe signs of infection in the test animals than in suitable control animals indicates that the biomolecule is a biomolecular inhibitor of infection, and identifying one or more compounds that compete with the biomolecular inhibitor of growth for binding to the target cell component (as by employing a competitive binding assay), then the compound inhibits infection of a mammal by a pathogen by binding to a target. The competitive binding assay to identify binding analogs of biomolecular binders, which have been proven to bind to their targets in an intracellular test of binding, can be applied to any target for which a biomolecular binder has been identified, including targets whose function is unknown or targets for which other types of assays are not easily developed and performed. Therefore, the method of the invention offers the advantage of decreasing assay development time when using a gene product of known function as a target cell component and the advantage of bypassing the major hurdle of gene function identification when using a gene product of unknown function as a target cell component. Other aspects of the invention are cells comprising a biomolecule and a target cell component, wherein the biomolecule is produced by expression of a regulable gene, and wherein the biomolecule modulates function of the target cell component, thereby causing a phenotypic change in the cells. Yet other aspects are cells comprising a biomolecule and a target cell component, wherein the biomolecule is a biomolecular binder of the target cell component, and is encoded by a regulatable gene. The cells can include mammalian cells or cells of a pathogen, for instance, and the phenotypic change can be a change in growth rate. The pathogen can be a species of bacteria, yeast, fungus, or parasite, for example.
Intracellular validation of a biomolecule The invention provides methods that result in the identification of compounds that cause a phenotypic effect on a cell. The general steps described herein to find a compound for drug development can be thought of as these: (1) identifying a biomolecule that can bind to an isolated target cell component in vitro, (2) confirming that the biomolecule, when produced in cells with the target cell component, can cause a desired phenotypic effect and (3) identifying, by an in vitro screening method, for example, compounds that compete with the biomolecule for binding to the target cell component. Central to these methods is general step (2) above, intracellular validation of a biomolecule comprising one or more steps that determine whether a biomolecule can cause a phenotypic effect on a cell, when the biomolecule is produced by the expression (which can be regulatable) of a gene in the cell. As used in general step (2), a biomolecule is a gene product (e.g., polypeptide, RNA, peptide or RNA oligonucleotide) of an exogenous gene — a gene which has been introduced in the course of construction of the cell. Biomolecules that bind to and alter the function of a candidate target are identified by various in vitro methods. Upon production of the biomolecule within a cell either in vitro or within an animal model system, the biomolecule binds to a specific site on the target, alters its intracellular function, and hence produces a phenotypic change (e.g. cessation of growth, cell death). When the biomolecule is produced in engineered pathogen cells in an animal model of infection, cessation of growth or death of the engineered pathogen cells leads to the clearing of infection and animal survival, demonstrating the importance of the target in infection and thereby validating the target. A further aspect of this invention provides for identifying a biomolecule that produces a phenotypic effect on a cell (wherein the cell can be, for instance, a pathogen cell or a mammalian cell) and (2) simultaneous intracellular target validation.
Methods for identifying compounds that inhibit the growth of cells having a target cell component The invention includes methods for identifying compounds that inhibit the growth of cells having a target cell component. The target cell component can first be identified as essential to the growth of the cells in culture and/or under conditions in which it is desired that the growth of the cells be inhibited. These methods can be applied, for example, to various types of cells that undergo abnormal or undesirable proliferation, including cells of neoplasms (tumors or growths, either benign or malignant) which, as known in the art, can originate from a variety of different cell types. Such cells can be refened to, for example, as being from adenomas, carcinomas, lymphomas or leukemias. The method can also be applied to cells that proliferate abnormally in certain other diseases, such as arthritis, psoriasis or autoimmune diseases. If intracellular expression of the biomolecular binder inhibits the function of a target essential for growth (presumably by binding to the target at a biologically relevant site) cells monitored in step (2) will exhibit a slow growth or no growth phenotype. Targets found to be essential for growth by these methods are validated starting points for drug discovery, and can be incoφorated into assays to identify more stable compounds that bind to the same site on the target as the biomolecule. Where the cells are pathogen cells and the desired phenotypic change to be monitored is inhibition of growth, the invention provides a procedure to examine the activity of target (pathogen) cell components in an animal infection model. Study as a target cell component a gene product of a particular cell type In the course of this method, it may be decided to study as a target cell component a gene product of a particular cell type (e.g., a type of pathogenic bacteria), wherein the target cell component is already known as being encoded by a characterized gene, as a potential target for a modulator to be identified. In this case, the target cell component can be isolated directly from the cell type of interest, assuming suitable culture methods are available to grow a sufficient number of cells, using methods appropriate to the type of cell component to be isolated (e.g., protein purification methods such as differential precipitation, ion exchange chromatography, gel chromatography, affinity chromatography, HPLC. Target cell component can be produced recombinantly Alternatively, the target cell component can be produced recombinantly, that requires that the gene encoding the target cell component be isolated from the cell type of interest. This can be done by any number of methods, for example known methods such as PCR, using template DNA isolated from the pathogen or a DNA library produced from the pathogen DNA, and using primers based on known sequences or combinations of known and unknown sequences within or external to the chosen gene. See, for example, methods described in "The Polymerase Chain Reaction," Chapter 15 of Cunent Protocols in Molecular Biology, (Ausubel, F.M. et al., eds), John Wiley & Sons, New York, 1998. Other methods include cloning a gene from a DNA library (e.g., a cDNA library from a eukaryotic pathogen) into a vector (e.g., plasmid, phage, phagemid, virus, etc.) and applying a means of selection or screening, to clones resulting from a transformation of vectors (including a population of vectors now having inserted genes) into appropriate host cells. The screening method can take advantage of properties given to the host cells by the expression of the inserted chosen gene (e.g., detection of the gene product by antibodies directed against it, detection of an enzymatic activity of the gene product), or can detect the presence of the gene itself (for instance, by methods employing nucleic acid hybridization). For methods of cloning genes in E. coli, which also may be applicable to cloning in other bacterial species, see, for example, "Escherichia coli, Plasmids and Bacteriophages," Chapter I of Cunent Protocols in Molecular Biology, (Ausubel, F.M. et al., eds), John Wiley & Sons, New York, 1998. For methods applicable to cloning genes of eukaryotic origin, see Chapter 5 ("Construction of Recombinant DNA Libraries"), Chapter 9 ("Introduction of DNA Into Mammalian Cells") and Chapter 6 ("Screening of Recombinant DNA Libraries") of Cunent
Protocols in Molecular Biology, (Ausubel, F.M. et al., eds), John Wiley & Sons, New York, 1998. Target proteins can be expressed with E. coli or other prokaryotic gene expression systems, or in eukaryotic gene expression systems. Since many eukaryotic proteins carry unique modifications that are required for their activities, e.g. glycosylation and methylation, protein expression can in some cases be better carried out in eukaryotic systems, such as yeast, insect, or mammalian cells that can perform these modifications. Examples of these expression systems have been reviewed in the following literature: Methods in Enzymology, Volume 185, eds D.V. Goeddel, Academic Press, San Diego, 1990; Geisse et al, Protein Expression and Purification 8:271-282, 1996; Simonsen and McGrogan, Biologicals 22: 85-94; Jones and Morikawa, Cunent Opinions in Biotechnologies 7: 512-516, 1996; Possee, Cunent Opinions in Biotechnologies 8:569-572. Where a gene encoding a chosen target cell component has not been isolated previously, but is thought to exist because homologs of the gene product are known in other species, the gene can be identified and cloned by a method such as that used in Shiba et al., US 5,759,833, Shiba et al., US 5,629,188, Martinis et al., US 5,656,470 and Sassanfar et al., US 5,756,327. Method should be used with target cell components which have not been previously isolated or characterized and whose functions are unknown It is an advantage of the target validation method that it can be used with target cell components which have not been previously isolated or characterized and whose functions are unknown. In this case, a segment of DNA containing an open reading frame (ORF; a cDNA can also be used, as appropriate to a eukaryotic cell) which has been isolated from a cell of a type that is to be an object of drug action (e.g., tumor cell, pathogen cell) can be cloned into a vector, and the target gene product of the ORF can be produced in host cells harboring the vector. The gene product can be purified and further studied in a manner similar to that of a gene product that has been previously isolated and characterized. In some cases, the open reading frame (in some cases, cDNA) can be isolated from a source of DNA of the cells of interest (genomic DNA or a library, as appropriate), and inserted into a fusion protein or fusion polypeptide construct. This construct can be a vector comprising a nucleic acid sequence which provides a control region (e.g., promoter, ribosome binding site) and a region which encodes a peptide or polypeptide portion of the fusion polypeptide wherein the polypeptide encoded by the fusion vector endows the fusion polypeptide with one or more properties that allow for the purification of the fusion polypeptide. For example, the vector can be one from the pGEX series of plasmids (Pharmacia) designed to produce fusions with glutathione S-transferase. Host cells The isolated DNA having an open reading frame, whether encoding a known or an as yet unidentified gene product, when inserted into an expression construct, can be expressed to produce the target cell component in host cells. Host cells can be, for example, Gram-negative or Gram-positive bacterial cells such as Escherichia coli or Bacillus subtilis, respectively, e.g., Bacillus anthracis, or yeast cells such as Saccharomyces cerevisiae, Schizosaccharomyces pombe or Pichia pastoris. In one aspect, the target cell component can be used in target validation studies be produced in a host that is genetically related to the pathogen from which the gene encoding it was isolated. For example, for a Gram-negative bacterial pathogen, an E. coli host is prefened over a Pichia pastoris host. The target cell component so produced can then be isolated from the host cells. Many protein purification methods are known that separate proteins on the basis of, for instance, size, charge, or affinity for a binding partner (e.g., for an enzyme, a binding partner can be a substrate or substrate analog), and these methods can be combined in a sequence of steps by persons of skill in the art to produce an effective purification scheme. For methods to manipulate RNA, see, for example, Chapter 4 in Cunent Protocols in Molecular Biology (Ausubel, F.M. et al., eds), John Wiley & Sons, New York, 1998. An isolated cell component or a fusion protein comprising the cell component can be used in a test to identify one or more biomolecular binders of the isolated product (general step (1)). A biomolecular binder of a target cell component can be identified by in vitro assays that test for the formation of complexes of target and biomolecular binder no covalently, bound to each other. For example, the isolated target can be contacted with one or more types of biomolecules under conditions conducive to binding, the unbound biomolecules can be removed from the targets, and a means of detecting bound complexes of biomolecules and targets can be applied. The detection of the bound complexes can be facilitated by having either the potential biomolecular binders or the target labeled or tagged with an adduct that allows detection or separation (e. g., radioactive isotope or fluorescent label; streptavidin, avidin or biotin affinity label). Alternatively, both the potential biomolecular binders and the target can be differentially labeled. For examples of such methods see, e.g., WO 98/19162. Biomolecules to be tested and means for detection The biomolecules to be tested for binding to a target can be from a library of candidate biomolecular binders, (e.g., a peptide or oligonucleotide library). For example, a peptide library can be displayed on the coat protein of a phage (see, for examples of the use of genetic packages such as phage display libraries, Koivunen, E. et al., J Biol. Chem. 268:20205-20210 (1993)). The biomolecules can be detected by means of a chemical tag or label attached to or integrated into the biomolecules before they are screened for binding properties. For example, the label can be a radioisotope, a biotin tag, or a fluorescent label. Those molecules that are found to bind to the target molecule can be called biomolecular binders. Fusion proteins An isolated target cell component, an antigenically similar portion thereof, or a suitable fusion protein comprising all of or a portion of or the entire target can be used in a method to select and identify biomolecules which bind specifically to the target. Where the target cell component comprises a protein, fusion proteins comprising all of, or a portion of, the target linked to a second moiety not occurring in the target as found in nature, can be prepared for use in another aspect of the method. Suitable fusion proteins for this piupose include those in which the second moiety comprises an affinity ligand (e.g., an enzyme, antigen, epitope). The fusion proteins can be produced by the insertion of a gene encoding a target or a suitable portion of such gene into a suitable expression vector, which encodes an affinity ligand (e.g., pGEX-4T-2 and pET- 15b, encoding glutathione S- transferase and His-Tag affinity ligands, respectively). The expression vector can be introduced into a suitable host cell for expression. Host cells are lysed and the lysate, containing fusion protein, can be bound to a suitable affinity matrix by contacting the lysate with an affinity matrix under conditions sufficient for binding of the affinity ligand portion of the fusion protein to the affinity matrix. Fusion protein can be immobilized In one aspect, the fusion protein can be immobilized on a suitable affinity matrix under conditions sufficient to bind the affinity ligand portion of the fusion protein to the matrix, and is contacted with one or more candidate biomolecules (e.g., a mixture of peptides) to be tested as biomolecular binders, under conditions suitable for binding of the biomolecules to the target portion of the bound fusion protein. Next, the affinity matrix with bound fusion protein can be washed with a suitable wash buffer to remove unbound biomolecules and non- specifically bound biomolecules. Biomolecules which remain bound can be released by contacting the affinity matrix with fusion protein bound thereto with a suitable elution buffer. Wash buffer can be formulated to permit binding of the fusion protein to the affinity matrix, without significantly disrupting binding of specifically bound biomolecules. In this aspect, elution buffer can be formulated to permit retention of the fusion protein by the affinity matrix, but can be formulated to interfere with binding of the test biomolecule(s) to the target portion of the fusion protein. For example, a change in the ionic strength or pH of the elution buffer can lead to release of biomolecules, or the elution buffer can comprise a release component or components designed to disrupt binding of biomolecules to the target portion of the fusion protein. Immobilization can be performed prior to, simultaneous with, or after contacting, the fusion protein with biomolecule, as appropriate. Various permutations of the method are possible, depending upon factors such as the biomolecules tested, the affinity matrix-ligand pair selected, and elution buffer formulation. For example, after the wash step, fusion protein with biomolecules bound thereto can be eluted from the affinity matrix with a suitable elution buffer (a matrix elution buffer, such as glutathione for a GST fusion). Where the fusion protein comprises a cleavable linker, such as a thrombin cleavage site, cleavage from the affinity ligand can release a portion of the fusion with the biomolecules bound thereto. Bound biomolecule can then be released from the fusion protein or its cleavage product by an appropriate method, such as extraction. Various methods to identify biomolecular binders In one aspect, one or more candidate biomolecular binders can be tested simultaneously. Where a mixture of biomolecules is tested, the biomolecules selected by the foregoing processes can be separated (as appropriate) and identified by suitable methods (e.g., PCR, sequencing, chromatography). Large libraries of biomolecules (e.g., peptides, RNA oligonucleotides) produced by combinatorial chemical synthesis or other methods can be tested (see e. a., Ohlmeyer, M.H.J. et al., Proc. Natl. Acad. Sci. USA 90: 10922-10926 (1993) and DeWitt, S.H. et al., Proc Natl. Acad. Sci. USA 90:6909-6913 (1993), relating to tagged compounds; see also Rutter, WJ. et al. U.S. Patent No. 5,010,175; Huebner, V.D. et al., U.S. Patent No. 5,182,366; and Geysen, H.M., U.S. Patent No. 4,833,092). Random sequence RNA libraries (see Ellington, A.D. et al, Nature 346:818-822 (1990); Bock, L.C. et al., Nature 355:584-566 (1992); and Szostak, J.W., Trends in Biochem. Sci. 17:89-93 (March, 1992)) can also be screened according to the present method to select RNA molecules which bind to a target. Where biomolecules selected from a combinatorial library by the present method carry unique tags, identification of individual biomolecules by chromatographic methods is possible. Where biomolecules do not carry tags, chromatographic separation, followed by mass spectrometry to ascertain structure, can be used to identify individual biomolecules selected by the method, for example. Other methods to identify biomolecular binders of a target cell component can be used. For example, the two-hybrid system or interaction trap is an in vivo system that can be used to identify polypeptides, peptides or proteins (candidate biomolecular binders) that bind to a target protein. In this system, both candidate biomolecular binders and target cell component proteins are produced as fusion proteins. The two-hybrid system and variations on it have been described (US 5,283,173 and US 5,468,614; Golemis, E.A. et al., pages 20.1.1-20.1.35 In Cunent Protocols in Molecular Biology, F.M. Ausubel et al., eds., John Wiley and Sons, containing supplements up through Supplement 40, 1997; two-hybrid systems available from Clontech, Palo Alto, CA). Once one or more biomolecular binders of a cell component have been identified, further steps can be combined with those taken to identify the biomolecular binder, to identify those biomolecular binders that produce a phenotypic effect on a cell (where "a cell" can mean cells of a cell strain or cell line). Thus, a method for identifying a biomolecule that produces a phenotypic effect on a first cell can comprise the steps of identifying a biomolecular binder of an isolated target cell component of the first cell, constructing a second cell comprising the target cell component and a regulable exogenous gene encoding the biomolecular binder, and testing the second cell for the phenotypic effect, upon production of the biomolecular binder in the second cell, where the second cell can be maintained in culture or introduced into an experimental animal. If the second cell shows the phenotypic effect upon intracellular production of the biomolecular binder, then a biomolecule that produces a phenotypic effect on the first cell has been identified. Testing the second cell is general step (2) of the invention, as the three general steps were outlined above. Host cells: Engineered to control expression Host cells (also, "second cells" in the terminology used above) of the cell type (e.g., species of pathogenic bacteria) the target was isolated from (or the gene encoding the target was originally isolated from, if the target is produced by recombinant methods), can be engineered to harbor a gene that can regulatably express the biomolecular binder (e.g., under an inducible or repressible promoter). The ability to regulate the expression of the biomolecular binder is desirable because constitutive expression of the biomolecular binder could be lethal to the cell. Therefore, inducible or regulated expression gives the researcher the ability to control if and when the biomolecular binder is expressed. The gene expressing the biomolecular binder can be present in one or more copies, either on an extra chromosomal structure, such as on a single or multicopy plasmid, or integrated into the host cell genome. Plasmids that provide an inducible gene expression system in pathogenic organisms can be used. For example, plasmids allowing tetracycline- inducible expression of a gene in Staphylococcus aureus have been developed. Genes for expression For intracellular expression of a biomolecule to be tested for its phenotypic effect in a eukaryotic cell (e.g., mammalian cell), the genes for expression can be carried on plasmid-based or virus-based vectors, or on a linear piece of DNA or RNA. For examples of expression vectors, see Hosfield and Lu, Biotechniques: 306-309, 1998; Stephens and Cockett, Nucleic Acid Research 17:7110, 1989; Wohlgemuth et al, Gene Therapy, 3:503-512, 1996; Ramirez-Solis et al, Gene 87:291-294, 1990, Dirks et al, Gene 149:387-388, 1994; Chenaalvala et al. Cunent Opinion in Biotechnologies 2:718-722, 1991; Methods in Enzymology, Volume 185, (D.V. Goeddel, ed.) Academic Press, San Diego, 1990. The genetic material can be introduced into cells using a variety of techniques, including whole cell or protoplast transformation, electroporation, calcium phosphate-DNA precipitation or DEAE- Dextran transfection, liposome mediated DNA or RNA transfer, or transduction with recombinant viral or retroviral vectors. Expression of the gene can be constitutive (e.g., ADHI promoter for expression in S. cerevisiae (Bennetzen, J.L. and Hall, B.D., J Biol. Chem 257:3026-3031 (1982)), or CMV immediate early promoter and RSV LTR for mammalian expression) or inducible, as the inducible GAL I promoter in yeast (Davis, L.I. and Fink, G.R., Cell 61:965-978 (1990)). A variety of inducible systems can be utilized, for example, E. coli Lac repressor/operator system and TnlO Tet repressor/operator systems have been engineered to govern regulated expression in organisms from bacterial to mammalian cells. Regulated gene expression can also be achieved by activation. For example, gene expression governed by HIV LTR can be activated by HIV or SIV Tat proteins in human cells; GAL4 promoter can be activated by galactose in a nonglucose-containing medium. The location of the biomolecule binder genes can be extra chromosomal or chromosomally integrated. The chromosome integration can be mediated through homologous or nonhomologous recombinations. For proper localization in the cells, it maybe desirable to tag the biomolecule binders with certain peptide signal sequences (for example, nuclear localization signal (NLS) sequences, mitochondria localization sequences). Secretion sequences have been well documented in the art. Fused biomolecular binders For presentation of the biomolecular binders in the intracellular system, they can be fused N-terminally, C-terminally, or internally in a carrier protein (if the biomolecular binder is a peptide), and can be fused (5', 3' or internally) in a carrier RNA or DNA molecule (if the biomolecular binder is a nucleic acid). The biomolecular binder can be presented with a protein or nucleic acid structural scaffold. Certain linkages (e.g., a 4-glycine linker for a peptide or a stretch of A's for an RNA can be inserted between the biomolecular binder and the carrier proteins or nucleic acids. In such engineered cells, the effect of this biomolecular binder on the phenotype of the cells can be tested, as a manifestation of the binding (implying binding to a functionally relevant site, thus, an activator, or more likely, an inhibitory) effect of the biomolecular binder on the target used in an in vitro binding assay as described above. An intracellular test can not only determine which biomolecular binders have a phenotypic effect on the cells, but at the same time can assess whether the target in the cells is essential for maintaining the normal phenotype of the cells. For example, a culture of the engineered cells expressing a biomolecular binder can be divided into two aliquots. The first aliquot ("test" cells) can be treated in a suitable manner to regulate (e.g., induce or release repression of, as appropriate) the gene encoding the biomolecular binder, such that the biomolecular binder is produced in the cells. The second aliquot ("control" cells) can be left untreated so that the biomolecular binder is not produced in the cells. In a variation of this method of testing the effect of a biomolecular binder on the phenotype of the cells, a different strain of cells, not having a gene that can express the biomolecular binder, can be used as control cells. The phenotype of the cells in each culture ("test" and "control" cells grown under the same conditions, other than the expression of the biomolecular binder), can then be monitored by a suitable means (e.g., enzymatic activity, monitoring, a product of a biosynthetic pathway, antibody to test for presence of cell surface antigen, etc.). Where the change in phenotype is a change in growth rate, the growth of the cells in each culture ("test" and "control" cells grown under the same conditions, other than the expression of the biomolecular binder), can be monitored by a suitable means (e.g., turbidity of liquid cultures, cell count, etc). If the extent of growth, or rate of growth of the test cells is less than the extent of growth or rate of growth of the control cells, then the biomolecular binder can be concluded to be an inhibitor of the growth of the cells, or a biomolecular inhibitor. If the phenotype of the test cells is altered relative to that of the control cells, then the biomolecular binder can be concluded to be one that causes a phenotypic effect. In an optional additional test, isolated target cell component having a known function (e.g., an enzyme activity) can be tested for modulation of this known function in the presence of biomolecular binder under conditions conducive to binding of the biomolecular binder to the target cell component. Positive results in these tests should encourage the investigator to continue in the drug discovery process with efforts to find a more stable compound (than a peptide, polypeptide or RNA biomolecule) that mimics the binding properties of the biomolecular binder on the tested target cell component. Engineering strain of cells A further test can, again, employ an engineered strain of cells that comprise both the target cell component and one or more genes encoding a biomolecule tested to be a biomolecular binder of the target celPcomponent. The cells of the cell strain can be tested in animals to see if regulable expression of the biomolecular binder in the engineered cells produces an observable or testable change in phenotype of the cells. Both the "in culture" test for the effect of intracellular expression of the biomolecular binder and the "in animal" test (described below) for the effect of intracellular expression of the biomolecular binder can be applied not only towards drug discovery in the categories of antimicrobials and anticancer agents, but also towards the discovery of therapeutic agents to treat inflammatory diseases, cardiovascular diseases, diseases associated with metabolic pathways, and diseases associated with the central nervous system, for example. Where the engineered strain of cells is a strain of pathogen cells or tumor cells, the object of the test is to see whether production of the biomolecular binder in the engineered strain inhibits growth of these cells after their introduction into an animal by the engineered pathogen. Such a test can not only determine which biomolecular binders are inhibitors of growth of the cells, but at the same time can assess whether the target in the cells is essential for maintaining growth of the cells (infection, for a pathogenic organism) in a host mammal. Suitable animals for such an experiment are, for example, mammals such as mice, rats, rabbits, guinea pigs, dogs, pigs, and the like. Small mammals can be used for reasons of convenience. The engineered cells are introduced into one or more animals ("test" animals) and into one or more animals in a separate group ("control" animals) by a route appropriate to cause symptoms of systemic or local growth of the engineered cells. The route of introduction may be, for example, by oral feeding, by inhalation, by subdermal, intramuscular, intravenous, or intraperitoneal injection as appropriate to the desired result. After the cell strain has been introduced into the test and control animals, expression of the gene encoding the biomolecular binder is regulated to allow production of the biomolecular binder in the engineered pathogen cells. This can be achieved, for instance, by administering to the test animals a treatment appropriate to the regulation system built into the cells, to cause the gene encoding the biomolecular binder to be expressed. The same treatment is not administered to the control animals, but the conditions under which they are maintained are otherwise identical to those of the test animals. The treatment to express the gene encoding the biomolecular binder can be the administration of an inducer substance (where expression of the biomolecular binder or gene is under the control of an inducible promoter) or the functional removal of a repressor substance (where expression of the biomolecular binder gene is under the control of a repressible promoter). After such treatment, the test and control animals can be monitored for a phenotypic effect in the introduced cells. Where the introduced cells are constructed pathogen cells, the animals can be monitored for signs of infection (as the simplest endpoint, death of the animal, but also e.g., lethargy, lack of grooming behavior, hunched posture, not eating, dianhea or other discharges; bacterial titer in samples of blood or other cultured fluids or tissues). In the case of testing engineered tumor cells, the test and control animals can be monitored for the development of tumors or for other indicators of the proliferation of the introduced engineered cells. If the test animals are observed to exhibit less growth of the introduced cells than the control animals, then the biomolecule can be also called a biomolecular inhibitor of growth, or biomolecular inhibitor of infection, as appropriate, as it can be concluded that the expression in vivo of the biomolecular inhibitor is the cause of the relative reduction in growth of the introduced cells in the test animals.
In vitro assays In alternative aspects, further steps of the procedure involve in vitro assays to identify one or more compounds that have binding and activating or inhibitory properties that are similar to those of the biomolecules which have been found to have a phenotypic effect, such as inhibition of growth. That is, compounds that compete for binding to a target cell component with the biomolecule would then be structural analogs of the biomolecules. Assays to identify such compounds can take advantage of known methods to identify competing molecules in a binding assay. These steps comprise general step (3) of the method. In one method to identify such compounds, a biomolecular inhibitor (or activator) can be contacted with the isolated target-cell component to allow binding, one or more compounds can be added to the milieu comprising the biomolecular inhibitor and the cell component under conditions that allow interaction and binding between the cell component and the biomolecular inhibitor, and any biomolecular inhibitor that is released from the cell component can be detected. Fluorescence One suitable system that allows the detection of released biomolecular inhibitor (or activator) is one in which fluorescence polarization of molecules in the milieu can be measured. The biomolecular inhibitor can have bound to it a fluorescent tag or label such as fluorescein or fluorescein attached to a linker. Assays for inhibition of the binding of the biomolecular inhibitor to the cell component can be done in microtiter plates to conveniently test a set of compounds at the same time. In such assays, a majority of the fluorescently labeled biomolecular inhibitor must bind to the protein in the absence of competitor compound to allow for the detection of small changes in the bound versus free probe population when a compound which is a competitor with a biomolecular inhibitor is added (B.A. Lynch, et al., Analytical Biochemistry 247:77-82 (1997)). If a compound competes with the biomolecular inhibitor for a binding site on the target cell component, then fluorescently labeled biomolecular inhibitor is released from the target cell component, lowering the polarization measured in the milieu. Radioactive isotope 0226
In a further method for identifying one or more compounds that compete with a biomolecular inhibitor (or activator) for a binding site on a target cell component, the target cell component can be attached to a solid support, contacted with one or more compounds, and contacted with the biomolecular inhibitor. One or more washing steps can be employed to remove biomolecular inhibitor and compound not bound to the cell component. Either the biomolecular inhibitor bound to the target cell component or the compound bound to the target cell component can be measured. Detection of biomolecular inhibitor or compound bound to the cell compound can be facilitated by the use of a label on either molecule type, wherein the label can be, for instance, a radioactive isotope either incoφorated into the molecule itself or attached as an adduct, streptavidin or biotin, a fluorescent label or a substrate for an enzyme that can produce from the substrate a colored or fluorescent product. An appropriate means of detection of the labeled biomolecular inhibitor or compound moiety of the biomolecular inhibitor- cell component complex or the compound-cell component complex can be applied. For example, a scintillation counter can be used to measure radioactivity. Radio labeled streptavidin or biotin can be allowed to bind to biotin or streptavidin, respectively, and the resulting complexes detected in a scintillation counter. Alkaline phosphatase conjugated to streptavidin can be added to a biotin- labeled biomolecular inhibitor or compound. Detection and quantitation of a biotin- labeled complex can then be by addition of pNPP substrate of alkaline phosphatase and detection by spectrophotometry, of a product which absorbs UV light at a wavelength of 405 nm. A fluorescent label can also be used, in which case detection of fluorescent complexes can be by a fluorometer. Models are available that can read multiple samples, as in a microtiter plate. For example, in one type of assay, the method for identifying compounds comprises attaching the target cell component to a solid support, contacting the biomolecular inhibitor with the target cell component under conditions suitable for binding of the biomolecular inhibitor to the cell component, removing unbound biomolecular inhibitor from the solid support, contacting one or more compounds (e.g., a mixture of compounds) with the cell component under conditions suitable for binding of the biomolecular inhibitor to the cell component, and testing for unbound biomolecular inhibitor released from the cell component, whereby if unbound biomolecular inhibitor is detected, one or more compounds that displace or compete with the biomolecular inhibitor for a particular site on the target cell component have been identified. Other methods for identifying compounds that are competitive binders with the biomolecule for a target can employ adaptations of fluorescence polarization methods. See, for instance, Anal. Biochem. 253(2):210-218 (1997), Anal. Biochem. 249(l):29-36 (1997), BioTechniques 17(3):585-589 (1994) and Nature 373:254-256 (1995). Those compounds that bind competitively to the target cell component can be considered to be drug candidates. Further appropriate testing can confirm that those compounds which bind competitively with biomolecular inhibitors (or activators) possess the same activity as seen in an intracellular test of the effect of the biomolecular inhibitor or activator upon the phenotype of cells. Derivatives of these compounds having modifications to confer improved solubility, stability, etc., can also be tested for a desired phenotypic effect. Combining steps Combining steps for testing the phenotypic effects of a biomolecule, as can be produced in an intracellular test, with steps for identifying compounds that compete with the biomolecule for sites on a target cell component, yields a method for identifying a compound which is a functional analog of a biomolecule which produces a phenotypic effect on a cell. These steps can be to test, for the phenotypic effect, either in culture or in an animal model, or in both, a cell which produces a biomolecule by regulatable expression of an exogenous gene in the cell, and to identify, if the biomolecule caused the phenotypic effect, one or more compounds that compete with the biomolecule for binding to a target cell component. If a compound is found to compete with the biomolecule for binding to the target cell component, then the compound is a functional analog of a biomolecule which produces a phenotypic effect on the cell. Such a functional analog can cause qualitatively a similar effect on the cell, but to a similar degree, lesser degree or greater degree than the biomolecule. Method for determining whether a target component of a cell is essential to producing a phenotypic effect on the cell A further aspect of the invention combining general steps (1) and (2) is a method for determining whether a target component of a cell is essential to producing a phenotypic effect on the cell, comprising isolating the target component from the cell, identifying a biomolecular binder of the isolated target component of the cell, constructing a second cell comprising the target component and a regulable, exogenous gene encoding the biomolecular binder, and testing the second cell in culture for an altered phenotypic effect, upon production of the biomolecular binder in the second cell, whereby, if the second cell shows the altered phenotypic effect upon production of the bimolecular binder, then the target component of the first cell is essential to producing the phenotypic effect on the first cell. Inhibit the proliferation of the cells The methods described herein are well suited to the identification of compounds that can inhibit the proliferation of the cells of infectious agents such as bacteria, fungi and the like. In addition, a procedure such as the one outlined below can be used in the identification of compounds to inhibit the proliferation of cancer cells. The two procedures described below further illustrate the use of the methods described herein and would provide proof of principle of these methods with a known target for anticancer therapy. Mammalian dihydrofolate reductase (DHFR) is a proven target for anticancer therapy. Methotrexate (MTX) is one of many existing drugs that inhibit DHFR. It is widely used for anticancer chemotherapy. NIH 3T3 is a mouse fibroblast cell line that is able to develop spontaneous transformed cells when cultured in low concentration (2%) of calf serum in molecular, cellular and developmental biology medium 402 (MCDB) (M. Chow and H. Rubin, Proc. Natl. Acad. Sci. USA 95(8):4550-4555 (1998)). The transformed cells, which can be selectively inhibited by MTX (Chow and Rubin), are isolated. Both the normal and transformed NIH3T3 cells are transfected with pTet- On plasmid (Clontech; Palo Alto, CA). Stable cell lines that express high levels of reverse tetracycline-control led activator (rtTA) are isolated and characterized for their normal or transformed phenotype (Chow and Rubin). The DHFR gene (Genbank Accession # L26316) from the NIH 3T3 cell line is amplified by reverse transcription-PCR (RT-PCR) using poly A' RNA isolated from NIH 3T3 cells (Sambrook, J. et al., Molecular Cloning: A Laboratory Manual, 2nd edition, Cold Spring Harbor Laboratory Press, 1989). Active DHFR is expressed using the BacPAK Baculovirus Expression System (Clontech) or other appropriate systems. The expressed DHFR is purified and biotinylated and subjected to peptide binder identification as exemplified for bacterial proteins. The identified peptides are biochemically characterized for in vitro inhibition of DHFR activity. Peptides that inhibit DHFR are identified. A nucleic acid encoding each peptide can be cloned into a vector such as pGEX-4T2 (Pharmacia) to yield a vector which encodes a fusion polypeptide having the peptide fused to the N- terminus of GST. This can also be done by PCR amplification as exemplified herein for the peptide Pro- 3. The fusion genes are cloned into plasmid pTRE (Clontech) for regulated expression. The constructed plasmid or the vector is co-transfected with pTK-Hyg into the stable NIH 3T3 cell line that expresses rtTA. The resulting cell lines, termed 3T3N-VITA (normal 3T3 cells that express rtTA and the DHFR inhibitory peptides), 3T3T-VITA (transformed 3T3 cells that express rtTA and the DHFR inhibitory peptides), or 3T3T-VITA control (transformed 3T3 cells that express rtTA and GST), are characterized for their normal or transformed phenotype (loss of contact inhibition, change in moφhology, immortalization, etc. ). 102-lθ' of 3T3T-VITA or 3T3T-VITA control cells are mixed with 105 3T3N-VITA and are grown in MCD 402 medium with 10% calf serum at 37'C for three days. Tetracycline is added to the medium to a final concentration of 0 to 1 ug/ml. In a control, 200 nM of MTX is added. The cultures are incubated for an additional eight days, and the number of foci formed are counted as described by M. Chow and H. Rubin, Proc Natl. Acad Sci. USA 95(8):4550-4555 (1998). Peptides that specifically inhibit foci formation of 3T3 transformed cells are identified. A murine model of fibroblastoma (Kogerman, P. et al., Oncogene
(12): 1407-1416 (1997)) is used for evaluating the DHFR/peptide combination for identification of compounds for cancer therapy. Various amounts of 3T3T- VITA or 3T3T-V1TA control cells (103, 104, 105, 106 cells) are injected subcutaneously into 5 groups (10 in each group) of athymic nude mice (4-6 weeks old, 18-22 g) to determine the minimal dose needed for development of fibroblastomas in all of the tested animals. Upon determination of the minimal tumorigenic dose, 6 groups of athymic nude mice (10 each) are injected subcutaneously (s.e) with the minimal tumorigenic dose for 3T3T-V1TA or 3T3T-VITA control cells to develop fibroblastoma. One week after injection, group I mice start receiving MTX s.e at 2 mg/kg/day as positive control, group 2 to 5 start receiving 1, 2, 5, or 10 mg/kg/day of tetracycline, group 6 start receiving saline (vehicle) as control. Five weeks after the introduction of cells, all of the mice are sacrificed and tumors are removed from them. Tumor mass is measured and compared among the groups. An effective peptide identified by these in vivo experiments can be used for screening libraries of compounds to identify those compounds that competitively bind to DHFR. One mechanism of tumorigenesis is overexpression of proto-oncogenes such as Ha-ras (Reviewed by Suarez, H.G., Anticancer Research 9(5):1331-1343 (1989)). Compounds that inhibit the activities of the products of such proto- oncogenes can be used for cancer chemotherapy. What follows is a further illustration of the methods described herein, as applied to mammalian cells. Transgenic mice that overexpress human Ha-ras have been produced. Such transgenic mice develop salivary and/or mammary adenocarcinomas (Nielsen, L.L. et al, In Vivo 8(5):1331-1343 (1994)). Secondary transgenic mice that express rtTA can be generated using the pTet-On plasmid from Clontech. Human Ha-ras open reading frame cDNA (Genbank Accession #GO0277) is amplified by RT-PCR using polyA- RNA isolated from human mammary gland or other tissues. Active Ha-ras is expressed using the BacPAK Baculovirus Expression System (Clontech) or other appropriate systems. The expressed Ha-ras is purified and biotinylated and subjected to peptide binder identification as exemplified herein for bacterial proteins as target cell components. The identified peptides are biochemically characterized for in vitro inhibition of Ha- ras GTPase activity. Peptides that inhibit Ha-ras are cloned into plasmid pTPE (Clontech) for regulated expression as an N-terminal fusion of GST. Such constructs are used to generate tertiary transgenic mice using the secondary transgenic mice. Transgenic mice that are able to overexpress peptide genes are identified by Northern and Western analysis. Control mice that express GST are also identified. Various doses of tetracycline are administered to the tertiary transgenic mice by s.e or I.P. injection before or after tumor onset. Prevention or regression of tumors resulting from expression of the peptide genes are analyzed as described above for murine fibroblastoma. Peptides found to be effective in in vivo experiments will be used to screen compounds that inhibit human Ha-ras activity for cancer therapy. Disease targets The method of the invention can be applied more generally to mammalian diseases caused by: (1) loss or gain of protein function, (2) over- expression or loss of regulation of protein activity. In each case the starting point is the identification of a putative protein target or metabolic pathway involved in the disease. The protocol can sometimes vary with the disease indication, depending on the availability of cell culture and animal model systems to study the disease. In all cases the process can deliver a validated target and assay combination to support the initiation of drug discovery. Appropriate disease indications include, but are not limited to, Alzheimer's, arthritis, cancer, cardiovascular diseases, central nervous system disorders, diabetes, depression, hypertension, inflammation, obesity and pain. Appropriate protein targets putatively linked to disease indications include, but are not limited to (1) the leptin protein, putatively linked to obesity and diabetes; (2) a mitogen- activated protein kinase putatively linked to arthritis, osteoporosis and atherosclerosis; (3) the interleukin- 1 beta converting protein putatively linked to arthritis, asthma and inflammation; (4) the caspase proteins putatively linked to neurodegenerative diseases such as Alzheimer's, Parkinson's and stroke, and (5) the tumor necrosis factor protein putatively linked to obesity and diabetes. Appropriate protein targets include also, but are not limited to, enzymes catalyzing the following types of reactions: (1) oxido-reductases, (2) transferases, (3) hydrolases, (4) lyases, (5) isomerases, and (6) ligases. The arachidonic acid pathway constitutes one of the main mechanisms for the production of pain and inflammation. The pathway produces different classes of end products, including the prostaglandins, thromboxane and leukotrienes. Prostaglandins, an end product of cyclooxygenase metabolism, modulate immune function, mediate vascular phases of inflammation and are potent vasodilators. The major therapeutic action of aspirin and other non-steroidal anti - inflammatory drugs (NSAIDs) is proposed to be inhibition of the enzyme cyclooxygenase (COX). Anti- inflammatory potencies of different NSAIDs have been shown to be proportional to their action as COX inhibitors. It has also been shown that COX inhibition produces toxic side effects such as erosive gastritis and renal toxicity. The knowledge base regarding the toxic side effects of COX inhibitors has been gained through years of monitoring human therapies and human suffering. Two kinds of COX enzymes are now known to exist, with inhibition of COX 1 related to toxicity, and inhibition of COX2 related to reduction of inflammation. Thus, selective COX2 inhibition is a desirable characteristic of new anti-inflammatory drugs. The method of the invention can provide a route from identification of potential drug targets to validating these targets (for example, COX1 and COX2) as playing a role in disease (pain and inflammation) to an examination of the phenotype for the inhibition of one or both target isozymes without human suffering. Importantly, this information can be collected in vivo. As an alternative strategy, the method of the invention can be used to define the phenotype of "genes of unknown function" obtained from various human genome sequencing projects or to assess the phenotype resulting, from inhibition of one isozyme subtype or one member of a family of related protein targets. Definitions Target: (also, "target component of a cell," or "target cell component") a constituent of a cell which contributes to and is necessary for the production or maintenance of a phenotype of the cell in which it is found. A target can be a single type of molecule or can be a complex of molecules. A target can be the product of a single gene, but can also be a complex comprising more than one gene product (for example, an enzyme comprising alpha and beta subunits, mRNA, tRNA, ribosomal RNA or a ribonucleoprotein particle such as a snRNP). Targets can be the product of a characterized gene (gene of known function) or the product of an uncharacterized gene (gene of unknown function). Target Validation: the process of determining whether a target is essential to the maintenance of a phenotype of the cell type in which the target normally occurs. For example, for pathogenic bacteria, researchers developing antimicrobials want to know if a compound which is potentially an antimicrobial agent not only binds to a target in vitro, but also binds to, and modulates the function of, a target in the bacteria in vivo, and especially under the conditions in which the bacteria are producing an infection — those conditions under which the antimicrobial agent must work to inhibit bacterial growth in an infected animal or human. If such compounds can be found that bind to a target in vitro and alter the target's function in cells resulting in an altered phenotype, as found by testing cells in culture and/or as found by testing cells in an animal, then the target is validated. Phenotypic Effect: a change in an observable characteristic of a cell which can include, e.g., growth rate, level or activity of an enzyme produced by the cell, sensitivity to various agents, antigenic characteristics, and level of various metabolites of the cell. A phenotypic effect can be a change away from wild type (normal) phenotype, or can be a change towards wild type phenotype, for example. A phenotypic effect can be the causing or curing of a disease state, especially where mammalian cells are refened to herein. For cells of a pathogen or tumor cells, especially, a phenotypic effect can be the slowing of growth rate or cessation of growth. Biomolecule: a molecule which can be produced as a gene product in cells that have been appropriately constructed to comprise one or more genes encoding the biomolecule. Production of the biomolecule can be turned on, when desired, by an inducible promoter. A biomolecule can be a peptide, polypeptide, or an RNA or RNA oligonucleotide, a DNA or DNA oligonucleotide, but is preferably a peptide. The same biomolecules can also be made synthetically. For peptides, see Merrifield, J., J. Am. Chem. Soc. 85: 2140-2154 (1963). For instance, an Applied Biosystems 431 A Peptide Synthesizer (Perkin Elmer) can be used for peptide synthesis. Biomolecules produced as gene products intracellularly are tested for their interaction with a target in the intracellular steps described herein (tests performed with cells in culture and tests performed with cells that have been introduced into animals). The same biomolecules produced synthetically are tested for their binding to an isolated target in an initial in vitro method described herein. Synthetically produced biomolecules can also be used for a final step of the method for finding compounds that are competitive binders of the target. Biomolecular Binder (of a target): a biomolecule which has been tested for its ability to bind to an isolated target cell component in vitro and has been found to bind to the target. Biomolecular Inhibitor of Growth: a biomolecule which has been tested for its ability to inhibit the growth of cells constructed to produce the biomolecule in an "in culture" test of the effect of the biomolecule on growth of the cells, and has been found, in fact, to inhibit the growth of the cells in this test in culture. Biomolecular Inhibitor of Infection: a biomolecule which has been tested for its ability to ameliorate the effects of infection, and has been found to do so. In the test, pathogen cells constructed to regulably express the biomolecule are introduced into one or more animals, the gene encoding the biomolecule is regulated so as to allow production of the biomolecule in the cells, and the effects of production of the biomolecule are observed in the infected animals compared to one or more suitable control animals. Isolated: term used herein to indicate that the material in question exists in a physical milieu distinct from that in which it occurs in nature. For example, an isolated target cell component of the invention may be substantially isolated with respect to the complex cellular milieu in which it naturally occurs. The absolute level of purity is not critical, and those skilled in the art can readily determine appropriate levels of purity according to the use to which the material is to be put. In many circumstances the isolated material will form part of a composition (for example, a more or less crude extract containing other substances), buffer system or reagent mix. In other circumstances, the material may be purified to essential homogeneity, for example as determined by PAGE or column chromatography (for example, HPLC). Pathogen or Pathogenic Organism: an organism which is capable of causing disease, detectable by signs of infection or symptoms characteristic of disease. Pathogens can include prokaryotes (which include, for example, medically significant Gram- positive bacteria such as Streptococcus pneumoniae, Enterococcus faecalis and Staphylococcus aureus, Gram-negative bacteria such as Escherichia coli, Pseudomonas aeroginosa and Klebsiella pneumoniae, and "acid- fast" bacteria such as Mycobacteria, especially M. tuberculosis), eukaryotes such as yeast and fungi (for example, Candida albicans and Aspergillus fumigatus) and parasites. It should be recognized that pathogens can include such organisms as soil-dwelling organisms and "normal flora" of the skin, gut and orifices, if such organisms colonize and cause symptoms of infection in a human or other mammal, by abnormal proliferation or by growth at a site from which the organism cannot usually be cultured.
Methods for simultaneously identifying individual proteins in complex mixtures of biological molecules The invention provides compositions (e.g., mixed bed multidimensional liquid chromatographs) and methods for simultaneously identifying individual proteins in complex mixtures of biological molecules and quantifying the expression levels of those proteins, e.g., proteome analyses. The methods compare two or more samples of proteins, one of which can be considered as the standard sample and all others can be considered as samples under investigation. The proteins in the standard and investigated samples are subjected separately to a series of chemical modifications, i.e., differential chemical labeling, and fragmentation, e.g., by proteolytic digestion and/or other enzymatic reactions or physical fragmenting methodologies. The chemical modifications can be done before, or after, or before and after fragmentation/ digestion of the polypeptide into peptides. Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but of similar properties, such that peptides with the same sequence from both samples are eluted together in the separation procedure and their ionization and detection properties regarding the mass spectrometry are very similar. Differential chemical labeling can be performed on reactive functional groups on some or all of the carboxy- and/or amino- termini of proteins and peptides and/or on selected amino acid side chains. A combination of chemical labeling, proteolytic digestion and other enzymatic reaction steps, physical fragmentation and/or fractionation can provide access to a variety of residues to general different specifically labeled peptides to enhance the overall selectivity of the procedure. The standard and the investigated samples are combined, subjected to multidimensional chromatographic separation, and analyzed by mass spectrometry methods. Mass spectrometry data is processed by special software, which allows for identification and quantification of peptides and proteins. Depending on the complexity and composition of the protein samples, it may be desirable, or be necessary, to perform protein fractionation using such methods as size exclusion, ion exchange, reverse phase, or other methods of affinity purifications prior to one or more chemical modification steps, proteolytic digestion or other enzymatic reaction steps, or physical fragmentation steps. The combined mixtures of peptides are first separated by a chromatography method, such as a multidimensional liquid chromatography, system, before being fed into a coupled mass spectrometry device, such as a tandem mass spectrometry device. The combination of multidimensional liquid chromatography and tandem mass spectrometry can be called "LC-LC-MS/MS." LC-LC-MS/MS was first developed by Link A. and Yates J. R., as described, e.g., by Link (1999) Nature Biotechnology 17:676-682; Link (1999) Electrophoresis 18:1314-1334. In practicing the methods of the invention, proteins can be first substantially or partially isolated from the biological samples of interest. The polypeptides can be treated before selective differential labeling; for example, they can be denatured, reduced, preparations can be desalted, and the like. Conversion of samples of proteins into mixtures of differentially labeled peptides can include preliminary chemical and/or enzymatic modification of side groups and/or termini; proteolytic digestion or fragmentation; post-digestion or post-fragmentation chemical and/or enzymatic modification of side groups and/or termini. The differentially modified polypeptides and peptides are then combined into one or more peptide mixtures. Solvent or other reagents can be removed, neutralized or diluted, if desired or necessary. The buffer can be modified, or, the peptides can be redissolved in one or more different buffers, such as a "MudPIT" (see below) loading buffer. The peptide mixture is then loaded onto chromatography column, such as a liquid chromatography column, a 2D capillary column or a multidimensional chromatography column, to generate an eluate. The eluate is fed into a mass spectrograph, such as a tandem mass spectrograph. In one aspect, an LC ESI MS and MS/MS analysis is complete. Finally, data output is processed by appropriate software using database searching and data analysis. In practicing the methods of the invention, high yields of peptides can generated for mass spectrograph analysis. Two or more samples can be differentially labeled by selective labeling of each sample. Peptide modifications, i.e., labeling, are stable. Reagents having differing masses or reactive groups can be chosen to maximize the number of reactive groups and differentially labeled samples, thus allowing for a multiplex analysis of sample, polypeptides and peptides. In one aspect, a "MudPIT" protocol is used for peptide analysis, as described herein. The methods of the invention can be fully automated and can essentially analyze every protein in a sample.
High throughput, comparative proteome characterization The invention provides apparatus (e.g., mixed bed multi-dimensional liquid chromatographs) and methods for high throughput, comparative proteome characterization. The invention provides a broad-based method for global profiling protein expression, which is a combination of differential peptides labeling, multidimensional chromatography coupled with mass spectrometry for separation, identification and quantification. Proteins are identified in complex mixtures with rapid speed, high sensitivity and accurate quantitative information. Using sets of labeling tags and modification methods, protein are differentially and efficiently modified with stable and flexible labeling. Second, by combination with multidimensional Liquid Chromatography (LC) of the invention (e.g., mixed bed multi-dimensional liquid chromatographs) and tandem mass spectrometry, the invention provides methods accurate and sensitive comparative proteomics in complex systems. The invention provides compositions (e.g., mixed bed multidimensional liquid chromatographs) and methods for high throughput, comparative proteome characterization. The goal is to provide a broad-based method for global profiling protein expression, which is a combination of differential peptides labeling, multi-dimensional chromatography coupled with mass spectrometry for separation, identification and quantification. This method significantly improves over traditional methods. Proteins are identified in complex mixture with rapid speed, high sensitivity and accurate quantitative information. First, by designing a set of labeling tags and modification methods, the invention provides novel approaches for modifying proteins differentially and efficiently with stable and flexible labeling. Second, by combination with multidimensional Liquid Chromatography (LC) and tandem mass spectrometry, the methods provide the speed and sensitivity for accurate comparative proteomics in complex systems. In alternative aspects, invention provides: Differential peptide labeling Compare various modifications and identify the top candidate(s) Optimize reaction conditions for desired peptide/protein modification Method validation Optimize Multi-dimensional Protein Identification Technique (MudPIT) procedure for high throughput differential proteome profiling Reliable protein preparation Optimize peptide separation and analysis Method validation on model protein mixtures The invention provides a high throughput proteomics technology with high speed, high efficiency and accurate quantitation, which can be employed for quantitative analysis of global protein expression in complex samples, and the detection and quantitation of specific proteins in complex samples. An exemplary high throughput, comparative proteomics method uses a model pathway study of Streptomyces diver sa (S. diver sa). The use of mass spectrometry to identify proteins whose sequences are present in either DNA or protein databases is well established and integrated to the field of Proteomics. One goal of Proteomics is to define the expressed proteins associated with a given cellular state, and another goal is to quantify changes in protein expression between cellular states. Many techniques have been developed to achieve these goals (see below). The present invention provides a non-gel based method of identifying individual proteins in complex protein mixtures simultaneously and quantifying protein expression level globally. It overcomes the limitations inherent in traditional techniques. Comparative Proteomics Techniques 2D gel electrophoresis (2D GE) is the most commonly used technique in proteomics. In 2D GE, proteins are separated by isoelectric focusing according to their PI difference in the first dimension and by electrophoresis mobility according to their molecular weight difference in the second dimension. Separated proteins are usually visualized by staining. Quantitation is achieved by comparing the spot density. For spot identification, the method involves spot cutting, in gel digestion and peptide extraction. The next stage is analyzing these peptides using mass spectrometry or tandem mass spectrometry and database searching for identifications. The disadvantages of 2D GE approach are that it is very time consuming and labor intensive, and it does not work well for hydrophobic proteins, proteins with extreme pi, and non-abundant proteins. Isotope-coded affinity tag (ICAT) is one of the new non-gel based methodologies that have a great impact on proteome research1. The method is based on a newly synthesized class of chemical reagents (ICAT) used in combination with tandem mass spectrometry. The ICAT reagent contains a biotin affinity tag and a thiol specific reactive group (cysteine side chain), which are joined by a spacer domain available in two forms: regular (light), and isotopically heavy which includes eight deuterium atoms. First, a reduced protein mixture representing one cell state is derivatized with the isotopically light version of the ICAT reagent, while the conesponding reduced protein mixture representing a second cell state is derivatized with the isotopically heavy version of the ICAT reagent. Second, the labeled samples are combined and proteolytically digested to produce peptide fragments. Third, the tagged cysteine containing peptide fragments are isolated by avidin affinity chromatography. Finally, the isolated tagged peptides are separated and analyzed by microcapillary tandem mass spectrometry. There are, however, limitations associated with this approach: (i) Differential labeling reagents rely on stable isotopes which are expensive and not very flexible to multiplex differential labeling; (ii) The moieties attached to the original peptides are approximately 500 Dalton heavy, which is heavier than some peptides and is likely to affect peptide ionization and fragmentation process; (iii) Some bonds in the labeling reagent are weak compared to the amide bond, which might complicate the MS/MS spectrum; (iv) Protein expression profiling is limited to duplex comparison; (v) The affinity interaction between biotin and avidin is too strong to release the immobilized peptide efficiently; (vi) the efficiency of protein reduction and alkylation are usually low; (vii) Some proteins do not contain cysteines so they are not going to be labeled. Differential isotopic labeling of peptides for global quantification of proteins2 is another method used cuπently, in which two different protein mixtures for quantitative comparison were digested to peptide mixtures. The peptide mixtures were separately methylated using either dO- or d3-methanol, the mixtures of methylated peptide were combined, and subjected to microcapillary HPLC-MS/MS. Parent proteins of methylated peptides were identified by coπelative database searching of fragment ion spectra using SEQUEST or automated de novo sequencing that compared all tandem mass spectra of dO- and d3 -methylated peptide ion pairs. Ratios of proteins in the two original mixtures were calculated by normalization of the area under the curve for dO- to d3 -methylated peptide pairs. There are several limitations specific to this approach: (i) differential labeling reagents relied on stable isotopes which are expensive and not flexible to differential labeling of more than two mixtures of peptides; (ii) labeling methods are limited only to methylation of c-terminal; (iii) protein expression profiling is limited to duplex comparison; (iv) one dimensional capillary HPLC chromatography was employed to separate peptides, which doesn't have enough capacity and resolving power for complex mixtures of peptides. The invention overcomes the shortcomings of the cuπently available quantitative proteomics methods described above. The technology of the present method has speed, high efficiency and accurate quantitation, which is employed for quantitative analysis of global protein expression in complex samples. The basic approach described is employed for: (i) quantitative analysis of global protein expression in complex samples (such as cells, tissues, fractions and etc.), (ii) the detection and quantitation of specific proteins in complex samples, and (iii) quantitative measurement of specific enzymatic activities in complex samples. Novelties of this approach include: (i) design of differential labeling reagents for peptides and methods for efficient peptide modification; (ii) multiplex analysis; (iii) combination of labeling by chemical modifications of termini and/or side chains of peptides; (iv) combination of chemical modification and proteolytic digestions in order to achieve the most favorable and selective chemical modification of peptides; (v) improvement of multidimensional chromatography for better protein peptide separation and identification. Experimental Design and Methods The present application provides a non-gel based method of identifying individual proteins in complex protein mixtures simultaneously and quantifying protein expression level globally. It overcomes the limitations inherent in traditional techniques. In detail, two or more samples of proteins are compared, one of which is considered as the standard sample and all others are considered as samples under investigation. First, the proteins in the standard and investigated samples are subjected to a sequence of proteolytic digestion and/or other enzymatic reaction in separate tubes. Then, these digested peptides are modified (novel differential chemical labeling). Peptides derived from the standard and the investigated samples are labeled with chemical residues of different mass, but they have similar properties such that the differential labeled peptides are eluted together in the separation procedure and their ionization and fragmentation properties regarding the mass spectrometry are very similar. Next, the samples are combined, separated by multidimensional chromatography, and analyzed by mass spectrometry methods. Finally, mass spectrometry data is processed by special software, for identification and quantification of proteins. This procedure is schematically illustrated in Figure 1. Differential characterization of post-translationally modified proteins is achieved by combining affinity separation techniques for enrichment of the modified proteins or special MS monitoring or data analysis with above approaches. Differential peptide labeling Differential chemical labeling is performed on reactive functional groups on the termini of proteins and peptides and/or on the side chains of amino acids. A combination of chemical labeling, proteolytic digestion, and other enzymatic reaction steps can provide access to a variety of specifically labeled peptides, which enhances the overall selectivity of the procedure. The combined mixtures of peptides are separated by improving a cunent chromatography method called Multidimensional Protein Identification Technique (MudPIT)3. a. Chemical transformations involved in differential labeling: (1) Esterification of C-termini of the peptides and carboxylic acid groups in the side chains; (2) Amidation of C-termini of the peptides and carboxylic acid groups in the side chains; (might require protection of amine groups first); (3) Acylation of N-termini of the peptides and amino and hydroxyl groups in the side chains. The esterification, amidation, and acylation reactions are performed on the mixtures of peptides in a fashion similar to other reactions of the types already described in previous part, or modified as needed in each particular case. b. Reagents for differential labeling: Mixtures of peptides coming from the standard protein samples and the investigated protein samples are labeled separately with differential reagents. These differential reagents differ in molecular mass, but do not differ in retention properties regarding the separation method used and in ionization and detection properties regarding the mass spectrometry methods used. Thus, these differential reagents differ either in their isotope composition (isotopical reagents) or they differ structurally by a rather small fragment, which change does not alter the properties stated above (homologous reagents). The obvious choices for such reagents are aliphatic alcohols, aliphatic amines, and aliphatic acids. Isotopic reagents based on aliphatic alcohols, amines, or acids contain different amount of protons and deuterons in different reagents, e.g., CH3CH2OH and CD CD2OH (mass difference is 5 Da) or CH3CH2CO2H and CD3CD2CO2H (mass difference is 5 Da). The homologous reagents differ from each other by the number of CH2 moieties in their molecules, e.g., CH3OH and CH3CH2OH (mass difference is 14 Da) or CH3CO2H and CH3CH2CO2H (mass difference is 14 Da). The alcohol reagents esterify peptide C-terminals and/or Glu and Asp side chains, the amines form amide bond with peptide C-terminals and/or Glu and Asp side chains, and the acids form amide bond with peptide N-terminals and/or Lys and Arg side chains. Substituents may be introduced into the mass-labeling reagents in order to tune their retention, ionization, and detection properties.
Differential labeling progress: The peptide esterification is performed using different alcohols.
Labeling process has been optimized. Figure 2 shows one example: a peptide is differential labeled by one of the homologous reagent pairs. In this case: methanol and ethanol. The physical/chemical properties of those differential labeled peptide pairs was further tested, and it was found that they are very similar in terms of reverse phase LC elution and ionization efficiency. Differential labeled peptide pairs with a methyl group difference serve as ideal mutual internal standards for quantification.
Advantages of this approach include the minimum cost of the reagents, the straight forward labeling procedure, and high product yield. All the other homologous and isotope reagents are tested and the best one for proteomics application is chosen. Figure 2 is an illustration of a MALDI MS spectrum of a peptide pairs.
These peptides are differentially esterified by either methanol or ethanol. They have the identical sequence before the labeling. Methods for peptide/protein separation, detection and analysis: a. Peptide separation and detection The cutting edge methodology that represents a significant step forward in proteome analysis is the use of multidimensional liquid chromatography coupled to tandem mass spectrometry (LC-LC-MS/MS), which was first developed by Link A. and Yates J. R.4'5'6 and further improved by Washburn M., Wolters D., and Yates J. R.3. The existence and further improvement of this technique are critical factors in the present approach for the application of complex peptide separation and full automation, which makes it the most ideal technology for high throughput proteomics. MudPIT has been previously reported in various incarnations involving reverse phase columns coupled to either cation exchange columns or size exclusion columns8. However, it was only when the technique was employed with a mixed bed microcapillary column containing strong cation exchange (SCX) and reverse phase chromatography (RPC) resins that the true utility of MudPIT was demonstrated. First, a denatured and reduced protein mixture is digested with trypsin to produce peptide fragments. The mixture is loaded onto a microcapillary column containing SCX resin upstream of RPC resin, eluting directly into a tandem mass spectrometer. A discrete fraction of the absorbed peptides are displaced from the SCX column onto the RPC column using a step gradient of salt, causing the peptides to be retained on the RPC column while contaminating salts and buffers are washed through. Peptides are then eluted from the RPC column using an acetonitrile gradient, and analyzed by MS/MS. This process is repeated using increasing salt concentration to displace additional fractions from the SCX column. This is applied in an iterative manner, typically involving 10-20 steps, and the MS/MS data from all of the fractions are analyzed by database searching 9' 10 and combined to give an overall picture of the protein components present in the initial sample. The MudPIT technique can be run in a fully automated system. The use of two dimensions for chromatographic separation also greatly increases the number of peptides that can be identified from very complex mixtures. In one typical 14 step MudPIT run, there are up to 1,000 proteins can be identified with high confidence. In order to identify more proteins from complex protein samples, one has to reduce protein complexity must be reduced prior to proteolysis by pre-fractionation using techniques such as size exclusion, ion exchange, reverse phase, or all the possible affinity purifications. Mixed bed multi-dimensional liquid chromatographs The invention provides novel three-dimensional microcapillary columns (e.g., a mixed bed multi-dimensional liquid chromatograph) comprising a reverse phase (RPC), a strong cation exchange (SCX) and a reverse phase (RPC), designated 3D LC MS/MS. In one aspect, the three-dimensional microcapillary columns of the invention are operably linked to tandem mass spectrographs (3D LC LC MS/MS), ion trap mass spectrographs or a combination of tandem mass spectrographs and ion trap mass spectrographs (LC-LCQ-MS/MS or LC-LTQ- MS/MS), as described herein. A three-dimensional microcapillary system of the invention can provide rapid metabolite identification and proteomic profiling to accelerate drug discovery and development. See Example 3, below, and Figures 4, 14 and 22 for exemplary 3D LC apparatus of the invention. Instead of using any of the pre- fractionation techniques as discussed above, the novel three-dimensional microcapillary columns of the invention can be used to improve on MudPIT techniques. In one aspect, the three-dimensional microcapillary columns of the invention also comprise tandem mass spectrometers ("3D LC LC MS/MS", as described herein), an ion trap mass spectrometer (LCQ or LTQ), such as a Finnigan LCQ Deca XP™ or MDLC LTQ™ (Thermo Electron Coφoration, San Jose, CA) ion trap mass spectrometer, or Agilent's LC/MSD Trap (Agilent Technologies, Palo Alto, CA), or an equivalent mass spectrometer, or a combination of tandem mass spectrometry and ion trap mass spectrometry ("3D LC LCQ MS/MS" or "3D LC LTQ MS/MS", as described herein). In one aspect, the MDLC LTQ™ is the Finnigan LTQ FT™, which combines Ion Trap and Fourier Transform Ion Cyclotron
Resonance technologies. In one aspect, the Agilent LC/MSD Trap is an 1100 series LC/MSD TRAP™, or, the LC/MSD Trap SL™, or, the LC/MSD Trap XCT ™ (Agilent Technologies, Palo Alto, CA). In one aspect, using the 3D LC MS/MS apparatus and methods of the invention, the invention provides a rapid one-fraction protocol for protein extraction, e.g., a rapid one-fraction protocol for extraction, fractionation and/or isolation of proteins of a proteome. In one aspect, the 3D LC MS/MS, 3D LC LCQ MS/MS or 3D LC LTQ MS/MS apparatus and methods of the invention can be used to fractionate/ isolate 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%,
98%, or 99%,, or more of a proteome, or all the proteins (100%,) of a proteome. In one aspect, the 3D LC MS/MS, 3D LC LCQ MS/MS or 3D LC LTQ MS/MS apparatus and methods of the invention provide a one-fraction protocol to fractionate/ isolate 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21 %, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%, or more of a proteome, or all the proteins (100%) of a proteome. See, e.g., Examples 4 and 5, below. In one exemplary protocol, first, a denatured and reduced protein mixture is digested with trypsin to produce peptide fragments. Without desalting, the mixture is directly loaded onto a microcapillary column containing RPC resin, SCX resin and RPC resin, accordingly, eluted directly into a tandem mass spectrometer. A discrete fraction of the absorbed peptides are displaced from the first RPC to the SCX section using a reverse phase gradient (0-X%). This fraction of peptides are retained onto SCX section and then sub-fractionated from the SCX column onto the RPC column using a step gradient of salt, causing part of the peptides to be eluted and retained on the last RPC section while contaminating salts and buffers are washed through. Peptides are then eluted from the RPC column using the same reverse phase gradient (0-X%), and analyzed by MS/MS. This process is repeated using increasing salt concentration to displace additional sub-fractions from the SCX column following each step by a reverse phase gradient. Once the completion of the whole sequence of salt steps, next cycle begins with a higher reverse phase gradient (0-Y%, Y>X). Each cycle is applied in an iterative manner, depends on the complexity of the peptides, involving 3-6 acetonitrile cycles followed by 5-10 salt steps, and the MS/MS data from all of the fractions are analyzed by database searching. Figure 3 illustrates 3D LC set-up and process. In one aspect, the mixed bed multi-dimensional liquid chromatographs of the invention (designated 3D LC MS, or, 3D LC MS/MS; see Example 3, below, and Figures 3, 4, 14 and 22 for exemplary 3D LC apparatus of the invention) are fully automated apparatus techniques using LC in combination with mass spectrometry and database search for highly complex mixtures. The 3D LC MS, or, 3D LC MS/MS of the invention is competitive toward the 2D GE technique in the following terms. It is universal, identifies proteins with extremes in pi, MW, and wide variety of protein classes. It can access hydrophobic proteins. It has high sensitivity, peak capacity and gives dynamic range greater than 10,000 to 1. It is time and labor efficient with its automatic workflow. The mixed bed multi-dimensional liquid chromatographs (e.g., 3D LC) of the invention play an important role on both qualitative proteomics as well as quantitative proteomics with the combination of novel tagging method (see Examples 3, 4, and 5, below). For example, in one aspect, the chromatographs and methods of the invention are used to analyze the entire proteome of a cell, e.g., a microorganism, such as Bacillus anthracis and Desulfovibrio vulgaris). b. Sequence analysis and quantification: Both quantity and sequence identity of the protein from which the modified peptide originated is determined by multistage MS. This is achieved by the operation of the mass spectrometer in a dual mode in which it alternates in successive scans between measuring the relative quantities of peptides eluting from the capillary column and recording the sequence information of selected peptides. Peptides are quantified by measuring in the MS mode the relative signal intensities for pairs or series of peptide ions of identical sequence that are tagged differentially, which therefore differ in mass by the mass differential encoded within the differential labeling reagents. Peptide sequence information is automatically generated by selecting peptide ions of a particular mass-to-charge (m/z) ratio for collision-induced dissociation (CID) in the mass spectrometer operating in the tandem MS mode6'1 ',12. The resulting tandem mass spectra is conelated to sequence databases to identify the protein from which the sequenced peptide originated. Commercial available software that may be used is Turbo SEQUEST™ by Thermofinnigan, Mascot by Matrix Science, and Sonar MS/MS by Proteometiics. Special software development will be developed for automated relative quantification. The present application provides a non-gel based method of identifying individual proteins in complex protein mixtures simultaneously and quantifying protein expression level globally. It overcomes the limitations inherent in traditional techniques. Literature Cited
1. Gygi, Steven P.; Rist, Beate; Gerber, Scott A.; Turecek, Frantisek; Gelb, Michael H.; Aebersold, Ruedi. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. In: Nature Biotechnology Oct. 1999. 17 (10): 994-999.
2. Goodlett, David R.; Keller, Andrew; Watts, Julian D.; Newitt, Richard; Yi, Eugene C; Purvine, Samuel; Eng, Jimmy K.; von Haller, Priska; Aebersold, Ruedi; Kolker,
Eugene. Differential stable isotope labeling of peptides for quantitation and de novo sequence derivation. In: Rapid Communications in Mass Spectrometry 2001. 15 (14): 1214-1221. 3. Washburn, Michael P.; Wolters, Dirk; Yates, John R.,. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. In: Nature Biotechnology March, 2001. 19 (3): 242-247.
4. Yates, J. R.; Link, Andrew J.; Schieltz, David A.; Eng, Jimmy K.; Carmack, Edwin American Societies for Experimental Biology. (Annual Meeting of the American
Societies for Experimental Biology on Biochemistry and Molecular Biology 99 San Francisco, California, USA May 16-20, 1999). Mining proteomes using mass spectrometry: New approaches to help define function. In: FASEB Journal April 23, 1999. 13 (7): A1431. 5. Link, Andrew J.; Robison, Keith; Church, George M.. Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12. In: Electrophoresis 1997. 18 (8): 1259-1313.
6. Link, Andrew J.; Hays, Lara G.; Carmack, Edwin B.; Yates, John R.,. Identifying the major proteome components of Haemophilus influenzae type-strain NCTC 8143. In: Electrophoresis 1997. 18 (8): 1314-1334.
7. Rose, Donald J.; Opiteck, Gregory J.. Two-dimensional gel electrophoresis/liquid chromatography for the micropreparative isolation of proteins. In: Analytical Chemistry 1994. 66 (15): 2529-2536.
8. Opiteck, Gregory J.; Ramirez, Suzanne M.; Jorgenson, James W.; Moseley, M. Arthur,. Comprehensive two-dimensional high-performance liquid chromatography for the isolation of over expressed proteins and proteome mapping. In: Analytical Biochemistry May 1, 1998. 258 (2): 349-361.
9. Yates, JR 3rd; Eng, JK; McCormack, AL. Mining genomes: conelating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases.
Analytical Chemistry 1995 Sep 15, 67(18):3202-10.
10. Yates, JR 3d; Eng, JK; McCormack, AL; Schieltz, D. Method to conelate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Analytical Chemistry 1995 Apr 15, 67(8): 1426-36. 11. Gygi, SP; Rist, B; Gerber, SA; Turecek, F; Gelb, MH; Aebersold, R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology 1999 Oct, 17(10):994-9. 12. Gygi, SP; Rochon, Y; Franza, BR; Aebersold, R. Conelation between protein and mRNA abundance in yeast.
Molecular and Cellular Biology 1999 Mar, 19(3): 1720-30.
Data Analysis In one aspect, upon the acquisition of data using the methods and/or systems of the invention, a program such as SEQUEST™ (Thermo Finnigan, San Jose, CA) or equivalent, e.g., U.S. Patent Nos. 6,017,693 and 5,538,897, can be used to inspect/ analyze the spectra with multiple peaks (e.g., more than 7 peaks/spectrum) for potential duplicates (see discussion in Example 3, below). In one aspect, the spectra comparisons are carried out using a dot-product criteria, e.g., as in Stein (1997) Am. Soc. Mass Spectrom. 5:859, in combination with the retention time, precursor m/z constraints, and index-peak matching. In one aspect, data acquired from the differentially labeled peptides are subjected to the following exemplary data analysis algorithm of the invention: 1. Component extraction, comprising the following sub-steps: a. For every MS spectrum from the beginning of the LC elution, select the "significant" ions, which are above the local noise background and 17 contain predominately C isotopes. b. For every "significant" ion, generate a "selected ion chromatogram" using the neighboring MS spectra. In one aspect, the width of the region is at least 2X of the expected width of the peptide elution (DO). c. Determine the peak location, quality, area and baseline level based on the "selected ion chromatogram". d. Save the "valid" component, which exceeds the quality requirement for the LC elution peak and locates within the elution boundary of the "significant" ion. e. Link the components to the MS/MS spectra if available based on their m/z values and elution time with the consideration of appropriate tolerances. 2. In one aspect, concunently, if the MS/MS spectra of the peptides are acquired, the intensities of the precursor ions are extracted as follows: a. The duplicated MS/MS spectra are identified using the following algorithm: i. For every MS/MS spectrum from the beginning of the LC elution, compare it to all MS/MS spectra; ii. The spectra equivalency is declared if the spectra pair satisfy the following requirements: 1. Their precursor m/z values are within the pre-defined tolerance; 2. Their elution times are within a pre-defined tolerance; 3. Their "signature" peaks achieved a pre-defined degree of match; and 4. Their "dot-products" in both forward and backward direction exceed pre-defined thresholds, b. The duplicated spectra are merged based on the m/z position of the peaks. The elution times of the first (TI) and last (T2) spectra are stored as a part of the description of the merged spectrum. c The intensity of the precursor ions is calculated from the MSI spectra by integrating the region where the precursor ions are detected. This region is defined as (TI - DO / 2, T2 + DO / 2), where DO is defined as in Lb. Figure 17 is a schematic, a flow chart, illustrating an exemplary data analysis algorithm of the invention for quantitative proteomics. Figure 18 is a schematic, a flow chart, illustrating the "component extraction" section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17. Figure 19 is a schematic, a flow chart, illustrating the "precursor integration" section of the exemplary data analysis algorithm for quantitative proteomics illustrated in Figure 17. Figure 20 is a schematic, a flow chart, illustrating the "spectra comparison" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19. Figure 21 is a schematic, a flow chart, illustrating the "identity and merge of duplicates LC-MS spectra" section of the exemplary data analysis algorithm for quantitative proteomics as illustrated in Figure 19. Thus, the invention provides data analysis algorithms as illustrated in Figure 17, and further described in Figures 18 to 21 , in whole, and/or, in part. The data analysis algorithm described in Figure 17, and further described in Figures 18 to 21, in whole, or, in part, can be used to analyze data generated by the systems and methods of the invention. For example, this analysis can be used to reconstruct a series of differentially labeled peptides based on a predictable elution behavior in combination with the predicted mass differences, which can be generated by the systems and methods of the invention. Alternatively, this algorithm, in whole, or, in part, can be used to analyze data generated by other applications, e.g., to analyze data generated by any LC, MS, LC-MS or other analytical system. Computer Systems and computer program products In one aspect, the invention provides computer program products comprising computer-implemented methods and/or programs comprising data analysis algorithms as described in Figure 17, and further described in Figures 18 to 21, in whole, or, in part. The invention provides computer systems, e.g., comprising computer program products, operably linked to the multidimensional columns of the invention, or the 3D LC LC MS/MS or 3D LC LCQ MS/MS systems of the invention. The invention provides a storage medium (e.g., a diskette, a tape, a CD, a hard drive, a memory chip) with a computer program of the invention (e.g., a computer-implemented method, a data-analysis algorithm of the invention) stored thereon. The invention provides computer program products comprising a computer useable medium having computer program logic recorded thereon, where computer program code logic is configured to perform operations comprising the computer- implemented methods, the data-analysis algorithms, of the invention. The invention provides computer systems comprising a processor and a computer program product of the invention. The invention provides a quantitative proteomics system comprising a chromatography system comprising a system of the invention or a mixed bed multi- dimensional liquid chromatograph of the invention, wherein the system is capable of outputting data to a processor; a processor; and a computer program product of the invention embodied within the processor. A computer/ processor used to practice the methods of the invention can be a conventional general-puφose digital computer, e.g., a personal workstation or portable computer, including various computer devices such as microprocessor, machine-readable memory units, and data transfer buses, a graphic controller, and one or more display devices such as CRT or LCD monitors. In addition, the computer may include data acquisition interface with sensing subsystem for receiving real-time measurements data and control interface which sends out computer-generated control commands to the controllable cell environment or the cell modification subsystem, either directly or indirectly via some other control units. Examples of the memory units include any form of memory elements, such as dynamic random access memory, flash memory or the like, or mass storage devices such as a magnetic disk drive, and optical disk drive. Computer software of the invention may be, at least in part, stored in one or more suitable memory units. For example, a conventional personal computer such as those based on an Intel microprocessor and running a Windows operating system can be used. Any hardware or software configuration can be used to practice the methods of the invention. For example, computers based on other well- known microprocessors and running operating system software such as UNIX, Linux, MacOS and others are contemplated.
EXAMPLES The following examples are offered to illustrate, but not to limit the claimed invention.
Example 1 : Identifying proteins by differential labeling of peptides An exemplary method for identifying proteins by differential labeling of peptides is provided, as described below. First, a denatured and reduced protein mixture is digested with trypsin to produce peptide fragments. The mixture is loaded onto a microcapillary column containing a sulfonated styrene resin (e.g., SCX resin, as from Dionex Coφoration, Sunnyvale, CA) upstream of RPC resin (Rapid Prototyping Chemicals, Switzerland), eluting directly into a tandem mass spectrometer. A discrete fraction of the absorbed peptides are displaced from the SCX column onto the RPC column using a step gradient of salt, causing the peptides to be retained on the RPC column while contaminating salts and buffers are washed through. Peptides are then eluted from the RPC column using an acetonitrile gradient, and analyzed by MS/MS. This process is repeated using increasing salt concentration to displace additional fractions from the SCX column. This is applied in an iterative manner; it can be repeated 10 to 20, or more, times. The MS/MS data from all of the fractions are analyzed by database searching, as described, for example, by Yates, J. R., Ill, et al (1995) Anal. Chem. 67, 1426-1436; Eng, J. et al (1994) J. Amer. Mass Spectrom. 5, 976-989. The data are combined to give an overall picture of the protein components present in the initial sample. The MudPIT technique can be run in a fully automated system. The use of two dimensions for chromatographic separation also greatly increases the number of peptides that can be identified from very complex mixtures.
Example 2: Identifying proteins by differential labeling of peptides An exemplary method for synthesizing a differential labeling reagent is provided, as described below. The invention provides chimeric labeling reagents comprising biotin and an amino acid reactive moiety, such as succimide, isothiocyanate, isocyanate. The amino acid reactive moiety can be attached directly or indirectly (i.e., through a linker) to the biotin. The biotin can comprise up to 6 deuterium atoms or six hydrogen atoms. Alternatively, other isotopes, such a 13C, 18O, as described above, can be incoφorated either into the biotin moiety, the amino acid reactive moiety or the crosslinker moiety. The biotin facilitates purification, see, e.g., WO 00/11208, and, by comprising at least one isotope, simultaneously allows mass discrimination in the mass spectrometer. The activated group allows covalent bonding to amino acids, such as lysines or cysteines. An exemplary precursor to biotin that can be used is:
Figure imgf000188_0001
A Grignard reaction is performed with the following compound: XMg-(CD2)4-MgX, where X is chlorine or bromine. The reaction is similar to the one described in US Patent 4,876,350, which describes the chemical synthesis of regular biotin. A deuteurated and undeuteurated biotin, subsequently derivatized to a pentafluorophenyl ester, can then be attached to iodoacetic acid anhydride or as an NHS ester, or other amino acid reactive groups. For example,
Figure imgf000189_0001
This technology allows the direct comparison between two differential proteome samples. For example, protein samples are differentially tagged with the isotope-coded affinity tags of the invention. These tags are only distinguishable by having different isotope compositions. The isotope- (e.g., deuterium-) containing moiety can be the biotin, the linker or the amino acid reactive group, or any combination thereof. The biotin moiety facilitates purification of the peptides. An isotopically "heavy" and isotopically "light" tagged peptides are separately mixed with denatured differential protein samples. The tagged proteins are digested with a protease before or after mixing of samples. Tagged peptides are purified on an avidin column. The column is washed, and the tagged peptides eluted. After elution of the tagged peptides, the peptide mixture is separated using capillary chromatography and the peptide mass is determined. Peptide masses with the exact difference as the isotopic tag conespond to the identical peptide species and can be directly compared quantitatively.
Example 3 : Mixed resin multi-dimensional chromatography This example describes exemplary methods for making and using a mixed resin multi-dimensional chromatography system of the invention. The invention provides a novel proteomics platform comprising multidimensional chromatography, e.g., a three-dimensional chromatography, or 3D LC- MS/MS, system. In one aspect, it has the capacity to identify over two thousand unique proteins within one analysis. This new system can be a fully automated process. It has demonstrated high peak capacity and resolution with minimal sample loss as well as a certain tolerance to detergent. This comprehensive proteomics platform can perform large-scale proteome profiling with high sequence coverage and high-confidence characterization with small protein quantity and minimal effort. In one aspect, multi-dimensional chromatographs of the invention are detergent tolerant, and thus are excellent for membrane proteins or any hydrophobic compound, e.g., steroids, lipids and the like. With the multi-dimensional chromatography (e.g., 3D LC-MS/MS) apparatus and methods of the invention, a comprehensive view of the yeast proteome was obtained. In brief, yeast cells grown to log-phase were homogenized and separated into three categories: soluble proteins, urea solubilized proteins, and SDS solubilized proteins. Collectively, the three fractions identified more than 5000 total proteins (2575 unique ones) with an average coverage of 11 peptides per protein. Among them, 406 proteins have more than two transmembrane domains. The distribution of the proteins identified among the fractions overlaid well with the predictions for function. The protein abundance estimated by the "coverage index". The majority of the primary metabolism pathways were fully identified. This developed technology is crucial for a detailed mapping of proteome, as well as differential proteome profiling. Reagents. Endoproteinase Lys-C and recombinant trypsin were purchased from Roche Diagnostics (Indianapolis, IN). Dithiothreitol (DTT) was obtained from Pharmacia Biotech, Uppsala, Sweden). All other chemicals, unless otherwise noted, were obtained from Sigma Chemical (St. Louis, MO). Preparation of protein samples from S. cerevisiae cells. S. cerevisiae strain S150-2B (MATa, leu2-3, leu2-112, ura3-52,tip 1-289, his3-Δl,Ga ) was cultured in YPD medium at 30°C to log growth phase. The cell pellet was resuspended in TNE extraction buffer (50 mM Tris-HCl, pH 8.0, 100 mM NaCI, 10 mM EDTA, 1 mM DTT, 0.7 μg/ml Pepstatin A, 5 μg/ml RNase and DNase). Cell lysis was performed by homogenization with glass beads in a Mini-BeadBeater (BioSpec Products, Bartlesville, OK) for 6 cycles of 1 min breakage and 1 min cooling on ice. The cell lysate was centrifuged at l,000g for 5 min and the supernatant was subjected to ultracentrifugation at 100,000g for 30 min at 4°C. The supernatant was collected as the soluble protein sample. The pellet was washed with 4 M Urea, 100 mM Tris-HCl, pH 8.0, 1 mM DTT buffer to generate the urea- solubilized protein sample. Optionally, the pellet was washed again with 8M Urea, 100 mM Tris-HCl, pH 8.0, 1 mM DTT buffer to yield an alternative Urea (8M)- solubilized fraction. The remaining pellet was resuspended with 1% SDS, 50 mM Tris-HCl, pH 8.0, 1 mM DTT buffer. After ultracentrifugation, the supernatant was collected as the SDS-solubilized protein sample. Protein concentrations were measured using a BCA protein assay kit (Pierce Biotechnology, Rockford, IL). The soluble protein sample was adjusted to 100 mM Tris-HCl, pH 8.0, 4 M urea and then along with the Urea-solubilized protein sample, the two samples were reduced with 1 mM DTT and alkylated with 0.4 mg/ml iodoacetamide. After digestion with endoproteinase Lys-C (1/200 of protein), the protein samples were diluted to 1 M urea and further digested twice with trypsin (1/100 of protein) by adding the second batch of trypsin 4 hours after the addition of the first batch. The digested proteins were concentrated in a DNA110 SpeedVac (Thermo Savant, Holbrook, NY). The SDS-solubilized protein sample was then diluted to 0.1% SDS.
The proteins were additionally reduced, alkylated, digested and concentrated as mentioned above. SDS was removed using the SDS-Out Sodium Dodecyl Sulfate Precipitation Kit from Pierce Biotechnology (Rockford, IL). Online 3D LC-MS/MS. The digested protein samples were analyzed by an exemplary 3D LC-MS/MS method of the invention comprising an Agilent 1100 series high pressure liquid chromatograph (HPLC) (Agilent, Polo Alto, CA), a three- dimensional microcapillary column, and a LCQ Deca XP mass spectrometer equipped with a nano-spray source (Thermo Finnigan, San Jose, CA), as illustrated in Figure 4. Another exemplary apparatus of the invention is illustrated in Figure 14; note the three-way valve connected to the first reverse phase resin bed (a PEEK Micro Cross, Micro-Tech Scientific, Inc., Vista, CA) in this exemplary apparatus). Another exemplary apparatus of the invention is illustrated in Figure 22. In one aspect, fused silica capillaries (Polymicro Technologies, LLC, Phoenix, AZ, or Agilent Technologies, Palo Alto, CA) are used as "housings" for the resin beds. In one aspect, tips for the fused silica capillaries are cut by a precision cutting instrument (e.g., a Sutter Instrument Company P-2000) to produce tips from the fused silica capillaries. In one aspect, apertures of the capillaries are approximately 5 nanometers across from capillaries that have εn inner diameter of 100 microns (precision machinery can reproducibly generate the ?e ti DS). The 3D microcapillary column was generated using a Pressure Bomb. Bombs can help load packing material and peptides into the columns. In one aspect, an Eppendorff tube is placed inside a bomb, holding the material to be loaded. The bomb is then assembled and the column inserted through the aperture at the top of the bomb into the Eppendorff below. Gas pressure forces the material in the tube into the column. The column was constructed with two microcapillaries coupled with an inline microfilter assembly (Upchurch Scientific, Oak Habor, WA) and packed with three resin phases. In one aspect, the two microcapillaries are coupled by a low volume flow valve. The first microcapillary (180 μm i.d. x 365 μm o.d. x 30cm) was packed with Zorbax SB-C18 reverse phase resin (Agilent, Polo Alto, CA)[RPC1]. The second microcapillary (100 μm i.d. x 365 μm o.d. x 15cm) was first packed with 10 cm of Zorbax SB-C18 reverse phase resin [RPC2] and then 5 cm of Polysulfoethyl A strong cation exchange resin (PolyLC Inc., Columbia) [SCX]. The column was then connected to the HPLC through RPC 1 and the RPC2 end was coupled to LCQ. Without desalting, 200 μg of peptide mixture from each digested protein sample was directly loaded onto the RPCl region of the microcapillary column using the pressure bomb. The absolute loading capacity of RPC 1 was tested and found to be capable of up to 400 μg of protein digests. In an alternative aspect, 5 micron spherical silica beads are used in the cation exchange column, e.g., Partisphere (Whatman). Referring to Figure 4, complex peptides are loaded onto the column directly in the digestion buffer. In an iterative process, a fraction of the peptides is displaced from the RPCl onto SCX using a RP gradient. This fraction of peptides was retained at the SCX phase and part of them was sub-fractionated onto the RPC2 region using a step salt gradient. The sub-fraction of peptides was retained in the
RPC2 front while salts and buffers were washed through. Peptides on the RPC2 were then separated using the same reverse phase gradient. The process was repeated with increasing salt concentration to displace additional sub-fractions from the SCX region following each step separated by the same RP gradient. Upon the completion of analyzing all the subfractions of this fraction, the cycle was repeated, employing a higher reverse phase gradient range. Each of the reverse cycle was applied in an iterative manner, with the total number of cycles depending on the complexity of the peptides. A voltage of 1.3 kV applied to the pre-column liquid-metal interface produces a stable electrospray at ion source of the mass spectrometer. The LC separation was carried out in a fully automated manner using four buffer solutions: buffer A (2% ACN/0.1% formic acid), buffer B (80% ACN/0.1 % formic acid), buffer C (250 mM ammonium acetate/2% ACN/0.1 % formic acid), and buffer D (2 M ammonium acetate/2% ACN/0.1% formic acid). A discrete fraction of the absorbed peptides was displaced from the RPCl region onto the SCX region using a reverse phase gradient (Xn-Xn+ι%B) over 120 minutes with a flow rate of 250 ril/min. This fraction of peptides was retained on the SCX phase while part of them were sub-fractionated onto the RPC2 region using a step salt gradient of 10 minutes with a flow rate of lμl/min. This sub-fraction of peptides was retained in the RPC2 while the salts and buffers washed through. The peptides on the RPC2 were then separated using the same reverse phase gradient (Xn-X„+i%B) of 120 minutes and a flow rate of 150 nl min. The eluted peptides were subsequently directly analyzed by the LCQ mass spectrometer. The process was repeated with increasing salt concentrations to transfer additional sub-fractions from the SCX region to RPC2, following each salt step with the same reverse phase gradient to elute the next fraction off RPC2. Upon the completion of the series of salt steps, the entire sequence was repeated, employing a higher reverse phase gradient (Xn+ι-Xn+2%, Xn+2> X„+ι, n=0, 1, 2, 3...). Each of the reverse phase cycles was applied in an iterative manner, with the total number of cycles depending on the complexity of the peptides, as shown in Table 1 and Figures 5 and 6. In one aspect, the chromatography proceeds in cycles, each comprising an increase in salt concentration to "bump" peptides off of the SCX followed by a gradient of increasing hydrophobicity to progressively elute peptides from the RP. In one aspect, Agilent 1100 and ThermoFinnigan Surveyor Quaternary pumps are used. Pumps can be operated at flow rates of about 100-200 microliters/min, optionally, with a pre-column splitting of the flow to produce 100-200 nL/min flow rates at the column. For the yeast protein samples, the separation included 5 reverse phase cycles (Xo%=0%B, Xι%=8%B, X2%=15%B, X3%=30%B, X4%=50%B, and X5%=100%B) each one followed by 12 salt gradient steps (25 mM, 50 mM, 75 mM, 100 mM, 125 mM, 150 mM, 175 mM, 200 mM, 225mM, 250mM, and 2M ammonium acetate). The LCQ mass spectrometer was set to divide the full MS scan into three smaller sections covering a range of 400 to 2000 m/z. Each of the smaller MS scans was followed by 4 to 6 MS/MS scans of the most intense ions from the preceding MS scan. The typical collision energy for collision-induced dissociation was set to 35% with a 30-ms activation time., Dynamic exclusion was enabled with a repeat count of 1, and a three minute exclusion duration window. The MS/MS of the separated peptides is illustrated in Figure 23. These peptides were consequently analyzed by database searching, as described below. In one aspect, after loading the first RP column, the peptides are eluted by reverse flow of the buffer. In other words, the peptides are loaded and eluted from the same end. In one mode, the peptides can be loaded at either end, and eluted from that end. In this way, in some aspects, the column may be useful for longer periods before it gets gunked up (saturated). In one aspect, one end is loaded and the peptides are eluted out the other end. However, in some applications, eluted bands may be either too diffuse or not detectable when eluted at the nanoflow rates needed for some downstream two resin columns. Therefore, in alternative aspects, in some applications, by eluting from the same end as used for loading, a nanoflow rate can be used successfully. In another alternative aspects, where the cation exchange (CX) column and the second reversed phase column (RPC) are enclosed in one housing and the first reversed phase column (RPC) is enclosed in a second housing, or, the reverse phase resin of the first bed is enclosed in a separate housing from the cation exchange (second) resin bed and the reverse phase resin of the third bed, the first column (first dimension), or bed, can be detachable from the remaining two resin columns (e.g., the second housing). The connection can be direct or through a valve assembly, e.g., a two, three or multi-way valve. Thus, the first resin bed or resin column to be exposed to sample can be readily replaced when gunked up, rather than replacing all three resins. In one aspect, only the resin bed, or column, with a housing is replaced. In another aspect, the entire housing (with resin) is replaced. In one aspect, where a three-way or multi-way valve separates the first dimension (first reversed phase column, or RPC) from the second (CX) dimension, the first RPC can be can be washed (e.g., re-newed) by flowing a reagent, e.g., a buffer, back through the first RPC after closing the valve leading into the second, or CX, dimension. After this process is completed, the valve leading from the buffer source can be closed and the valve leading into the CX column is opened. In one aspect, a second buffer can be passed through the first RP column out through the wash valve (with the CX opening valve closed) before doing an RP loading if the wash buffer is different from the RP loading buffer. In one aspect, this process is automated. Thus, the first resin bed or resin column can be cleaned (de-gunked) and, in one aspect, re-equilibrated with buffer, between RP loading runs without disconnecting the first column from the second dimension (and, in one aspect, in a completely automated manner). Data Analysis. A series of programs were developed to analyze the LC-MS/MS data generated from the proteomics experiments with the puφose of achieving high accuracy, throughput and comprehensive data management. Upon the acquisition of the data, a program such as SEQUEST™ (Thermo Finnigan, San Jose, CA) or equivalent (see, e.g., U.S. Patent Nos. 6,017,693 and 5,538,897, and discussion below) inspects the spectra with more than 7 peaks/spectrum for potential duplicates. The spectra comparisons are carried out using dot-product criteria (see, e.g., Stein (1997) Am. Soc. Mass Spectrom. 5:859) in combination with the retention time, precursor m/z constraints, and index-peak matching. For this study, the minimum dot product was 0.6, while the RT window was six minutes, and the precursor tolerance equal to 1.5 amu. The index peaks for each spectrum were generated from the most-intense peak within seven equally-spaced regions, where only one mismatch was allowed. The spectra duplicates were merged according to their m/z positions. For peptide identification, the spectra were searched at all plausible charge states against all yeast proteins in the "rupep" collection from NCBI (dated 4/3/2002). The sequences from the human RefSeq (05/14/2002) were also searched as the control. For each match candidate, the estimated of number peptide hits (MP) expected from the random match between the spectrum and the database of given size was calculated. Once the candidate pool is finished for a spectrum, the "best" false match defined as the non-homologous match below the top match is located. From there, the S/N ratio (SN) for the top match is calculated. An eπor rate (ER) is then assigned by a series of filters created from a standard protein mixture under equivalent matching conditions. The filters are given limits using MP, SN and the peptide length as the parameters. Protein identification is based on the combined enor rate calculated on the assumption that each peptide identification is mutually independent. For this study, a protein identification is made when the confidence level, which is equal to 100% minus the combined enor rate, from the peptide detections exceeds a predefined threshold of 90%. In the case where multiple occunences of a peptide are observed, the instance with best matching probability is selected. When results from multiple protein isolations are combined, at least one of the isolation must produce a valid identification. The coverage index (CI) of a protein identification is calculated as (1000 * Np / NAA), where Np is the number of distinct peptides detected from the protein and NAA is the number of amino acid residues of the protein. The Anchor Plot (AP) for the protein identifications among the protein fractions was constructed as described by Ankerst (1996) Circle segments: A technique for visually exploring large dimensional data sets. In Proceedings of the IEEE Visualization Conference, Conference Proceeding. Described briefly, protein isolations are assigned to the anchors, whose X,Y coordinates are calculated such that the anchors are equally distributed as a circle. The coordinates for the protein identifications are calculated as (∑dx„ ∑dx, ), where dt is the normalized number of the distinct peptides from the ith isolation, and (x„ JV) is the coordinates of the /* anchor. The areas of the bubbles are proportional to the coverage indices. The functional predictions from the MIPS database (see, e.g., Mewes (2002) Nucleic Acids Res. 30:31 -34) and pathway designations from the KEGG database (see, e.g., Kanehisa (2003) Nucleic Acids Res. 30:42-46) were downloaded from the sources and merged into the header of the ORF predictions from SGD based on the ORF numbers. The membrane predictions from the Yeast Membrane Protein Library (Ward J. YMPL: Yeast Membrane Protein Library) were merged based on sequence identity. The resulted sequences were scanned against the peptide identifications to reconstruct the protein identification.
Peptide and protein identification. Soluble proteins, urea-solubilized proteins, and SDS-solubilized proteins were extracted from yeast cells at log-phase. The three samples were independently digested and analyzed using the 3D LC-MS/MS system. From a single analysis of the soluble protein sample, 50,248 final spectra were annotated and assigned to 10,580 unique peptides. Using a probability-based scoring system, 1,363 proteins were clearly identified. Similarly, a single analysis of the urea-solubilized protein sample yielded 75,201 final spectra, 15579 unique peptides, and 2,050 identified proteins, while the SDS-solubilized protem sample generated 17,070 final spectra, 4,100 unique peptides, and 612 identified proteins. After merging the overlapping identifications and combining the data, 2,553 unique proteins were identified with a confidence level above 90%. Among them, 2,397 proteins were linked to a confidence level of 99% or above. On average, amino acid coverage was 25 percent across the whole protein sample, where the coverage spanned a range of 0.8 percent to 97 percent. This coverage equates to approximately 11 peptides for each identified protein. The increased peptide and protein identifications and higher protein sequence coverage than cunent 2D LC-MS/MS platforms show that this exemplary system and method of the invention (the "3D LC-MS/MS system") provide high peak capacity and separation resolution. To emulate the aforementioned whole cell lysate analysis, where no initial fractionation was performed, the MS/MS spectra was combined from all fractions. The pool yielded 2,789 protein identifications at 90% confidence level (where 2,435 were given 99%). On average, amino acid coverage was 25% of the whole protein sample, where the coverage spanned a range of 0.8% to 97%. This coverage amounted to approximately 1 1 peptides for each identified protein. From our yeast analysis, 15% of the final spectra were matched to peptide sequences derived from the 6,300-gene Saccharomyces Genome Database
(SGD). Similar percentages were observed using a standard protein mixture under the same acquisition condition. Protein or peptide modifications, precursors at 4+ or above and low quality spectra account for most of the remainder. Multi-dimensional Separation. The exemplary separation technique of the invention employed three consecutive separation phases that can provide superior peak capacity and separation resolution. The functions of these three phases can be complimentary. The first phase, or first dimension, termed RPCl (for details see Experimental protocols above) was utilized for sample loading, desalting, and most importantly, fractionating with a reverse phase gradient. This first dimension can be in the same or a separate housing from the second and/or third dimension. The second phase, or second dimension, termed SCX, in one aspect, further separates the mixture (a "sub-fractionation" step). In one aspect, the SCX can contribute to the separation via salt gradient steps, e.g., increasing salt step gradients. The last phase, or third dimension, termed RPC2, was employed for dynamically binding the peptides from the SCX phase (or those directly off RPCl that didn't bind to the SCX) to perform the high resolution separation. This last dimension (designated RPC2) contributed to the high-resolution separation of each sub-fractionation from the SCX section. RPC2 can be directly (or indirectly) coupled to the mass spectrometer and, in one aspect, functioned as the nano-ESI source. To help gain a closer look at the peptides eluting off the column, the total MS scan range was divided into three smaller ranges, thus utilizing the gas phase fractionation power of the instrument. Each of the smaller ranges was followed by four to six MS/MS data dependent scans. With these settings, the number and quality of the unique MS/MS spectrum was significantly improved, helping prevent the ion suppression effect as well as the inherent limited ion capacity of the instrument. Among the three protein samples, the urea-solubilized protein sample consistently yielded the most protein identifications. Since the majority of the abundant proteins were relocated from membrane pellet to the soluble protein sample, the urea washed proteins retained both hydrophilic as well as some hydrophobic proteins. This may explain why the urea-solubilized protein sample had the highest complexity while reducing the dynamic range of the other two samples. Because of this characteristic, data acquired from the urea-solubilized protein sample was used as a benchmark for evaluation of the chromatography performance. Figure 5 graphically depicts the segment partitioning of 306,235 total MS/MS spectra that were obtained from the mass spectrometer. The total 306,235 MS/MS spectra were quite evenly distributed over the RP gradients up to 50% buffer B, whereas the number of spectra were reduced significantly in the higher RP gradient fractions. Poor spectra with less than 7 peaks/spectrum were filtered out; from the total spectra, software was used to filter out less than 10% of the total (most of these originated from higher RP gradient fractions) before data analysis started and reduced this number again by merging the redundant spectra. The redundant spectra, which are likely due to the extreme complexity of the samples, were merged on the basis of absolute ion intensities and peak position. After the above exclusions, 178,665 spectra were ultimately sent for peptide identification, e.g., peptide search against the 6,300-gene Saccharomyces Genome Database ("SGD"), as from the Saccharomyces Genome Database (SacchDB) at Stanford University Genome Center. Spectra from those fractions by RP gradients of less than 30 percent buffer B yielded 82 percent of the total peptide identifications and 96 percent of the unique protein identifications, whereas spectra from the higher RP gradient fractions provided the remaining unique protein identifications and additional peptides to boost the confidence of the protein identifications from the earlier RP gradient fractions. To shorten the time of 3D separation, the higher RP gradient fractionations can be eliminated. In one aspect, this is a strategy to maintain protein identifications with less analysis time. It was also observed that a large number of spectra were generated from the zero salt step within each RPCl fraction, suggesting that a significant number of peptides did not bind to the SCX resin. Figure 5 graphically depicts the statistics of this exemplary mixed resin chromatography analysis and protein identifications of yeast urea (8M) solubilized fraction. Each column represents one combination of reverse-phase and salt concentration. The reverse-phase steps are 8%, 15%, 30%, 50% and 100%, while salt steps (a to 1) conespond to the concentration of ammonium acetate at 25, 50, 75, 100, 125, 150, 175, 200, 225, 250 and 2000 mM, respectively. The quality breakdown of the MS/MS spectra is shown (lower: poor, middle: duplicates and upper: final) in (A). The percentages of the spectra that were matched to peptides are given in (B). Lastly, the protein identifications for each RP1/SCX combinations are plotted in (C). More spectra were generated from the beginning of the fraction and subfractions, indicating that more peptides elute during low RP gradients (where %B < 30%). Not suφrisingly, at higher RP gradients, the number of poor spectra increased with the total number decreasing. To give a numerical idea of the degree to which this occuπed, 82% of the total peptide IDs and 96% of total protein IDs where identified from RP gradients of less than 30% buffer B. Within each RP fraction, the number of spectra generated from the first salt step suggests that a significant number of peptides never bind to the SCX phase in the first place. To effectively take advantage of the capacity of each separation phase, a column with a large loading capacity was used upstream as the sample-trapping phase. This column was found to have a trapping efficiency of 400 μg of protein. The SCX phase was much shorter in length as it has a much higher binding capacity. The second reverse-phase (RPC2) was also quite compact to preserve good peak resolution. To measure the efficiency of this exemplary separation method of the invention, the degree of redundancy was examined by forming a 5 x 5 lattice (i.e., using a 5x5 lattice of separation steps) pooling together spectra from 5 RP gradients and 5 of their associated salt gradients, i.e., 5 sub-fractions from salt gradients b, c, d, e, and f (25 mM, 50 mM, 75 mM, 100 mM, 125 mM salt, respectively). Specifically, spectra from Step 8, 15, 30, 50, and 100% buffer B (i.e., 5 RP gradient fractions of 0- 8, 8-15, 15-30, 30-50, and 50-100 % buffer B) RP gradients and related salt steps b, c, d, e, and f were chosen for this measurement. Spectral clustering was performed using a precursor ion tolerance level of m/z= 2.5 with no constraints on retention time. Of the 127,840 raw MS/MS spectra, 7,303 was removed as poor and found 70,920 unique spectra. Of the total raw spectra, 49,617 were duplicates and merged into 6,476 unique sets. The center of the lattice, RP30-salt step D, was studied very closely as it was felt it was representative of the events taking place. It was found that it contained 741 sets of spectra. Upon examination of these sets, it was apparent that co-elution is much more likely to occur between salt steps than vertically across RP steps (RP gradient fractionations). The average number of times co-elution occuned across rows, or RP gradients, was 1.8, whereas the average column spread, or SCX gradients, was 3.3. Furthermore, RP distribution of spectra appeared to be rather sporadic, meaning that there were large disparities between scores. Only a small number of peptides, equal to less than 10 percent of the total spectra, eluted at multiple salt sub-fractions. These peptides mainly came from abundant proteins. Additionally, most peptides maintained chromatographic fidelity. This indicated the extensive and efficient separation by 3D LC. Column distribution was quite diffused across all positions. This clearly indicates that RPCl performance is quite good while the resolution of SCX was much worse, in line with observations from conventional 2D-LC methods. Also noted, was that while most peptides maintained their chromatographic fidelity, a small number did not, many of which were determined to be abundant species. For instance, of the total raw spectra only 11,707 were found to have overlapping spectra, amounting to just 143 sets that had 9 or more cells occupied. This indicated that our multidimensional separation of peptides is indeed extensive and efficient. Spectral quality can be measured as the percentage of amino acids that can be successfully matched to a peptide sequence. Figure 5B (annotated spectra %) depicts how the peptide matching percentages were greatly related to the portion of reverse phase gradient. It was found that the average length of peptides increased as the percentage of acetonitrile was increased. For instance, the length calculated using more conservative filter, smaller peptides were found to elute off at a lower ACN%, which often produced ambiguous matches, and thus lowering the success rate. Using search algorithms, 15 percent of the final spectra were matched to peptide sequences derived from SGD. Similar percentages were observed using a standard protein mixture under the same acquisition conditions. The other 85 percent may be explained by peptide modifications, precursors at charges of 4+ or above and low quality spectra. Protein Distribution. Figure 6 gives a three-dimensional view of proteins identified using the exemplary apparatus and methods of the invention and their location within the sample fractions. Each spot equates to distinct protein identifications while the area of the spot refers to the "coverage index" of the protein. The location of the spot describes where along the separation the protein was found: if it lies near an anchor, or line, it implies that the protein was found primarily in that sample. If it lies between two anchors, or in other words in the triangle formed by the two fractions, that protein was detected in both samples. Last, but not least, if the spot lies near the center of the map, at the axis of the three fractions, it suggests that there was no specific connection to any fraction. As can be seen, there were very few identifications that were exclusive to one sample, and it was also rare to see many proteins in the triangle made by the soluble protein and SDS-soluble protein samples. This last observation is faithful to the biology that such a partition of the proteins within a cell would be very unlikely. Figure 6A shows an overlay of the predicted (and also observed) membrane proteins (solid circles) over the total population (open circles). The spots are mainly grouped into two clusters where the majority lie near or on the solubilized sample anchor, while the other group is in the triangle between urea and SDS- solubilized sample. Their location indicates a strong relationship to their presence in the cell. The location of the other cluster (pyruvate decarboxylase isozyme 1) suggests a relationship to the cellular presence in the cytosolic region, as has been reported in literature. Certain functional classes may also have prefened locations within the cell, as depicted by the overlays in Figures 6B, 6C, and 6D, illustrating the class of proteins belonging to "protein synthesis", "glycolysis" and "protein glycosylation", respectively. The proteins from protein synthesis are predominately cytosolic, while those involved in protein glycosylation and glycolysis are mostly membrane bound. Functional categories and pathways. While detection of the proteins was found to be relatively uniform across their predicted functional categories, many of the genes with an "unknown" designation were also detected, thus laying the groundwork for a study of functional categories. More specifically, the coverage index of the detected enzymes was overlayed over the glycolysis pathway, as described in the KEGG database. The enzymes that are required for the major pathways were all detected comprehensively, whereas those along non-essential pathways where less abundant. Consequently, although the major pathways were all successfully detected, coverage was less than 100%, for the pathway as a whole. Similar observations were made for other primary metabolic pathways, where the coverage percentages ranged from 21 to 97%. Figure 15 describes the metabolic pathways identified in the yeast proteome using this system (using an exemplary apparatus and method of the invention). Figure 16 illustrates proteins (highlighted in blue) from the glycolysis pathway identified using this system. Membrane proteins. It is well known that membrane proteins play a very important role in cellular functions. For example, they are crucial to activities such as signal transduction, vehicle trafficking, energy generation, transportation, and intercellular communication. They are related to numerous diseases and are of great diagnostic and therapeutic importance, and one third of sequenced genes code for membrane proteins. All in all, membrane proteins are an important area of focus for proteomics, but until now they have been difficult to analyze for they need to be solubilized with detergents. The novel mixed resin bed multi-dimensional chromatography system of the invention (e.g., the 3D LC-MS/MS) apparatus of the invention), because it is amenable to small amounts of detergent, has opened a new dimension for the analysis of these important proteins. Integral membrane proteins are tightly bound to hydrophobic forces and may only be solubilized using various detergents. Thus, 8M urea was used to extract numerous integral membrane proteins along with some peripheral membrane proteins. 1% SDS was used to solubilize the remaining membrane proteins. The SDS-solubilized membrane proteins were most likely tightly associated with lipids in the membrane, thus the proteins were digested under 0.1% SDS first, and then the detergent was removed to avoid precipitation of the membrane bound molecules. With this method, the SDS sample generated 612 unique protein identifications of high-quality spectra. To the best of our knowledge, this is first time that proteins solubilized with SDS were successfully identified using an LC-MS/MS technology. This result demonstrates the tolerance of the mixed bed multi-dimensional chromatography system of the invention to detergents and the potential it has for other applications utilizing SDS. This advantage may be attributed to the prefened binding SDS has to the RPCl column, allowing the SDS-free peptides to achieve good separation from the RPC2 column. Also, to effectively compare two different samples, it is good to use the same column which has up till now been problematic due to memory effect and contaminants. Because we replace RPCl after each use, we can reuse the other two dimensions for successful comparison. A library of 1,221 protein sequences with two or more predicted transmembrane domains (termed TMDs) was produced using the predictions from the Yeast Membrane Protein Library (YMPL) that were also found in SGD. Our experiments matched 406, or 33%, of these proteins that contained two or more of the predicted TMDs, as noted in Table 1, below, although the percentage of identified membrane proteins was slightly lower than the percentage (41%) of the total genes identified in our experiments. It is the highest percentage of membrane proteins reported thus far. Of the 406 proteins, 97% were found in urea and SDS-solubilized protein fractions, suggesting the actual membrane properties of these proteins. Among the proteins detected, 118 had been solubilized in 1% SDS, indicating the need for a method such as the mixed resin multi-dimensional chromatography of the invention (e.g., the 3D LC-MS/MS of the invention) that can tolerate the necessary detergents for the complete resolution of the membrane proteome.
Table 1. Proteins with more than two TMD were detected in SDS-solubilized and urea-solubilized protein samples. Detected in SDS&urea- solubilized
Number of TMD Predicted Detected Detected/ protein SDS&urea/ in proteins proteins proteins Predicted samples Total detected 2 437 129 30% 124 96% 3 198 37 19% 36 97% 4 124 43 35% 42 98% 5 51 19 37% 19 100% 6 53 24 45% 22 92% 7 52 21 40% 21 100% 8 32 16 50% 16 100% 9 34 16 47% 15 94% 10 42 21 50% 19 90% 11 37 11 30% 11 100% 12 108 43 40% 43 100% 13 22 9 41% 9 100% 14 13 7 54% 7 100% 15 8 4 50% 4 100% 16 5 3 60% 3 100% 17 3 3 100% 3 100% 18 1 0 0% 0 0% 20 1 0 0% 0 0% Sum 1221 406 33% 394 97%
Additionally, it is well known that transmembrane regions of integral membrane proteins present a great challenge to existing technologies, and yet are vital to understanding membrane proteins as a whole. These experiments using mixed resin multi-dimensional chromatography and exemplary methods of the invention identified 95 peptides from 53 proteins that covered part or all of the transmembrane domains of these membrane proteins. As depicted in the sequence set forth in SEQ ID NOT, Figure 7, pyruvate decarboxylase contained two predicted transmembrane domains that were fully represented by identified peptides. In Figure 7, the sequence coverage (87%) of pyruvate decarboxylase (Pdclp) based on the peptide identifications from the "water- soluble" fraction. The residue that was contained in one or more peptide identifications is highlighted in blue. The regions of predicted transmembrane domains are underlined. The N-terminus of the mature protein after the cleavage of initial Met residue, was reported as acetylated (see, e.g., Ganels (1997)
Electrophoresis 18: 1347-1360), therefore not expected to be detectable through our default search definition. The membrane protein-friendly mixed resin multi-dimensional chromatography and exemplary methods of the invention, as well as the improved peptide coverage, provided an advantage for detection, isolation and analysis of transmembrane peptides. Membrane proteins can also be classified by their localization in subcellular membranes, such as plasma, ER, Golgi, mitochondria, vacuoles, and integral membranes. 456 proteins located in these subcellular membranes were collected from the Comprehensive Yeast Genome Database (CYGD). 62%,, or 284, of these proteins were detected by the mixed resin multi- dimensional chromatography and exemplary methods of the invention approach, as shown in Table 2.
Table 2. Proteins localized in subcellular membranes were detected in SDS-solubilized and urea- solubilized protein samples. Detected in SDS&urea- SDS&urea/
Subcellular membrane Known Detected Detected/ solubilized Total location proteins proteins Known protein samples detected
ER membrane 75 55 73% 55 100%
Golgi membrane 49 34 69% 31 91%
Integral membrane 15 7 47% 7 100%
Mitochondrial inner 117 72 62% 70 97% membrane
Mitochondrial outer 18 14 78% 14 100% membrane
Plasma membrane 146 76 52% 76 100%
Vacuolar membrane 36 26 72% 26 100%
Sum 456 284 62% 279 98% Among the detected proteins, 98% were in urea and SDS-solubilized protein fractions, confirming their actual membrane localization. Again, 110 of the 284 proteins identified were from the 1% SDS-solubilized proteins sample, emphasizing the requirement of detergents for membrane protein analysis. Other applications involving SDS, such as protein-protein interactions, protein-lipid interactions, raft proteins, protein trafficking, and others could be subjected to MS analysis using our 3D LC-MS/MS technology, thus is proving our technology greatly widens the application of mass spectrometry. Summary: The mixed resin multi-dimensional chromatography and methods of the invention were developed from a need for exploring the whole cell proteome with high resolution, high sensitivity, and high throughput via a fully automated process. At the protein level, cells are processed and separated non- chromatographically into three parts according to their biological localization. This straightforward approach simplifies the whole cell proteome with high efficiency and minimal effort. Once digested, the samples are loaded onto a micro-column and analyzed by a mixed resin multi-dimensional chromatography of the invention (3D LC-MS/MS). During the 3D separation, the additional CI 8 phase (RPCl) as the first separation step can have advantages as compared to cunent 2D separation methods. First, peptides bind much more efficiently to the CI 8 phase than to the SCX. These data show a significant amount of peptide does not bind to the SCX at all (see discussion, above). Also, there is no need for the desalting step required by 2D technology as RPCl binds detergents such as SDS much tighter than it does to peptides, making it possible to analyze detergent-treated samples. As noted in our results, this advantage has been shown to have a great effect on membrane protein studies. Most importantly, RPCl adds another dimension of separation and gives very high peak capacity and resolution, resulting in high sequence coverage and high confidence in protein identifications. Furthermore, the 3D LC method is a fully automated process. The novel apparatus and methods of the invention for three- dimensional separation of peptides are extensive and efficient. In one aspect, they are capable of identifying more than 2,500 unique proteins in one 3D LC-MS/MS analysis at one cell state (a single analysis includes three fractions) including many integral membrane proteins. It is an interesting fact that the urea-solubilized fraction consistently yielded the highest number of protein identifications. The reasons might be that this sample has the highest complexity between the three and/or that the dynamic range has been significantly reduced after the relocation of abundant proteins into the solubilized fraction. In summary, the exemplary mixed resin multi-dimensional chromatography and methods of the invention demonstrated high separation power with minimal effort and sample loss as well as a certain tolerance to detergent. Combined appropriate algorithms and software, the comprehensive proteomics platform of the invention allows large-scale proteome profiling with high sequence coverage and high confidence in both qualitative and semi-quantitative characterization. The high protein coverage identification should make quantification possible using chemical/isotope labeling without separation prior to 3D analysis. The extensive separation power of the mixed resin multi-dimensional chromatography and methods of the invention will also improve the detection of protein modifications as well as protein-protein interactions.
Example 4: Multi-dimensional chromatography to analyze a complete proteome This example describes exemplary methods for using a chromatography system of the invention to analyze the complete Bacillus anthracis proteome (analysis of the yeast proteome was discussed in Example 3, above). An exemplary chromatography system and method of the invention as described in Example 3, and illustrated in Figure 8, was used for "global profiling" analysis the complete Bacillus anthracis proteome. Five protein fractions were analyzed using a chromatography system and exemplary method of the invention: water soluble, membrane associated, integral membrane, spore-surface and secreted protein fractions. 1177 distinct proteins were identified among the fractions. Figure 9 illustrates an exemplary sample preparation protocol for both Bacillus anthracis spores and cells used in this "global profiling" proteome analysis. Figure 10 illustrates the result of salt extraction subfractions in a reverse phase sub- fraction, the protocol as described above, in Example 3. A total of 929 proteins were identified in the water soluble fraction, 580 proteins were identified in the urea-solubilized fraction, 184 proteins were identified in the SDS-solubilized fraction, 203 proteins were identified in the spores- coat proteins and 179 proteins were identified in the secreted protein fraction. A total of 1177 proteins were identified using the apparatus and methods of the invention. Figure 11 illustrates the results of an analysis of a B. anthracis proteome using a chromatography system of the invention. Figure 12 summarizes a "matrix" of protein distribution from these samples. Figure 13 summarizes the discovered protein distribution by "role" category. In summary, using an exemplary chromatography system and method of the invention as described in Example 3, and illustrated in Figure 8, a comprehensive portrait of B. anthracis proteome was obtained. From these five fractions, over one thousand unique proteins with an average coverage of greater than ten peptides per protein sample was obtained. Amino acid coverage was 25% of the whole protein sample. The comprehensive sequence coverage and high confidence identification yielded detailed mapping of the whole B. anthracis proteome. In one aspect, the chromatography systems and methods of the invention are used to analyze microbial surface and secreted proteins, including B. anthracis spore surface proteins and secreted proteins to provide targets for therapeutic intervention.
Example 5: Multi-dimensional chromatography to analyze a complete proteome This example describes exemplary methods for using a chromatography system of the invention (e.g., Figure 3 illustrates an exemplary 3D LC apparatus and process of the invention, and Figures 4, 14, and 22 illustrate exemplary 3D LC apparatus of the invention) to analyze the complete Desulfovibrio vulgaris proteome (analysis of the yeast proteome was discussed in Example 3, and analysis of the B. anthracis proteome was discussed in Example 4, above). Oxygen is a common stress factor for anaerobes like Desulfovibrio vulgaris in the environment. To better understand the cellular responses to such oxidative stress, the proteome composition and expression level of D. vulgaris upon exposure to air was studied. Whole cell proteomes were extracted from cells before and after they were sparged with air or N2. In one aspect, the sparging involved agitating a liquid cell proteome preparation by means of compressed air or gas (e.g., air or N2). Protein compositions were assessed using the 3D LC-MS/MS technology of the invention (see, e.g., Figures 3, 4, 14, and 22). Proteins conesponding to over 50% of the predicted 3674 open reading frames (ORFs) were identified among the four samples. Protein abundances in each of the samples were calculated using the sequence coverage of each identified protein. Lists of potential candidates of either up-regulated or down-regulated proteins were obtained. The expressions of some previous studied cytoplasmic oxidative stress protection proteins were also compared. The 3D proteomics technology provided tools for detailed mapping of D. vulgaris proteome and differential profiling of protein expression in stress response. The design of this oxidative stress experiment is schematically illustrated in Figure 24, showing that 1 ml of Desulfovibrio vulgaris freezer stock is placed in starter culture (350 ml), grown at about 30°C for 48 hours, after which they were inoculated with about 60 mis, 10% v/v, grown in a seed culture (600 ml) at 30°C to a density of about 109 cells/ml, further inoculates made in 60 mis at 10% v/v, grown at 30°C to a density of about 109 cells/ml, aliquots harvested, then either air or N2 sparged and incubated at 30°C and again aliquots were harvested as 3D LC samples. Thus, four aliquots were harvested at CO, CI, V0, VI times, as noted in Figure 24. Please note a summary of data from 3D LC analyses of these aliquots in Figure 26, discussed below. The harvested aliquots (fractions) were analyzed by an online 3D LC-
MS/MS apparatus of the invention as illustrated, e.g., in Figures 4 and 14. About 0.5 to 1 million MS/MS spectra per injection was collected. As discussed above, the 3D LC-MS/MS apparatus of the invention allows for minimal sample handling and SDS detergent tolerance. Sample preparation comprised the steps of culturing (as discussed, above), solubilization in 1% RAPIGEST™ (Waters Coφoration, Milford, MA), centrifugation, harvesting a whole cell lysate, protease digestion (e.g., of about 350 ug protein) and fractionating hydrophilic peptides from hydrophobic peptides and subjecting each fraction to 3D LC-MS/MS analysis (i.e., 3D LC-MS/MS a and 3D LC-MS/MS b, respectively), as summarized in Figure 25. Figure 26 graphically illustrates data representing the number of protein identifications from these 3D LC-MS/MS analyses, with CO, CI, V0, VI samples. CO, CI (N2 sparged), V0, VI (air sparged) time samples represent the four aliquots harvested as noted in Figure 24. Number of identifications (IDs) from the first run and number of identifications (IDs) from the second run for CO were 1343 and 1060, respectively. Number of identifications (IDs) from the first run and number of identifications (IDs) from the second run for CI (N2 sparged) were 1495 and 1226, respectively. Number of identifications (IDs) from the first run and number of identifications (IDs) from the second run for V0 were 1338 and 1238, respectively. Number of identifications (IDs) from the first run and number of identifications (IDs) from the second run for VI (air sparged) were 1350 and 1140, respectively. 3674 ORFs were predicted, with 1766 total unique proteins identified. Many proteins exhibited significant changes (e.g., changes in protein levels, including upregulation or downregulation) that coincided with oxidative stress. Data representing significant differences (changes in protein levels) between non- stressed and stressed cell samples are summarized in Figure 27, and Table 3, below. The number of protems as a function of significant differences in proteins (non- oxidized versus oxidized) can be summarized as 16 (lower cutoff) and 1 (higher cutoff) for CO versus V0, 31 (lower cutoff) and 4 (higher cutoff) for CO versus C 1 , 81 (lower cutoff) and 45 (higher cutoff) for CI versus VI, and 100 (lower cutoff) and 54 (higher cutoff) for V0 versus VI. The origin of the differences of cell growth (small) for CO vs V0, cell growth (large) for CO vs CI, oxygen stress (large) + cell growth (small) for CI vs VI, and oxygen stress (large) + cell growth (large) for V0 vs VI. With "low cutoff, the confidence (S) of the protein identification is calculated based on the expected eπor rate of the individual peptide identifications (s). Typically the threshold is set at 90%, i.e., the protein identification is made when its calculated confidence is 90% or higher. For a given protein, S = 1 - (sl*s2*...*sn), where n distinct peptides were observed. The expected eπor rate of the peptide was established based on the behavior of a standard protein mixture that was subjected to the same experimental and computational procedures as the sample of interest here. More conservatively, additional constrains can be added so that at least one of the peptides with low expected enor rate (<= 0.9%) must present and the protein confidence must meet or exceed 99%. This is refened to as "high cutoff. Table 3
Figure imgf000210_0001
In this study, the 3D LC-MS/MS analysis detected a down-regulation in superoxide reductase ("Sor") after oxidative stress of Desulfovibrio vulgaris cells, as illustrated in Figure 28. In particular, Sor was detected by the 3D LC-MS/MS of the invention m samples CI VI and V0 VI (note: CI is N2 sparged, VI is air sparged). In Figure 28, CI levels were highest (over 400), followed by CO (over 350), with VI being the lowest (at about 150). CI, CO, V0 and VI are defined as described above, see Figure 24. These results are similar to those found by Fournier et al. (2003) J. Bactenol. pg 71-79, where a 2D IEF gels showed Sor to be down regulated. Some of the protems found to be upregulated (Cl/Vl) by oxidative stress (note: CI is N2 sparged, VI is air sparged), detected and quantified by the 3D LC-MS/MS system of the invention. Some of the proteins found to be downregulated (Cl/Vl) by oxidative stress (note: CI is N2 sparged, VI is air sparged), detected and quantified by the 3D LC-MS/MS system of the invention. Proteins found to be upregulated by oxidative stress, detected and quantified by the 3D LC-MS/MS system of the invention, include those of the polyglucose utilization pathway. In fact, these studies found that there was a concerted down-regulation of protems along the polyglucose utilization pathway, as summarized and schematically illustrated in Figure 29. The down-regulated protems of the polyglucose utilization pathway (see, e.g., Fareleira, et al. (1997) J. Bactenol. 179:3972-3980) were L-lactate dehydrogenase, pyruvate feπedoxin oxidoreductase, phosphate acetyltransferase, acetate kinase and pyruvate carboxylase. In summary, using the 3D LC-MS/MS system of the invention about 50% of the Desulfovibrio vulgaris proteome can be assessed for expression changes and more than 50 proteins could be detected displaying changes upon oxidative stress. Many of the changes in protein levels (including upregulation or downregulation) are observed along the same pathway, e.g., the polyglucose utilization pathway.
Example 6: Multi-dimensional chromatography systems with LCQ spectrometry This example describes exemplary chromatography systems comprising tandem mass spectrometers and ion trap mass spectrometers (3D LC LCQ MS/MS) and methods of using them for, e.g., analysis of a proteome of a cell. Figure 30 summarizes the results of proteome analysis from different organisms (P. fumarii, D. vulgaris, Y. pestis, B. anthracis, S. cerevisiae, Streptomyces diversa) using an exemplary 3D LC LCQ MS/MS system of the invention. For example, for P. fumarii, Orf (open reading frame) predictions are at about 2000, protein identifications (IDs) from one sample are just under 1000, and total IDs from different biological conditions are at just above 1000. For these orgamsms, the 3D LC LCQ MS/MS system of the invention was able to cover from 24% to 47% gene coverage. Figure 31 summarizes the results of proteome analysis comparing two exemplary 3D LC LCQ MS/MS systems of the invention: 3D LC LCQ MS/MS versus 3D LC LTQ MS/MS (Finnigan MDLC LTQ™ or LTQ FT™, Thermo Electron Coφoration, San Jose, CA). Figure 31 summarizes protein identifications for: (a) 2 times (2X) the separation length for 3D LC LCQ MS/MS runs of 2 days and 4 days (resulting in 960 and 1122 protein IDs, respectively); (b) 2X separation length vs using "faster scanning MS" comparing 3D LC LCQ MS/MS and 3D LC LTQ MS/MS, where the 3D LC LCQ MS/MS has a run of 4 days and the 3D LC LTQ MS/MS has a run of 2 days (resulting in 1122 and 1724 protein IDs, respectively); and (c) "faster scanning MS" comparing 3D LC LCQ MS/MS and 3D LC LTQ MS/MS, where the 3D LC LCQ MS/MS has a run of 2 days and the 3D LC LTQ
MS/MS has a run of 2 days (resulting in 960 and 1724 protein IDs, respectively). For (a) the total protein ID is 1257 and the overlap is 825, or 86%; for (b) the total protein ID is 1834 and the overlap is 1021, or 91%; for (c) the total protein ID is 1772 and the overlap is 921, or 96%.
Example 7: Multi-dimensional chromato raphy to analyze a complete proteome This example describes exemplary methods for using a chromatography system of the invention (e.g., Figure 3 illustrates an exemplary 3D LC apparatus and process of the invention, and Figures 4, 14, and 22 illustrate exemplary 3D LC apparatus of the invention) to analyze the Human Embryonic Kidney HEK293 (see, e.g., Shaw (2002) FASEB J. 16: 869-871) cell line proteome. HEK293 cells were cultured and protein extracted with a one-step RAPIGEST™ (Waters Coφoration, Milford, MA) protocol, followed by tryptic digestion of two peptide samples - the hydrophobic and the hydrophilic parts (samples) (500 ug total). This protocol is summarized in Figure 32, which also summarizes the results of a peptide analysis with a four day run using an LTQ (3D LC LTQ MS/MS) apparatus of the invention (ID of 6478 proteins, 32545 peptides) and a two day run using an LCQ (3D LC LCQ MS/MS) apparatus of the invention (ID of 1177 proteins, 3297 peptides), for a combined total of 6931 proteins, at a greater than 90% confidence, for about 18% gene coverage. For a greater than 99% confidence in IDs, the combined data found 3671 proteins and 23281 peptides. A number of embodiments of the invention have been described.
Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.

Claims

WHAT IS CLAIMED IS: 1. A chromatography system comprising a first reverse phase column (RPC), an ion exchange column, a second reverse phase column (RPC), wherein the first reverse phase column (RPC), the ion exchange column and the second reverse phase column (RPC) are connected in series; the first reverse phase column (RPC) has a free distal end and a proximal end connected to the ion exchange column, or, either the distal end or the proximal end are connected to the ion exchange column such that a sample can be loaded into and eluted out of first reverse phase column (RPC) to the ion exchange column from the same end; and, the second reverse phase column (RPC) has a free distal end and a proximal end connected to the ion exchange column, and the first reverse phase column (RPC) has a greater capacity than the second reverse phase column (RPC).
2. The chromatography system of claim 1, wherein the second reverse phase column (RPC), or the first reverse phase column (RPC), or both, are connected to an analytical device such that an eluate can be fed into the analytical device.
3. The method of claim 2, wherein the analytical device comprises a mass spectrometer.
4. The chromatography system of claim 3, wherein the mass spectrometer further comprises a nano-spray apparatus.
5. The chromatography system of claim 3, wherein the mass spectrometer comprises a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof.
6. The cliromatography system of claim 1, wherein the ion exchange column and the second reverse phase column (RPC) are enclosed in one housing and the first reverse phase column (RPC) is enclosed in a second housing, or, the first reverse phase column (RPC), the ion exchange column and the first reverse phase column (RPC) are all enclosed in separate housings.
7. The chromatography system of claim 6, wherein a valve connects one or all of the housings.
8. The chromatography system of claim 7, wherein the valve comprises a low volume flow valve and/or an inline microfilter assembly.
9. The chromatography system of claim 6, wherein a valve connects the first housing and the second housing, or the first reverse phase column (RPC) and the ion exchange column.
10. The chromatography system of claim 1, wherein a valve connects the first reverse phase column (RPC) to the ion exchange column and the second reverse phase column (RPC).
11. The chromatography system of claim 10, wherein a low volume flow valve and/or an inline microfilter assembly connects the first reverse phase column
(RPC) to the ion exchange column and the second reverse phase column (RPC).
12. The chromatography system of claim 1, wherein the first reverse phase column (RPC), the ion exchange column and the second reverse phase column (RPC) are enclosed in one housing.
13. The chromatography system of claim 1, wherein the first, second or both reverse phase columns are packed with a reverse phase resin or equivalent.
14. The chromatography system of claim 13, wherein the first, second or both reverse phase resins comprise a C18 reverse phase resin or equivalent.
15. The chromatography system of claim 1, wherein the ion exchange column comprises a cation exchange (CX) column or an anion exchange column.
16. The chromatography system of claim 15, wherein the cation exchange (CX) column comprises a strong cation exchange (SCX) resin or equivalent.
17. The chromatography system of claim 16, wherein the strong cation exchange (SCX) resin comprises a polysulfoethyl A strong cation exchange resin.
18. The chromatography system of claim 1 , wherein the first reverse phase column (RPC), the second first reverse phase column (RPC), or both are connected to an HPLC on a distal end.
19. The chromatography system of claim 1, wherein the first reverse phase column (RPC) has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%,, 375%, 400%,, 425%, 450%, 475%, 500%, 525%,, 550%, 575%, 600%, 625%, 650%, 675%,, 700%,, 725%, 750%, 775%, 800%, 825%,, 850%, 875%,, 900%, 925%, 950%, 975%, 1000%,, or more, greater capacity than the second reverse phase column (RPC), or, wherein the first reverse phase column (RPC) has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%, 1000%, or more of the same or equivalent resin than the second reverse phase column (RPC).
20. The chromatography system of claim 1, further comprising a computer system operatively linked to the chromatography system, thereby making the chromatography system an automated operation.
21. The chromatography system of claim 3, further comprising a computer system operatively linked to the mass spectrometer for quantifying the amount of each peptide by use of data from the mass spectrometer.
22. The chromatography system of claim 3, further comprising a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer.
23. The chromatography system of claim 1, further comprising on-line sample collection apparatus.
24. The chromatography system of claim 1, wherein the operation is fully automated.
25. A mixed bed multi-dimensional liquid chromatograph comprising a first resin bed, a second resin bed and a third resin bed connected in series, wherein the first resin bed comprises a reverse phase resin, the second resin bed comprises an ion exchange resin bed and the third resin bed comprises a reverse phase resin; and the reverse phase resin of the first bed has a free distal end and a proximal end connected to the ion exchange bed, or, either the distal end or the proximal end are connected to the ion exchange column such that a sample can be loaded into and eluted out of the first reverse phase column (RPC) to the ion exchange column from the same end; and, the reverse phase resin of the third bed has a free distal end and a proximal end connected to the ion exchange bed.
26. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the reverse phase resin of the first bed has a greater capacity than the reverse phase resin of the third bed, or, the reverse phase resin of the third bed has a greater capacity than the reverse phase resin of the first bed.
27. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the loading capacity is proportional to column dimension and roughly 100 ug digest per 10 cm X 180 um C18 column, up to milligrams sample.
28. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the reverse phase resin of the first bed, the reverse phase resin of the third bed, or both, are connected to an analytical device such that an eluate can be fed into the analytical device.
29. The mixed bed multi-dimensional liquid chromatograph of claim 28, wherein the analytical device comprises a mass spectrometer.
30. The mixed bed multi-dimensional liquid chromatograph of claim 29, wherein the mass spectrometer further comprises a nano-spray apparatus.
31. The mixed bed multi-dimensional liquid chromatograph of claim 29, wherein the mass spectrometer comprises a tandem mass spectrometer or an ion trap mass spectrometer or a combination thereof.
32. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein each resin bed is enclosed in a separate housing.
33. The mixed bed multi-dimensional liquid chromatograph of claim 32, wherein each resin bed or housing is independently detachable and replaceable.
34. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the second resin bed and a third resin bed are enclosed in one housing and the first resin bed is enclosed in a second housing.
35. The mixed bed multi-dimensional liquid chromatograph of claim 32, wherein a flow valve connects each or all of the housings.
36. The mixed bed multi-dimensional liquid chromatograph of claim 35, wherein the flow valve comprises a low volume flow valve, a directional flow valve and/or an inline microfilter assembly.
37. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein a low volume flow valve and/or an inline microfilter assembly connects the first bed to the second and third resin beds.
38. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the first reverse phase resin bed, the ion exchange resin bed and the second reverse phase resin bed are enclosed in one housing.
39. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the reverse phase resin of the first bed, the reverse phase resin of the third bed or both reverse phase resin beds are packed with a Cx reverse phase resin or equivalent, wherein X is an integer between five and thirty.
40. The mixed bed multi-dimensional liquid chromatograph of claim 39, wherein the Cx reverse phase resin or equivalent comprises a C18 reverse phase resin or equivalent.
41. The mixed bed multi-dimensional liquid chromatograph of claim 20, wherein the ion exchange bed comprises a cation exchange bed.
42. The mixed bed multi-dimensional liquid chromatograph of claim 20, wherein the ion exchange bed comprises an anion exchange bed.
43. The mixed bed multi-dimensional liquid chromatograph of claim 41, wherein the cation exchange bed is packed with a strong cation exchange (SCX) resin or equivalent.
44. The mixed bed multi-dimensional liquid chromatograph of claim 43, wherein the strong cation exchange resin (SCX) comprises a polysulfoethyl A strong cation exchange resin.
45. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the reverse phase resin of the first bed, or the reverse phase resin of the third bed, or both, are connected to an HPLC.
46. The mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the first reverse phase resin bed has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%, 375%, 400%, 425%, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%,, 825%, 850%, 875%, 900%,, 925%, 950%, 975%, 1000%, or more, greater capacity than the second reverse phase resin bed, or, the first reverse phase resin bed has about 10%, 20%, 30%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190%, 200%, 210%, 220%, 230%, 240%, 250%, 260%, 270%, 280%, 290%, 300%, 325%, 350%,, 375%, 400%,, 425%,, 450%, 475%, 500%, 525%, 550%, 575%, 600%, 625%, 650%, 675%, 700%, 725%, 750%, 775%, 800%, 825%, 850%, 875%, 900%, 925%, 950%, 975%,, 1000%, more of the same or equivalent resin than the second reverse phase resin bed.
47. The mixed bed multi-dimensional liquid chromatograph of claim 25, further comprising a computer system operatively linked to the chromatography system, thereby making the chromatography system an automated operation.
48. The mixed bed multi-dimensional liquid chromatograph of claim 25, further comprising a computer system.
49. The mixed bed multi-dimensional liquid chromatograph of claim 48, wherein the computer is operatively linked to an analytical device or a sample collection device.
50. The mixed bed multi-dimensional liquid chromatograph of claim 49, wherein the analytical device comprises a mass spectrometer.
51. The mixed bed multi-dimensional liquid chromatograph of claim 50, wherein the mass spectrometer is used for quantifying the amount of each peptide in an eluate.
52. The mixed bed multi-dimensional liquid chromatograph of claim 23, further comprising a computer system operatively linked to the mass spectrometer for generating the sequence of each peptide by use of data from the mass spectrometer.
53. A method for separating proteins comprising the following steps: (a) providing a sample comprising a polypeptide; (b) fragmenting the polypeptide into peptide fragments; and (c) separating the peptides by chromatography to generate an eluate using a chromatography system as set forth in claim 1 or a mixed bed multidimensional liquid chromatograph of claim 25, wherein the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system.
54. The method of claim 53, wherein the peptide fragments are eluted through the distal end of the reverse phase resin of the first bed and/or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph.
55. The method of claim 53, wherein some or all of the peptide fragments are eluted through the same end of the first RP column of the chromatography system or the same end of the reverse phase resin of the first bed of the mixed bed multidimensional liquid chromatograph from which they were loaded.
56. The method of claim 53, wherein the peptide fragments are generated by enzymatic digestion or by non-enzymatic fragmentation.
57. The method of claim 56, wherein the enzymatic digestion is by trypsin, endoproteinase or a combination thereof.
58. The method of claim 53, wherein the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system without desalting.
59. The method of claim 53, wherein the peptide fragments are loaded into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system without removing detergent in a sample.
60. The method of claim 53, wherein the peptide fragments are solubilized in a detergent or a denaturing agent before loading into the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system.
61. The method of claim 60, wherein the detergent or denaturing agent is SDS or urea.
62. The method of claim 53, wherein the peptide fragments are loaded into reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph or the first reverse phase column (RPC) of the chromatography system using a pressure bomb.
63. The method of claim 53, further comprising feeding the eluate into a mass spectrometer and quantifying the amount of each peptide.
64. The method of claim 53, further comprising feeding the eluate into a mass spectrometer and generating the sequence of each peptide by use of the mass spectrometer.
65. The method of claim 64, further comprising inputting the sequence into a computer program product to compare the inputted sequence to a database of polypeptide sequences to identify the polypeptide from which a sequenced peptide originated.
66. The method of claim 53, wherein the separating of step (c) comprises (i) loading a labeled peptide mixture into the first reverse phase column (RPC) of the chromatography system or the reverse phase resin of the first bed of the mixed bed multi-dimensional liquid chromatograph, wherein the first RPC or first reverse phase resin bed absorbs a plurality of peptides; (ii) eluting a fraction of the first RPC-absorbed or first resin bed- absorbed plurality of peptides to the ion exchange column of the chromatography system or the ion exchange resin bed of the mixed bed multi-dimensional liquid chromatograph, using a reverse phase gradient; (iii) eluting a fraction of the ion exchange column-absorbed or ion exchange resin bed-absorbed plurality of peptides onto the second reverse phase column (RPC) of the chromatography system or the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph using a salt gradient; and (iv) eluting a fraction of the second RPC-absorbed or second reverse phase resin bed-absorbed plurality of peptides.
67. The method of claim 66, wherein the plurality of peptides eluted in step (iv) is eluted through the distal end of the second reverse phase column (RPC) of the chromatography system or the distal end of the reverse phase resin of the third bed of the mixed bed multi-dimensional liquid chromatograph.
68. The method of claim 66, wherein in step (iv) the fraction of the second RPC-absorbed or third resin bed-absorbed plurality of peptides are eluted using the same reverse phase gradient used to elute the first RPC-absorbed or first resin bed-absorbed fraction of peptides in step (ii).
69. The method of claim 66, further comprising: after step (iii) is completed and before the step (iv) eluting a fraction of the second RPC-absorbed or second reverse phase resin bed-absorbed plurality of peptides is begun, washing the column free of the salts and buffers used to elute a fraction of the ion exchange column-absorbed or ion exchange resin bed-absorbed plurality of peptides.
70. The method of claim 66, wherein a discrete fraction of the first RPC-absorbed or first resin bed-absorbed plurality of peptides is eluted to the ion exchange column of the chromatography system or the ion exchange resin bed of the mixed bed multi-dimensional liquid chromatograph from using a reverse phase gradient.
71. The method of claim 70, wherein the reverse phase gradient comprises (Xn-X„+i%B) over 120 minutes with a flow rate of 250 nl/min, and B comprises a buffer B comprising 80% ACN/0.1 % formic acid, or equivalent, and n is an integer, n=0, 1, 2, 3, etc.
72. The method of claim 66, wherein the salt gradient comprises a series of salt elution steps.
73. The method of claim 72, wherein upon the completion of a series of salt elution steps, the entire elution sequence is repeated, employing a higher reverse phase gradient comprising Xn+ι-Xn+2%, Xn+2 Xn+ι%, n=0, 1, 2, 3, etc.
74. The method of claim 66, wherein the separation comprises 5 reverse phase cycles comprising Xo%=0%B, Xι%=8%B, X2%=15%B, X3%=30%B,
X4%=50%B, and X5%=100%B, each one followed by a salt gradient step. 75. The method of claim 72, wherein the salt gradient steps comprise 12 salt gradient steps comprising 25 mM, 50 mM,
75 mM, 100 mM, 125 mM, 150 mM, 175 mM, 200 mM, 225mM, 250mM, and 2M ammonium acetate, or equivalent.
76. The method of claim 53, further comprising labeling the peptide fragments before loading them into the chromatography system or the mixed bed multi-dimensional liquid chromatograph.
77. The method of claim 53, wherein the sample is derived from a cell, a seed or a spore.
78. The method of claim 77, wherein the cell is a prokaryotic cell or a eukaryotic cell.
79. The method of claim 77, wherein the cell, seed or spore is derived from a bacteria, a yeast, an insect, a plant, a fungus, a protozoa or a mammal.
80. The method of claim 79, wherein the mammalian cell is a human cell or a mouse cell.
81. The method of claim 79, wherein the bacterial cell or spore is a Bacillus anthracis.
82. A method for separating proteins by differential labeling of peptides, the method comprising the following steps: (a) providing at least two samples comprising a polypeptide; (b) providing at least one pair of labeling reagents, wherein each member of the pair differs in molecular mass and the difference in molecular mass is distinguishable by mass spectrographic analysis; (c) fragmenting the polypeptides into peptide fragments by enzymatic digestion or by non-enzymatic fragmentation; (d) contacting the labeling reagents of step (b) with the peptide fragments of step (c), wherein each sample is labeled with a different labeling reagent, thereby differentially labeling the peptides; and 226
Figure imgf000222_0001
(e) separating the labeled peptides by chromatography to generate an eluate using a chromatography system as set forth in claim 1 or a mixed bed multidimensional liquid chromatograph as set forth in claim 25.
83. The method of claim 82, further comprising a step (f) comprising feeding the eluate of step (e) into a mass spectrometer and quantifying the amount of each peptide and generating the sequence of each peptide by use of the mass spectrometer.
84. The method of claim 82, further comprising providing two or more samples from different sources.
85. The method of claim 84, wherein one sample is derived from a wild type cell and one sample is derived from an abnormal or a modified cell.
86. The method of claim 85, wherein the abnormal cell is a cancer cell.
87. The method of claim 82, wherein the peptide fragments are labeled with a reagent comprising a general formula selected from the group consisting of: ZAOH for labeling at least a first sample and ZBOH for labeling at least a second sample, to esterify peptide C-terminals and or Glu and Asp side chains; ZANH for labeling at least a first sample and ZBNH2 for labeling at least a second sample, to form amide bond with peptide C-terminals and/or Glu and Asp side chains; and ZACO H for labeling at least a first sample and ZBCO2Hτ for labeling at least a second sample to form amide bond with peptide N-terminals and/or Lys and Arg side chains; wherein ZA and ZB independently of one another comprise the general formula R-Z1-A1-Z2-A2-Z3-A3-Z4-A4- , Z , Z , Z , and Z independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRRI+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR1, (Si(RR')O)n, SnRR1, Sn(RR')O, BR(OR'), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR'), and R and R1 is an alkyl group, A1, A2, A3, and A4 independently of one another, are selected from the group consisting of nothing or (CRR')n, wherein R, R1, independently from other R and R1 in Z1 to Z4 and independently from other R and R1 in A1 to A4, are selected from the group consisting of a hydrogen atom, a halogen atom and an alkyl group; n in Z1 to Z4, independent of n in A1 to A4, is an integer having a value selected from the group consisting of 0 to about 51; 0 to about 41; 0 to about 31; 0 to about 21, 0 to about 11 and 0 to about 6.
88. The method of claim 87, wherein the alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
89. The method of claim 87, wherein one or more C-C bonds from (CRR )„ are replaced with a double or a triple bond.
90. The method of claim 87, wherein an R and/or an R1 group are absent.
91. The method of claim 87, wherein (CRR1),, is selected from the group consisting of an o-arylene, an /w-arylene and a -arylene, wherein each group has none or up to 6 substituents.
92. The method of claim 87, wherein (CRR1 )„ is selected from the group consisting of a carbocyclic, a bicyclic and a tricyclic fragment, wherein the fragment has up to 8 atoms in the cycle with or without a heteroatom selected from the group consisting of an O atom, a N atom and an S atom.
93. The method of claim 82, wherein two or more labeling reagents have the same structure but a different isotope composition.
94. The method of claim 87, wherein ZΛ has the same structure as ZB, but ZA has a different isotope composition than ZB.
95. The method of claim 93 or claim 94, wherein the isotope is boron- 10 and boron- 11.
96. The method of claim 93 or claim 94, wherein the isotope is carbon- 12 and carbon-13.
97. The method of claim 93 or claim 94, wherein the isotope is nitrogen- 14 and nitrogen-15.
98. The method of claim 93 or claim 94, wherein the isotope is sulfur-32 and sulfur-34.
99. The method of claim 93 or claim 94, wherein, where the isotope with the lower mass is x and the isotope with the higher mass is y, and x and y are integers, x is greater than y.
100. The method of claim 99, wherein x and y are between 1 and about 11, between 1 and about 21, between 1 and about 31, between 1 and about 41, or between 1 and about 51.
101. The method of claim 82, wherein the labeling reagent of step (b) comprises the general formula selected from the group consisting of: i. CD3(CD2)„OH for labeling at least a first sample and CH3(CH2)nOH for labeling at least a second sample, to esterify peptide C-terminals, where n = 0, 1, 2 or y; ii. CD3(CD2)nNH2 for labeling at least a first sample and CH3(CH2)nNH2, to form amide bond with peptide C-terminals for labeling at least a second sample, where n = 0, 1 , 2 or y; and iii. D(CD2)„CO2H for labeling at least a first sample and H(CH2)nCO2H for labeling at least a second sample, to form amide bond with peptide N-terminals, where n = 0, 1, 2 or y; wherein D is a deuteron atom, and y is an integer selected from the group consisting of about 51 ; about 41 ; about 31 ; about 21 , about 11 ; about 6 and between about 5 and 51.
102. The method of claim 82, wherein the labeling reagent of step (b) comprises the general formula selected from the group consisting of: i. ZAOH for labeling at least a first sample and ZBOH for labeling at least a second sample to esterify peptide C-terminals; ii. ZANH2 for labeling at least a first sample and ZBNH2 for labeling at least a second sample to form an amide bond with peptide C-terminals; and iii. ZACO2H for labeling at least a first sample and ZBCO2H for labeling at least a second sample to form an amide bond with peptide N-terminals; wherein ZA and ZB have the general formula R-Z'-A'-Z2-A2-Z3-A3-Z4-
A4 Z1, Z2, Z3, and Z4, independently of one another, are selected from the group consisting of nothing, O, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR1, S, SC(O), SC(S), SS, S(O), S(O2), NR, NRR . 1" +, C(O), C(O)O, C(S), C(S)O, CC((OO))SS,, CC((OO))NNRR,, CC((SS))NNRR,, SSiiRRRR11,, ((SSii((RRRR''))OO))nn,, SSnnRRRR'1 ;, Sn(RR')O, BR^R1), BRR1, B(OR)(OR') , OBR(OR'), OBRR1, and OB(OR)(OR'); AA1,, AA ,, AA3,, aanndd AA4,, iinnddeeppeennddeennttllyy ooff oonnee another, are selected from the group consisting of nothing and the general formulae (CRR1),,, and, R and R is an alkyl group.
103. The method of claim 102, wherein a single C-C bond in a (CRR1),, group is replaced with a double or a triple bond.
104. The method of claim 102, wherein R and R1 are absent.
105. The method of claim 102, wherein (CRR1),, comprises a moiety selected from the group consisting of an o-arylene, an w-arylene and ap-arylene, wherein the group has none or up to 6 substituents.
106. The method of claim 105, wherein the (CRR1),, group comprises a carbocyclic, a bicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, with or without a heteroatom selected from the group consisting of an O atom, an N atom and an S atom.
107. The method of claim 102, wherein R, R1, independently from other R and R1 in Z1 - Z4 and independently from other R and R1 in A1 - A4, are selected from the group consisting of a hydrogen atom, a halogen and an alkyl group.
108. The method of claim 107, wherein the alkyl group is selected from the group consisting of an alkenyl, an alkynyl and an aryl group.
109. The method of claim 102, wherein n in Z - Z is independent of n in A1 - A4 and is an integer selected from the group consisting of about 51; about
41; about 31; about 21, about 11 and about 6.
110. The method of claim 102, wherein ZA has the same structure a
ZB but ZA further comprises x number of -CH2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer.
111. The method of claim 102, wherein ZA has the same structure a ZB but ZA further comprises x number of -CF2- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer.
112. The method of claim 102, wherein ZA comprises x number of Ft protons and Z compnses y number of halogens in the place of protons, wherein x and y are integers.
113. The method of claim 102, wherein ZA contains x number of protons and ZB contains y number of halogens, and there are x - y number of protons remaining in one or more A1 - A4 fragments, wherein x and y are integers.
114. The method of claim 102, wherein ZA further comprises x number of-O- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer.
115. The method of claim 86, wherein ZA further comprises x number of -S- fragment(s) in one or more A1 - A4 fragments, wherein x is an integer.
116. The method of claim 102, wherein ZA further comprises x number of -O- fragment(s) and ZB further comprises y number of-S- fragment(s) in the place of -O- fragment(s), wherein x and y are integers.
117. The method of claim 102, wherein ZA further comprises x -y number of-O- fragment(s) in one or more A1 - A4 fragments, wherein x and y are integers.
118. The method of claim 113, claim 116 or claim 117, wherein x and are integers independently selected from the group consisting of between 1 about 51; between 1 about 41; between 1 about 31; between 1 about 21, between 1 about 11 and between 1 about 6, wherein x is greater thanj.
119. The method of claim 82, wherein the labeling reagent of step (b) comprises the general formulae selected from the group consisting of: i. CH (CH2)„OH for labeling at least a first sample and
CH3(CH2)n+mOH for labeling at least a second sample, to esterify peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; ii. CH (CH2)n NH2 for labeling at least a first sample and CH (CH2)n+mNH2 for labeling at least a second sample, to form amide bond with peptide C-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; and, iii. H(CH2)„CO2H for labeling at least a first sample and H(CH2)n+mCO2H for labeling at least a second sample, to form amide bond with peptide N-terminals, where n = 0, 1, 2, ..., y; m = 1, 2, ..., y; wherein n, m and y are integers.
120. The method of claim 119, wherein the labeling reagent of step
(b) comprises an N, N, dimethyl-iodoacetamide and an N, N, d6-dimethyl- iodoacetamide.
121. A method for separating a hydrophobic protein or a hydrophobic compound, the method comprising the following steps: (a) providing a sample comprising the hydrophobic protein or the hydrophobic compound; (b) solubilizing the hydrophobic protein or the hydrophobic compound in a detergent or urea; (c) loading the detergent-solubilized or urea-solubilized hydrophobic protein or hydrophobic compound into a chromatography system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 20; and (d) separating the hydrophobic proteins or the hydrophobic compounds by chromatography to generate an eluate using the chromatography system as set forth in claim 1 or the mixed bed multi-dimensional liquid chromatograph of claim 25.
122. The method of claim 121, wherein the hydrophobic protein is a membrane protein.
123. The method of claim 122, wherein the membrane protein is an integral membrane protein.
124. The method of claim 123, wherein the integral membrane protein is a protein expressed on the surface of a pathogenic cell or a cancer cell.
125. The method of claim 121, wherein the hydrophobic compound is a lipid or a steroid.
126. A computer program product comprising a computer useable medium having computer program logic recorded thereon for analyzing data generated by a chromatography system, said computer program logic comprising computer program code logic configured to perform operations as set forth in Figure 17, Figure 18, Figure 19, Figure 20 or Figure 21.
127. The computer program product of claim 126, wherein the chromatography system comprises a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25.
128. A computer-implemented method for analyzing data generated by a chromatography system comprising the following steps: (a) providing a chromatography system capable of outputting data to a computer; (b) providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product as set forth in claim 126; (c) inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system.
129. The computer-implemented method of claim 128, wherein the chromatography system comprises a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25. 130. A quantitative proteomics system comprising: (a) a chromatography system comprising a system as set forth in claim
1 or a mixed bed multi-dimensional liquid chromatograph of claim 25, wherein the system is capable of outputting data to a processor; (b) a processor; and (c) a computer program product as set forth in claim 126 embodied within the processor. 131. A labeling reagent pair comprising N, N, dimethyl- iodoacetamide and N, N, d6-dimethyl-iodoacetamide, having the structures:
Figure imgf000228_0001
Λ/,Λ/-dimethyliodoacetamide Λ/,Λ -dimethyl-d6-iodoacetamide 132. A method for fractionating a proteome of a cell comprising (a) providing a chromatography system comprising a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25; (b) providing a proteome preparation; and (c) fractionating the proteome preparation with the chromatography system, wherein 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%,
14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 61%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%, or more of the proteome is fractionated.
133. The method of claim 133, wherein 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%, or more of the proteome is fractionated in a one-fraction protocol.
134. The method of claim 132, further comprising use of a computer-implemented method for analyzing data generated by a chromatography system comprising the following steps: (a) providing a chromatography system capable of outputting data to a computer; (b) providing a computer capable of storing and analyzing data input from the chromatography system comprising a computer program product embodied therein, wherein the computer program product comprises a computer program product as set forth in claim 126; (c) inputting the data from the chromatography system into the computer and analyzing data input from the chromatography system.
135. A quantitative proteomics system comprising: (a) a chromatography system comprising a system as set forth in claim 1 or a mixed bed multi-dimensional liquid chromatograph of claim 25, and a mass spectrometer, wherein the system is capable of outputting data to a processor; (b) a processor; and (c) a computer program product as set forth in claim 126 embodied within the processor.
136. The quantitative proteomics system of claim 135, wherein the mass spectrometer comprises an ion trap mass spectrometer.
137. The quantitative proteomics system of claim 136, wherein the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan
MDLC LTQ™ or a Finnigan LTQ FT™.
138. The chromatography system of claim 5, wherein the ion trap mass spectrometer comprises a Finnigan LCQ Deca XP MAX™, a Finnigan MDLC
LTQ™ or a Finnigan LTQ FT™.
139. The mixed bed multi-dimensional liquid chromatograph of claim 31, wherein the ion trap mass spectrometer comprises a Finmgan LCQ Deca XP MAX™, a Finnigan MDLC LTQ™ or a Finnigan LTQ FT™.
PCT/US2004/017647 2003-06-06 2004-06-04 Mixed bed multi-dimensional chromatography systems and methods of making and using them WO2005000226A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US47654003P 2003-06-06 2003-06-06
US60/476,540 2003-06-06
US49202703P 2003-08-01 2003-08-01
US60/492,027 2003-08-01

Publications (2)

Publication Number Publication Date
WO2005000226A2 true WO2005000226A2 (en) 2005-01-06
WO2005000226A3 WO2005000226A3 (en) 2005-05-06

Family

ID=33555420

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/017647 WO2005000226A2 (en) 2003-06-06 2004-06-04 Mixed bed multi-dimensional chromatography systems and methods of making and using them

Country Status (1)

Country Link
WO (1) WO2005000226A2 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007139471A2 (en) * 2006-05-31 2007-12-06 Ge Healthcare Bio-Sciences Ab Method for chromatography
EP2215460A1 (en) * 2007-11-26 2010-08-11 Waters Technologies Corporation Internal standards and methods for use in quantitatively measuring analytes in a sample
CN102498394A (en) * 2009-09-07 2012-06-13 邦尼克本鲁克公司 Separation body for three-dimensional chromatography
CN107607642A (en) * 2017-09-06 2018-01-19 上海烟草集团有限责任公司 The multidimensional liquid chromatography mass of albumen and protein groups combination method in a kind of identification tobacco
CN109813824A (en) * 2017-11-22 2019-05-28 中国科学院大连化学物理研究所 A kind of plant sample pre-treating method
WO2019152352A1 (en) * 2018-01-31 2019-08-08 Regeneron Pharmaceuticals, Inc. A dual-column lc-ms system and methods of use thereof
CN110869768A (en) * 2017-08-01 2020-03-06 安进公司 System and method for preparing polypeptide samples for mass spectrometry in real time
US10865224B2 (en) 2012-06-29 2020-12-15 Emd Millipore Corporation Purification of biological molecules
WO2024112940A1 (en) * 2022-11-26 2024-05-30 Enveda Therapeutics, Inc. Methods and systems for identifying compounds for forming, stabilizing or disrupting molecular complexes

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020155614A1 (en) * 2001-02-21 2002-10-24 Tomlinson Andrew J. Peptide esterification
US20020164809A1 (en) * 2000-10-23 2002-11-07 Genetics Institute, Inc. Acid-labile isotope-coded extractant (ALICE) and its use in quantitative mass spectrometric analysis of protein mixtures

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020164809A1 (en) * 2000-10-23 2002-11-07 Genetics Institute, Inc. Acid-labile isotope-coded extractant (ALICE) and its use in quantitative mass spectrometric analysis of protein mixtures
US20020155614A1 (en) * 2001-02-21 2002-10-24 Tomlinson Andrew J. Peptide esterification

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007139471A3 (en) * 2006-05-31 2008-01-24 Ge Healthcare Bio Sciences Ab Method for chromatography
WO2007139471A2 (en) * 2006-05-31 2007-12-06 Ge Healthcare Bio-Sciences Ab Method for chromatography
EP2215460A1 (en) * 2007-11-26 2010-08-11 Waters Technologies Corporation Internal standards and methods for use in quantitatively measuring analytes in a sample
EP2215460A4 (en) * 2007-11-26 2010-12-29 Waters Technologies Corp Internal standards and methods for use in quantitatively measuring analytes in a sample
CN102498394A (en) * 2009-09-07 2012-06-13 邦尼克本鲁克公司 Separation body for three-dimensional chromatography
CN102498394B (en) * 2009-09-07 2014-11-26 邦尼克本鲁克公司 Separation body for three-dimensional chromatography
US10865224B2 (en) 2012-06-29 2020-12-15 Emd Millipore Corporation Purification of biological molecules
CN110869768B (en) * 2017-08-01 2023-11-21 安进公司 System and method for preparing polypeptide samples for mass spectrometry in real time
CN110869768A (en) * 2017-08-01 2020-03-06 安进公司 System and method for preparing polypeptide samples for mass spectrometry in real time
CN107607642A (en) * 2017-09-06 2018-01-19 上海烟草集团有限责任公司 The multidimensional liquid chromatography mass of albumen and protein groups combination method in a kind of identification tobacco
CN107607642B (en) * 2017-09-06 2020-12-29 上海烟草集团有限责任公司 Multidimensional liquid chromatography-mass spectrometry combined method for identifying protein and proteome in tobacco
CN109813824B (en) * 2017-11-22 2021-11-26 中国科学院大连化学物理研究所 Pretreatment method of plant sample
CN109813824A (en) * 2017-11-22 2019-05-28 中国科学院大连化学物理研究所 A kind of plant sample pre-treating method
CN111699392A (en) * 2018-01-31 2020-09-22 瑞泽恩制药公司 Dual column LC-MS system and method of use
US10908166B2 (en) 2018-01-31 2021-02-02 Regeneron Pharmaceuticals, Inc. Dual-column LC-MS system and methods of use thereof
JP2021513060A (en) * 2018-01-31 2021-05-20 リジェネロン・ファーマシューティカルズ・インコーポレイテッド Dual column LC-MS system and how to use it
WO2019152352A1 (en) * 2018-01-31 2019-08-08 Regeneron Pharmaceuticals, Inc. A dual-column lc-ms system and methods of use thereof
US11435359B2 (en) 2018-01-31 2022-09-06 Regeneron Pharmaceuticals, Inc. Dual-column LC-MS system and methods of use thereof
US11740246B2 (en) 2018-01-31 2023-08-29 Regeneron Pharmaceuticals, Inc. Dual-column LC-MS system and methods of use thereof
WO2024112940A1 (en) * 2022-11-26 2024-05-30 Enveda Therapeutics, Inc. Methods and systems for identifying compounds for forming, stabilizing or disrupting molecular complexes

Also Published As

Publication number Publication date
WO2005000226A3 (en) 2005-05-06

Similar Documents

Publication Publication Date Title
Dreger Subcellular proteomics
CA2424178A1 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
CA2714641C (en) Methods for identification of an antibody or a target
EP2209893B1 (en) Use of aptamers in proteomics
JP4166572B2 (en) New proteome analysis method and apparatus therefor
SK13242001A3 (en) Protein isolation and analysis
US20030044864A1 (en) Cellular engineering, protein expression profiling, differential labeling of peptides, and novel reagents therefor
AU2001266978A1 (en) Whole cell engineering by mutagenizing a substantial portion of a starting genome, combining mutations, and optionally repeating
EP1739424B1 (en) Rapid and quantitative proteome anaysis and related methods
WO2005000226A2 (en) Mixed bed multi-dimensional chromatography systems and methods of making and using them
WO2002039120A1 (en) A method for identifying the proteome of cells using an antibody library microarray
Thannhauser et al. A workflow for large‐scale empirical identification of cell wall N‐linked glycoproteins of tomato (Solanum lycopersicum) fruit by tandem mass spectrometry
CA2470083A1 (en) Methods for protein analysis using protein capture arrays
JP2003520940A (en) Modification of molecular interaction sites of RNA and other biomolecules
Chia et al. Knockout of the Hmt1p arginine methyltransferase in Saccharomyces cerevisiae leads to the dysregulation of phosphate-associated genes and processes
Bradshaw On the development of proteomics: a brief history
AU2002360240A1 (en) Cellular engineering, protein expression profiling, differential labeling of peptides, and novel reagents therefor
Shrestha et al. SECRET AGENT O-GlcNAcylates hundreds of proteins involved in diverse cellular processes in Arabidopsis
SEQUENCING Article Watch: September 2021
Hessmann Development of analytical strategies in quantitative proteomic: quantitation of host cell proteins by mass spectrometry as a quality control tool for the biopharmaceutical industry
Thompson et al. Article Watch: March, 2020
Zürbig et al. Peptidomics approach to proteomics
Mahajan et al. Proteomics: taking over where genomics leaves off
Tryggvason Proteome studies, with emphasis on the kidney glomerulus
Baerenfaller et al. Sample Preparation and Data Processing in Plant Proteomics

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase