Nothing Special   »   [go: up one dir, main page]

From Genomics To Proteomics: Insight

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

insight overview

From genomics to proteomics


Mike Tyers* & Matthias Mann†
*Samuel Lunenfeld Research Institute, Mount Sinai Hospital, and Department of Medical Genetics and Microbiology, University of Toronto,
Toronto, Canada M5G 1X5 (e-mail: tyers@mshri.on.ca)
†Center for Experimental BioInformatics, Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55,
DK-5230 Odense M, Denmark (e-mail: mann@bmb.sdu.dk)

Proteomics is the study of the function of all expressed proteins. Tremendous progress has been made in the
past few years in generating large-scale data sets for protein–protein interactions, organelle composition,
protein activity patterns and protein profiles in cancer patients. But further technological improvements,
organization of international proteomics projects and open access to results are needed for proteomics to
fulfil its potential.

T
he term proteome was first coined to describe the cal assays, systematic structural biology and imaging
set of proteins encoded by the genome1. The study techniques, proteome informatics, and clinical applications
of the proteome, called proteomics, now evokes of proteomics. As is apparent from the reviews, the divisions
not only all the proteins in any given cell, but also between these areas are somewhat arbitrary, not least
the set of all protein isoforms and modifications, because technological breakthroughs often find immediate
the interactions between them, the structural description of application on several fronts. More important, biologically
proteins and their higher-order complexes, and for that useful insights into protein function often emerge from the
matter almost everything ‘post-genomic’. In this overview we combination of different proteomic approaches.
will use proteomics in an overall sense to mean protein
biochemistry on an unprecedented, high-throughput scale. Mass spectrometry-based proteomics
The hope, now being realized, is that this high-throughput The ability of mass spectrometry to identify ever smaller
biochemistry will contribute at a direct level to a full amounts of protein from increasingly complex mixtures is a
description of cellular function. primary driving force in proteomics, as described in the review
Proteomics complements other functional genomics on page 198 by Aebersold and Mann. Initial proteomics efforts
approaches, including microarray-based expression profiles2, relied on protein separation by two-dimensional gel
systematic phenotypic profiles at the cell and organism electrophoresis, with subsequent mass spectrometric identifi-
level3,4, systematic genetics5,6 and small-molecule-based cation of protein spots. An inherent limitation of this approach
arrays7 (Fig. 1). Integration of these data sets through bioinfor- is the depth of coverage, which is necessarily constrained to the
matics will yield a comprehensive database of gene function most abundant proteins in the sample. The rapid develop-
that will serve as a powerful reference of protein properties and ments in mass spectrometry have shifted the balance to direct
functions, and a useful tool for the individual researcher to mass spectrometric analysis, and further developments will
both build and test hypotheses. Moreover, large-scale data sets increase sensitivity, robustness and data handling.
will be crucial for the emerging field of systems biology8. The past year has seen partial analysis of the yeast
interactome, the malaria proteome, bacterial proteomes and
Challenges and approaches in proteomics various organellar proteomes (see review by Aebersold and
Proteomics would not be possible without the previous Mann, page 198). These vast data sets represent but the tip of
achievements of genomics, which provided the ‘blueprint’ the iceberg for biological discovery and drug development.
of possible gene products that are the focal point of An enormous challenge resides in the obvious fact that the
proteomics studies. Although almost trite, the tasks of proteome is a dynamic, not a static, entity. Initial efforts to
proteomics can usefully be contrasted with the huge but gauge proteome-wide regulatory events in single experi-
straightforward challenges initially facing the genome ments have been directed at the yeast phosphoproteome9 and
projects. Unlike the scalable exercise of DNA sequencing, the ubiquitin-mediated ‘degradome’ (S. P. Gygi, personal
with its attendant enabling technologies such as the communication). Much higher throughput and sensitivity
polymerase chain reaction and automated sequencing, will be needed to enable true proteome dynamics and
proteomics must deal with unavoidable problems of limited moment-by-moment snap shots of cellular responses.
and variable sample material, sample degradation, vast Nascent methods for gel-free analysis of complex mixtures
dynamic range (more than 106-fold for protein abundance hold great promise in this regard10. Further needs will include
alone), a plethora of post-translational modifications, more complete sequence coverage of each individual protein,
almost boundless tissue, developmental and temporal robust and varied methods for sample preparation, and
specificity, and disease and drug perturbations. While pro- sophisticated algorithms for automated protein identifica-
teomics is by definition expected to yield direct biological tion and detection of post-translational modifications. The
insights, all of these difficulties render any comprehensive ambitious goals of systems biology, which aims to compre-
proteomics project an inherently intimidating and often hensively model cellular behaviour at the whole-system
humbling exercise. level8,11, will also require reliable quantitative methods.
In this Nature Insight, five central pillars of proteomics
research are discussed with an emphasis on technological Array-based proteomics
developments and applications. These areas are mass A number of established and emergent proteome-wide
spectrometry-based proteomics, proteome-wide biochemi- platforms complement mass spectrometric methods, as
NATURE | VOL 422 | 13 MARCH 2003 | www.nature.com/nature
© 2003 Nature Publishing Group 193
insight overview

Systems
Proteomics Functional genomics

Methods: Dataset: Dataset: Methods:

Mass spectrometry Dynamics Expression DNA microarrays


Two-hybrid Abundance Splicing RDA/ROMA
GFP + FRET Interactions Genomic Barcode arrays
fluorescence Structure alterations Cell arrays
Protein arrays Clinical profiles Cellular RNAi
Chemical arrays Localization phenotype Forward/reverse
Antibody arrays Isoforms Organismal genetics
Modifications phenotype Synthetic genetics

Informatics, databases,
systems biology

Figure 1 Platforms for proteomics and functional genomics. Methodology is shown in the outer columns, resultant data sets in the middle columns, and model systems in the centre.

reviewed on page 208 of this issue by Stan Fields and co-workers. The silico docking will be necessary to build in dynamics of protein
forerunner amongst these efforts is the systematic two-hybrid screen interactions, much of which may be controlled through largely
developed by Fields12. Unlike direct biochemical methods that are unstructured regions14.
constrained by protein abundance, two-hybrid methods can often
detect weak interactions between low-abundance proteins, albeit at Informatics
the expense of false positives. As with any data-rich enterprise, informatics issues loom large on
More recently, various protein-array formats promise to allow several proteomics fronts. On page 233 of this issue, Boguski and
rapid interrogation of protein activity on a proteomic scale. These McIntosh highlight the importance of sample documentation, the
arrays may be based on either recombinant proteins or, conversely, implementation of rigorous standards and proper annotation of
reagents that interact specifically with proteins, including antibod- gene function15. It is crucial that software development is linked at an
ies, peptides and small molecules13. Readouts for protein-based early stage through agreed documentation, XML-based definitions
arrays can derive from protein interactions, protein modifications or and controlled vocabularies that allow different tools to exchange
enzymatic activities. A current challenge is to effectively couple high- primary data sets. Considerable effort has already gone into
end mass spectrometry to array formats. Array-based approaches interaction databases16 and systems biology software infrastructure17
can also use in vivo readouts, for example in the systematic analysis of that should be built upon by future proteomics initiatives. The
protein localization in the cell through green fluorescent protein development of statistically sound methods for assignment of
(GFP) signals or protein association through fluorescence resonance protein identity from incomplete mass spectral data will be critical
energy transfer (FRET) between protein fusions to different wave- for automated deposition into databases, which is currently a
length variants of GFP. Finally, cell- and tissue-based arrays enable painstaking manual and error-prone process. Lessons learned from
yet another layer of functional interrogation. analysis of DNA microarray data, including clustering, compendium
One practical bottleneck to these approaches, and indeed to most and pattern-matching approaches, should be transportable to
systematic approaches, has been the limited availability of validated proteomic analysis2, and it is encouraging that the European
genome-wide complementary DNA for use in the capture of protein Bioinformatics Institute and the Human Proteome Organisation
complexes with epitope tags. The FlexGene consortium between (HUPO) have together started an initiative on the exchange of
academic institutions and industry aims to develop complete cDNA protein–protein interaction and other proteomic data (see
collections in recombination-based cloning formats for the biomed- http://psidev.sourceforge.net/)
ical community (see http://www.hip.harvard.edu).
Clinical proteomics
Structural proteomics Proteomics is set to have a profound impact on clinical diagnosis and
Beyond a description of protein primary structure, abundance and drug discovery, as is fittingly reviewed by Sam Hanash on page 226,
activities, the ambitious goal of systematically understanding the the inaugural president of HUPO. Because most drug targets are
structural basis for protein interactions and function is reviewed by proteins, it is inescapable that proteomics will enable drug discovery,
Baumeister et al. on page 216 of this issue. Through literary development and clinical practice. The form(s) in which proteomics
metaphor, the authors make a compelling argument that a full will best fulfil this mandate is in a state of flux owing to a multitude of
description of cell behaviour necessitates structural information at factors, not the least of which are the varied technological platforms
the level not only of all single proteins, but of all salient protein in different stages of implementation.
complexes and the organization of such complexes at a cellular scale. The detection of protein profiles associated with disease states
This all-encompassing structural endeavour spans several orders of dates back to the very beginning of proteomics, when two-dimension-
magnitude in measurement scale and requires a battery of structural al gel electrophoresis was first applied to clinical material. The advent
techniques, from X-ray crystallography and nuclear magnetic of mass spectrometers now able to resolve many tens of thousands of
resonance (NMR) at the protein level, to electron microscopy of protein and peptide species in body fluids is set to revolutionize
mega-complexes and electron tomography for high-resolution visu- protein-based diagnostics, as demonstrated in recent retrospective
alization of the entire cellular milieu. The recurrent proteomic theme studies of cancer patients18. The robust and high-throughput nature
of throughput and sensitivity runs through each of these structural of mass spectrometric instrumentation is imminently suited to
methods, and Baumeister et al. suggest novel solutions, even clinical applications. Protein- and antibody-based arrays with vali-
including eliminating the crystals from crystallography! NMR and in dated diagnostic readouts may also become amenable to the clinical
194 © 2003 Nature Publishing Group NATURE | VOL 422 | 13 MARCH 2003 | www.nature.com/nature
insight overview

26S core
proteosome
mRNA
splicing 20S core
APC proteosome
complex

Histone deacetylase
Tubulin-binding complex
complex

Arp 2/3 TRAPP


complex complex

RNase
complex v-SNARE

COP II
vesicle coat
TAF IID
complex

mRNA
SCF processing

Eukaryotic translation
initiation factor 3
complex Ribosomal

DNA replication
factor C complex
Pol II mediator
Mitochondrial large RNA complex
ribosomal subunit biogenesis

Figure 2 Visualization of combined, large-scale interaction data sets in yeast. A total of 14,000 physical interactions obtained from the GRID database were represented with the Osprey
network visualization system (see http://biodata.mshri.on.ca/grid). Each edge in the graph represents an interaction between nodes, which are coloured according to Gene Ontology
(GO) functional annotation. Highly connected complexes within the data set, shown at the perimeter of the central mass, are built from nodes that share at least three interactions within
other complex members. The complete graph contains 4,543 nodes of ~6,000 proteins encoded by the yeast genome, 12,843 interactions and an average connectivity of 2.82 per
node. The 20 highly connected complexes contain 340 genes, 1,835 connections and an average connectivity of 5.39.

setting. As with all clinical interfaces, issues of standardized sample process of ribosome biogenesis19. Independent systematic analysis of
preparation, storage and annotation must be addressed. yeast-cell size mutants (phenomics) and the gene set regulated by one
Proteomics will inevitably accelerate drug discovery, although the of these size-control genes (transcriptomics) revealed an unantici-
pace of progress in this area has been slower than was initially pated regulatory relationship between ribosome biogenesis and
envisaged. Identification of new disease-specific targets, often those commitment to cell division20.
present on the cell surface, has been greatly enabled with current Similarly, the integration of interactome, phenome and
technology. An understanding of the biological networks that lie transcriptome data sets has been used to deduce a new regulatory
below the cell’s exterior will provide a rational basis for preliminary network in the nematode germline21. The combined use of physical,
decisions on target suitability. phenotypic and expression data sets can generate non-obvious
hypotheses that would otherwise not arise from any individual
Orthogonal omics approach. Even with limited data sets, educated guesses can made
A caveat of all high-throughput approaches, including proteomics, is based on simple parameters. For example, an algorithm called
that the very scale of experimentation often precludes repetition and ScanSite was used to identify tuberous sclerosis complex-1 as a phys-
rigorous confirmation that is the essence of sound research. Howev- iologically relevant substrate of protein kinase B (PKB), based solely
er, the intersection between proteomic data sets from different on the apparent mass by electrophoresis of the phosphorylated
species or between proteomic and other genome-wide data sets often species and an abundance of PKB consensus site sequences22. Finally,
allows robust cross-validation (Fig. 1). This point is aptly illustrated new information can often be gained by re-investigating known
by recent proteomic analysis of the yeast and human nucleolus, in complexes with new methods. For example, three new components
which both directed and undirected efforts uncovered a vast network of the heavily studied anaphase-promoting complex have recently
of protein interactions, many of which impinge on the conserved been found by multidimensional mass spectrometry23.
NATURE | VOL 422 | 13 MARCH 2003 | www.nature.com/nature © 2003 Nature Publishing Group 195
insight overview
With the numerous initiatives to systematically correlate monitoring, particularly as patterns of disease prediction are
phenotype with loss of gene function in many model organisms recognized empirically from large clinical data sets. Application of
including yeast, nematode, fruitfly, zebrafish, mouse and human, the phosphoproteomic methods to clinical samples promises what may
insights gained from the combined use of large-scale cell biological, be the most informative and discriminating readout of cellular sta-
transcriptional and proteomic data sets should become synergistic as tus, which can then be used to advantage in diagnosis, drug discovery
coverage increases. Most recently, the rapid acquisition of phenotyp- and elucidation of mechanisms of drug action. The proteomics of
ic data by RNA interference methods, with which it is now possible to host–pathogen interactions should also be an area rich in new drug
systematically interrogate the human genome in tissue-culture targets. Regardless of the exact format, robust mass spectrometry and
cells6, will greatly accelerate functional discovery when coupled to protein-array platforms must be moved into clinical medicine to
proteomic data sets. replace the more expensive and less reliable biochemical assays that
are the basis of traditional clinical chemistry. Finally, the nascent area
Future developments and challenges of chemiproteomics will not only allow mechanism of action to be
As the highly successful effort to sequence the human genome has discovered for many drugs, but also has the potential to resurrect
illustrated, faster and cheaper is the inevitable mantra of any large- innumerable failed small molecules that have dire off-target effects of
scale enterprise. This rhetoric applies doubly so to proteomics, unknown basis. Relatively little investment in well characterized
although there is far more to proteomics than just throughput. In its leads hidden in the archives of pharmaceutical companies may
absolute sense, the proteome will be as unreachable as the horizon; leverage huge therapeutic returns.
rather proteomics will coalesce with other technologies in as yet
unimagined ways to converge on an accurate description of cellular Open-access proteomics
properties. An all too common refrain of proteomics has been the limited or
By all criteria, current instrumentation is far from optimal, in part non-existent access for the individual biomedical researcher.
because manufacturers have not yet had the necessary lead time to Although virtually all academic centres have a mass spectrometry
build machines and associated hardware that are perfectly tailored to facility of some sort, lost samples, failed identifications and
protein analysis. Mass spectrometry-based proteomics is nowhere inadequate throughput are commonplace. In part, these problems
near the physical limit of the few ions needed to register a peak and so represent the teething stages of a complex technology; additional
a huge increase in performance can be expected in the coming years. factors are unaffordable equipment costs and a dearth of highly
As refinements are made in next-generation proteomic instruments, trained personnel to oversee facilities. As a consequence, most
it will be possible to monitor many relevant post-translational modi- breakthroughs and the generation of raw data in proteomics derive
fications and protein interactions in ever more complex mixtures24. from the work of only a handful of technically inclined laboratories.
As one anticipated example of innovation, throughput and coverage The burden of improving this circumstance falls on instrument
could be greatly enabled by storing mass spectrometric signatures manufacturers, proteomics leaders, funding agencies, academic
of every protein for real-time data-dependent analysis of highly institutions and the individual user alike. National proteome centres
complex mixtures. have also been proposed as a way to ensure availability of both
At the level of the individual laboratory, there is undoubtedly a expertise and equipment27.
huge market for sensitive and affordable bench-top mass spectrome- The common effort to map and understand the proteome in its
ters for routine applications as analytical devices in all aspects of various guises can benefit from lessons learned by genome-sequenc-
biological research. Developments in robotic sample preparation, ing consortia. First and foremost, public access to on-line raw data is
alternative readouts for protein interactions, and microfluidics to essential if there is to be sense of participation across the biomedical
minimize sample losses will all factor into achieving the goal of deliv- research community. Agreements similar to the Bermuda guidelines
ering high-powered proteomics to the masses. Equally important, issued at a critical juncture of the genome projects28 that mandate
availability of reasonably complete sets of expression and antibody public accessibility and non-patenting of basic proteomic data would
reagents for all proteins would improve the speed and scope of both facilitate research in both the academic and industrial sectors. Such
small- and large-scale proteomics. data should include the primary structure, post-translational modi-
With regard to the proteomes of even simple model organisms, all fication, localization and protein–protein interaction pattern of all
indications are that extant interaction maps are far from saturated. proteins.
As the density of known interactions increases, testable hypotheses It is important that large-scale proteomics efforts are co-
should emerge from the data set at an increasing rate, especially in coordinated, both to avoid duplication and to provide strong
combination with other genome-wide data sets, including predic- rational for funding agencies. These bodies are in principle willing to
tions from structural data. Once sufficient dynamics data become support proteomics as a way to reap the rewards of the genome pro-
available to build first-draft models of cellular behaviour, model jects, but they will have to be presented with clear goals and rationales
refinement will require reiteration of proteomic analyses in numer- of how proteomics will build an infrastructure to advance biomedical
ous mutant and drug-treated conditions. If modelling of simple science. HUPO is one body that is positioned to play an important
Boolean networks is a guide, the systems-level behaviour of bona fide coordinating role. HUPO has proclaimed five initial goals for
protein interaction networks is sure to yield some surprises25. world-wide proteomics research: definition of the plasma proteome,
All this information must obviously be presented in a form that proposals for an in-depth proteomics assault on specific cell types,
can be processed by the human user. To this end, a great deal more formation of a consortium to generate antibodies to all human
effort must be placed on development of visualization tools, proteins, development of new technologies and formation of an
including automated integration with other genome-wide data sets informatics infrastructure. To this list we would add cataloguing the
(Fig. 2). There is much room here for novel approaches, many of primary structure of all proteins, mapping all organelles that can be
which are likely to come from other fields that are also suffering from purified, and generating protein interaction maps of model
information overload. Examples include sophisticated tools for organisms, for both comparative proteomics and integration with
clustering DNA microarray data and multivariate graphical on-going functional genomics projects.
representations that use coloured readouts to highlight overall To meet these laudable goals, it seems that a dedicated funding
trends26, as well as the sophisticated, three dimensional interfaces pool must be established for proteomics research, analogous to that
used in modern computer games. created for the human and model-organism genome sequencing
On the clinical front, comprehensive proteomic analysis of small projects, or ongoing funding for these projects should be made
amounts of diseased tissue will facilitate diagnosis and therapeutic available to proteomics. Given the cost of proteomic-scale projects, it
196 © 2003 Nature Publishing Group NATURE | VOL 422 | 13 MARCH 2003 | www.nature.com/nature
insight overview
benefits academia and industry to collaborate as much as possible on 12. Fields, S. & Song, O. A novel genetic system to detect protein–protein interactions. Nature 340,
245–246 (1989).
method development, data acquisition and project coordination. 13. MacBeath, G. Protein microarrays and proteomics. Nature Genet. 32(Suppl.), 526–532 (2002).
Finally, a way must be established to integrate proteome-scale experi- 14. Wright, P. E. & Dyson, H. J. Intrinsically unstructured proteins: re-assessing the protein structure-
ments with efforts of the many individual biology laboratories to function paradigm. J. Mol. Biol. 293, 321–331 (1999).
develop and test biological models, the final key step in the discovery 15. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nature Genet. 25, 25–29 (2000).
process that may always defy automation. Whatever the future holds, 16. Bader, G. D. & Hogue, W. V. C. in Genomics and Bioinformatics (ed. Sensen, C. W.) 399–413 (Wiley-
proteomics will yield great returns for all in what promises to be a VCH, Weinheim, 2001).
knowledge watershed in biology and medicine. ■ 17. Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).
18. Petricoin, E. F., Zoon, K. C., Kohn, E. C., Barrett, J. C. & Liotta, L. A. Clinical proteomics: translating
doi:10.1038/nature01510 benchside promise into bedside reality. Nature Rev. Drug Discov. 1, 683–695 (2002).
19. Andersen, J. S. et al. Directed proteomic analysis of the human nucleolus. Curr. Biol. 12, 1–11 (2002).
1. Wilkins, M. R. et al. From proteins to proteomes: large scale protein identification by two- 20. Jorgensen, P., Nishikawa, J. L., Breitkreutz, B. J. & Tyers, M. Systematic identification of pathways that
dimensional electrophoresis and amino acid analysis. Biotechnology 14, 61–65 (1996). couple cell growth and division in yeast. Science 297, 395–400 (2002).
2. Shoemaker, D. D. & Linsley, P. S. Recent developments in DNA microarrays. Curr. Opin. Microbiol. 5, 21. Walhout, A. J. et al. Integrating interactome, phenome, and transcriptome mapping data for the C.
334–337 (2002). elegans germline. Curr. Biol. 12, 1952–1958 (2002).
3. Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 387–391 22. Manning, B. D., Tee, A. R., Logsdon, M. N., Blenis, J. & Cantley, L. C. Identification of the tuberous
(2002). sclerosis complex-2 tumor suppressor gene product tuberin as a target of the phosphoinositide 3-
4. Gerlai, R. Phenomics: fiction or the future? Trends Neurosci. 25, 506–509 (2002). kinase/akt pathway. Mol. Cell 10, 151–162 (2002).
5. Tong, A. H. et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 23. Yoon, H. J. et al. Proteomics analysis identifies new components of the fission and budding yeast
294, 2364–2368 (2001). anaphase-promoting complexes. Curr. Biol. 12, 2048–2054 (2002).
6. Hannon, G. J. RNA interference. Nature 418, 244–251 (2002). 24. Mann, M. & Jensen, O. N. Proteomic analysis of post-translational modifications. Nature Biotechnol.
7. Kuruvilla, F. G., Shamji, A. F., Sternson, S. M., Hergenrother, P. J. & Schreiber, S. L. Dissecting glucose (in the press).
signalling with diversity-oriented synthesis and small-molecule microarrays. Nature 416, 653–657 25. Huang, S. & Ingber, D. E. Shape-dependent control of cell growth, differentiation, and apoptosis:
(2002). switching between attractors in cell regulatory networks. Exp. Cell Res. 261, 91–103 (2000).
8. Csete, M. E. & Doyle, J. C. Reverse engineering of biological complexity. Science 295, 1664–1669 26. Ball, P. Data visualization: picture this. Nature 418, 11–13 (2002).
(2002). 27. Aebersold, R. & Watts, J. D. The need for national centers for proteomics. Nature Biotechnol. 20, 651
9. Ficarro, S. B. et al. Phosphoproteome analysis by mass spectrometry and its application to (2002).
Saccharomyces cerevisiae. Nature Biotechnol. 20, 301–305 (2002). 28. Marshall, E. Bermuda rules: community spirit, with teeth. Science 291, 1192 (2001).
10. Liu, H., Lin, D. & Yates, J. R. III Multidimensional separations for protein/peptide analysis in the post-
genomic era. Biotechniques 32, 898–911 (2002). Acknowledgements We thank B.-J. Breitkreutz for preparing Fig. 2, D. Figeys and
11. Ideker, T. et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic members of the Center for Experimental BioInformatics (CEBI) for critical reading of the
network. Science 292, 929–934 (2001). manuscript. CEBI is supported by a grant from the Danish Natural Research Foundation.

NATURE | VOL 422 | 13 MARCH 2003 | www.nature.com/nature © 2003 Nature Publishing Group 197

You might also like