Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using

Vol. 30 no.
9 2014, pages 1322–1324

BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btu013
Gene expression Advance Access publication January 11, 2014
Mass-spectrometry-based spatial proteomics data analysis

using pRoloc and pRolocdata
Laurent Gatto1,2,*, Lisa M. Breckels1,2, Samuel Wieczorek3, Thomas Burger3 and
Kathryn S. Lilley2
1
Computational Proteomics Unit and 2Cambridge Centre for Proteomics, Department of Biochemistry, University of
Cambridge, Tennis Court Road, CB2 1QR, Cambridge, UK and 3Université Grenoble-Alpes, CEA (iRSTV/BGE),
INSERM (U1038), CNRS (FR3425), 38054 Grenoble, France
Associate Editor: Dr Janet Kelso
ABSTRACT them in a consistent framework, accommodating any experimen-

Motivation: Experimental spatial proteomics, i.e. the high-throughput tal designs and quantitation strategies.
assignment of proteins to sub-cellular compartments based on quan-
titative proteomics data, promises to shed new light on many biolo-
gical processes given adequate computational tools.
2 AVAILABLE FUNCTIONALITY
Results: Here we present pRoloc, a complete infrastructure to pRoloc makes use of the architecture implemented in the
support and guide the sound analysis of quantitative mass- MSnbase package (Gatto and Lilley, 2012) for data storage,
spectrometry-based spatial proteomics data. It provides functionality feature and sample annotation (meta-data) and data processing,
for unsupervised and supervised machine learning for data exploration such as scaling, normalization and missing data imputation. We
and protein classification and novelty detection to identify new puta- also distribute 16 annotated datasets in the pRolocdata pack-
tive sub-cellular clusters. The software builds upon existing infrastruc- age, which are used for illustration of different pipelines as well
ture for data management and data processing. as algorithm testing and development. Algorithms for (i) cluster-
Availability: pRoloc is implemented in the R language and available ing, (ii) novelty detection and (iii) classification are proposed
under an open-source license from the Bioconductor project (http:// along with visualization functionalities.
www.bioconductor.org/). A vignette with a complete tutorial describing
data import/export and analysis is included in the package. Test data 2.1 Clustering
is available in the companion package pRolocdata. The unsupervised machine-learning techniques are used, among
Contact: lg390@cam.ac.uk other aims, as exploration and quality control tools. Several crit-
ical factors such as feature-level quantitation values, the extent of
Received on September 10, 2013; revised on November 25, 2013;
missing values and organelle markers can be overlaid on the data
accepted on January 5, 2014
clusters as effective data exploration and quality control.
2.2 Novelty detection

1 INTRODUCTION An essential step for reliable classification is the availability of
Knowledge of the spatial distribution of proteins is of critical well-characterized labeled data, termed ‘marker proteins’. These
importance to elucidate their role and refine our understanding reliable organelle residents define the set of observed organelles
of cellular processes. Mis-localization of proteins have been asso- and are used to train a classifier. It is however laborious and
ciated with cellular dysfunction and disease states (Kau et al., extremely difficult to manually define reliable markers for all
2004; Laurila et al., 2009; Park et al., 2011), highlighting the possible sub-cellular structures. As such, any organelles without
importance of localization studies. Spatial or organelle prote- any suitable markers will be completely omitted from subsequent
omics is the systematic study of the proteins and their sub- classification. pRoloc provides the implementation for the
cellular localization; these compartments can be organelles, i.e. phenoDisco novelty detection algorithm (Breckels et al., 2013)
structures defined by lipid bi-layers, macro-molecular assemblies that, based on a minimal set of markers and unlabeled data,
of proteins and nucleic acids or large protein complexes. Despite can be used to effectively detect new putative clusters in
technological advances in spatial proteomics experimental de- the data, beyond those that were initially manually described
signs and progress in mass-spectrometry (Gatto et al., 2010), (Fig. 1).
software support is lacking. To address this, we developed the
pRoloc package that provides a wide range of thoroughly docu- 2.3 Classification
mented analysis methodologies. The software includes state- Since the development and refinement of spatial proteomics ex-
of-the-art statistical machine-learning algorithms and bundles periments, several classification methods have been used: partial
least-square discriminant analysis (Dunkley et al., 2006), SVMs
*To whom correspondence should be addressed. (Trotter et al., 2010), random forest (Ohta et al., 2010), neural
ß The Author 2014. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Spatial proteomics data analysis
Fig. 1. Current state-of-the-art experimental organelle proteomics data analysis with pRoloc. On the left, we replicated the original findings from Tan
et al. (2009) on Drosophila embryos. On the right, we present results of the same data set obtained with pRoloc, utilizing the novelty discovery
functionality (new color-coded organelles) and a class-weighted support vector machine (SVM) algorithm with classifier posterior probabilities (point
sizes)
networks (Tardif et al., 2012) and naive Bayes (Nikolovski et al., res 5- phenoDisco(dunkley2006)
2012), all available in pRoloc. In addition, other novel algo- p 5- svmOptimisation(res, fcol¼"pd.markers")
rithms are proposed, such as PerTurbo (Courty et al., 2011). We res 5- svmClassification(res, p,
have compared and contrasted these algorithms using reliable fcol¼"pd.markers")
marker sets and demonstrate in the package documentation plot2D(res, fcol¼"svm")
that the driving factor for good classification is reflected in the
intrinsic quality of the data itself, i.e. efficient cellular content
separation, accurate quantitation (Jakobsen et al., 2011), etc. 4 CONCLUSIONS
illustrating the minor importance of the classification algorithm
with respect to thorough data exploration and quality control. The need for statistically sound proteomics data analysis has
While the exact algorithm might not be the major reason for a spawned interest in the proteomics community (Gatto and
good analysis, it is essential to guarantee optimal application of Christoforou, 2013) for R and Bioconductor (Gentleman et al.,
the algorithm. A central design decision in the development of 2004). pRoloc is a mature R package that provide users with
the classification schema was to explicitly implement model par- dedicated data infrastructure, visualization functionality and
ameter optimization routines to maximize the generalization state-of-the-art machine-learning methodologies, enabling un-
paralleled insight into experimental spatial proteomics data. It
power of the results.
is also a framework to further develop spatial proteomics data
analysis and novel pipelines. Multiple organelle proteomics
datasets illustrating various and diverse experimental designs
3 A TYPICAL PIPELINE are available in pRolocdata. Both packages come with thor-
A typical pipeline is summarized below using data from ough documentation and represent a unique framework for
Arabidopsis thaliana callus (Dunkley et al., 2006). We first sound and reproducible organelle proteomics data analysis.
load the required packages and example data. The
Funding: European Union 7th Framework Program (PRIME-
phenoDisco function is then run to identify new putative clus-
XS project, grant agreement number 262067); BBSRC Tools
ters that, after validation (the pd.markers feature meta-data),
and Resources Development Fund (Award BB/K00137X/1);
can be used for the classification using the SVM algorithm (with
Prospectom project (Mastodons 2012 CNRS challenge).
a Gaussian kernel). The algorithms parameters are first
optimized and then subsequently applied in the actual classifi- Conflict of Interest: none declared.
cation. Finally, the plot2D function is used to generate an
annotated scatter plot along the two first principal components
(Fig. 1). REFERENCES
Breckels,L. et al. (2013) The effect of organelle discovery upon sub-cellular protein
library(pRoloc) localisation. J. Proteom., 88, 129–140.
library(pRolocdata) Courty,N. et al. (2011) Perturbo: a new classification algorithm based on the spec-
data(dunkley2006) trum perturbations of the laplace-beltrami operator. In: Gunopulos,D. et al.
1323
L.Gatto et al.
(ed.) The Proceedings of ECML/PKDD (1). Vol. 6911 of Lecture Notes in Laurila,K. et al. (2009) Prediction of disease-related mutations affecting protein
Computer Science, pp. 359–374. Springer-Verlag, Berlin Heidelberg. localization. BMC Genomics, 10, 122.
Dunkley,T. et al. (2006) Mapping the arabidopsis organelle proteome. Proc. Natl Nikolovski,N. et al. (2012) Putative glycosyltransferases and other plant golgi ap-
Acad. Sci. USA, 103, 6518–6523. paratus proteins are revealed by LOPIT proteomics. Plant Physiol., 160,
Gatto,L. and Christoforou,A. (2013) Using R and Bioconductor for proteomics 1037–1051.
data analysis. Biochim. Biophys. Acta., 1844 (1 Pt A), 42–51. Ohta,S. et al. (2010) The protein composition of mitotic chromosomes determined
Gatto,L. and Lilley,K.S. (2012) MSnbase – an R/Bioconductor package for isobaric using multiclassifier combinatorial proteomics. Cell, 142, 810–821.
tagged mass spectrometry data visualization, processing and quantitation. Park,S. et al. (2011) Protein localization as a principal feature of the etiology and
Bioinformatics, 28, 288–289. comorbidity of genetic diseases. Mol. Syst. Biol., 7, 494.
Gatto,L. et al. (2010) Organelle proteomics experimental designs and analysis. Tan,D. et al. (2009) Mapping organelle proteins and protein complexes in drosoph-
Proteomics, 10, 3957–3969. ila melanogaster. J. Proteome Res., 8, 2667–2678.
Gentleman,R.C. et al. (2004) Bioconductor: open software development for com- Tardif,M. et al. (2012) PredAlgo: a new subcellular localization prediction tool
putational biology and bioinformatics. Genome Biol., 5, 80. dedicated to green algae. Mol. Biol. Evol., 29, 3625–3639.
Jakobsen,L. et al. (2011) Novel asymmetrically localizing components of human Trotter,M. et al. (2010) Improved sub-cellular resolution via simultaneous analysis
centrosomes identified by complementary proteomics methods. EMBO J., 30, of organelle proteomics data across varied experimental conditions. Proteomics,
1520–1535. 10, 4213–4219.
Kau,T. et al. (2004) Nuclear transport and cancer: from mechanism to intervention.
Nat. Rev. Cancer, 4, 106–117.
1324

Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using

Uploaded by

Copyright:

Available Formats

Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using

Uploaded by

Copyright:

Available Formats

Vol. 30 no.

9 2014, pages 1322–1324

Gene expression Advance Access publication January 11, 2014

Mass-spectrometry-based spatial proteomics data analysis

ABSTRACT them in a consistent framework, accommodating any experimen-

2.2 Novelty detection

ß The Author 2014. Published by Oxford University Press.

You might also like