Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using
Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using
Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using
Fig. 1. Current state-of-the-art experimental organelle proteomics data analysis with pRoloc. On the left, we replicated the original findings from Tan
et al. (2009) on Drosophila embryos. On the right, we present results of the same data set obtained with pRoloc, utilizing the novelty discovery
functionality (new color-coded organelles) and a class-weighted support vector machine (SVM) algorithm with classifier posterior probabilities (point
sizes)
networks (Tardif et al., 2012) and naive Bayes (Nikolovski et al., res 5- phenoDisco(dunkley2006)
2012), all available in pRoloc. In addition, other novel algo- p 5- svmOptimisation(res, fcol¼"pd.markers")
rithms are proposed, such as PerTurbo (Courty et al., 2011). We res 5- svmClassification(res, p,
have compared and contrasted these algorithms using reliable fcol¼"pd.markers")
marker sets and demonstrate in the package documentation plot2D(res, fcol¼"svm")
that the driving factor for good classification is reflected in the
intrinsic quality of the data itself, i.e. efficient cellular content
separation, accurate quantitation (Jakobsen et al., 2011), etc. 4 CONCLUSIONS
illustrating the minor importance of the classification algorithm
with respect to thorough data exploration and quality control. The need for statistically sound proteomics data analysis has
While the exact algorithm might not be the major reason for a spawned interest in the proteomics community (Gatto and
good analysis, it is essential to guarantee optimal application of Christoforou, 2013) for R and Bioconductor (Gentleman et al.,
the algorithm. A central design decision in the development of 2004). pRoloc is a mature R package that provide users with
the classification schema was to explicitly implement model par- dedicated data infrastructure, visualization functionality and
ameter optimization routines to maximize the generalization state-of-the-art machine-learning methodologies, enabling un-
paralleled insight into experimental spatial proteomics data. It
power of the results.
is also a framework to further develop spatial proteomics data
analysis and novel pipelines. Multiple organelle proteomics
datasets illustrating various and diverse experimental designs
3 A TYPICAL PIPELINE are available in pRolocdata. Both packages come with thor-
A typical pipeline is summarized below using data from ough documentation and represent a unique framework for
Arabidopsis thaliana callus (Dunkley et al., 2006). We first sound and reproducible organelle proteomics data analysis.
load the required packages and example data. The
Funding: European Union 7th Framework Program (PRIME-
phenoDisco function is then run to identify new putative clus-
XS project, grant agreement number 262067); BBSRC Tools
ters that, after validation (the pd.markers feature meta-data),
and Resources Development Fund (Award BB/K00137X/1);
can be used for the classification using the SVM algorithm (with
Prospectom project (Mastodons 2012 CNRS challenge).
a Gaussian kernel). The algorithms parameters are first
optimized and then subsequently applied in the actual classifi- Conflict of Interest: none declared.
cation. Finally, the plot2D function is used to generate an
annotated scatter plot along the two first principal components
(Fig. 1). REFERENCES
Breckels,L. et al. (2013) The effect of organelle discovery upon sub-cellular protein
library(pRoloc) localisation. J. Proteom., 88, 129–140.
library(pRolocdata) Courty,N. et al. (2011) Perturbo: a new classification algorithm based on the spec-
data(dunkley2006) trum perturbations of the laplace-beltrami operator. In: Gunopulos,D. et al.
1323
L.Gatto et al.
(ed.) The Proceedings of ECML/PKDD (1). Vol. 6911 of Lecture Notes in Laurila,K. et al. (2009) Prediction of disease-related mutations affecting protein
Computer Science, pp. 359–374. Springer-Verlag, Berlin Heidelberg. localization. BMC Genomics, 10, 122.
Dunkley,T. et al. (2006) Mapping the arabidopsis organelle proteome. Proc. Natl Nikolovski,N. et al. (2012) Putative glycosyltransferases and other plant golgi ap-
Acad. Sci. USA, 103, 6518–6523. paratus proteins are revealed by LOPIT proteomics. Plant Physiol., 160,
Gatto,L. and Christoforou,A. (2013) Using R and Bioconductor for proteomics 1037–1051.
data analysis. Biochim. Biophys. Acta., 1844 (1 Pt A), 42–51. Ohta,S. et al. (2010) The protein composition of mitotic chromosomes determined
Gatto,L. and Lilley,K.S. (2012) MSnbase – an R/Bioconductor package for isobaric using multiclassifier combinatorial proteomics. Cell, 142, 810–821.
tagged mass spectrometry data visualization, processing and quantitation. Park,S. et al. (2011) Protein localization as a principal feature of the etiology and
Bioinformatics, 28, 288–289. comorbidity of genetic diseases. Mol. Syst. Biol., 7, 494.
Gatto,L. et al. (2010) Organelle proteomics experimental designs and analysis. Tan,D. et al. (2009) Mapping organelle proteins and protein complexes in drosoph-
Proteomics, 10, 3957–3969. ila melanogaster. J. Proteome Res., 8, 2667–2678.
Gentleman,R.C. et al. (2004) Bioconductor: open software development for com- Tardif,M. et al. (2012) PredAlgo: a new subcellular localization prediction tool
putational biology and bioinformatics. Genome Biol., 5, 80. dedicated to green algae. Mol. Biol. Evol., 29, 3625–3639.
Jakobsen,L. et al. (2011) Novel asymmetrically localizing components of human Trotter,M. et al. (2010) Improved sub-cellular resolution via simultaneous analysis
centrosomes identified by complementary proteomics methods. EMBO J., 30, of organelle proteomics data across varied experimental conditions. Proteomics,
1520–1535. 10, 4213–4219.
Kau,T. et al. (2004) Nuclear transport and cancer: from mechanism to intervention.
Nat. Rev. Cancer, 4, 106–117.
1324