Multivariate Data Integration Using R: Methods and Applications With The Mixomics Package 1St Edition Kim-Anh Lê Cao
Multivariate Data Integration Using R: Methods and Applications With The Mixomics Package 1St Edition Kim-Anh Lê Cao
Multivariate Data Integration Using R: Methods and Applications With The Mixomics Package 1St Edition Kim-Anh Lê Cao
com
https://ebookmeta.com/product/multivariate-data-
integration-using-r-methods-and-applications-with-
the-mixomics-package-1st-edition-kim-anh-le-cao/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookmeta.com/product/bootstrap-methods-with-
applications-in-r-gerhard-dikta/
https://ebookmeta.com/product/multivariate-reduced-rank-
regression-theory-methods-and-applications-2nd-edition-gregory-c-
reinsel/
https://ebookmeta.com/product/data-analysis-and-related-
applications-volume-2-multivariate-health-and-demographic-data-
analysis-1st-edition-konstantinos-n-zafeiris/
https://ebookmeta.com/product/multivariate-data-analysis-on-
matrix-manifolds-with-manopt-1st-edition-trendafilov/
Statistical Methods for Handling Incomplete Data 2nd
Edition Kim
https://ebookmeta.com/product/statistical-methods-for-handling-
incomplete-data-2nd-edition-kim/
https://ebookmeta.com/product/spatial-analysis-using-big-data-
methods-and-urban-applications-spatial-econometrics-and-spatial-
statistics-1st-edition-yoshiki-yamagata/
https://ebookmeta.com/product/online-learning-systems-methods-
and-applications-with-large-scale-data-zdzislaw-polkowski-editor/
https://ebookmeta.com/product/practical-r-4-applying-r-to-data-
manipulation-processing-and-integration-1st-edition-jon-westfall/
https://ebookmeta.com/product/data-science-and-predictive-
analytics-biomedical-and-health-applications-using-r-2nd-2nd-
edition-ivo-d-dinov/
Multivariate Data
Integration Using R
Computational Biology Series
About the Series:
This series aims to capture new developments in computational biology, as well as high-quality work summarizing or con-
tributing to more established topics. Publishing a broad range of reference works, textbooks, and handbooks, the series is
designed to appeal to students, researchers, and professionals in all areas of computational biology, including genomics,
proteomics, and cancer computational biology, as well as interdisciplinary researchers involved in associated fields, such
as bioinformatics and systems biology.
Virus Bioinformatics
Dmitrij Frishman, Manuela Marz
Multivariate Data Integration Using R: Methods and Applications with the mixOmics Package
Kim-Anh LeCao, Zoe Marie Welham
Bioinformatics
A Practical Guide to NCBI Databases and Sequence Alignments
Hamid D. Ismail
For more information about this series please visit:
https://www.routledge.com/Chapman--HallCRC-Computational-Biology-Series/book-series/CRCCBS
Multivariate Data
Integration Using R
Methods and Applications with the
mixOmics Package
Kim-Anh Lê Cao
Zoe Marie Welham
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
DOI: 10.1201/9781003026860
This book has been prepared from camera-ready copy provided by the authors.
From Kim-Anh Lê Cao:
To my parents, Betty and Huy Lê Cao
To the mixOmics team and our mixOmics users,
And to my co-author Zoe Welham without whom this book would not have existed.
Preface xv
Authors xxi
vii
viii Contents
Bibliography 287
Index 299
Preface
This book is suitable for biologists, computational biologists, and bioinformaticians who
generate and work with high-throughput omics data. Such data include – but are not
restricted to – transcriptomics, epigenomics, proteomics, metabolomics, the microbiome,
and clinical data. Our book is dedicated to research postgraduate students and scientists at
any career stage, and can be used for teaching specialised multi-disciplinary undergraduate
and Masters’s courses. Data analysts with a basic level of R programming will benefit most
from this resource. The book is organised into three distinct parts, where each part can be
skimmed according to the level and interest of the reader. Each chapter contains different
levels of information, and the most technical chapters can be skipped during a first read.
The mixOmics package focuses on multivariate analysis which examines more than two
variables simultaneously to integrate different types of variables (e.g. genes, proteins, metabo-
lites). We use dimension reduction techniques applicable to a wide range of data analysis
types. Our analyses can be descriptive, exploratory, or focus on modeling or prediction. Our
xv
xvi Preface
Single ‘omics N-integration P-integration
INPUT
PCA*
MULTIVARIATE
PLS*
METHODS
-5
-10
GRAPHICS
-15
-5
0 5
0
5 -5
-10
A
VARIABLE PLOTS
RN
i
m
mRNA
pr
te
in
o
supervised method *variable selection
FIGURE 1: Overview of the methods implemented in the mixOmics package for the explo-
ration and integration of multiple data sets. This book aims to guide the data analyst in
constructing the research question, applying the appropriate multivariate techniques, and
interpreting the resulting graphics.
aim is to summarise these large biological data sets to elucidate similarities between sam-
ples, between variables, and the relationship between samples and variables. The mixOmics
package provides a range of methods to answer different kinds of biological questions, for
example to:
• Highlight patterns pertaining to the major sources of variation in the data (e.g. Principal
Component Analysis),
• Segregate samples according to their known group and predict group membership of new
samples (e.g. Partial Least Squares Discriminant Analysis),
• Identify agreement between multiple data sets (e.g. Canonical Correlation Analysis,
Partial Least Squares regression, and other variants),
• Identify molecular signatures across multiple data sets with sparse methods that achieve
variable selection.
Methods in mixOmics are based on matrix factorisation techniques, which offer great
flexibility in analysing and integrating multiple data sets in a holistic manner. We use
dimension reduction combined with feature selection to summarise the main characteristics
of the data and posit novel biological hypotheses.
Preface xvii
Dimension reduction is achieved by combining all original variables into a smaller number of
artificial components that summarise patterns in the original data.
The mixOmics package is unique in providing novel multivariate techniques that enable
feature selection to identify molecular signatures. Feature selection refers to identifying
variables that best explain, or predict, the outcome variable (e.g. group membership, or
disease status) of interest. Variables deemed irrelevant according to the specific statistical
criterion we use in the methods are not taken into account when calculating the components.
Data integration methods use data projection techniques to maximise the covariance, or the
correlation between, omics data sets. We propose two types of data integration, whether on
the same N samples, or on the same P variables (Figure 1).
Finally, our methods can provide either unsupervised or supervised analyses. Unsupervised
analyses are exploratory: any information about sample group membership, or outcome,
is disregarded, and data are explored based on their variance or correlation structure.
Supervised analyses aim to segregate sample groups known a priori (e.g. disease status,
treatments) and identify variables (i.e. biomarker candidates, or molecular signatures) that
either explain or separate sample groups.
These concepts will be explained further in Part I.
To aid in interpreting analysis results, mixOmics provides insightful graphical plots designed
to highlight patterns in both the sample and variable dimensions uncovered by each method
(Figure 1).
Each mixOmics method corresponds to an underlying statistical model. However, the methods
we present radically differ from univariate formulations as they do not test one variable at a
time, or produce p-values. In that sense, multivariate methods can be considered exploratory
as they do not enable statistical inference. Our wide range of methods come in many different
flavours and can be applied also for predictive purposes, as we detail in this book. ‘Classical’
univariate statistical inference methods can still be used in our analysis framework after
the identification of molecular signatures, as our methods aim to generate novel biological
hypotheses.
Who is ‘mixOmics’?
The mixOmics project has been developed between France, Australia and Canada since 2009,
when the first version of the package was submitted to the CRAN1 . Our team is composed of
core members from the University of Melbourne, Australia, and the Université de Toulouse,
France. The team also includes several key contributors and collaborators.
The package implements more than nineteen multivariate and sparse methodologies for
omics data exploration, integration, and biomarker discovery for different biological settings,
amongst which thirteen were developed by our team (see our list of publications in Section
14.8). Originally, all methods were designed for omics data, however, their application is
not limited to biological data only. Other applications where integration is required can be
considered, but mostly for cases where the predictor variables are continuous.
1 The Comprehensive R Architecture Network https://www.cran.r-project.org
xviii Preface
The package is currently available from Bioconductor2 , with a development version available
on GitHub3 . We continue to maintain and improve the package via new methods, code
optimisation and efficient memory storage of R objects.
In addition to the R package, the mixOmics project includes a website with extensive
tutorials in http://www.mixOmics.org. The R code of each chapter is also available on
the website. Our readers can also register for our newsletter mailing list, and be part of
the mixOmics community on GitHub and via our discussion forum https://mixomics-
users.discourse.group/.
Authors
Dr Kim-Anh Lê Cao develops novel methods, software and tools to interpret big biological
data and answer research questions efficiently. She is committed to statistical education to
instill best analytical practice and has taught numerous statistical workshops for biologists
and leads collaborative projects in medicine, fundamental biology or microbiology disciplines.
Dr Kim-Anh Lê Cao has a mathematical engineering background and graduated with a PhD
in Statistics from the Université de Toulouse, France. She is currently an Associate Professor
in Statistical Genomics at the University of Melbourne. In 2019, Kim-Anh received the
Australian Academy of Science’s Moran Medal for her contributions to Applied Statistics in
multidisciplinary collaborations. She has contributed to a leadership program for women
in STEMM, including the international Homeward Bound which culminated in a trip to
Antarctica, and Superstars of STEM from Science Technology Australia.
Zoe Welham completed a BSc in molecular biology and during this time developed an
interest in the analysis of big data. She completed a Master of Bioinformatics with a focus
on the statistical integration of different omics data in bowel cancer. She is currently a PhD
candidate at the Kolling Institute in Sydney where she is furthering her research into bowel
cancer with a focus on integrating microbiome data with other omics to characterise early
bowel polyps. Her research interests include bioinformatics and biostatistics for many areas
of biology and making that information accessible to the general public.
xxi
Part I
DOI: 10.1201/9781003026860-1 3
4 Multi-omics and biological systems
RNA CS
MI
IP TO
SCR
AN
TR
PROTEIN CS
MI
EO
OT
PR
METABOLITE ICS
OM
B OL
TA
ME
MICROBE ICS
OM
EN
AG
MET
Conventional molecular
biology Single-omics Multi-omics
(Reductionist) (Hypothesis free) (Holistic)
H yp ot g
h e s is g e n e r a t i n
FIGURE 1.1: From reductionism to holism. Until recently, only a few molecules of a
given omics type were analysed and related to other omics. The advent of high-throughput
biology has ushered in an era of hypothesis-free approaches within a single type of omics
data, and across multiple omics from the same set of samples. A holistic approach is now
required to understand the different omic functional layers in a biological system and posit
novel hypotheses that can be further validated with a traditional reductionist approach.
We have omitted DNA as this data type needs to be handled differently in mixOmics, see
Section 4.2.3.
become cumbersome when dealing with thousands of variables that are considered in a
pairwise manner.
A multivariate analysis examines more than two variables simultaneously and potentially
thousands at a time. In omics studies, this approach can lead to computational issues and
inaccurate results, especially when the number of samples is much smaller than the number of
variables. Several computational and statistical techniques have been revisited or developed
for high-dimensional data. This book focuses on multivariate analyses and extends this to
include the integration of multi-omics data sets.
The fundamental difference between multivariate and univariate analysis lies in the scope of
the results obtained. Multivariate analysis can unravel groups of variables that share similar
patterns in expression across different phenotypes, thus complementing each other to describe
an outcome. A univariate analysis may declare the same variables as non-significant, as a
variable’s ability to explain the phenotype may be subtle and can be masked by individual
variation, or confounders (Saccenti et al., 2014). However, with sufficiently powered data,
univariate and multivariate methods are complementary and can help make sense of the
data. For example, several multivariate and exploratory methods presented in this book can
suggest promising candidate variables that can be further validated through experiments,
reductionist approaches, and inferential statistics.
Despite the potential advantages of high-dimensional data, we should keep in mind that
quantity does not equal quality. Multivariate data integration is not straightforward: the
analyses cannot be reduced to a mere concatenation of variables of different types, or by
overlapping information between single data sets, as we illustrate in Figure 1.2. As such, we
must shift our traditional view of analysing data.
Biological experimentation often employs univariate statistics to answer clear hypotheses
6 Multi-omics and biological systems
Factorisation
≈ ×
Matrix
prior
Bayesian
prior Inferred
ICS posterior
M
TO
S CR
IP prior
AN
TR
S
MIC
EO
OT
PR
Network
based
ICS
OM
B OL
TA
ME
Multiple
steps
Single-omics Correlation/
analysis overlap
FIGURE 1.2: Types of methods for data integration. Methods for multi-omics data
integration are still in active development, and can be broadly categorised into matrix
factorisation techniques (the focus of this book), Bayesian, network-based, and multiple-
step approaches. The latter deviates from data integration as it considers each data set
individually before combining the results.
about the potential causal effect of a given molecule of interest. In high-dimensional data
sets, this reductionist approach may not hold due to the sheer amount of molecules that are
monitored, and their interactions that might be of biological interest. Therefore, exploratory,
data-driven approaches are needed to extract information from noise and generate new
hypotheses and knowledge. However, the lack of a clear, causal-driven hypothesis presents a
challenging new paradigm in statistical analyses.
In univariate hypothesis testing, we report p-values to determine the significance of a
statistical test conducted on a single variable. In a multivariate setting, however, a p-
value assesses the statistical significance of a result while taking into account all variables
simultaneously. In such analyses, permutation-based tests are common to assess how far
from random a result is when the data are reshuffled, but other inference-based methods
are currently being developed in the field of multivariate analysis (Wang and Xu, 2021). In
mixOmics we do not offer such tests, but related methods propose permutation approaches
to choose the parameters in the method (see Section 10.7.5).
1.4.1 Overfitting
Multivariate omics analysis assesses many molecules that individually, or in combination, can
explain the biological outcome of interest. However, these associations may be spurious, as the
large number of features can often be combined in different ways to explain the outcome well,
despite having no biological relevance. Overfitting occurs when a statistical model captures
the noise along with the underlying pattern in the data: if we apply the same statistical
model fitted on a high-dimensional data set to a similar but external study, we might obtain
different results.1 The problem of overfitting is a well-known issue in high-throughput biology
(Hawkins, 2004). We can assess the amount of overfit using cross-validation or subsampling
of the data, as described in Chapter 7.
As the number of variables increase, the number of pairwise correlations also increases. Multi-
collinearity poses a problem in most statistical analyses as these variables bring redundant
and noisy information that decreases the precision of the statistical model. Correlations
in high-throughput data sets are often spurious, especially when the number of biological
samples, or individuals N , is small compared to the number of variables P 2 . The ‘small N
large P ’ problem is defined as ill-posed, as standard statistical inference methods assume N
is much greater than P to generalise the results to the population the sample was drawn
from. Ill-posed problems also lead to inaccurate computations.
Data sets may contain a large number of zeros, depending on the type of omics studied
and the platform that is used. This is particularly the case for microbiome, proteomics, and
metabolomics data: a large number of zeros results in zero-inflated (skewed) data, which
can impair methods that assume a normal distribution of the data. Structural zeros, or true
zeros, reflect a true absence of the variable in the biological environment while sampling
zeros, or false zeros, may not reflect reality due to experimental error, technological reasons,
or an insufficient sample size (Blasco-Moreno et al., 2019). The challenge is whether to
consider these zeros as a true zero or missing (coded as NA in R).
Methods that can handle missing values often assume they are ‘missing at random’, i.e. miss-
ingness is not related to a specific sample, individual, molecule, or type of omics platform.
Some methods can estimate missing values, as we present in Appendix 9.A.
1 Statistical models that overfit have low bias and high variance, meaning that they tend to be complex to
fit the training data well, but do not predict well on test data (more details about the bias-variance tradeoff
can be found in Friedman et al. (2001) Chapter 2).
2 In our context, N can also refer to the number of cells in single cell assays, as we briefly mention in
Section 14.6.
8 Multi-omics and biological systems
Examining data holistically may lead to better biological understanding, but integrating
multiple omics data sets is not a trivial task and raises another series of challenges.
Different omics rely on different laboratory techniques and data extraction platforms, resulting
in data sets of different formats, complexity, dimensionalities, information content, and scale,
and may be processed using different bioinformatics tools. Therefore, data heterogeneity
arises from biological and technical reasons and is the main analytical challenge to overcome.
Integrating multiple omics results in a drastic increase in the number of variables. A filtering
step is often applied to remove irrelevant and noisy variables (see Section 8.1). However, the
number of variables P still remains extremely large compared to the number of samples N ,
which raises computational as well as analytical issues.
1.5.3 Platforms
The data integration field is constantly evolving due to ever-advancing technologies with new
platforms and protocols, each containing inherent technical biases and analytical challenges.
It is crucial that data analysts swiftly adapt their analysis framework to keep apace with
these omics-era demands. For example, single cell techniques are rapidly advancing, as are
new protocols for their multi-omics analysis.
The field of data integration has no set definition. Data integration can be managed
biologically, bioinformatically, statistically, or at the interpretation steps (i.e. by overlapping
biological interpretation once the statistical results are obtained). Therefore, the expectations
for data integration are diverse; from exploration, and from a low to high-level understanding
of the different omics data types. Despite recent advances in single cell sequencing, current
technologies are still limited in their ability to parse omics interactions at precise functional
levels. Thus, our expectations for data integration are limited, not only by the statistical
methods but also by the technologies available to us.
Summary 9
Integrative techniques fully suited to multi-omics biological data are still in development and
continue to expand3 . Different types of techniques can be considered and broadly categorised
into (Huang et al. (2017), Figure 1.2):
• Matrix factorisation techniques, where large data sets are decomposed into smaller
sub-matrices to summarise information. These techniques use algebra and analysis to
optimise specific statistical criteria and integrate different levels of information. Methods
in mixOmics fit into this category and will be detailed in Chapter 3 and subsequent
chapters,
• Bayesian methods, which use assumptions of prior distributions for each omics type to
find correlations between data layers and infer posterior distributions,
• Network-based approaches, which use visual and symbolic representations of biological
systems, with nodes representing molecules and edges as correlations between molecules,
if they exist. Network-based methods are mostly applied for detecting significant genes
within pathways, discovering sub-clusters, or finding co-expression network modules,
• Multiple-step approaches that first analyse each single omics data set individually
before combining the results based on their overlap (e.g. at the gene level of a molecular
signature) or correlation. This type of approach technically deviates from data integration
but is commonly used.
1.6 Summary
Modern biological data are high dimensional; they include up to thousands of molecular
entities (e.g. genes, proteins, or epigenetic markers) per sample. Integrating these rich
data sets can potentially uncover the hierarchical and holistic mechanisms that govern
biological pathways. While classical, reductionist, univariate methods ignore these molecular
interactions, multivariate, integrative methods offer a promising alternative to obtain a
more complete picture of a biological system. Thus, univariate and multivariate methods
are different approaches with very little overlap in results but have the advantage of
complementarity.
The advent of high-throughput technology has revealed a complex world of multi-omics
molecular systems that can be unraveled with appropriate integration methods. However,
multivariate methods able to manage high-dimensional and multi-omics data are yet to
be fully developed. The methods presented in this book mitigate some of these challenges
and will help to reveal patterns in omics data, thus forging new insights and directions for
understanding biological systems as a whole.
awesome-multi-omics.
2
The cycle of analysis
The Problem, Plan, Data, Analysis, Conclusion (PPDAC) cycle is a useful framework for
answering an experimental question effectively (Figure 2.1). The mixOmics project emphasises
crafting a well-defined biological question (Chapter 4), as this guides data acquisition
and preparation (Chapter 8), as well as choosing appropriate multivariate techniques for
analysis (Chapter 4). Although this book is focused on analysis and interpretation, careful
consideration of each step will maximise a successful analytical outcome.
PROBLEM
CONCLUSIONS
PLAN
ANALYSIS
DATA
FIGURE 2.1: PPDAC. The Problem, Plan, Data, Analysis, Conclusion cycle proposed
by MacKay and Oldford (2000) will guide our multivariate analysis process.
Multivariate analysis is appropriate for large data sets where the biological question en-
compasses a broad domain, rather than parsing the action of a single or small number of
variables. Thus, we often require a hypothesis-free investigation based on a data-driven
approach. However, this does not imply that multivariate analysis is a fishing expedition
with no underlying biological question. The experimental design, driven by a well-formulated
biological question and the choice of statistical method, will ensure a successful analysis
(Shmueli, 2010). Chapter 4 lists several types of biological questions that can be answered
with multivariate and integrative methods.
DOI: 10.1201/9781003026860-2 11
12 The cycle of analysis
– Sir Ronald Fisher, Statistician, Presidential Address to the First Indian Statistical
Congress, 1938
The underlying biological question will narrow the choice of appropriate omics technology
(see Chapter 4), keeping in mind that each technology has its own artefacts and generates
noise that may mask the effect of interest. The type of organism under study will also affect
the amount of biological variation we expect to uncover. Similarly, the nature of the effect of
interest, whether subtle, or strong, will also impact the amount of variation we can extract
from the data. Taken together, these issues will affect the sample size needed to detect the
effect of interest (as discussed in Saccenti and Timmerman (2016) and Lenth (2001)).
Statistical power analyses are mostly valid for inferential and univariate tests but difficult to
estimate when the method considered for statistical analysis does not assume any specific
data distribution. As such, methods in mixOmics that are exploratory in nature do not fit
into a classical power analysis framework. This limitation is a double-edged sword. On the
one hand, any exploration on any sample size can be conducted to understand the amount
of variation present in the data, and whether this variation coincides with the biological
variation of interest. Such an approach can be useful for a pilot study to choose an omics
technology, for example. On the other hand, a small sample size will limit any follow-up
analysis: as the number of variables becomes very large, a small sample size is likely to lead
to overfitting if the analysis goes beyond exploration (discussed further in Section 1.4).
Appropriate sample sizes can be calculated if pilot data are available. However, if the aim
Plan in advance 13
A B
F 10 0
M 0 10
TABLE 2.2: Example where the number of samples with respect to the variable
of interest (treatment) and the covariate (sex) are balanced across treatments.
A B
F 5 5
M 5 5
is to use multivariate methods, the calculation will rely on an empirical approach using
permutation tests1 .
Covariates are observed variables that affect the outcome of interest but may not be of
primary interest. For example, consider an experiment examining the effect of weight (primary
variable of interest) on gene expression. Subject sex, age, ethnicity, or medication intake are
all examples of covariates, that if assessed as having an effect on the data, must be taken
into account in the analysis.
Confounding occurs when a covariate is intimately linked with the primary variable of
interest. A typical example is when all females receive treatment A and all males treatment
B (Table 2.1). In this case, we are not able to differentiate whether gene expression is affected
by treatment, or sex, or both. Confounders can be avoided during the experimental planning
stage by ensuring that both treatments are evenly assigned across sex (Table 2.2).
Batch effects refer to variation introduced by factors that are unrelated to the biological
variable of interest. These factors can be technical (e.g. day of experiment, technician,
sequencing run), computational (bioinformatics methods) or biological (birth dam, animal
facility), as described in Figure 2.2 and Table 2.3. The nature of the variation is defined as
systematic across batches, however, depending on the omics technology, this may not be the
case. For example, specific genes or micro-organisms might be more affected by one type of
batch compared to others (Wang and Lê Cao, 2019).
Batch effects can be mitigated with an appropriate experimental design so that their variation
does not overwhelm the biological variation. When strong batch effects cannot be avoided,
methods exist to correct or to account for batch effects (see Section 8.1).
1 Permutation tests build a sampling distribution based on the existing data by resampling them randomly.
14 The cycle of analysis
TABLE 2.3: Example of a batch factor. Number of samples with respect to the
variable of interest (treatment) and the batch (sequencing run). Such a table gives a better
understanding of the experimental design when the batch information is recorded.
A B
Run1 5 1
Run2 5 9 !"#$%&'%($)*&+,% )&%-+./,)
'&0% 1-)2*%$''$2),
=//'&1(*63#27(>218 %"#2
<-%&723&1"02''(*6
:1")"0"7 + I2-62*) + K()3+ :1"02''(*637"0-)("*3
EFGB32C)1-0)("*3H3:$I3-%&7(/(0-)("*J3
B00".*)3"1 0"1120) /"1
!$2*742-5
FGB3<2L.2*0(*6 I.* + $;(&3)8&2 + :7-)/"1%
@20;*(0(-*
<"/)M-12 :1")"0"7
8&(#/)-)4&7-5 B00".*)3"1 0"1120) /"1
:-1-%2)21
FIGURE 2.2: Potential sources of batch effects. Examples of factors that may affect
the quality of microbiome data, from Wang and Lê Cao (2019).
Prior to analysis, the data sets should be normalised and pre-filtered using appropriate
techniques specific to the type of omics technology platform used. We briefly describe these
steps and further discuss relevant techniques in Chapter 8.
2.3.1 Normalisation
High-throughput biological data often results in samples with different sequencing depths
(RNA-sequencing) or overall expression levels (microarrays) due to technological platform
artefacts rather than true biological differences. Thus, data must be normalised for samples
to be comparable to one another. Moreover, highly-expressed genes may take up a large part
of sequenced reads in an experiment, with fewer reads remaining for other genes, leading to
a possibly incorrect conclusion that the latter genes have log expression levels (Robinson and
Oshlack, 2010). Many normalisation techniques have been proposed that are platform-specific
and constantly evolve with the latest technological, computational, and statistical advances.
Analysis: Choose the right approach 15
2.3.2 Filtering
Whilst most multivariate methods can manage a large number of molecules (several tens of
thousands), it is often beneficial to remove those that are not expressed, or vary very little,
across samples. These variables are usually not relevant for the biological problem, add noise
to the multivariate analyses, and may increase computational time during parameter tuning.
In biological terms, missing values refer to data that are not measured (see Section 1.4).
However, in practice, the definition of missing is often unclear. For example, a value might
be missing because it is not relevant biologically or because it did not pass the detection
threshold in a mass spectrometry experiment. Therefore, the data analyst must make
the choice of setting the missing value to a numerical 0 (i.e. the ‘absence’ is biologically
relevant), or ‘NA’ (i.e. technically missing), keeping in mind that these decisions will affect
the statistical analysis.
Data analysis can be descriptive, exploratory, inferential, or include modelling and prediction.
A well-framed biological question will clarify which type of analysis is better suited to address
a biological problem.
Descriptive statistics precede any other type of analysis. They help to anticipate future
analytical challenges and obtain a basic understanding of the data. Descriptive statistics
solely describe the properties or characteristics of the data without making any assumptions
that the data were sampled from a population. In descriptive analysis, we use summary
statistics or visualisations to either obtain an initial description of the data, or investigate
a particular aspect of the data. Univariate summary statistics can include calculating the
sample mean and standard deviation, or graphing boxplots of a single variable. Bivariate
summary statistics can include the calculation of a correlation coefficient, or a scatterplot
representing the expression values of two genes. When dealing with a large number of
variables, such summaries can become cumbersome and difficult to interpret, which may be
overcome by using exploratory methods.
Exploratory statistics aim to summarise and provide insights into the data before proceeding
with the analysis. This step does not lay any emphasis on an underlying biological hypothesis
and may not require a specific data distribution. Typically, such analyses can be conducted
to better understand the major sources of variation in the data by using dimension reduction
16 The cycle of analysis
methods such as Principal Component Analysis (Jolliffe, 2005). More sophisticated integrative
methods can also be applied to highlight the correlation structure between variables from
different data sets, for example with Canonical Correlation Analysis (Vinod, 1976), or
Projection to Latent Structures, also called Partial Least Squares (Wold, 1966).
Inferential statistics allow us to infer from a sample taken from a population. They go
together with univariate statistics. If we assume that a sample is representative of the
population it was drawn from, then the conclusions reached from the statistical test should
generalise to the whole population. In univariate methods, we conduct hypothesis testing,
calculate test statistics and compare them to a theoretical statistical distribution in order to
infer that our conclusions indeed could be true for the whole population.
In the omics era, statistical inference is difficult to achieve and is often unreliable due to
an insufficient sample size. Inferential statistics can be achieved with multivariate methods,
but only when the number of samples, or individuals, is much larger than the number of
variables, and when the multivariate methods fit into an inferential statistical framework.
Another approach in data analysis is to fit a statistical model, i.e. a mathematical and
often simplified formula, to the data. The ultimate aim of modelling is to make predictions
based on the then-fitted model to a new set of data, when applicable. Here is an example of
an equation for a simple linear regression that models a dependent variable (or outcome
variable, e.g. a phenotype) with an explanatory variable (also referred to as an independent
variable, or predictor, e.g. the expression levels of a gene):
In model (2.1), a is a coefficient representing the slope of the line fitted if we were to plot
the gene expression levels along the x-axis and the phenotype along the y-axis (Figure 2.3).
Therefore, we assume the relationship between the dependent variable phenotype and the
independent variable gene expression is linear. Using an ordinary least squares approach
(OLS2 ), we can estimate the intercept or offset value (here estimated to −58.95), and the
slope a of this linear relationship (here estimated to 6.71). We often refer to this model as a
univariate linear regression.
A multivariate linear regression model includes several explanatory variables or predictors.
For example in a transcriptomics experiment, if we had measured the expression levels of P
genes:
where the intercept and regression coefficients (a1 , a2 , . . . , aP ) can be estimated using the
2 OLS estimates the unknown parameters in a linear regression model (e.g. the slope (a) and the intercept)
by estimating values for these parameters that minimise the sum of the squares of the differences between
the observed values of the variable and those predicted by the linear equation.
Analysis: Choose the right approach 17
240
prediction of phenotype
220
(unit of response)
200
phenotype
180
prediction of phenotype
160
140
FIGURE 2.3: Example of a linear regression fit between the expression levels of a
given gene (along the x-axis) and the phenotype value (along the y-axis) for 20 individuals.
The model from the linear regression equation is fitted to the data points and is represented
as a blue line using the Ordinary Least Squares method. Red full circles represent new gene
expression values from which we can predict the phenotype of two new individuals, based
on this fitted model.
OLS method. However, when the number of predictors P becomes much larger than the
number of samples, the algebra to estimate these parameters becomes complex and sometimes
numerically impossible to solve due to multi-collinearity (see Section 1.4). A scatterplot
representation of all possible pairs of variables is often impractical and not easy to interpret.
The methods available in mixOmics use similar types of models as Equation (2.2), but
instead of a separate OLS approach, we use Partial Least Square approaches to manage high
dimensional data, as described in detail in Chapter 5. We also use other techniques such as
ridge and lasso regularisation to manage the large number of variables (see Section 3.3).
2.4.5 Prediction
Predictive statistics refers to the process of applying a statistical model or data mining
algorithm to training data to make predictions about new test data, often in terms of a
phenotype or disease status of those new samples3 . In the univariate regression model in
Equation (2.1) that is fitted on training data, let us assume we also measured the expression
levels of that same gene but in another cohort of patients. We can then predict the phenotype
of those new patients based on that same equation. The new gene expression values for
the new samples are represented in full red circles in Figure 2.3, with their y-axis values
corresponding to the predicted phenotype.
3 In our methods we focus on point estimate, i.e. a single value estimate of a parameter rather than
interval estimate, i.e. a range of values where the parameter is expected to lie.
Another random document with
no related content on Scribd:
Shetlands
Sheu
shew
shewn
shi
shia
Shiba
shield
shielded
Shifalu
shift
shifting
shiftless
Shigoku
Shih
Shilka
Shilkhak
shilling
shillings
Shilluk
Shimazu
Shimonoseki
Shims
Shimti
shin
shine
shines
Shingking
shining
Shio
ship
shipboard
shipbuilding
shipmen
shipment
shipments
shipowners
Shipp
shipped
shippers
SHIPPING
ships
shipwrecked
shipyards
SHIRE
shires
shirk
shirt
shirts
Shiré
Shishak
shiver
shivered
Shoa
shoal
shock
shocked
shocking
shockingly
shocks
shod
Shoe
shoes
Shogun
Shogunate
shoot
shooteth
shooting
shoots
Shop
shopkeeper
shopkeepers
Shops
Shore
shores
shorn
short
shortcomings
shorten
shortened
shortening
shorter
shortest
Shortland
shortly
shortness
shot
shots
Shou
Should
shoulder
shouldered
shoulders
Shoushan
shout
shouted
shouting
shoved
Show
showed
Showing
shown
shows
shrank
shrapnel
shred
shrewd
shrieking
shrieks
shrine
shrines
shrink
shrinkage
shrinking
shrinks
shrouded
shrunk
Shu
shudder
shuddering
shui
Shuja
SHUN
shut
shutting
Shuy
Shâsha
Shê
Shên
si
Siah
Siam
Siamese
Sian
siang
Siaoheichan
Siargao
Siassi
Siberia
SIBERIAN
Siboney
Sibutu
Sibuyan
sic
Sicilian
Sicily
Sick
sickening
sickle
sickly
sickness
Siddons
side
sided
sidedness
sidereal
sides
sidetrack
sideways
Sidgwick
Sidi
Sidon
Siege
Sieges
Sieleny
Siemens
sien
SIERRA
sifted
sight
sighted
sights
sign
Signal
signaled
signaler
signalize
signalized
signalled
signally
signals
signatories
Signatory
signature
signatures
signed
signers
signets
Significance
significant
significantly
signification
signified
signifies
signify
signifying
signing
Signor
signs
Sikhs
Silan
Silang
Silas
silence
silenced
silencing
silent
silently
Silesia
Silico
silicon
silk
silken
silks
silkworms
Silvela
silver
Silvestre
Simara
similar
similarity
similarly
simile
similitude
Simla
simmering
Simon
Simonstown
simple
simpler
simplest
simplicity
simplification
simplify
simplifying
Simplon
simply
simultaneous
simultaneously
sin
Sinaitic
since
sincere
sincerely
sincerity
Sind
sine
sinful
sinfully
sing
Singan
Singapore
singing
single
singled
singles
singly
sings
singular
singularity
singularly
sinister
sink
sinking
Sinminting
sins
Sioux
Siquijor
Sir
Sirdar
Sire
Sisran
Sissoi
sister
Sisters
sit
site
sites
sits
sitting
sittings
situate
situated
Situation
situs
Siu
Sivas
six
sixfold
sixpence
sixteen
Sixteenth
Sixth
Sixthly
sixtieth
Sixty
size
sized
sizes
Siècle
Skagway
Skaugh
skeleton
skelter
Skelton
skepticism
Sketch
sketches
skies
skilful
skilfully
skill
Skilled
skillful
skillfully
skimmed
skin
skinned
skins
skirmish
skirmishers
skirmishes
skirmishing
skirt
skirted
skirting
skirts
Skouses
skulls
sky
slab
slabs
slack
slackened
slain
slandering
slate
slaughter
slaughtered
Slav
slave
slaveholding
slavery
slaves
Slavic
Slavo
Slavonic
Slavs
slay
slayers
sledge
sledges
Sledging
sleds
sleep
sleepers
sleeping
Sleigh
slept
SLESWICK
slice
slid
sliding
Slight
slightest
slightly
slip
slipped
slips
slipshod
slogans
slope
Sloping
slot
sloth
Slovak
Slovenes
Slovenian
slow
slowed
slower
slowly
sluice
sluices
slum
slumbered
slumbering
slumberous
smack
small
smaller
smallest
smallness
smallpox
smartness
smashed
smeared
smell
smelters
smile
smiled
smiling
Smit
smite
smites
Smith
Smithmeyer
Smithsonian
smoke
SMOKELESS
smokers
smoking
Smollenske
smooth
smoothed
smoothing
smoothly
smoothness
smothered
smoulder
smouldering
smuggled
smugglers
Smuggling
Smuts
Smyrna
Smyth
snail
Snake
snakes
snap
snapper
snare
snares
snatch
snatched
Snead
sneers
sniper
sniping
Snow
snows
snowy
SO
soaked
soap
soapstone
Sobat
sober
sobering
soberness
sobriety
Socapa
social
socialism
SOCIALIST
socialistic
Socialists
socialization
socially
Sociedad
Societies
Society
Société
sockets
sod
Soden
sodium
soever
Sofia
soft
Softas
soften
softened
softening
softens
softness
Sohm
soil
sojourned
sojourner
sojourning
SOKOTO
solace
Solar
sold
soldered
solders
soldier
soldierly
Soldiers
soldiery
sole
solely
solemn
solemnely
solemnities
solemnity
solemnly
Solent
solicit
solicitation
solicitations
solicitor
solicitous
solicitude
solid
solidarity
solidification
solidified
solidify
solidifying
solidity
solidly
solitary
solitude
Solomon
solution
Solutions
solve
solved
solvency
solvent
Somali
SOMALIS
some
somebody
somehow
somersaults
Somerset
Something
sometidos
sometimes
somewhat
somewhere
somnolence
Son
song
songs
Sonora
sons
Soochow
soon
Sooner
soothe
sorcerer
sorcery
sordid
sore
sorely
Soriano
sorrow
sorrowful
Sorrowfully
sorrows
sorry
sort
sortie
sorts
soshi
Sosian
SOUDAN
Soudanese
soufflet
sought
soul
souls
sound
sounded
sounder
sounding
Soundings
soundness
sounds
source
sources
Sousse
Soutcheou
south
southeast
southeasterly
southeastern
southeastward
southerly
southern
southernmost
Southgate
southward
southwardly
southwards
Southwark
SOUTHWEST
southwesterly
southwestern
southwestward
southwestwards
Southworth
souvenir
souvenirs
Sovereign
sovereigns
sovereignties
Sovereignty
sow
sown
Soziale
space
spaces
spacious
spade
spadeful
Spain
span
Spaniard
Spaniards
Spanish
spanned
spare
spared
sparing
sparingly
spark
sparkling
sparks
sparred
spars
sparse
sparsely
spasmodic
Spaulding
speak
speakee
speaker
speakers
Speaking
speaks
spear
spearmen
spears
special
SPECIALISTS
specialized
specially
specialties
specie
species
specific
specifically
specification
specifications
specified
specify
specimen
specimens
specious
spectacle
Spectator
spectators
spectra
spectre
spectroscopic
Spectrum
speculation
speculative
speculators
Speech
speeches
speechless
speed
speedier
speedily
speeds
speedy
spelt
Spencer
spend
spending
spendthrifts
spent
Speranza
sperm
spermato
Sphakia
sphere
Spheres
spherical
spice
spiders
spies
spill
spilt
spinal
spindle
Spine
spinning
SPION
spirit
spirited
spiritedly
spirits
Spiritual
spiritualistic
spirituous
Spiritus
Spirogyra
spit
spite
Spitzbergen
Spitzkop
splashing
spleen
splendid
splendidly
splendor
splendour
splenic
splints
split
splitting
spoil
spoiled
spoilers
SPOILS
spoilsmen
spoke
spoken
spokesman
spokesmen
spoliation
spongy
sponsor
sponsors
spontaneous
spontaneously
spool
spoon
Spooner
sporadic
spore
spores