PCA for hidden variable inference in QTL mapping
We have shown that PCA is faster, better-performing, and much easier to interpret and use than popular hidden variable inference methods for QTL mapping including SVA, PEER, and HCP (2022). Here we aim to provide some guidance on how to use PCA for hidden variable inference in QTL mapping and in particular how to choose K (the number of PCs) using the functions provided in this package. This package implements two simple, highly interpretable methods for choosing the number of PCs: an automatic elbow detection method (based on distance to the diagonal line) and the Buja and Eyuboglu (BE) algorithm (1992) (a permutation-based approach). Detailed descriptions of both methods can be found in our paper (2022).
#install.packages("devtools")
devtools::install_github("heatherjzhou/PCAForQTL")
We start with the fully processed molecular phenotype matrix. In this tutorial, we use the fully processed gene expression matrix for Colon - Transverse from GTEx V8 (2020) as an example. In GTEx’s case, “fully processed” means TPM normalized, filtered, TMM normalized, and inverse normal transformed (see link; depending on your specific situation, your procedure may be different). The data set can be obtained directly from link and is also made available under the folder named Example_Data in this repository in a format that can be easily read into R.
First, download the example data set and load it into the environment (change the path as necessary on your device).
dataGeneExpressionFP<-readRDS("./Example_Data/Colon_Transverse.v8.normalized_expression.rds") #25,379*372.
#The first four columns are chr, start, end, and gene_id.
Next, make sure that the data matrix is observation by feature and contains no auxiliary information.
expr<-t(dataGeneExpressionFP[,-(1:4)]) #368*25,379. 368 samples, 25,379 genes.
# dim(expr)
Now we are ready to run PCA. In general, centering is almost always mandatory, and scaling is almost always preferred when running PCA. Since GTEx has already performed inverse normal transform on each feature, whether centering and scaling are performed on our example data does not affect the PCA result much. But we suggest always centering and scaling to be safe.
prcompResult<-prcomp(expr,center=TRUE,scale.=TRUE) #This should take less than a minute.
PCs<-prcompResult$x #368*368. The columns are PCs.
# dim(PCs)
Note that the approach we use in this tutorial constitutes PCA_direct rather than PCA_resid (2022). The two approaches perform similarly in our simulation studies, so we choose PCA_direct because it is simpler. In addition, PCA_direct can better hedge against the possibility that the known covariates are not actually important confounders because in PCA_direct, the known covariates do not affect the calculation of the PCs.
Optionally, you may inspect the proportion of variance explained by the PCs at this point. We omit this here because we will perform a more thorough analysis in the next section.
# importanceTable<-summary(prcompResult)$importance
# PVEs<-importanceTable[2,]
# sum(PVEs) #Theoretically, this should be 1.
# plot(PVEs,xlab="PC index",ylab="PVE")
There are four main functions in this package: runElbow()
, runBE()
,
makeScreePlot()
, and filterKnownCovariates()
. The first three
functions help us choose K. The last function will be discussed in the
next section. For full details of these functions, run
library(PCAForQTL)
followed by ?runElbow
, ?runBE
, etc.
runElbow()
implements an automatic elbow detection method (based on
distance to the diagonal line). In our experience, the number of PCs
chosen using this method tends to match the visual elbow point
reasonably well. To run this function, simply input the previously
obtained prcompResult
. This method selects K = 12 for our example
data.
resultRunElbow<-PCAForQTL::runElbow(prcompResult=prcompResult)
print(resultRunElbow)
## [1] 12
runBE()
implements the BE algorithm, a permutation-based approach for
choosing K in PCA. Intuitively, the BE algorithm retains PCs that
explain more variance in the data than by random chance and discards
those that do not. In our experience, the number of PCs chosen via BE
tends to signal an upper bound of the reasonable number of PCs to
choose. To run this function, we need to input the data matrix (must
be observation by feature) and may optionally specify B
, the number of
permutations (default is 20), and alpha
, the significance level
(default is 0.05). We may also specify mc.cores
to change how many
cores are used for parallel computing (default is B
or the number of
available cores minus 1, whichever is smaller). For reproducibility,
Linux and Mac users must change the random number generator (RNG) type
(unless mc.cores
is 1) and set the seed. On the other hand, Windows
users must set mc.cores=1
to avoid error and set the seed for
reproducibility (no need to change the RNG type). The BE method selects
K = 29 for our example data.
Linux and Mac users:
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
resultRunBE<-PCAForQTL::runBE(expr,B=20,alpha=0.05)
print(resultRunBE$numOfPCsChosen)
## [1] 29
Windows users:
set.seed(1) #No need to change the RNG type since mc.cores will need to be 1.
resultRunBE<-PCAForQTL::runBE(expr,B=20,alpha=0.05,
mc.cores=1)
print(resultRunBE$numOfPCsChosen)
After running runElbow()
and/or runBE()
, we recommend using
makeScreePlot()
to visualize the selected K’s. If you have
candidate K’s chosen via other methods as well, you may also include
them here.
K_elbow<-resultRunElbow #12.
K_BE<-resultRunBE$numOfPCsChosen #29.
K_GTEx<-60 #GTEx uses 60 PEER factors, and they are almost identical to the top 60 PCs.
PCAForQTL::makeScreePlot(prcompResult,labels=c("Elbow","BE","GTEx"),values=c(K_elbow,K_BE,K_GTEx),
titleText="Colon - Transverse")
We recommend saving the plot for each tissue type (using code similar to below) in order to compare across all tissue types before making a final decision on how to choose K.
ggplot2::ggsave("./Colon_Transverse.jpg",width=16,height=11,unit="cm")
Optionally, you may run your entire pipeline using a few different
choices of K (for example, 0, the number of PCs chosen via
runElbow()
, and the number of PCs chosen via runBE()
) and visualize
the number of discoveries versus K before making a final decision on
how to choose K (2017).
Lastly, it is good practice to filter out the known covariates that are captured well by the inferred covariates (PCs) in order to avoid redundancy. GTEx V8 (2020) uses eight known covariates: the top five genotype PCs, WGS sequencing platform (HiSeq 2000 or HiSeq X), WGS library construction protocol (PCR-based or PCR-free), and donor sex. Similar to the gene expression matrix, these variables can be obtained directly from link and are also made available under the folder named Example_Data in this repository in a format that can be easily read into R.
First, download the covariates and load them into the environment (change the path as necessary on your device). Make sure that the known covariate matrix is observation by feature and that the observations match those in the molecular phenotype matrix (and hence the PCs). If necessary, rename the known covariates to avoid potential confusion.
dataCovariates<-readRDS("./Example_Data/Colon_Transverse.v8.covariates.rds") #368*68.
#The columns are the top five genotype PCs, 60 PEER factors, pcr, platform, and sex.
knownCovariates<-dataCovariates[,c(1:5,66:68)] #368*8. 368 samples, 8 known covariates.
identical(rownames(knownCovariates),rownames(expr)) #TRUE is good.
## [1] TRUE
colnames(knownCovariates)[1:5]<-paste0("genotypePC",1:5) #This is to avoid potential confusion.
Suppose we have decided to use the number of PCs chosen via BE for our example data.
PCsTop<-PCs[,1:K_BE] #368*29.
Now we use filterKnownCovariates()
to filter out the known
covariates that are captured well by the top PCs (unadjusted
R2 ≥ 0.9 by default). This function returns the known
covariates that should be kept. We use unadjusted R2
instead of adjusted R2 because we do not want to penalize
for model complexity here. The cutoff value may be customized using the
argument unadjustedR2_cutoff
.
knownCovariatesFiltered<-PCAForQTL::filterKnownCovariates(knownCovariates,PCsTop,unadjustedR2_cutoff=0.9)
Finally, we combine the remaining known covariates and the top PCs and use them as covariates in the QTL analysis. To avoid potential numerical inaccuracies in the QTL analysis, you may optionally scale the PCs to unit variance, though theoretically this would not change the QTL result from regression-based methods such as Matrix eQTL and FastQTL.
PCsTop<-scale(PCsTop) #Optional. Could be helpful for avoiding numerical inaccuracies.
covariatesToUse<-cbind(knownCovariatesFiltered,PCsTop)
To acknowledge this package or this tutorial, please cite our paper (2022): https://doi.org/10.1186/s13059-022-02761-4. For questions, please email us at lijy03@g.ucla.edu or heatherjzhou@ucla.edu.
Buja, Andreas, and Nermin Eyuboglu. 1992. “Remarks on Parallel Analysis.” Multivariate Behavioral Research 27 (4): 509–40.
GTEx Consortium. 2017. “Genetic Effects on Gene Expression Across Human Tissues.” Nature 550 (7675): 204–13.
———. 2020. “The GTEx Consortium Atlas of Genetic Regulatory Effects Across Human Tissues.” Science 369 (6509): 1318–30.
Johnson, Richard A., and Dean W. Wichern. 2007. Applied Multivariate Statistical Analysis. Sixth. Upper Saddle River, NJ: Pearson Prentice Hall.
Jolliffe, Ian T. 2002. Principal Component Analysis. Second. New York: Springer.
Leek, Jefferey T., and John D. Storey. 2008. “A General Framework for Multiple Testing Dependence.” Proceedings of the National Academy of Sciences 105 (48): 18718–23.
Leek, Jeffrey T., and John D. Storey. 2007. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLoS Genetics 3 (9): e161.
Mostafavi, Sara, Alexis Battle, Xiaowei Zhu, Alexander E. Urban, Douglas Levinson, Stephen B. Montgomery, and Daphne Koller. 2013. “Normalizing RNA-sequencing Data by Modeling Hidden Covariates with Prior Knowledge.” Edited by Panayiotis V. Benos. PLoS ONE 8 (7): e68141.
Ongen, Halit, Alfonso Buil, Andrew Anand Brown, Emmanouil T. Dermitzakis, and Olivier Delaneau. 2016. “Fast and Efficient QTL Mapper for Thousands of Molecular Phenotypes.” Bioinformatics 32 (10): 1479–85.
Shabalin, Andrey A. 2012. “Matrix eQTL: Ultra Fast eQTL Analysis via Large Matrix Operations.” Bioinformatics 28 (10): 1353–58.
Stegle, Oliver, Leopold Parts, Richard Durbin, and John Winn. 2010. “A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies.” Edited by Aviv Regev. PLoS Computational Biology 6 (5): e1000770.
Stegle, Oliver, Leopold Parts, Matias Piipari, John Winn, and Richard Durbin. 2012. “Using Probabilistic Estimation of Expression Residuals (PEER) to Obtain Increased Power and Interpretability of Gene Expression Analyses.” Nature Protocols 7 (3): 500–507.
Zhou, Heather J., Lei Li, Yumei Li, Wei Li, and Jingyi Jessica Li. 2022. “PCA Outperforms Popular Hidden Variable Inference Methods for Molecular QTL Mapping.” Genome Biology 23 (1): 210.