. Author manuscript; available in PMC: 2023 Feb 9.

Published in final edited form as: J Comput Graph Stat. 2021 Jul 19;31(1):163–175. doi: 10.1080/10618600.2021.1935971

Latent Network Estimation and Variable Selection for Compositional Data Via Variational EM

Nathan Osborne ^a, Christine B Peterson ^b, Marina Vannucci ^a

PMCID: PMC9909885 NIHMSID: NIHMS1845956 PMID: 36776345

Abstract

Network estimation and variable selection have been extensively studied in the statistical literature, but only recently have those two challenges been addressed simultaneously. In this article, we seek to develop a novel method to simultaneously estimate network interactions and associations to relevant covariates for count data, and specifically for compositional data, which have a fixed sum constraint. We use a hierarchical Bayesian model with latent layers and employ spike-and-slab priors for both edge and covariate selection. For posterior inference, we develop a novel variational inference scheme with an expectation–maximization step, to enable efficient estimation. Through simulation studies, we demonstrate that the proposed model outperforms existing methods in its accuracy of network recovery. We show the practical utility of our model via an application to microbiome data. The human microbiome has been shown to contribute too many of the functions of the human body, and also to be linked with a number of diseases. In our application, we seek to better understand the interaction between microbes and relevant covariates, as well as the interaction of microbes with each other. We call our algorithm simultaneous inference for networks and covariates and provide a Python implementation, which is available online.

Keywords: Bayesian hierarchical model, Count data, EM algorithm, Graphical model, Microbiome data, Variational inference

1. Introduction

Variable selection, also known as feature selection, is a well-studied subject in the statistical literature, particularly in the context of regression models, where many approaches have been proposed. Feature selection offers an opportunity to both improve model predictions, by avoiding the inclusion of noisy or irrelevant predictors, and to identify interpretable parsimonious models. Frequentist approaches often use a penalized likelihood to obtain sparse estimates of the regression coefficients, and include methods such as LASSO (Tibshirani 1996), adaptive LASSO (Zou 2006), and SCAD (Fan and Li 2001). Alternatively, Bayesian approaches employ carefully constructed priors on the regression coefficients to identify the relevant variables. Spike-and-slab priors, first proposed by Mitchell and Beauchamp (1988), are a popular class of priors that use a latent indicator to represent variable inclusion. Conditional on the indicators, the regression coefficients are assumed to come from a mixture prior representing important vs. negligible effects (George and McCulloch 1997; Brown, Vannucci, and Fearn 1998). In addition to sparse estimation of the coefficients, these priors produce posterior probabilities of inclusion (PPIs) for each covariate that capture the uncertainty in the selection. Spike-and-slab priors have been extended to regression models for non-Gaussian data, including binary, multinomial, and count responses (Raftery 1996; Ntzoufras, Dellaportas, and Forster 2003; Sha et al. 2004; Wadsworth et al. 2017; Koslovsky and Vannucci 2020).

A parallel development has happened in the graphical model literature: in this framework, nodes correspond to variables, and edges connecting these nodes represent conditional dependence relations. In the Gaussian setting, the problem of selecting edges in the graph reduces to the estimation of a sparse inverse covariance matrix, since exact zeros in this matrix, which is also known as the precision matrix, correspond to conditional independence relations (Dempster 1972). In frequentist settings, penalized likelihood methods, such as neighborhood selection (Meinshausen and Bühlmann 2006) and the graphical LASSO (Yuan and Lin 2007; Friedman, Hastie, and Tibshirani 2008), have been proposed. These methods have been extended to count data by using data transformations (Kurtz et al. 2015) or penalized log-likelihood methods (Fang et al. 2017). In Bayesian inference, the G-Wishart prior (Roverato 2002), which is the conjugate prior that imposes exact zeros in the precision matrix, has been explored by several authors for inference in Gaussian graphical models, but poses significant computational challenges (Lenkoski and Dobra 2011). As a result, this prior is not easily scalable. Alternative shrinkage constructions that employ continuous priors on the off-diagonal elements of the precision matrix have been proposed, such as the Bayesian graphical lasso (Wang 2012), which relies on double exponential priors, and mixture priors (Wang 2015), inspired by the spike-and-slab priors in the regression framework discussed above. To enable estimation of the spike-and-slab model of Wang (2015) in high-dimensional settings, Li and McCormick (2019) recently proposed an efficient expectation conditional maximization method, which offers an attractive alternative to stochastic search approaches.

In this article, we propose a novel Bayesian hierarchical model for count data that allows for simultaneous estimation of covariate dependence and network interactions. Methods for simultaneous estimation are gaining popularity, with approaches including penalized likelihood methods (Rothman, Levina, and Zhu 2010; Yang, Chen, and Chen 2017), and, most recently, spike-and-slab lasso prior models (Deshpande, Ročková, and George 2019). By accounting for covariate selection, simultaneous estimation methods are able to control for those variables, which ultimately leads to more accurate network estimation. Moreover, simultaneous estimation can improve the detection of covariate effects, as noted by Deshpande, Ročková, and George (2019). However, with the exception of Yang, Chen, and Chen (2017), these methods are not suitable for count data. In our approach, we consider multivariate count data, and specifically compositional data that have a fixed sum constraint. We model the data using a Dirichlet-multinomial likelihood and then introduce a latent layer by modeling the log concentration parameters via a Gaussian distribution. We account for covariates through the mean function of the latent layer and employ multivariate variable selection spike-and-slab priors that allow each covariate to be relevant for individual response variables (Richardson, Bottolo, and Rosenthal 2010; Stingo et al. 2010). We also capture a network of latent dependence relationships by estimating the inverse covariance matrix via the mixture prior of Wang (2015). For posterior inference, we implement a novel variational Bayes approach that includes an expectation–minimization (EM) step to estimate the model. This allows us to gain flexibility by using a Bayesian model, while still remaining computationally efficient. Additionally, the algorithm is developed so that multiple steps can be run in parallel, achieving larger computational gains. We show through simulations that our method outperforms the LASSO-based approach of Yang, Chen, and Chen (2017). We refer to our model as simultaneous inference of networks and covariates (SINC).

Compositional data are often collected in chemistry, geology, and biology applications. In biomedicine, modern genomic sequencing technologies have allowed investigators to collect samples on the human microbiome. Microbes associated with the human body include eukaryotes, archaea, bacteria, and viruses, which have been shown to contribute to important bodily functions including food digestion and energy supply. The human microbiome has also been implicated in many diseases including colorectal cancer, inflammatory bowel disease, and immunologically mediated skin diseases. The observed data from a microbiome study are typically short reads of DNA sequences, which are clustered to create operational taxonomic units (OTUs). The abundances across samples of these OTUs, which represent genetically close groups of microbes assumed to have similar functions, are taken as input to downstream analysis. A challenge to modeling these data is that the number of counts for a particular OTU depends on the number of sequences collected for that sample, meaning that the observed counts are dependent on each other, as they constitute proportions of a whole. This results in data that are compositional. For these reasons, Dirichlet-multinomial distributions are particularly appropriate to model microbiome data, as demonstrated by several authors (Chen and Li 2013; Wadsworth et al. 2017; Tang, Ma, and Nicolae 2018). In the application of this article, we focus on two questions of interest in the understanding of the microbiome: which variables influence the microbial abundances, and what are the dependence relationships among microbes. The abundance of microbes or groups of microbes is dependent on many factors. Microbial abundance may be related to external covariates, such as diet, cytokines, or use of medication. These factors influence the microbiome by introducing new organisms, changing the abundance of metabolites, or altering the pH of their environment. For example, consumption of an animal-based diet high in meat has been shown to increase production of bile acid, which inhibits growth of bacteria belonging to the Bacteroidetes and Firmicutes phyla (David et al. 2014). Antibiotics can alter the microbiome substantially, by killing off components of the microbiome in addition to the bacteria triggering the infection (Edwards et al. 2019). As we understand more about the importance of the microbiome, it is also critical to understand what factors lead to the prevalence of different microbes. Here, we apply the proposed method to real data from the Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI) study, to estimate the interaction between microbes in the vagina, as well as the interplay between vaginal cytokines and microbial abundances, providing insight into mechanisms of host-microbial interaction during pregnancy.

The article is outlined as follows: in Section 2, we describe the proposed hierarchical model, followed by the variational EM estimation method in Section 3. We provide a simulation study in Section 4, and then showcase the proposed model in an application to multi-omic data from a study on the role of the microbiome in pregnancy in Section 5. Finally, we discuss the advantages of the proposed model in Section 6.

2. Proposed Model

Suppose we have observed multivariate counts arranged in an n × p matrix, X, where p is the number of observed variables measured across n samples. We then let the p-vector X_i correspond to the measurements for observation i, and the matrix entry x_i,j correspond to the jth variable measurement for the ith observation. We also observe q covariates for each of the n observations, with these q additional factors possibly influencing the measured counts for each observation. We arrange the covariate data in an n × q matrix, M.

We are interested in understanding the conditional dependence relationships among the p variables while simultaneously selecting the relevant covariates. We adopt a hierarchical model formulation with a latent Gaussian layer, similarly to Yang, Chen, and Chen (2017), as

Z_{i} ∣ B_{0}, M_{i}, B, Ω ~ MVNorm (B_{0} + M_{i} B, Ω^{- 1}) α_{i} = exp {Z_{i}} h_{i} ∣ α_{i} ~ Dirichlet (α_{i}) X_{i} ∣ h_{i} ~ Multinomial (h_{i}, N_{i}) .

(1)

In this hierarchical formulation, we introduce a latent normal variable Z_i, which is a direct transformation of the concentration parameter α_i and therefore controls the observed counts X_i. This model has several important features: the Dirichlet-multinomial likelihood for the count data, X_i, allows us to account for overdispersion as well as the compositional nature of the data. The dependence on covariates is incorporated through the mean of the multivariate normal, where the observed covariates M_i have effects B. The dependence among the Z_i is captured by the inverse covariance matrix, also known as the precision matrix, Ω. The 1 × p vector B₀ accounts for the mean of each column of the latent matrix Z. In our modeling approach, careful consideration of the priors on the covariate effects B, the intercepts B₀ and the precision matrix Ω allows us to construct a directed graph between covariates M and latent variables Z, as well as an undirected graph between the columns of Z.

For microbiome studies, Gloor et al. (2017) noted that the observed compositional data have a different correlation structure than the true underlying abundances. More specifically, due to the fixed-sum constraint, compositional data tend to exhibit negative correlations. In model formulation (1), we interpret the latent layer h_i to be the relative abundances, and α_i to be the absolute abundances (Yang, Chen, and Chen 2017). By estimating a network on the latent Z, we capture the network of the underlying, absolute abundances through the precision matrix Ω. Therefore, even though the latent Gaussian layer does not allow us to recover relationships directly among the observed counts, the inferred dependences do provide some insights into the relationships among the underlying processes. Latent graphical models for Poisson-distributed count data that use Gaussian layers were used by Vinci et al. (2018), for spike-count data. See also Talhouk, Doucet, and Murphy (2012) and Li, McComick, and Clark (2020) for latent graphical model constructions for binary data.

2.1. Prior on Covariate Effects B

Here, we describe the prior on the covariate effects, which enables selection of the important associations between X and other potentially related factors M. We consider the effects of the covariates M on each column of Z separately, which means that we will be able to update the columns of B independently of each other. Here B is a q × p matrix, where each column of B represents the vector of regression coefficients for the q covariates of M on the jth column of Z. We use a spike-and-slab prior on each element of the matrix B, which shrinks the effects of features that do not influence Z to zero. Remember that we are looking at the columns of Z one at a time, and can thus say that any entry from the jth column, Z_i,j, comes from a $Normal (B_{0 j} + M_{i} B_{j}, σ_{j}^{*})$ , where $σ_{j}^{*}$ is the standard deviation of the jth column of Z, found by using the properties of the multivariate normal distribution shown in Equation (1). The prior on B is as follows:

B_{k, j} ∣ γ_{k, j}, v_{B}^{2} ~ γ_{k, j} Normal (0, v_{B}^{2}) + (1 - γ_{k, j}) δ_{0}, γ_{k, j} ∣ θ_{γ_{j}} ~ Bernoulli (θ_{γ_{j}}), θ_{γ_{j}} ∣ a_{γ}, b_{γ} ~ Beta (a_{γ}, b_{γ}),

(2)

for j = 1, …, p and k = 1, …, q, and with δ₀ a point mass at 0, indicating that when γ_k,j is 0, B_k,j is exactly 0. Here, θ_γj is the probability of a variable being relevant in B_j. Notice that the mixture prior (2) allows each variable to be relevant for individual responses (Richardson, Bottolo, and Rosenthal 2010; Stingo et al. 2010), as opposed to spike-and-slab constructions that select variables as relevant to either all or none of the responses (Brown, Vannucci, and Fearn 1998). We also put a non-informative prior on each element of B₀, that is, B_0j ∝ 1.

2.2. Prior on Precision Matrix Ω

Next we introduce the prior on the precision matrix Ω, which allows us to learn a sparse association network. We consider the prior of Wang (2015) in the formulation proposed by Li and McCormick (2019)

π (Ω ∣ δ, v_{1}, v_{0}, λ, τ) \propto \prod_{i < j} {(1 - δ_{i, j}) Normal (ω_{i, j} ∣ 0, \frac{v_{0}^{2}}{τ}) + δ_{i, j} Normal (ω_{i, j} ∣ 0, \frac{v_{1}^{2}}{τ})} \times \prod_{i} Exp (ω_{i, i} ∣ λ / 2) 1_{Ω \in M^{+}},

(3)

where ν₀ and ν₁ are fixed standard deviations, that assume small and large values, respectively, δ_i,j is a latent variable indicating whether or not an edge is present between nodes i and j, and τ is a scaling parameter, with a hyperprior Gamma(a_τ, b_τ) that allows us to adaptively learn the standard deviations. The original prior of Wang (2015) is obtained by setting τ = 1. Additional complexity can be added to the prior on τ to include existing knowledge about variable associations, as shown in Li and McCormick (2019). The mixture of normals on the off-diagonal precision matrix entries enables the selection of interactions, represented by edges in a network, since nonzero precision matrix entries reflect conditional dependence relationships (Dempster 1972). Here, entries reflecting conditional independence relations do not equal exactly zero, but get shrunk to close to zero. The diagonal entries are drawn from a common exponential prior. The final term in Equation (3) expresses a constraint to the space of positive definite matrices M⁺. This prior is particularly advantageous in our model, as it allows for efficient estimation via the EM algorithm and leads to less bias in estimation of the off-diagonal precision matrix elements than the graphical LASSO, as shown by Li and McCormick (2019).

We complete the modeling of the precision matrix Ω by setting the prior on the graph structure, assuming independent Bernoulli distributions on the inclusion of each edge as follows:

δ_{i, j} ∣ π ~ Bernoulli (π), π ∣ a_{π}, b_{π} ~ Beta (a_{π}, b_{π}) .

(4)

3. Posterior Inference

We now discuss how to obtain posterior estimates of the parameters in the model outlined in Section 2. Instead of a traditional Markov chain Monte Carlo (MCMC) sampler, which can be computationally quite expensive, we rely on a variational inference (VI) approach, which aims to find an approximation of the posterior using optimization methods. VI works by specifying a family of approximate distributions 𝒬, which are densities over latent variables W that are dependent on free parameters ξ, and then seeking to find the values of ξ that minimize the Kullback–Leibler (KL) divergence between the approximate distribution and the true posterior. As discussed in Blei, Kucukelbir, and McAuliffe (2017), minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO), which is defined as follows:

ELBO = E_{ξ} [log p (X, W)] - E_{ξ} [log q (W)],

(5)

with p(X, W) as the joint distribution of the observed data and the latent variables, and q(W) the variational distributions of the latent variables.

The most common approach to obtain an approximating distribution when applying a variational Bayes approach is mean field approximation, where the approximating distribution is assumed to factorize over some partition of the parameters. This is the approach that we adopt for the coefficient vector B. However, a mean field approach for the elements of the precision matrix Ω is not appropriate, due to the dependence among the parameters induced by the fact that this matrix is constrained to be symmetric and positive semidefinite. For this reason, the choice of an appropriate approximating distribution for the precision matrix is an open research question. To circumvent this issue, similar to Miao et al. (2020), we adopt a hybrid VI algorithm, with an EM step to estimate Ω and δ.

Specifically, for B we use the mean field variational distributions $q (B, γ) = \prod_{j = 1}^{p} \prod_{k = 1}^{q} q (B_{k, j}, γ_{k, j} ∣ ϕ_{k, j}, μ_{k, j}, σ_{k, j})$ , where

q (B_{k, j}, γ_{k, j} ∣ ϕ_{k, j}, μ_{k, j}, σ_{k, j}) = {\begin{array}{l} ϕ_{k, j} Normal (B_{k, j} ∣ μ_{k, j}, σ_{k, j}) & if γ_{k, j} = 1, \\ (1 - ϕ_{k, j}) δ_{0} (B_{k, j}) & otherwise \end{array}

with free parameters ξ = {ϕ_k,j, μ_k,j, σ_k,j}. We then define the ELBO as

ELBO = E_{ξ} [log \prod_{j = 1}^{p} (p (X_{i} ∣ Z_{i}) p (Z_{i} ∣ B_{0}, M_{i}, B, Ω) \times \prod_{k = 1}^{q} (p (B_{k, j} ∣ γ_{k, j}, v_{B}^{2}) p (γ_{k, j} ∣ θ_{γ_{j}}) p (θ_{γ_{j}} ∣ a_{γ}, b_{γ}))) \times p (Ω ∣ δ, v_{1}, v_{0}, λ, τ) p (δ_{i, j} ∣ π) p (π ∣ a_{π}, b_{π})] - E_{ξ} [log \prod_{i = 1}^{p} \prod_{k = 1}^{q} q (B_{k, j}, γ_{k, j} ∣ ϕ_{k, j}, μ_{k, j}, σ_{k, j})],

where the first expectation is equivalent to $E_{ξ} [log p (X, W)]$ and the second expectation is $E_{ξ} [log q (W)]$ of [Equation (5),] for W = {Z, B₀, B, Ω, δ, π, γ, θ_γ}.

The hybrid scheme we use to maximize the ELBO, where the first part is a VI step and the second part is an EM step, is described in detail in the following subsections. In the VI step, we update the free parameters, ξ, by setting the partial derivative of the ELBO with respect to the desired parameters equal to zero. This maximizes the ELBO with respect to ξ. We then further maximize the ELBO by finding the optimal values for the remainder of the latent parameters. For this, we rely on an EM step, by treating δ as latent parameters and taking the expectation of the ELBO with respect to δ, or equivalently setting

Q (θ ∣ θ^{(t)}, ξ^{(t)}) = E_{δ} [ELBO],

at iteration t, and optimizing Q(θ | θ^(t), ξ^(t)) by finding the maximum a posteriori (MAP) estimate of the remaining parameters θ = {Z, B₀, Ω, π, θ_γ}. The resulting algorithm, which we call SINC, is described in Algorithm 1. As with traditional EM and VI schemes, parameter updates at each iteration are made with the most current estimates of all other parameters. The algorithm results in MAP estimates for the parameters in θ. Additionally, since no uncertainty about these parameters is used in the updates of the other parameters, the proposed algorithm is only suitable for point estimation.

graphic file with name nihms-1845956-f0001.jpg

Our proposed hybrid algorithm builds upon the similarities between the VI and EM algorithms. As noted in Blei, Kucukelbir, and McAuliffe (2017), the first term of Equation (5) is the expected complete log-likelihood, which is optimized by the EM algorithm. Since no variational distributions are proposed for the parameters in θ, updating those parameters is achieved by optimizing log p(W, Y) in Equation (5). As an alternative perspective to highlight the similarity, we could say that we have assigned a point mass as our variational distribution for these latent parameters. Optimizing the ELBO would then lead to the same result, since taking the partial derivative of the ELBO with respect to the variables with point mass variational distributions would result in optimizing $E_{ξ} log [p (W, Y)]$ . By stating the algorithm in this way, we can interpret our approach as a proper VI scheme, solved via an EM step similar to Titsias and Lázaro-Gredilla (2011), which affords us the confidence of VI guarantees of previous literature (Blei, Kucukelbir, and McAuliffe 2017).

3.1. VI Step

Here, we use a VI step to estimate the regression coefficients B by updating the free parameters μ_k,j, σ_k,j, and ϕ_k,j, where μ_k,j and σ_k,j are the mean and variance, respectively, of B_k,j when γ_k,j 1, and ϕ = k,j is interpreted as the probability of γ_k,j = 1, resulting in B_k,j ≠ 0. Following the work of Titsias and Lázaro-Gredilla (2011) and Carbonetto and Stephens (2012) the free parameters can then be updated as

μ_{k, j} = \frac{σ_{k, j}}{σ_{j}^{*}} [{M^{T} (Z_{j} - B_{0 j})}_{k} - \sum_{l \neq k} {(M^{T} M)}_{l k} ϕ_{l, k} μ_{l, k}],

(6)

σ_{k, j} = \frac{σ_{j}^{*}}{{(M^{T} M)}_{k, k} + 1 / ν_{B}},

(7)

ϕ_{k, j} = {Logit}^{- 1} (log \frac{θ_{γ_{j}}}{1 - θ_{γ_{j}}} - \frac{1}{2} log \frac{σ_{j, k}}{v_{B} σ_{j}^{*}} + \frac{μ_{i, j}^{2}}{2 σ_{j, k}^{2}}) .

(8)

Updating each column of B can then be done independently and, when resources are available, these updates can be done in parallel. While updating a column of B_j, each component of μ_j, σ_j, ϕ_j is updated given all other components. This component-wise update of μ_j, σ_j, ϕ_j is repeated until ELBO(μ_j, σ_j, ϕ_j) has converged. Once all μ and ϕ have been updated, the individual elements of B are assigned $E (B_{k, j}) = μ_{k, j} ϕ_{k, j}$ . Once the SINC algorithm has converged, as common in variational spike-and-slab literature (Huang, Wang, and Liang 2016; Miao et al. 2020), we set B_k,j = μ_k,j if ϕ_k,j > 0.5 and B_k,j = 0 if ϕ_k,j ≤ 0.5. The threshold of 0.5 equivalent to selecting the median model of Barbieri and Berger (2004) and can be adjusted to include or exclude more covariates, but the threshold of 0.5 is the most commonly used.

3.2. E Step

In this step, we focus on updates to the edge inclusion parameter δ_i,j. For the first step we take the expectation of the posterior distribution, treating δ as the latent variable. We define Q(Θ| Θ^(t), ξ^(t)) as $E_{δ ∣ Ω^{(t)}, π, X} (log p (Z, Ω, B, B_{0}, π ∣ X, M) ∣ Ω^{(t)}, π, X)$ . Following the results shown in Li and McCormick (2019), the E step can be broken into two steps:

E_{δ ∣ Ω^{(t)}, π, X} [δ_{i, j}] = p_{i, j}^{*} \equiv \frac{a_{i, j}}{a_{i, j} + b_{i, j}},

(9)

E_{δ ∣ Ω^{(t)}, π, X} [\frac{1}{v_{0}^{2} τ^{- 1} (1 - δ_{i, j}) + v_{1}^{2} τ^{- 1} δ_{i, j}}] = d_{i, j}^{*} \equiv (\frac{1 - p_{i, j}^{*}}{v_{0}^{2}} + \frac{p_{i, j}^{*}}{v_{1}^{2}}) τ^{- 1},

(10)

where a_i,j = p(ω_i,j | δ_i,j = 1)π and b_i,j = p(ω_i,j | δ_i,j = 0)(1 − π), and (i, j) is the (i, j)th entry of the precision matrix, where i and j ∈ {1, …, P}.

3.3. M Step

The remainder of the unknown parameters can be found by maximizing the posterior distribution with regards to each of the parameters we are interested in. Here, we first update the column-wise centering parameters B_0j independently as

B_{0 j} = \frac{\sum_{i = 1}^{n} Z_{i, j} - M_{i} B_{j}}{n} .

(11)

Next, we update the precision matrix, Ω. Following Li and McCormick (2019) and Wang (2015) the conditional distribution of each column of Ω can be found in closed form. For this, let

Ω = (\begin{array}{l} Ω_{11} & ω_{12} \\ ω_{21}^{T} & ω_{22} \end{array}),

{(Z - (M B + B_{0}))}^{T} (Z - (M B + B_{0})) = (\begin{array}{l} S_{11} & s_{12} \\ s_{21} & s_{22} \end{array}) .

Then, the conditional distributions are

ω_{12} ~ Normal (- C s_{12}, C), ω_{22} - Ω_{12} Ω_{11} Ω_{12} ~ Gamma (1 + \frac{n}{2}, \frac{λ + s_{22}}{2}), C = {((s_{22} + λ) Ω^{- 1} + diag (ν_{δ_{12}}))}^{- 1} .

(12)

We can then do a column-by-column update as

ω_{12} = - {((s_{22} + λ) {(Ω)}^{- 1} + diag (d_{i j}^{*}))}^{- 1}) s_{12}, ω_{22} = ω_{12} Ω_{11} ω_{12} + \frac{n}{λ + s_{22}} .

(13)

The point estimates of θ_γ and π are also updated as follows:

θ_{γ} = \frac{\sum ϕ_{i, j} + a_{γ} - 1}{p + a_{γ} + b_{γ} - 2},

(14)

π = (a_{π} + \sum_{i < j} δ_{i j} - 1) / (a_{π} + b_{π} + \frac{p \times (p - 1)}{2} - 2) .

(15)

If using the adaptive scale parameter, τ, then an additional update is done by setting

τ = \frac{a_{τ} - 1 + 0.5 (p \times (p - 1))}{b_{τ} - 2 + 0.5 \sum_{i < j} ω_{i, j}^{2} d_{i, j}^{*}}

(16)

Finally, the matrix of latent variables can be estimated by finding a point estimate for each entry of the matrix. This is done by updating each row of the matrix independent of the others. As shown in Yang, Chen, and Chen (2017), the objective function to optimize with respect to Z is

log P (Z ∣ X, M, B_{0}, B, Ω) = - \frac{1}{n} \sum_{i = 1}^{n} [\sum_{j = 1}^{p} \tilde{Γ} (α_{i, j} + x_{i, j}) - \tilde{Γ} (s (α_{i}) + s (X_{i})) - \sum_{j = 1}^{p} \tilde{Γ} (α_{i, j}) + \tilde{Γ} (s (α_{i}))] - \frac{1}{2} log | Ω | + \frac{1}{2 n} \sum_{i = 1}^{n} (Z_{i} - (B_{0} + M_{i} B)) Ω (Z_{i} - (B_{0} + M_{i} B)),

(17)

where $\tilde{Γ}$ is the log-gamma function, and $s (x_{i}) = \sum_{j = 1}^{p} x_{i, j}$ . To accomplish optimization of each Z_i we use the limited-memory quasi-Newton (L-BFGS) algorithm, which is a quasi-newton gradient descent method that makes use of the inverse gradient to direct where to search through the variable space.

For posterior inference, we iterate through the VI step, which iterates between updating ϕ_k,j, μ_k,j, and σ_k,j, and the E and M steps, which update Ω one column at a time, until the algorithm has converged. For both the VI and the M steps, we run each of those steps until the respective parameter estimates have converged. We determine the algorithm to have converged if the ELBO changes by less than a predefined tolerance from one iteration to the next. To obtain a selected network and set of covariates based on these posterior estimates, we select edges i, j with $p_{i, j}^{*}$ in Equation (9) ≥ 0.50, and covariate associations with $ϕ_{k, j}^{*}$ in Equation (8) ≥ 0.50. In practice, both the $p_{i, j}^{*}$ and ϕ_k,j values, which reflect the posterior probabilities for the selection of edges and covariates, tend to converge to values close to 0 or close to 1. A similar trend has been noted by Kook et al. (2021), who reported that the variational parameters for the marginal posterior probabilities of inclusion tended to become more widely separated as the algorithm converges.

4. Simulation Study

We now compare the performance of our method to existing approaches in a simulation setting designed to mimic the application to microbiome data described later in the article.

4.1. Simulation Setup

Simulated data were generated with the following steps. First, the covariates, M, were generated from a normal distribution MVNorm(0, I_p), and subsequently scaled. The values of the regression coefficients B, related to M, were then sampled. Each element of B, B_k,j, was assigned either a random value between [−1, −0.5] with probability 0.1, a value in [0.5, 1] with probability 0.1, or else 0. Each B_0j was then sampled from the interval [6, 8] with probability 0.2, and from [2, 4] with probability 0.8. This allowed for some variables to have larger counts and others to be sparser, as common in microbiome data. Simulated counts were then sampled by first drawing Z from MVNorm(MB + B₀, Ω), with Ω as described below, and then assigning α as exp(Z). Finally, h_i was sampled as a random draw from Dirichlet(α_i), and X_i drawn from a Multinomial(h_i, nint(N_i)), with N_i generated from a Normal(3000, 250), allowing for samples to have different numbers of total counts, where nint() represents the nearest integer function. We set p = 100, q = 50 and n = 300.

To explore performance for a range of possible network structures, we simulated a variety of configurations. These networks were created using the R package huge (Jiang et al. 2019). For this simulation, we used a band, cluster, hub, and random graph structure. An example of what these networks look like can be seen in Figure 1. Band graphs and random graphs are common test cases for network learning, while the hub and cluster graphs capture some aspects of biological networks, such as highly connected nodes and community structure (Girvan and Newman 2002). The probability of an edge in the network was set to 0.025 for the random graph and 0.30 for each cluster in the cluster graph. The bandwidth in the graph was set to 3 and the number of hubs in the hub graph was set to 3. The precision matrix Ω used to generate the simulated data was also constructed using the function huge.generate of the R package huge Jiang et al. (2019). Parameters v and u of huge.generate, which control the off-diagonal elements of the precision matrix and magnitude of the partial correlations, were set to 1 and 0.0001, respectively.

Figure 1. — Simulation study: Example networks used in the simulation studies. Black boxes represent true edges, and light gray boxes correspond to no edge. For the cluster and random graphs, the actual networks that generated the data were different for each simulated dataset, but each simulation kept the blocked shape or random network, respectively.

The results we report below for our proposed model were obtained with the following hyperparameter settings. Fairly non-informative priors were set by choosing a_γ = b_γ = 2 in the prior probability of inclusion (2) of each covariate. The same setting was used for the hyperparameters a_π and b_π in Equation (4), which determines the prior probability of an edge being included. The standard deviation of the prior on the selected regression coefficients, ν_B in Equation (2), was set to 1. Following guidelines given by Wang (2015) and Li and McCormick (2019), we set ν₁ = 10, the standard deviation in prior (3) on the off-diagonal precision matrix entries corresponding to selected edges, and fit the model across a grid of values, ranging from 0.0001 to 0.1, of ν₀, the prior standard deviation of off-diagonal elements of the precision matrix corresponding to nonselected edges. We then chose the final model by using the ν₀ value that gave sparsity closest to 0.10. This sparsity level was selected arbitrarily, and did not result in any specific advantage for our method, as none of the simulation networks had sparsity equal to 0.10. Finally, the rate parameter λ of Equation (3), which appears in the prior on diagonal elements of the precision matrix, was set to 150. We also compare the model when the scaling parameter, τ, is learned, and use a Gamma(2,2) prior to do so. We comment on the sensitivity of the results to parameter choices in Section 4.3.

4.2. Simulation Results

We compare the performance of our method to several existing alternative approaches in terms of accuracy in network estimation and covariate selection. For comparison, we used mLDM (Yang, Chen, and Chen 2017), which is specifically designed for estimating networks of compositional data while controlling for covariates, and mSSL-DPE (Deshpande, Ročková, and George 2019), which we applied to the centered log ratio (CLR) transformed version of the simulated count data, a common method to account for the compositionality (Fang et al. 2017; Kurtz et al. 2015). For network estimation, we also considered SpiecEasi (Kurtz et al. 2015), which applies graphical LASSO to the CLR-transformed data. These methods were applied by using the default selection criteria in their respective R packages. To more precisely characterize factors contributing to the network estimation performance of the SINC method, we apply a version of SINC with the covariate effects B constrained to be 0. A comparison of the results from this constrained version of SINC to those of SpiecEasi reflects the performance advantages arising from differences in the network estimation procedure, while comparison to the full unconstrained SINC method provides quantitative insight on the benefit of simultaneous estimation of covariate effects on network recovery. Similarly, for the comparison of variable selection accuracy, we apply a constrained version of SINC with Ω fixed to the identity matrix. Comparison of the results from this approach to those of the full unconstrained SINC method illustrates the added value of accounting for the residual covariance in estimation of B.

We report results in terms of true positive rate (TPR), false positive rate (FPR), F1 score, and Matthew’s correlation coefficient (MCC). For edge selection, we also report the area under the curve (AUC). This was calculated, for the SINC method, over a grid of ν₀ values, and for the mLDM and mSSL-DPE methods by using the LASSO penalization parameter for the coefficients associated with the best selected graph, and then varying the graph penalization parameter over a grid of values. The AUC for SpiecEasi was calculated by varying the penalization parameter. Tables 1 and 2 show the results for network estimation and variable selection, respectively. From Table 1, we can see that mSSL-DPE and SpiecEasi are generally not competitive in terms of their performance, with low F1, MCC, and AUC values. This is likely because mSSL-DPE was not designed for compositional data, and SpiecEasi is not able to account for the effects of the covariates on the counts. We also see that mLDM performs better than the other two methods but is still outperformed by the proposed model, which does better in all F1, MCC, and AUC scores across all of the network structures except the hub structure. Finally, we see that when the proposed model does not control for additional covariates, the network estimation scores decrease and are comparable to the other methods. Across all methods, SINC while learning τ performed best in all network structures in terms of F1, MCC, and AUC. The performance metrics for SINC are pretty similar across all network types, though the best performance is achieved on the random graph, while the hub and cluster settings are more challenging. From Table 2, we can see that mSSL-DPE performs quite well in selecting the covariates of interest. In fact, its performance is very close to the proposed model, SINC, on all metrics. Similarly, the proposed model, while holding the estimated network and precision matrix fixed performs well for coefficient estimation, but does not do as well as mSSL-DPE and the full version of the proposed model. Additionally, we did not see any significant difference in performance in the full SINC models when τ is fixed or learned. SpiecEasi is not included in Table 2, as it is not able to select or adjust for relevant covariates, which is a limitation of the method.

Table 1.

Simulation results for network selection.

Network estimation accuracy
	TPR	FPR	F1	MCC	AUC	TPR	FPR	F1	MCC	AUC
	Random					Hub
mSSL-DPE	0.000	0.015	0.000	−0.019	0.499	0.001	0.024	0.001	−0.021	0.450
SpiecEasi	0.013	0.009	0.018	0.005	0.563	0.011	0.009	0.015	0.003	0.528
mLDM	0.277	0.181	0.067	0.039	0.560	0.440	0.248	0.080	0.069	0.602
SINC (B = 0)	0.276	0.225	0.056	0.020	0.542	0.253	0.241	0.038	0.004	0.507
SINC (τ = 1)	0.420	0.093	0.167	0.169	0.750	0.175	0.096	0.059	0.037	0.613
SINC (τ learned)	0.598	0.091	0.237	0.263	0.838	0.294	0.094	0.098	0.094	0.689
	Cluster					Band
mSSL-DPE	0.005	0.0181	0.008	−0.0230	0.446	0.003	0.022	0.004	−0.032	0.468
SpiecEasi	0.013	0.010	0.0182	0.005	0.563	0.012	0.009	0.020	0.007	0.544
mLDM	0.232	0.155	0.126	0.051	0.544	0.440	0.248	0.080	0.069	0.602
SINC (B = 0)	0.272	0.229	0.110	0.024	0.534	0.294	0.242	0.114	0.028	0.533
SINC (τ = 1)	0.294	0.084	0.223	0.169	0.678	0.311	0.088	0.230	0.175	0.685
SINC (τ learned)	0.411	0.080	0.306	0.261	0.741	0.446	0.089	0.312	0.269	0.737

Open in a new tab

NOTE: mSSL-DPE refers to the method of Deshpande, Ročková, and George (2019), SpiecEasi to the method of Kurtz et al. (2015), mLDM to Yang, Chen, and Chen (2017), SINC (B = 0) to the modified version of the proposed model with the covariate estimates fixed, and SINC to the proposed model. Random, Hub, Cluster, and Band refer to the underlying shape of the network, as illustrated in Figure 1. Bold values reflect top performing methods.

Table 2.

Simulation results for covariate selection.

Variable selection accuracy
	TPR	FPR	F1	MCC	TPR	FPR	F1	MCC
	Random				Hub
mSSL-DPE	0.840	0.000	0.912	0.898	0.868	0.000	0.929	0.915
mLDM	0.609	0.003	0.751	0.734	0.612	0.003	0.754	0.738
SINC (Ω = I)	0.808	0.003	0.871	0.889	0.813	0.001	0.887	0.894
SINC (τ = 1)	0.914	0.001	0.943	0.953	0.925	0.000	0.952	0.960
SINC (τ learned)	0.917	0.000	0.947	0.956	0.926	0.001	0.952	0.960
	Cluster				Band
mSSL-DPE	0.837	0.000	0.910	0.900	0.853	0.000	0.920	0.906
mLDM	0.605	0.003	0.747	0.730	0.597	0.003	0.741	0.725
SINC (Ω = I)	0.791	0.006	0.854	0.873	0.808	0.000	0.887	0.893
SINC (τ = 1)	0.908	0.000	0.941	0.951	0.921	0.001	0.947	0.957
SINC (τ learned)	0.910	0.000	0.942	0.952	0.922	0.000	0.950	0.959

Open in a new tab

NOTE: mSSL-DPE refers to the method of Deshpande, Ročková, and George (2019), mLDM to the method of Yang, Chen, and Chen (2017), SINC (Ω = I) to the modified version of the proposed model with the precision matrix fixed, and SINC to the proposed model. SpiecEasi is omitted from the comparison, as it does not perform selection or adjustment for covariates. Random, Hub, Cluster, and Band refer to the underlying shape of the network, as illustrated in Figure 1. Bold values reflect top performing methods.

We did not compare our model to MCMC approaches because of the computational complexity resulting from a lack of conjugacy. We did, however, experiment by using a Monte Carlo draw to update Ω at each iteration of SINC, and found that the point estimates of SINC without a Monte Carlo step were very close to the mean of the Monte Carlo draws.

4.3. Influence of Parameters

An advantage of using spike-and-slab priors for covariate and network edge selection, over penalized methods, is given by the flexible level of sparsity induced on the regression coefficients and the precision matrix entries. For example, Li and McCormick (2019) showed that $d_{i, j}^{*}$ from Equation (10) is comparable to the penalty parameter, λ, in the graphical LASSO (Dempster 1972). However, $d_{i, j}^{*}$ is unique to each edge and is adaptively learned from the data. In Figure 2, we show this advantage over penalized methods by plotting the estimated coefficients and precision matrix values, for a smaller simulation scenario with p = 10, q = 15, and n = 5000, using SINC (with τ fixed at 1) and the penalization based method mLDM, while varying the sparsity-inducing parameters. The top-left plot shows the estimated B coefficients by the proposed model when increasing the variance parameter ν_B of the spike-and-slab prior in Equation (2). The top-right plot shows the estimated B coefficients via mLDM when increasing the LASSO penalty parameter. The bottom-left plot shows the estimated off-diagonal values of the precision matrix Ω when increasing the variance parameter ν₀ in Equation (3) and the bottom-right plot shows the estimated off-diagonal estimates of the precision matrix when increasing the graphical LASSO penalty parameter. In all plots, red lines correspond to true associations in the simulated data, and black lines correspond to coefficients representing no underlying association. The flat trend in the red lines of the plots related to SINC shows that the estimated covariate effects and precision matrix entries corresponding to true associations are stable, while for mLDM, depicted at bottom, they get shrunk to zero as the penalty parameters increase.

We conclude this section by providing some comments on the sensitivity of the results to the choice of the hyperparameters. As shown in Figure 2, with sufficient data, the estimates of Ω and B are stable for increasing values of ν₀, in the prior of Equation (3), and ν_B, in the prior of Equation (2), respectively. These parameters, however, affect the sparsity of the selection. In particular, as ν₀ increases, holding all other parameters constant, the selected network becomes sparser. Similarly, as ν₁, which appears in the prior of Equation (3), increases, holding ν₀ constant, the network sparsity increases. In recent work using this type of mixture prior, Li and McCormick (2019) and Ročková and George (2014) have suggested holding ν₁ constant while varying ν₀. It should be apparent, then, that increasing ν_B while holding all other parameters constant decreases the number of selected coefficients, as increasing ν_B is analogous to increasing ν₁. The remainder of the hyperparameters influence sparsity as well, but to a lesser extent. For example, changing a_π and b_π in the prior given in Equation (4) to put more weight on larger values of π results in sparser networks. Similarly, selecting a_γ and b_γ in the prior of Equation (2) to reflect a stronger prior belief in larger θ_γ values results in an increase in the number of selected coefficients. Since π and θ_γ are both updated at each iteration of the SINC algorithm, selecting relatively non-informative priors, such as the ones used in the simulations of a_γ = b_γ = 2, allows the sparsity levels to be primarily controlled by ν₀ and ν_B. Alternative choices that would also be appropriate include a_γ = b_γ = 1, a more non-informative setting corresponding to a uniform prior on the unit interval, or a_γ = 1 and b_γ = p, which would more strongly favor sparsity, as discussed in Ročková and George (2014). We found that our variable selection results were not overly sensitive to the choice of these parameters. Increasing λ also increases the network sparsity because it changes the scale of the estimated Ω values by making them smaller. Appropriate λ values need to be selected based on the scale of the data that is being used.

5. Application to a Study of the Vaginal Microbiome in Pregnancy

In this section, we apply our proposed method to data from the Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI), an NIH-funded study aimed at characterizing the microbiome and its role in shaping maternal and infant health. Previous research has demonstrated that immune and metabolic changes during pregnancy reshape the microbiome, which undergoes large shifts during the course of pregnancy. The vaginal microbiome in particular has been shown to change early in pregnancy (Serrano et al. 2019) and be predictive of pregnancy outcomes such as preterm birth (Fettweis et al. 2019).

The MOMS-PI study involved following pregnant women throughout pregnancy and for a short term after childbirth. Participants in the MOMS-PI study were asked to provide samples from the mouth, skin, vagina and rectum. Multiple omic technologies were used to process the collected samples including microbiome profiling, metabolomics, and quantification of cytokine abundances via immunoproteomics. Cytokines, in particular, are one mechanism by which the host regulates the composition of the vaginal microbiome. The data were obtained from the R package HMP2Data and consists of 596 subjects that were sampled across multiple visits. For our analysis, we focus on samples collected at the first baseline visit. Of the 596 subjects, 225 subjects had both the microbiome and cytokine profiling of the vagina available at this time point. We consider these 225 subjects in the analysis. To avoid the inclusion of very rare taxa, the OTUs were filtered for inclusion in the analysis using the following rule: the absolute abundance of an OTU had to be greater than 1 for at least 10 percent of the subjects, resulting in 90 OTUs. All 29 cytokines profiled were included as covariates. For the analysis, the cytokine data was transformed to the log scale and centered.

We applied the SINC method to estimate the interaction between vaginal microbes, as well as the interplay between vaginal cytokines and microbial abundances. We used the same hyperparameter settings as in the simulation: ν_B = 1, a_γ = 2, b_γ = 2, ν₁ = 10, λ = 150, a_π = 2, b_π = 2, and set ν₀ = 0.01, a value that, in the simulations, achieved a sparsity level of 0.10, and fixed τ to 1. Since we are controlling for cytokine counts when estimating the microbiome network, we are more confident in the selection as we do not expect to select an edge between two microbes that may be related only via their common dependence on a cytokine.

5.1. MOMS-PI Results

Figure 3(a) shows the adjacency matrix of the microbial network inferred from the MOMS-PI data, with filled boxes representing selected edges, together with a plot of the number of edges for each OTU in 3(b). A network diagram of the inferred microbial network is shown in in Figure 3(c), with node sizes representing the degree of the nodes, so the larger a node, the more edges that node has with other nodes. In these plots, OTUs are grouped based on their phylum (Firmicutes, Actinobacteria, Bacteroidetes, Proteobacteria, Fusobacteria, and TM7). Looking at the adjacency matrix and network representation, we notice that Actinobacteria have few shared edges with Bacteroidetes, Proteobacteria and Fusobacteria, instead sharing the majority of their inter-phylum edges with Firmicutes, while the other phyla (Firmicutes, Bacteroidetes, Proteobacteria and Fusobacteria) show no trend in inter-phylum edges. We also notice that within the Firmicutes subnetwork, OTUs of the genus Lactobacillus (OTU 1 through 26 in the adjacency matrix) form their own subnetwork with very few inter-genus connections. Also, from the node degrees plot we can see that many Firmicutes have large numbers of edges. Indeed, when looking at the most connected nodes of the inferred microbial network, we found that 5 of the 6 most connected OTUs belong to the Firmicutes phylum.

Next, we show the inference on the microbe-cytokine association network. Figure 4 shows the adjacency matrix of the selected microbe-cytokine associations, with microbes colored based on phylum. We observe clear patterns of association, with both cytokines that show relationships to many OTUs, and OTUs that show relationships with several cytokines. In particular, we see that the cytokines IP-10, IL-1b, IL-17A, FGF basic, and IL-8 have the most associations with OTU abundances. When looking at which OTUs have the most associations with cytokines, we found that 6 of the 10 most connected are Lactobacillus. This is also seen in Figure 4, where many of the first 26 microbes (columns) have several cytokine associations. Lactobacillus has previously been shown to be largely influenced by cytokines (Valenti et al. 2018).

We also compare the results from SINC to those from SpiecEasi (Kurtz et al. 2015) and the B-constrained version of SINC, and found that the two methods that do not control for covariates shared 22 edges that were not selected by the proposed model. Two of these edges can be seen in Figure 5, which shows the network of a subset of three microbes and three cytokines estimated by SINC, as well as the network of the same subset of microbes estimated by SpiecEasi and a variant of SINC with the B coefficients not estimated. We hypothesize that SINC did not select the same edges as the other two methods, that is, the edge between OTUs 14 and 19 and the edge between OTUs 19 and 20, because edges selected by methods that do not control for cytokines may incorrectly determine an edge when microbe pairs have a mutual association with a cytokine. This can be seen in Figure 5, where OTUs 14,19, and 20 all have an association with IP-10. This illustrates the ability of our model to discover covariate effects and a sparse network accounting for these effects.

Figure 5. — Case study: Plot (a) Subnetwork of selected edges in a microbe-microbe network and selected associations between microbes and cytokines, found when using SINC. Plot (b) Subnetwork of selected edges in a microbe-microbe network, found when using both SpeicEasi and a variation of SINC with B fixed to 0. Cytokines are not included in the subnetwork of plot (b), as neither model accounts for cytokines.

6. Discussion

In this article, we have introduced a novel Bayesian hierarchical model for count data that allows for simultaneous estimation of covariate dependence and network interactions. By accounting for covariate selection, simultaneous estimation methods are able to control for those variables, which ultimately leads to more accurate network estimation. We have considered multivariate count data, and specifically compositional data that have a fixed sum constraint, and have modeled the data using a Dirichlet-multinomial likelihood. We have accounted for covariates by modeling the log concentration parameters via a Gaussian distribution, and achieved simultaneous covariate and edge selection via spike-and-slab priors. For posterior inference, we have implemented a variational Bayes approach that includes an EM step to enable efficient estimation. We have shown through simulations that the proposed model outperforms existing methods in its accuracy of network recovery. This is due, in part, to the flexibility of the hierarchical model, as discussed in Section 4.3, which avoids some of the over-shrinkage typical of penalized approaches, as well as the added accuracy from doing simultaneous covariate and network selection. Finally, we have applied the proposed method to data from the MOMS-PI study, to estimate the microbial interactions in the vagina, as well as the interplay between vaginal cytokines and microbial abundances, providing insight into mechanisms of host-microbial interaction during pregnancy.

Although other estimation methods, such as EM, could potentially be applied, we found that our unique hybrid algorithm offers several advantages. As we noted in Section 3, the VI and EM approaches are similar in many ways. VI, however, can enable additional insight on the uncertainty of the parameter estimates, as one can think of the EM as a special case of the VI algorithm when the variational distributions are point estimates. Although this is one potential advantage of the VI estimation of B, we primarily prefer the VI approach for pragmatic reasons regarding performance, since we found in practice that the VI algorithm for variable selection is less sensitive to the choice of the hyperparameters (specifically, the standard deviations of the spike and slab distributions) than alternative approaches for variable selection in the EM framework. This makes the application of our approach simpler, since we can focus on tuning the parameters for the EMGS portion of the algorithm. Moreover, Ray and Szabo (2021) recently demonstrated that the VI algorithm of Carbonetto and Stephens (2012) generally outperformed the EMVS algorithm of Ročková and George (2014) across various simulation settings, suggesting we may be able to obtain more accurate estimates of B under this approach. Finally, VI methods can be used with discrete spike-and-slab priors, whereas continuous spike-and-slab priors are used with EM methods.

Although our VI scheme is more computationally efficient than MCMC sampling, estimating the latent variable matrix Z is still computationally expensive and a bottleneck to this problem. For the case study of this article, the model ran on a cluster using 25 cores and took 16 min (approximately 6.2 CPU hr). Using the case study data and the default R package settings, mSSL-DPE took 49.4 min, and SpiecEasi took 57 sec. Even though spiecEasi and mSSL-DPE were much faster, they resulted in less accurate predictions. Finally, mLDM was much slower and took over 120 hr per simulation. The dramatic difference in time between mLDM and SINC is due, in part, to SINC being calibrated for parallel computing. Not only does the computational complexity of our model scale in p, but because there are p × n latent variables in Z, the speed of our algorithm also scales with n. Avoiding estimation of these latent variables, or finding computationally more efficient estimates, would allow for further scalability of the implementation.

Python code implementing the SINC method is available at https://github.com/Nathan-Osborne/SINC/.

Funding

This work was supported by NSF/DMS 1811568/1811445. CBP was partially supported by NIH/NCI Cancer Center Support Grant (CCSG) P30CA016672.

Footnotes

This article is the winner of the ASA’s Statistical Computing and Graphics student paper competition.

References

Barbieri MM, and Berger JO (2004), “Optimal Predictive Model Selection,” The Annals of Statistics, 32, 870–897. [Google Scholar]
Blei DM, Kucukelbir A, and McAuliffe JD (2017), “Variational Inference: A Review for Statisticians,” Journal of the American Statistical Association, 112, 859–877. [Google Scholar]
Brown PJ, Vannucci M, and Fearn T (1998), “Multivariate Bayesian Variable Selection and Prediction,” Journal of the Royal Statistical Society, Series B, 60, 627–641. [Google Scholar]
Carbonetto P, and Stephens M (2012), “Scalable Variational Inference for Bayesian Variable Selection in Regression, and its Accuracy in Genetic Association Studies,” Bayesian Analysis, 7, 73–108. [Google Scholar]
Chen J, and Li H (2013), “Variable Selection for Sparse Dirichlet-Multinomial Regression With an Application to Microbiome Data Analysis,” The Annals of Applied Statistics, 7, 418–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
David LA, Maurice CF, Carmody RN, Gootenberg DB, Button JE, Wolfe BE, Ling AV, Devlin AS, Varma Y, Fischbach MA, Biddinger SB, Dutton RJ, and Turnbaugh PJ (2014), “Diet Rapidly and Reproducibly Alters The Human Gut Microbiome,” Nature, 505, 559–563. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempster A (1972), “Covariance Selection,” Biometrics, 28, 157–175. [Google Scholar]
Deshpande SK, Ročková V, and George EI (2019), “Simultaneous Variable and Covariance Selection With The Multivariate Spike-And-Slab Lasso,” Journal of Computational and Graphical Statistics, 28, 921–931. [Google Scholar]
Edwards VL, Smith SB, McComb EJ, Tamarelle J, Ma B, Humphrys MS, Gajer P, Gwilliam K, Schaefer AM, Lai SK, Terplan M, Mark KS, Brotman RM, Forney LJ, Bavoil PM, Ravel J (2019), “The Cervicovaginal Microbiota–Host Interaction Modulates Chlamydia Trachomatis Infection,” MBio, 10, e01548–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, and Li R (2001), “Variable Selection Via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
Fang H, Huang C, Zhao H, and Deng M (2017), “gCoda: Conditional Dependence Network Inference for Compositional Data,” Journal of Computational Biology, 24, 699–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fettweis JM, Serrano MG, Brooks JP, Edwards DJ, Girerd PH, Parikh HI, Huang B, Arodz TJ, Edupuganti L, Glascock AL, Xu J, Jimenez NR, Vivadelli SC, Fong SS, Sheth NU, Jean S, Lee V, Bokhari YA, Lara AM, Mistry SD, Duckworth RA 3rd, Bradley SP, Koparde VN, Orenda XV, Milton SH, Rozycki SK, Matveyev AV, Wright ML, Huzurbazar SV, Jackson EM, Smirnova E, Korlach J, Tsai YC, Dickinson MR, Brooks JL, Drake JI, Chaffin DO, Sexton AL, Gravett MG, Rubens CE, Wijesooriya NR, Hendricks-Muñoz KD, Jefferson KK, Strauss JF 3rd, Buck GA (2019), “The Vaginal Microbiome and Preterm Birth,” Nature Medicine, 25, 1012–1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, and Tibshirani R (2008), “Sparse Inverse Covariance Estimation With The Graphical Lasso,” Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
George EI, and McCulloch RE (1997), “Approaches for Bayesian Variable Selection,” Statistica Sinica, 7, 339–373. [Google Scholar]
Girvan M, and Newman ME (2002), “Community Structure in Social and Biological Networks,” Proceedings of The National Academy of Sciences, pp. 7821–7826. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, and Egozcue JJ (2017), “Microbiome Datasets Are Compositional: And This is Not Optional,” Frontiers in Microbiology, 8, 22–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang X, Wang J, and Liang F (2016), “A Variational Algorithm for Bayesian Variable Selection,” arXiv: 1602.07640.
Jiang H, Fei X, Liu H, Roeder K, Lafferty J, Wasserman L, Li X, and Zhao T (2019), huge: High-Dimensional Undirected Graph Estimation. R package version 1.3.3. [PMC free article] [PubMed]
Kook JH, Vaughn KA, DeMaster DM, Ewing-Cobbs L, and Vannucci M (2021), “BVAR-Connect: A Variational Bayes Approach to Multi-Subject Vector Autoregressive Models for Inference on Brain Connectivity Networks,” Neuroinformatics, 19, 39–56. [DOI] [PubMed] [Google Scholar]
Koslovsky M, and Vannucci M (2020), “MicroBVS: Dirichlet-Tree Multinomial Regression Models With Bayesian Variable Selection – an R package. BMC Bioinformatics, 21, 301. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, and Bonneau RA (2015), “Sparse and Compositionally Robust Inference Of Microbial Ecological Networks,” PLOS Computational Biology, 11, 1–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lenkoski A, and Dobra A (2011), “Computational Aspects Related to Inference in Gaussian Graphical Models With the G-Wishart Prior,” Journal of Computational and Graphical Statistics, 20, 140–157. [Google Scholar]
Li ZR, McComick TH, and Clark SJ (2020), “Using Bayesian Latent Gaussian Graphical Models to Infer Symptom Associations in Verbal Autopsies,” Bayesian Analysis, 15, 781–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li ZR, and McCormick TH (2019), “An Expectation Conditional Maximization Approach for Gaussian Graphical Models,” Journal of Computational and Graphical Statistics, 28, 767–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen N, and Bühlmann P (2006), “High-Dimensional Graphs and Variable Selection With the Lasso,” Annals of Statistics, 34, 1436–1462. [Google Scholar]
Miao Y, Kook JH, Lu Y, Guindani M, and Vannucci M (2020), “Scalable Bayesian Variable Selection Regression Models for Count Data,” in Flexible Bayesian Regression Modelling, eds. Yanan F, Smith M, Nott D, and Dortet-Bernadet J, London: Elsevier, pp. 187–219. [Google Scholar]
Mitchell TJ, and Beauchamp JJ (1988), “Bayesian Variable Selection in Linear Regression,” Journal of the American Statistical Association, 83, 1023–1032. [Google Scholar]
Ntzoufras I, Dellaportas P, and Forster JJ (2003), “Bayesian Variable and Link Determination for Generalised Linear Models,” Journal of Statistical Planning and Inference, 111, 165–180. [Google Scholar]
Raftery AE (1996), “Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models,” Biometrika, 83, 251–266. [Google Scholar]
Ray K, and Szabo B (2021), “Variational Bayes for High-Dimensional Linear Regression With Sparse Priors,” Journal of the American Statistical Association. [Google Scholar]
Richardson S, Bottolo L, and Rosenthal (2010), “Bayesian Models for Sparse Regression Analysis of High Dimensional Data,” in Bayesian Statistics (Vol. 9), Oxford University Press, pp. 539–569. [Google Scholar]
Rothman AJ, Levina E, and Zhu J (2010), “Sparse Multivariate Regression With Covariance Estimation,” Journal of Computational and Graphical Statistics, 19, 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roverato A (2002), “Hyper Inverse Wishart Distribution for Non-Decomposable Graphs and Its Application to Bayesian Inference for Gaussian Graphical Models,” Scandinavian Journal of Statistics, 29, 391–411. [Google Scholar]
Ročková V, and George EI (2014), “Emvs: The EM Approach to Bayesian Variable Selection,” Journal of the American Statistical Association, 109, 828–846. [Google Scholar]
Serrano MG, Parikh HI, Brooks JP, Edwards DJ, Arodz TJ, Edupuganti L, Huang B, Girerd PH, Bokhari YA, Bradley SP (2019), “Racioethnic Diversity in the Dynamics of the Vaginal Microbiome During Pregnancy,” Nature Medicine, 25, 1001–1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sha N, Vannucci M, Tadesse MG, Brown PJ, Dragoni I, Davies N, Roberts TC, Contestabile A, Salmon M, Buckley C, and Falciani F (2004), “Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage,” Biometrics, 60, 812–819. [DOI] [PubMed] [Google Scholar]
Stingo F, Chen Y, Vannucci M, Barrier M, and Mirkes P (2010), “A Bayesian Graphical Modeling Approach to microRNA Regulatory Network Inference,” Annals of Applied Statistics, 4, 2024–2048. [DOI] [PMC free article] [PubMed] [Google Scholar]
Talhouk A, Doucet A, and Murphy K (2012), “Efficient Bayesian Inference for Multivariate Probit Models With Sparse Inverse Correlation Matrices,” Journal of Computational and Graphical Statistics, 21, 739–757. [Google Scholar]
Tang Y, Ma L, and Nicolae D (2018), “A Phylogenetic Scan Test on Dirichlet-Tree Multinomial Model for Microbiome Data,” Annals of Applied Statistics, 12, 1–26. [Google Scholar]
Tibshirani R (1996), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
Titsias MK, and Lázaro-Gredilla M (2011), “Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning,” in Advances in Neural Information Processing Systems, Red Hook, NY: Curran Associates Inc. pp. 2339–2347. [Google Scholar]
Valenti P, Rosa L, Capobianco D, Lepanto MS, Schiavi E, Cutone A, Paesano R, and Mastromarino P (2018), “Role of Lactobacilli and Lactoferrin in the Mucosal Cervicovaginal Defense,” Frontiers in Immunology, 9, 376. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vinci G, Ventura V, Smith M, and Kass R (2018), “Adjusted Regularization in Latent Graphical Models: Application to Multiple-Neuron Spike Count Data,” Annals of Applied Statistics, 12, 1068–1095. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wadsworth WD, Argiento R, Guindani M, Galloway-Pena J, Shelburne SA, and Vannucci M (2017), “An Integrative Bayesian Dirichlet-Multinomial Regression Model for the Analysis of Taxonomic Abundances in Microbiome Data,” BMC Bioinformatics, 18, 94. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H (2012), “Bayesian Graphical Lasso Models and Efficient Posterior Computation,” Bayesian Analysis, 7, 771–790.27375829 [Google Scholar]
—— (2015), “Scaling It Up: Stochastic Search Structure Learning in Graphical Models,” Bayesian Analysis, 10, 351–377. [Google Scholar]
Yang Y, Chen N, and Chen T (2017), “Inference of Environmental Factor-Microbe and Microbe–Microbe Associations from Metagenomic Data Using a Hierarchical Bayesian Statistical Model,” Cell Systems, 4, 129–137. [DOI] [PubMed] [Google Scholar]
Yuan M, and Lin Y (2007), “Model Selection and Estimation in the Gaussian Graphical Model,” Biometrika, 94, 19–35. [Google Scholar]
Zou H (2006), “The Adaptive Lasso and Its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

PERMALINK

Latent Network Estimation and Variable Selection for Compositional Data Via Variational EM

Nathan Osborne

Christine B Peterson

Marina Vannucci

Abstract

1. Introduction