In this paper, we formulate a hierarchical Bayesian version of the Mixture of Unigrams model for text clustering and approach its posterior inference through variational inference. We compute the explicit expression of the variational objective function for our hierarchical model under a mean-field approximation. We then derive the update equations of a suitable algorithm based on coordinate ascent to find local maxima of the variational target, and estimate the model parameters through the optimized variational hyperparameters. The advantages of variational algorithms over traditional Markov Chain Monte Carlo methods based on iterative posterior sampling are also discussed in detail.
Data availability
Available on the websites referenced in the article.
Code availability
Upon request.
Appendix A The Dirichlet-multinomial distribution
The j-th class-conditional distribution of the proposed hierarchical model can be written in closed form by integrating out the Multinomial parameters (in what follows \(z_i = j\)):
where the inverse of the normalization constant c has expression:
Using the standard notation for the multivariate Beta function:
the class-conditional likelihood can be rewritten as:
This probability mass function (pmf) defines the Dirichlet-Multinomial distribution. It was studied, among others, by Mosimann (1962), who showed that the variance of each marginal component of the j-th class-conditional distributions is given by:
Thus, the variance of each class-conditional marginal likelihood exhibits overdispersion with respect to the standard Multinomial distribution. The magnitude of this overdispersion, which depends on the semantic heterogeneity of the underlying documents, is controlled by the term \(p\theta\), with higher values corresponding to lower overdispersion.
Appendix B Calculating the ELBO in explicit form
We begin by writing the joint distribution of the latent variables and model parameters that appears in the first term of the ELBO (11):
that is:
We calculate the expected values of these quantities.
\(\boxed {\text {A1}\ }\) By definition, \(y_i \vert \beta , z_i \sim \textsf {Multinomial}_p(\beta _{s})\), where the index s corresponds to the index of the only component of the vector \(z_i\) that is equal to 1. It follows that:
and that:
given (14), since the term \({\textsf{E}}_q \left[ \log \beta _{j\ell } \right]\) is a function of the random variable \(z_i\) through the index s. We now observe that the variational distribution of \(\beta _j\) can be written as:
which is a multiparametric exponential family with:
\(\log \beta _{j\ell }\): minimal sufficient statistics for \(\ell =1,2,\ldots ,p\).
\(u_{j\ell } =\phi _{j\ell } -1\): natural (or canonical) parameters for \(\ell =1,2,\ldots ,p\).
By defining:
it is well known that (in what follows \(\phi _j - 1 \equiv u_j\) componentwise):
Putting everything together, we get the summand (16) of ELBO. \(\square\)
\(\boxed {\text {A2}}\) Using the independence between the latent indicator variables \(z_i\) and \(\lambda\) under the variational distribution, and exploiting the representation of the variational distribution of \(\lambda\) as a multiparametric exponential family, we easily obtain the term (17):
\(\boxed {\text {A3}}\) From \(\beta _j \vert \theta \sim {\textsf{Dirichlet}}_p(\mathbbm {1}_p\theta )\) it readily follows that:
that is the expression in (18). \(\square\)
\(\boxed {\text {A4}}\) As in the previous point, from \(\lambda \vert \alpha \sim {\textsf {Dirichlet}}_k (\mathbbm {1}_k \alpha )\) we have:
from which (19) follows that:
If we consider the second addend of the ELBO we have the following factorization:
that is:
If we compute the expected value of \(\log q(\beta ,z,\lambda \vert \nu )\) with respect to the variational distribution q, using a simple algebra and the representation of the Dirichlet distribution as a multiparametric exponential family, which we have already seen, we find that the expected values with respect to q of \(\boxed {\text {B1}}, \boxed {\text {B2}} {\mbox{ and }} \boxed {\text {B3}}\) correspond to (20), (21) and (22) except the sign, respectively.
Appendix C Maximizing the ELBO
Since we need to maximize each term individually, holding all others constant, we first isolate the terms in the ELBO that depend on the parameter that is being updated, and then compute the maximum point.
\(\boxed {\gamma _{ij}}\) (\(i=1,2,\ldots ,n\), \(j=1,2,\ldots , k\)). It appears in (16), (17), and (21). We isolate the factors containing \(\gamma _{ij}\) and add a Lagrangian to the objective function to account for the condition that such Multinomial parameters sum to 1 for fixed i:
We take the partial derivatives to \(\gamma _{ij}\) and set them equal to zero:
from which we obtain:
that is:
which must be normalized to 1 for each fixed i according to (26). \(\square\)
\(\boxed {\eta _{j}}\) (\(j=1,2,\ldots , k\)). Isolating \(\eta _j\), which appears in (17), (19) and (22), we have:
As above, taking the partial derivatives with respect to \(\eta _j\) and setting them to 0, we have:
which is equivalent to the following equation in \(\eta _j\):
For positive arguments, the Digamma function has exactly one root, so it is obvious that \(\Psi '(\eta _j)\) and \(\Psi '\left( \sum _{j=1}^k \eta _j \right)\) cannot be simultaneously zero. Therefore, this equation admits a unique solution if and only if:
that is if and only if:
\(\boxed {\phi _{j\ell }}\) (\(j=1,2,\ldots ,k\), \(\ell =1,2,\ldots , p\)). Isolating \(\phi _{j\ell }\) in (16), (18) and (20):
Taking the first derivative and setting it to 0:
which, as in the previous case, it has a unique solution in \({\phi _{j\ell }}\) given by:
