Abstract
In this paper, we formulate a hierarchical Bayesian version of the Mixture of Unigrams model for text clustering and approach its posterior inference through variational inference. We compute the explicit expression of the variational objective function for our hierarchical model under a mean-field approximation. We then derive the update equations of a suitable algorithm based on coordinate ascent to find local maxima of the variational target, and estimate the model parameters through the optimized variational hyperparameters. The advantages of variational algorithms over traditional Markov Chain Monte Carlo methods based on iterative posterior sampling are also discussed in detail.
Similar content being viewed by others
Data availability
Available on the websites referenced in the article.
Code availability
Upon request.
References
Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York. https://doi.org/10.1007/978-1-4614-3223-4
Airoldi EM, Blei D, Erosheva EA et al (2014) Handbook of mixed membership models and their applications. Chapman and Hall, Boca Raton. https://doi.org/10.1201/b17520
Anastasiu DC, Tagarelli A, Karypis G (2014) Document clustering: the next frontier. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications. Chapman & Hall, Boca Raton, pp 305–338
Anderlucci L, Viroli C (2020) Mixtures of Dirichlet-multinomial distributions for supervised and unsupervised classification of short text data. Adv Data Anal Classif 14:759–770. https://doi.org/10.1007/s11634-020-00399-3
Andrews N, Fox E (2007) Recent developments in document clustering. http://hdl.handle.net/10919/19473, Virginia Tech computer science technical report, TR-07-35
Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12:233–251. https://doi.org/10.1145/183422.183423
Awasthi P, Risteski A (2015) On some provably correct cases of variational inference for topic models. In: Cortes C, Lawrence N, Lee D et al (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., New York
Baudry JP, Celeux G (2015) EM for mixtures. Inizialiation requires special care. Stat Comput 25:713–726. https://doi.org/10.1007/s11222-015-9561-x
Baudry JP, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22:455–470. https://doi.org/10.1007/s11222-011-9236-1
Blanchard P, Higham DJ, Higham NJ (2021) Accurately computing the log-sum-exp and softmax functions. IMA J Numer Anal 41:2311–2330. https://doi.org/10.1093/imanum/draa038
Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84. https://doi.org/10.1145/2133806.2133826
Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat. https://doi.org/10.1214/07-AOAS114
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112:859–877. https://doi.org/10.1080/01621459.2017.1285773
Celeux G, Hurn M, Robert CP (2000) Computational and inferential difficulties with mixture posterior distributions. J Am Stat Assoc 95:957–970. https://doi.org/10.1080/01621459.2000.10474285
Celeux G, Früwirth-Schnatter S, Robert CP (2018a) Model selection for mixture models—perspectives and strategies. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 118–154. https://doi.org/10.1201/9780429055911
Celeux G, Kamary K, Malsiner-Walli G et al (2018b) Computational solutions for Bayesian inference in mixture models. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 73–115. https://doi.org/10.1201/9780429055911
Chandra NK, Canale A, Dunson DB (2020) Escaping the curse of dimensionality in Bayesian model based clustering. arxiv:2006.02700
Dayton CM, Macready GB (1988) Concomitant-variable latent-class models. J Am Stat Assoc 83:173. https://doi.org/10.2307/2288938
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175. https://doi.org/10.1023/A:1007612920971
Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc Ser B (Methodol) 56:363–375. https://doi.org/10.1111/j.2517-6161.1994.tb01985.x
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stati Softw. https://doi.org/10.18637/jss.v025.i05
Feinerer I, Hornik K (2020) tm: text mining package. https://CRAN.R-project.org/package=tm, R package version 0.7-8
Frühwirth-Schnatter S (2004) Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econom J 7:143–167. https://doi.org/10.1111/j.1368-423X.2004.00125.x
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York. https://doi.org/10.1007/978-0-387-35768-3
Gelman A, Carlin J, Stern H et al (2013) Bayesian data analysis, 3rd edn. Chapman and Hall, Boca Raton
Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459. https://doi.org/10.1038/nature14541
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384
Harris ZS (1954) Distributional structure. WORD 10:146–162. https://doi.org/10.1080/00437956.1954.11659520
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7
Hornik K, Feinerer I, Kober M et al (2012) Spherical \(k\)-means clustering. J Stat Softw. https://doi.org/10.18637/jss.v050.i10
Jordan MI, Ghahramani Z, Jaakkola TS et al (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233. https://doi.org/10.1023/A:1007665907178
Kaggle (2022) Sports dataset(bbc). https://www.kaggle.com/datasets/maneesh99/sports-datasetbbc. Accessed 04 Nov 2022
Keribin C (2000) Consistent estimation of the order of mixture models. Sankhyā Indian J Stat Ser A (1961–2002) 62:49–66
Kunkel D, Peruggia M (2020) Anchored Bayesian Gaussian mixture models. Electron J Stat. https://doi.org/10.1214/20-EJS1756
Lee SY (2021) Gibbs sampler and coordinate ascent variational inference: a set-theoretical review. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.1921214
Li H, Fan X (2016) A pivotal allocation-based algorithm for solving the label-switching problem in Bayesian mixture models. J Comput Graph Stat 25:266–283. https://doi.org/10.1080/10618600.2014.983643
Maechler M (2022) Rmpfr: R mpfr—multiple precision floating-point reliable. https://cran.r-project.org/package=Rmpfr, R package version 0.8-9
Malsiner-Walli G, Frühwirth-Schnatter S, Grün B (2016) Model-based clustering based on sparse finite Gaussian mixtures. Stat Comput 26:303–324. https://doi.org/10.1007/s11222-014-9500-2
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Marin JM, Robert C (2008) Approximating the marginal likelihood in mixture models. Indian Bayesian Soc Newslett 5:2–7
Mosimann JE (1962) On the compound multinomial distribution, the multivariate Beta-distribution, and correlations among proportions. Biometrika 49:65–82. https://doi.org/10.1093/biomet/49.1-2.65
Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, Cambridge
Nielsen F, Garcia V (2009) Statistical exponential families: a digest with flash cards. arXiv:0911.4863
Nigam K, Mccallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134. https://doi.org/10.1023/A:1007692713085
Nikita M (2020) ldatuning: tuning of the latent Dirichlet allocation models parameters. https://CRAN.R-project.org/package=ldatuning, R package version 1.0.2
Plummer S, Pati D, Bhattacharya A (2020) Dynamics of coordinate ascent variational inference: a case study in 2D Ising models. Entropy 22:1263. https://doi.org/10.3390/e22111263
Pollice A, Bilancia M (2000) A hierarchical finite mixture model for Bayesian classification in the presence of auxiliary information. Metron Int J Stat LVIII:109–131
R Core Team (2022) R: a language and environment for statistical computing. https://www.R-project.org/
Rakib MRH, Zeh N, Jankowska M et al (2020) Enhancement of short text clustering by iterative classification. In: Métais E, Meziane F, Horacek H et al (eds) Natural language processing and information systems. Springer, Berlin, pp 105–117. https://doi.org/10.1007/978-3-030-51310-8_10
Robert CP (2007) The Bayesian choice. Springer, New York. https://doi.org/10.1007/0-387-71599-1
Sankaran K, Holmes SP (2019) Latent variable modeling for the microbiome. Biostatistics 20:599–614. https://doi.org/10.1093/biostatistics/kxy018
Silverman J (2022) RcppHungarian: solves minimum cost bipartite matching problems. https://CRAN.R-project.org/package=RcppHungarian, R package version 0.2
Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B (Stat Methodol) 62:795–809. https://doi.org/10.1111/1467-9868.00265
Titterington DM, Wang B (2006) Convergence properties of a general algorithm for calculating variational Bayesian estimates for a Normal mixture model. Bayesian Anal. https://doi.org/10.1214/06-BA121
Tran MN, Nguyen TN, Dao VH (2021) A practical tutorial on variational Bayes. arXiv:2103.01327
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Wainwright MJ, Jordan MI (2007) Graphical models, exponential families, and variational inference. Found Trends® Mach Learn 1:1–305. https://doi.org/10.1561/2200000001
Wallach H, Mimno D, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty J et al (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., New York
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Zhang C, Butepage J, Kjellstrom H et al (2019) Advances in variational inference. IEEE Trans Pattern Anal Mach Intell 41:2008–2026. https://doi.org/10.1109/TPAMI.2018.2889774
Zhang C, Kjellström H (2015) How to supervise topic models. In: Agapito L, Bronstein MM, Rother C (eds) Computer vision—ECCv 2014 workshops. Springer, Cham, pp 500–515. https://doi.org/10.1007/978-3-319-16181-5_39
Acknowledgements
We wish to thank the Associate Editor for his help and support and the two anonymous referees for their careful and constructive reviews.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
The authors contributed to the manuscript equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A The Dirichlet-multinomial distribution
The j-th class-conditional distribution of the proposed hierarchical model can be written in closed form by integrating out the Multinomial parameters (in what follows \(z_i = j\)):
where the inverse of the normalization constant c has expression:
Using the standard notation for the multivariate Beta function:
the class-conditional likelihood can be rewritten as:
This probability mass function (pmf) defines the Dirichlet-Multinomial distribution. It was studied, among others, by Mosimann (1962), who showed that the variance of each marginal component of the j-th class-conditional distributions is given by:
Thus, the variance of each class-conditional marginal likelihood exhibits overdispersion with respect to the standard Multinomial distribution. The magnitude of this overdispersion, which depends on the semantic heterogeneity of the underlying documents, is controlled by the term \(p\theta\), with higher values corresponding to lower overdispersion.
Appendix B Calculating the ELBO in explicit form
We begin by writing the joint distribution of the latent variables and model parameters that appears in the first term of the ELBO (11):
that is:
We calculate the expected values of these quantities.
\(\boxed {\text {A1}\ }\) By definition, \(y_i \vert \beta , z_i \sim \textsf {Multinomial}_p(\beta _{s})\), where the index s corresponds to the index of the only component of the vector \(z_i\) that is equal to 1. It follows that:
and that:
given (14), since the term \({\textsf{E}}_q \left[ \log \beta _{j\ell } \right]\) is a function of the random variable \(z_i\) through the index s. We now observe that the variational distribution of \(\beta _j\) can be written as:
which is a multiparametric exponential family with:
-
\(\log \beta _{j\ell }\): minimal sufficient statistics for \(\ell =1,2,\ldots ,p\).
-
\(u_{j\ell } =\phi _{j\ell } -1\): natural (or canonical) parameters for \(\ell =1,2,\ldots ,p\).
By defining:
it is well known that (in what follows \(\phi _j - 1 \equiv u_j\) componentwise):
Putting everything together, we get the summand (16) of ELBO. \(\square\)
\(\boxed {\text {A2}}\) Using the independence between the latent indicator variables \(z_i\) and \(\lambda\) under the variational distribution, and exploiting the representation of the variational distribution of \(\lambda\) as a multiparametric exponential family, we easily obtain the term (17):
\(\square\)
\(\boxed {\text {A3}}\) From \(\beta _j \vert \theta \sim {\textsf{Dirichlet}}_p(\mathbbm {1}_p\theta )\) it readily follows that:
and:
that is the expression in (18). \(\square\)
\(\boxed {\text {A4}}\) As in the previous point, from \(\lambda \vert \alpha \sim {\textsf {Dirichlet}}_k (\mathbbm {1}_k \alpha )\) we have:
from which (19) follows that:
\(\square\)
If we consider the second addend of the ELBO we have the following factorization:
that is:
If we compute the expected value of \(\log q(\beta ,z,\lambda \vert \nu )\) with respect to the variational distribution q, using a simple algebra and the representation of the Dirichlet distribution as a multiparametric exponential family, which we have already seen, we find that the expected values with respect to q of \(\boxed {\text {B1}}, \boxed {\text {B2}} {\mbox{ and }} \boxed {\text {B3}}\) correspond to (20), (21) and (22) except the sign, respectively.
Appendix C Maximizing the ELBO
Since we need to maximize each term individually, holding all others constant, we first isolate the terms in the ELBO that depend on the parameter that is being updated, and then compute the maximum point.
\(\boxed {\gamma _{ij}}\) (\(i=1,2,\ldots ,n\), \(j=1,2,\ldots , k\)). It appears in (16), (17), and (21). We isolate the factors containing \(\gamma _{ij}\) and add a Lagrangian to the objective function to account for the condition that such Multinomial parameters sum to 1 for fixed i:
We take the partial derivatives to \(\gamma _{ij}\) and set them equal to zero:
from which we obtain:
that is:
which must be normalized to 1 for each fixed i according to (26). \(\square\)
\(\boxed {\eta _{j}}\) (\(j=1,2,\ldots , k\)). Isolating \(\eta _j\), which appears in (17), (19) and (22), we have:
As above, taking the partial derivatives with respect to \(\eta _j\) and setting them to 0, we have:
which is equivalent to the following equation in \(\eta _j\):
For positive arguments, the Digamma function has exactly one root, so it is obvious that \(\Psi '(\eta _j)\) and \(\Psi '\left( \sum _{j=1}^k \eta _j \right)\) cannot be simultaneously zero. Therefore, this equation admits a unique solution if and only if:
that is if and only if:
\(\boxed {\phi _{j\ell }}\) (\(j=1,2,\ldots ,k\), \(\ell =1,2,\ldots , p\)). Isolating \(\phi _{j\ell }\) in (16), (18) and (20):
Taking the first derivative and setting it to 0:
which, as in the previous case, it has a unique solution in \({\phi _{j\ell }}\) given by:
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bilancia, M., Di Nanni, M., Manca, F. et al. Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering. Comput Stat 38, 2015–2051 (2023). https://doi.org/10.1007/s00180-023-01350-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01350-8