Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Massimo Bilancia ORCID: orcid.org/0000-0002-5330-2403¹,
Michele Di Nanni²,
Fabio Manca³ &
…
Gianvito Pio⁴

284 Accesses
3 Citations
Explore all metrics

Abstract

In this paper, we formulate a hierarchical Bayesian version of the Mixture of Unigrams model for text clustering and approach its posterior inference through variational inference. We compute the explicit expression of the variational objective function for our hierarchical model under a mean-field approximation. We then derive the update equations of a suitable algorithm based on coordinate ascent to find local maxima of the variational target, and estimate the model parameters through the optimized variational hyperparameters. The advantages of variational algorithms over traditional Markov Chain Monte Carlo methods based on iterative posterior sampling are also discussed in detail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Article 25 May 2020

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models

Data availability

Available on the websites referenced in the article.

Code availability

Upon request.

References

Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York. https://doi.org/10.1007/978-1-4614-3223-4
Book Google Scholar
Airoldi EM, Blei D, Erosheva EA et al (2014) Handbook of mixed membership models and their applications. Chapman and Hall, Boca Raton. https://doi.org/10.1201/b17520
Book MATH Google Scholar
Anastasiu DC, Tagarelli A, Karypis G (2014) Document clustering: the next frontier. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications. Chapman & Hall, Boca Raton, pp 305–338
Google Scholar
Anderlucci L, Viroli C (2020) Mixtures of Dirichlet-multinomial distributions for supervised and unsupervised classification of short text data. Adv Data Anal Classif 14:759–770. https://doi.org/10.1007/s11634-020-00399-3
Article MathSciNet MATH Google Scholar
Andrews N, Fox E (2007) Recent developments in document clustering. http://hdl.handle.net/10919/19473, Virginia Tech computer science technical report, TR-07-35
Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst 12:233–251. https://doi.org/10.1145/183422.183423
Article Google Scholar
Awasthi P, Risteski A (2015) On some provably correct cases of variational inference for topic models. In: Cortes C, Lawrence N, Lee D et al (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., New York
Google Scholar
Baudry JP, Celeux G (2015) EM for mixtures. Inizialiation requires special care. Stat Comput 25:713–726. https://doi.org/10.1007/s11222-015-9561-x
Article MathSciNet MATH Google Scholar
Baudry JP, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22:455–470. https://doi.org/10.1007/s11222-011-9236-1
Article MathSciNet MATH Google Scholar
Blanchard P, Higham DJ, Higham NJ (2021) Accurately computing the log-sum-exp and softmax functions. IMA J Numer Anal 41:2311–2330. https://doi.org/10.1093/imanum/draa038
Article MathSciNet MATH Google Scholar
Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84. https://doi.org/10.1145/2133806.2133826
Article Google Scholar
Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat. https://doi.org/10.1214/07-AOAS114
Article MathSciNet MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112:859–877. https://doi.org/10.1080/01621459.2017.1285773
Article MathSciNet Google Scholar
Celeux G, Hurn M, Robert CP (2000) Computational and inferential difficulties with mixture posterior distributions. J Am Stat Assoc 95:957–970. https://doi.org/10.1080/01621459.2000.10474285
Article MathSciNet MATH Google Scholar
Celeux G, Früwirth-Schnatter S, Robert CP (2018a) Model selection for mixture models—perspectives and strategies. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 118–154. https://doi.org/10.1201/9780429055911
Chapter Google Scholar
Celeux G, Kamary K, Malsiner-Walli G et al (2018b) Computational solutions for Bayesian inference in mixture models. In: Frühwirth-Schnatter S, Celeux G, Robert CP (eds) Handbook of mixture analysis. Chapmann & Hall, New York, pp 73–115. https://doi.org/10.1201/9780429055911
Chapter Google Scholar
Chandra NK, Canale A, Dunson DB (2020) Escaping the curse of dimensionality in Bayesian model based clustering. arxiv:2006.02700
Dayton CM, Macready GB (1988) Concomitant-variable latent-class models. J Am Stat Assoc 83:173. https://doi.org/10.2307/2288938
Article MathSciNet Google Scholar
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175. https://doi.org/10.1023/A:1007612920971
Article MATH Google Scholar
Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc Ser B (Methodol) 56:363–375. https://doi.org/10.1111/j.2517-6161.1994.tb01985.x
Article MathSciNet MATH Google Scholar
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stati Softw. https://doi.org/10.18637/jss.v025.i05
Article Google Scholar
Feinerer I, Hornik K (2020) tm: text mining package. https://CRAN.R-project.org/package=tm, R package version 0.7-8
Frühwirth-Schnatter S (2004) Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econom J 7:143–167. https://doi.org/10.1111/j.1368-423X.2004.00125.x
Article MathSciNet MATH Google Scholar
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York. https://doi.org/10.1007/978-0-387-35768-3
Book MATH Google Scholar
Gelman A, Carlin J, Stern H et al (2013) Bayesian data analysis, 3rd edn. Chapman and Hall, Boca Raton
Book MATH Google Scholar
Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459. https://doi.org/10.1038/nature14541
Article Google Scholar
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning (ICML’06). ACM Press, pp 377–384
Harris ZS (1954) Distributional structure. WORD 10:146–162. https://doi.org/10.1080/00437956.1954.11659520
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7
Book MATH Google Scholar
Hornik K, Feinerer I, Kober M et al (2012) Spherical $k$-means clustering. J Stat Softw. https://doi.org/10.18637/jss.v050.i10
Article Google Scholar
Jordan MI, Ghahramani Z, Jaakkola TS et al (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233. https://doi.org/10.1023/A:1007665907178
Article MATH Google Scholar
Kaggle (2022) Sports dataset(bbc). https://www.kaggle.com/datasets/maneesh99/sports-datasetbbc. Accessed 04 Nov 2022
Keribin C (2000) Consistent estimation of the order of mixture models. Sankhyā Indian J Stat Ser A (1961–2002) 62:49–66
MathSciNet MATH Google Scholar
Kunkel D, Peruggia M (2020) Anchored Bayesian Gaussian mixture models. Electron J Stat. https://doi.org/10.1214/20-EJS1756
Article MathSciNet MATH Google Scholar
Lee SY (2021) Gibbs sampler and coordinate ascent variational inference: a set-theoretical review. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.1921214
Article Google Scholar
Li H, Fan X (2016) A pivotal allocation-based algorithm for solving the label-switching problem in Bayesian mixture models. J Comput Graph Stat 25:266–283. https://doi.org/10.1080/10618600.2014.983643
Article MathSciNet Google Scholar
Maechler M (2022) Rmpfr: R mpfr—multiple precision floating-point reliable. https://cran.r-project.org/package=Rmpfr, R package version 0.8-9
Malsiner-Walli G, Frühwirth-Schnatter S, Grün B (2016) Model-based clustering based on sparse finite Gaussian mixtures. Stat Comput 26:303–324. https://doi.org/10.1007/s11222-014-9500-2
Article MathSciNet MATH Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Marin JM, Robert C (2008) Approximating the marginal likelihood in mixture models. Indian Bayesian Soc Newslett 5:2–7
Google Scholar
Mosimann JE (1962) On the compound multinomial distribution, the multivariate Beta-distribution, and correlations among proportions. Biometrika 49:65–82. https://doi.org/10.1093/biomet/49.1-2.65
Article MathSciNet MATH Google Scholar
Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, Cambridge
MATH Google Scholar
Nielsen F, Garcia V (2009) Statistical exponential families: a digest with flash cards. arXiv:0911.4863
Nigam K, Mccallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134. https://doi.org/10.1023/A:1007692713085
Article MATH Google Scholar
Nikita M (2020) ldatuning: tuning of the latent Dirichlet allocation models parameters. https://CRAN.R-project.org/package=ldatuning, R package version 1.0.2
Plummer S, Pati D, Bhattacharya A (2020) Dynamics of coordinate ascent variational inference: a case study in 2D Ising models. Entropy 22:1263. https://doi.org/10.3390/e22111263
Article MathSciNet Google Scholar
Pollice A, Bilancia M (2000) A hierarchical finite mixture model for Bayesian classification in the presence of auxiliary information. Metron Int J Stat LVIII:109–131
MathSciNet MATH Google Scholar
R Core Team (2022) R: a language and environment for statistical computing. https://www.R-project.org/
Rakib MRH, Zeh N, Jankowska M et al (2020) Enhancement of short text clustering by iterative classification. In: Métais E, Meziane F, Horacek H et al (eds) Natural language processing and information systems. Springer, Berlin, pp 105–117. https://doi.org/10.1007/978-3-030-51310-8_10
Chapter Google Scholar
Robert CP (2007) The Bayesian choice. Springer, New York. https://doi.org/10.1007/0-387-71599-1
Book Google Scholar
Sankaran K, Holmes SP (2019) Latent variable modeling for the microbiome. Biostatistics 20:599–614. https://doi.org/10.1093/biostatistics/kxy018
Article MathSciNet Google Scholar
Silverman J (2022) RcppHungarian: solves minimum cost bipartite matching problems. https://CRAN.R-project.org/package=RcppHungarian, R package version 0.2
Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Ser B (Stat Methodol) 62:795–809. https://doi.org/10.1111/1467-9868.00265
Article MathSciNet MATH Google Scholar
Titterington DM, Wang B (2006) Convergence properties of a general algorithm for calculating variational Bayesian estimates for a Normal mixture model. Bayesian Anal. https://doi.org/10.1214/06-BA121
Article MathSciNet MATH Google Scholar
Tran MN, Nguyen TN, Dao VH (2021) A practical tutorial on variational Bayes. arXiv:2103.01327
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Wainwright MJ, Jordan MI (2007) Graphical models, exponential families, and variational inference. Found Trends® Mach Learn 1:1–305. https://doi.org/10.1561/2200000001
Article MATH Google Scholar
Wallach H, Mimno D, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty J et al (eds) Advances in neural information processing systems, vol 22. Curran Associates Inc., New York
Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Article Google Scholar
Zhang C, Butepage J, Kjellstrom H et al (2019) Advances in variational inference. IEEE Trans Pattern Anal Mach Intell 41:2008–2026. https://doi.org/10.1109/TPAMI.2018.2889774
Article Google Scholar
Zhang C, Kjellström H (2015) How to supervise topic models. In: Agapito L, Bronstein MM, Rother C (eds) Computer vision—ECCv 2014 workshops. Springer, Cham, pp 500–515. https://doi.org/10.1007/978-3-319-16181-5_39

Download references

Acknowledgements

We wish to thank the Associate Editor for his help and support and the two anonymous referees for their careful and constructive reviews.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Precision and Regenerative Medicine and Ionian Area (DiMePRe-J), University of Bari Aldo Moro, Policlinic University Hospital, Piazza G. Cesare 11, 70124, Bari, Italy
Massimo Bilancia
EY Business and Technology Solution, Via G. Oberdan 40/U, 70126, Bari, Italy
Michele Di Nanni
Department of Education, Psychology, Communication (ForPsiCom), University of Bari Aldo Moro, Palazzo Chiaia - Napolitano, Via Scipione Crisanzio 42, 70124, Bari, Italy
Fabio Manca
Department of Computer Science, University of Bari Aldo Moro, Via E. Orabona 4, 70125, Bari, Italy
Gianvito Pio

Authors

Massimo Bilancia
View author publications
You can also search for this author in PubMed Google Scholar
Michele Di Nanni
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Manca
View author publications
You can also search for this author in PubMed Google Scholar
Gianvito Pio
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors contributed to the manuscript equally.

Corresponding author

Correspondence to Massimo Bilancia.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A The Dirichlet-multinomial distribution

The j-th class-conditional distribution of the proposed hierarchical model can be written in closed form by integrating out the Multinomial parameters (in what follows $z_i = j$):

$$\begin{aligned} p(y_i \vert z_i, \theta )&= \int p(y_i, \beta \vert z_i, \theta )d\beta = \int p(y_i\vert \beta , z_i) p(\beta \vert \theta ) d\beta =\nonumber \\&= \int p(y_i\vert \beta _j) \prod _{s=1}^K p(\beta _s \vert \theta )d\beta _s = \nonumber \\&= \int p(y_i \vert \beta _j) p(\beta _j \vert \theta )d\beta _j \underbrace{\int \prod _{-j} p(\beta _{-j} \vert \theta )d\beta _{-j}}_1 \nonumber = \\&= \int \left( {\begin{array}{c}y_{i+}\\ y_i\end{array}}\right) \prod _{\ell = 1}^p \beta _{j\ell }^{y_{i\ell }} \frac{\Gamma \left( p\theta \right) }{\Gamma (\theta )^p}\prod _{\ell =1}^p \beta _{j\ell }^{\theta - 1} d\beta _{j\ell } = \nonumber \\&= \left( {\begin{array}{c}y_{i+}\\ y_i\end{array}}\right) \frac{\Gamma \left( p\theta \right) }{\Gamma (\theta )^p} c^{-1} \underbrace{\int c \prod _{\ell =1}^p \beta _{j\ell }^{y_{i\ell } + \theta - 1} d\beta _{j\ell }}_1, \end{aligned}$$

where the inverse of the normalization constant c has expression:

$$\begin{aligned} c^{-1}=\frac{\prod _{\ell =1}^p \Gamma \left( y_{i\ell } +\theta \right) }{\Gamma \left( \sum _{\ell = 1}^p \left( y_{i\ell } +\theta \right) \right) }. \end{aligned}$$

Using the standard notation for the multivariate Beta function:

$$\begin{aligned} B(x)=B(x_1,x_2,\ldots ,x_p) = \frac{\prod _{\ell =1}^p \Gamma (x_\ell )}{\Gamma \left( \sum _{\ell = 1}^p x_\ell \right) }, \end{aligned}$$

the class-conditional likelihood can be rewritten as:

$$\begin{aligned} p(y_i \vert z_i,\theta ) = \left( {\begin{array}{c}y_{i+}\\ y_i\end{array}}\right) \frac{B(y_i+\theta )}{B(\theta )}. \end{aligned}$$

This probability mass function (pmf) defines the Dirichlet-Multinomial distribution. It was studied, among others, by Mosimann (1962), who showed that the variance of each marginal component of the j-th class-conditional distributions is given by:

$$\begin{aligned} {\textsf{Var}}(y_{i\ell } \vert z_i, \theta ) = y_{i+}{\textsf{E}}(\beta _{j\ell }) {\textsf{E}}(1 - \beta _{j\ell })\left( \frac{y_{i+} + p\theta }{1 + p\theta } \right) , \end{aligned}$$

Thus, the variance of each class-conditional marginal likelihood exhibits overdispersion with respect to the standard Multinomial distribution. The magnitude of this overdispersion, which depends on the semantic heterogeneity of the underlying documents, is controlled by the term $p\theta$, with higher values corresponding to lower overdispersion.

Appendix B Calculating the ELBO in explicit form

We begin by writing the joint distribution of the latent variables and model parameters that appears in the first term of the ELBO (11):

$$\begin{aligned} p(y,\beta ,z, \lambda \vert \theta , \alpha )=\prod _{i=1}^n p(y_i \vert \beta , z_i) \times \prod _{i=1}^n p(z_i \vert \lambda )\times \prod _{j=1}^k p(\beta _j \vert \theta ) \times p(\lambda \vert \alpha ) \end{aligned}$$

that is:

$$\begin{aligned}&\log p(y,\beta ,z, \lambda )= \\&\quad = \sum _{i=1}^n \log p(y_i \vert \beta , z_i) \,+ & \boxed {\text {A1}} \\&\qquad + \sum _{i=1}^n \log p(z_i \vert \lambda )\,+&\quad \boxed {\text {A2}}\\&\qquad + \sum _{j=1}^k \log p(\beta _j \vert \theta )\, +&\quad \boxed {\text {A3}}\\&\qquad + \log p(\lambda \vert \alpha ) &\quad \boxed {\text {A4}} \end{aligned}$$

We calculate the expected values of these quantities.

$\boxed {\text {A1}\ }$ By definition, $y_i \vert \beta , z_i \sim \textsf {Multinomial}_p(\beta _{s})$, where the index s corresponds to the index of the only component of the vector $z_i$ that is equal to 1. It follows that:

$$\begin{aligned} \log p(y_i\vert \beta , z_i) \propto \log \prod _{\ell =1}^p \beta _{s\ell }^{y_{i\ell }}= \sum _{\ell = 1}^p y_{i\ell }\log \beta _{s\ell }, \end{aligned}$$

and that:

$$\begin{aligned} {\textsf{E}}_q \left[ \sum _{i=1}^n \log p(y_i \vert \beta , z_i) \right]&= {\textsf{E}}_q \left[ \sum _{i=1}^n \sum _{\ell = 1}^p y_{i\ell }\log \beta _{s\ell } \right] = \\&= \sum _{i=1}^n \sum _{\ell = 1}^p y_{i\ell } {\textsf{E}}_q \left[ \log \beta _{s\ell } \right] = \\&= \sum _{i=1}^n \sum _{\ell =1}^p \sum _{j=1}^k y_{i\ell } \gamma _{ij}{\textsf{E}}_q \left[ \log \beta _{j\ell } \right] , \end{aligned}$$

given (14), since the term ${\textsf{E}}_q \left[ \log \beta _{j\ell } \right]$ is a function of the random variable $z_i$ through the index s. We now observe that the variational distribution of $\beta _j$ can be written as:

$$\begin{aligned} q(\beta _j \vert \phi _j) =\exp \left\{ \sum _{\ell =1}^p (\phi _{j\ell }-1) \log \beta _{j\ell }- \left[ \sum _{\ell =1}^p \log \Gamma (\phi _{j\ell }) - \log \Gamma \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right] \right\} , \end{aligned}$$

which is a multiparametric exponential family with:

$\log \beta _{j\ell }$: minimal sufficient statistics for $\ell =1,2,\ldots ,p$.
$u_{j\ell } =\phi _{j\ell } -1$: natural (or canonical) parameters for $\ell =1,2,\ldots ,p$.

By defining:

$$\begin{aligned} A(u_j) = \sum _{\ell =1}^p \log \Gamma (u_{j\ell } + 1) - \log \Gamma \left( \sum _{\ell =1}^p u_{j\ell } + 1 \right) , \end{aligned}$$

it is well known that (in what follows $\phi _j - 1 \equiv u_j$ componentwise):

$$\begin{aligned} {\textsf{E}}_q \left[ \log \beta _{j\ell } \right]&= \frac{\partial A(u_j)}{\partial u_{j\ell }} = \frac{\partial A(u_j)}{\partial \phi _{j\ell }} = \ \frac{\partial \phi _{j\ell }}{\partial u_{j\ell }} = \\&= \frac{\partial A(\phi _j - 1)}{\partial \phi _{j\ell }} \frac{\partial (u_{j\ell } + 1)}{\partial u_{j\ell }} = \frac{\partial A(\phi _j - 1)}{\partial \phi _{j\ell }} = \\&= \frac{\partial }{\partial \phi _{j\ell }} \left[ \sum _{h=1}^p \log \Gamma (\phi _{jh}) - \log \Gamma \left( \sum _{\ell =1}^p \phi _{jh} \right) \right] = \\&= \frac{\partial }{\partial \phi _{j\ell }} \log \Gamma (\phi _{j\ell }) - \frac{\partial }{\partial \phi _{j\ell }} \log \Gamma \left( \sum _{\ell =1}^p \phi _{jh} \right) = \\&= \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) . \end{aligned}$$

Putting everything together, we get the summand (16) of ELBO. $\square$

$\boxed {\text {A2}}$ Using the independence between the latent indicator variables $z_i$ and $\lambda$ under the variational distribution, and exploiting the representation of the variational distribution of $\lambda$ as a multiparametric exponential family, we easily obtain the term (17):

$$\begin{aligned} {\textsf{E}}_q \left[ \sum _{i=1}^n \log p(z_i \vert \lambda )\right]&= {\textsf{E}}_q \left[ \sum _{i=1}^n \sum _{j=1}^k z_{ij} \log \lambda _j \right] = \\&= \sum _{i=1}^n \sum _{j=1}^k {\textsf{E}}_q \left[ z_{ij} \right] {\textsf{E}}_q \left[ \log \lambda _j \right] = \\&= \sum _{i=1}^n\sum _{j=1}^k \gamma _{ij}\left\{ \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} . \end{aligned}$$

$\square$

$\boxed {\text {A3}}$ From $\beta _j \vert \theta \sim {\textsf{Dirichlet}}_p(\mathbbm {1}_p\theta )$ it readily follows that:

$$\begin{aligned} \log p(\beta _j \vert \theta ) = \log \Gamma (p\theta ) - p \log \Gamma (\theta ) + \sum _{\ell = 1}^p (\theta - 1) \log \beta _{j\ell }, \end{aligned}$$

and:

$$\begin{aligned}&{\textsf{E}}_q \left[ \sum _{j=1}^k \log p(\beta _j \vert \theta ) \right] = \\&\quad = {\textsf{E}}_q \left[ k\log \Gamma (p\theta ) - kp \log \Gamma (\theta ) + \sum _{j=1}^k\sum _{\ell = 1}^p (\theta - 1) \log \beta _{j\ell } \right] = \\&\quad = k \log \Gamma (p\theta ) - k p \log \Gamma (\theta ) + \sum _{j=1}^k\sum _{\ell = 1}^p (\theta - 1) {\textsf{E}}_q \left[ \log \beta _{j\ell } \right] =\\&\quad = k\log \Gamma (p\theta ) - kp \log \Gamma (\theta ) + \sum _{j=1}^k\sum _{\ell = 1}^p (\theta - 1) \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} , \end{aligned}$$

that is the expression in (18). $\square$

$\boxed {\text {A4}}$ As in the previous point, from $\lambda \vert \alpha \sim {\textsf {Dirichlet}}_k (\mathbbm {1}_k \alpha )$ we have:

$$\begin{aligned} \log p(\lambda \vert \alpha ) = \log \Gamma (k\alpha ) - k \log \Gamma (\alpha ) + \sum _{j=1}^k (\alpha - 1)\log \lambda _j, \end{aligned}$$

from which (19) follows that:

$$\begin{aligned}&{\textsf{E}}_q \left[ \log p(\lambda \vert \alpha ) \right] = \\&\quad = \log \Gamma (k\alpha ) - k \log \Gamma (\alpha ) + \sum _{j=1}^k(\alpha - 1) {\textsf{E}}_q \left[ \log \lambda _{j} \right] = \\&\quad = \log \Gamma (k\alpha ) - k \log \Gamma (\alpha ) + \sum _{j=1}^k(\alpha - 1) \left\{ \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} . \end{aligned}$$

$\square$

If we consider the second addend of the ELBO we have the following factorization:

$$\begin{aligned} q(\beta ,z,\lambda \vert \nu ) = \prod _{j=1}^k q(\beta _j \vert \phi _j) \times \prod _{i=1}^n q(z_i \vert \gamma _i) \times q(\lambda \vert \eta ), \end{aligned}$$

that is:

$$\begin{aligned}&\log q(\beta ,z,\lambda \vert \nu ) =\\&\quad = \sum _{j=1}^k \log q(\beta _j \vert \phi _j)\, + &\quad \boxed {\text {B1}}\\&\qquad + \sum _{i=1}^n \log q(z_i \vert \gamma _i)\,+&\quad \boxed {\text {B2}}\\&\qquad + \log q(\lambda \vert \eta )&\quad \boxed {\text {B3}} \end{aligned}$$

If we compute the expected value of $\log q(\beta ,z,\lambda \vert \nu )$ with respect to the variational distribution q, using a simple algebra and the representation of the Dirichlet distribution as a multiparametric exponential family, which we have already seen, we find that the expected values with respect to q of $\boxed {\text {B1}}, \boxed {\text {B2}} {\mbox{ and }} \boxed {\text {B3}}$ correspond to (20), (21) and (22) except the sign, respectively.

Appendix C Maximizing the ELBO

Since we need to maximize each term individually, holding all others constant, we first isolate the terms in the ELBO that depend on the parameter that is being updated, and then compute the maximum point.

$\boxed {\gamma _{ij}}$ ($i=1,2,\ldots ,n$, $j=1,2,\ldots , k$). It appears in (16), (17), and (21). We isolate the factors containing $\gamma _{ij}$ and add a Lagrangian to the objective function to account for the condition that such Multinomial parameters sum to 1 for fixed i:

$$\begin{aligned} {\mathcal {L}}_{[\gamma _{ij}]}&= \sum _{\ell =1}^p y_{i\ell }\gamma _{ij} \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} + \\&\quad + \gamma _{ij}\left\{ \Psi (\eta _{j}) - \Psi \left( \sum _{\ell =1}^p \eta _{j} \right) \right\} - \\&\quad - \gamma _{ij}\log \gamma _{ij} - L \left( \sum _{s=1}^k \gamma _{is} - 1 \right) . \end{aligned}$$

We take the partial derivatives to $\gamma _{ij}$ and set them equal to zero:

$$\begin{aligned} \frac{\partial {\mathcal {L}}_{[\gamma _{ij}]}}{\partial \gamma _{ij}}&= \sum _{\ell =1}^p y_{i\ell } \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} + \\&\quad + \left\{ \Psi (\eta _{j}) - \Psi \left( \sum _{\ell =1}^p \eta _{j} \right) \right\} - \\&\quad - \log \gamma _{ij} - \gamma _{ij}\frac{1}{\gamma _{ij}} - L = 0, \end{aligned}$$

from which we obtain:

$$\begin{aligned} \log \gamma _{ij}&= -1-L + \sum _{\ell =1}^p y_{i\ell } \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} + \\&\quad + \left\{ \Psi (\eta _{j}) - \Psi \left( \sum _{\ell =1}^p \eta _{j} \right) \right\} , \end{aligned}$$

that is:

$$\begin{aligned} \gamma _{ij}&= \exp (-1-L) \;\times \\&\quad \times \exp \left\{ \sum _{\ell =1}^p y_{i\ell } \left[ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right] \right\} \,\times\\&\quad \times \exp \left\{ \Psi (\eta _{j}) - \Psi \left( \sum _{j=1}^k \eta _{j} \right) \right\} \propto\\&\quad \propto \exp \bigg \{ \sum _{\ell =1}^p y_{i\ell } {\textsf{E}}_q \left[ \log \beta _{j\ell } \right] \bigg \}\exp \bigg \{ {\textsf{E}}_q \left[ \log \lambda _{j} \right] \bigg \}, \end{aligned}$$

which must be normalized to 1 for each fixed i according to (26). $\square$

$\boxed {\eta _{j}}$ ($j=1,2,\ldots , k$). Isolating $\eta _j$, which appears in (17), (19) and (22), we have:

$$\begin{aligned} {\mathcal {L}}_{[\eta _j]}&= \sum _{i=1}^n \gamma _{ij}\left\{ \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} + \\&\quad + (\alpha -1) \left\{ \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} - \\&\quad - \log \Gamma \left( \sum _{j=1}^k\eta _j\right) + \log \Gamma (\eta _j) \,-\\&\quad - (\eta _j-1)\left\{ \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} = \\&= \left\{ \Psi (\eta _j) = \Psi \left( \sum _{j=1}^k \eta _j\right) \right\} \left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] - \\&\quad - \log \Gamma \left( \sum _{j=1}^k\eta _j\right) +\log \Gamma (\eta _j). \end{aligned}$$

As above, taking the partial derivatives with respect to $\eta _j$ and setting them to 0, we have:

$$\begin{aligned} \frac{\partial {\mathcal {L}}_{[\eta _{j}]}}{\partial \eta _j}&= \Psi '(\eta _j)\left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] + \Psi (\eta _j)(-1) \, -\\&\quad - \Psi '\left( \sum _{j=1}^k \eta _j \right) \left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] - \Psi \left( \sum _{j=1}^k \eta _j\right) (-1) \, - \\&\quad - \Psi (\eta _j) - \Psi \left( \sum _{j=1}^k \eta _j\right) \, =\\&= \Psi '(\eta _j)\left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] \, -\\&\quad - \Psi '\left( \sum _{j=1}^k \eta _j \right) \left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] = 0, \end{aligned}$$

which is equivalent to the following equation in $\eta _j$:

$$\begin{aligned} \Psi '(\eta _j)\left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] = \Psi '\left( \sum _{j=1}^k \eta _j \right) \left[ \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j \right] . \end{aligned}$$

For positive arguments, the Digamma function has exactly one root, so it is obvious that $\Psi '(\eta _j)$ and $\Psi '\left( \sum _{j=1}^k \eta _j \right)$ cannot be simultaneously zero. Therefore, this equation admits a unique solution if and only if:

$$\begin{aligned} \sum _{i=1}^n \gamma _{ij} + \alpha - \eta _j = 0, \end{aligned}$$

that is if and only if:

$$\begin{aligned} \eta _j = \alpha + \sum _{i=1}^n \gamma _{ij}. \end{aligned}$$

$\boxed {\phi _{j\ell }}$ ($j=1,2,\ldots ,k$, $\ell =1,2,\ldots , p$). Isolating $\phi _{j\ell }$ in (16), (18) and (20):

$$\begin{aligned} {\mathcal {L}}_{[\phi _{j\ell }]}&= \sum _{i=1}^n y_{i\ell }\gamma _{ij} \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} \, + \\&\quad +(\theta - 1) \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} \, - \\&\quad - \log \Gamma \left( \sum _{\ell =1}^p \phi _{j\ell } \right) + \log \Gamma (\phi _{j\ell })\, - \\&\quad - (\phi _{j\ell }- 1)\left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} \, = \\&= \left\{ \Psi (\phi _{j\ell }) - \Psi \left( \sum _{\ell =1}^p \phi _{j\ell } \right) \right\} \left[ \sum _{i=1}^n y_{i\ell }\gamma _{ij}+\theta -\phi _{j\ell } \right] \, -\\&\quad - \log \Gamma \left( \sum _{\ell =1}^p \phi _{j\ell } \right) + \log \Gamma (\phi _{j\ell }). \end{aligned}$$

Taking the first derivative and setting it to 0:

$$\begin{aligned} \frac{\partial {\mathcal {L}}_{[\phi _{j\ell }]}}{\partial \phi _{j\ell }}&= \Psi '(\phi _{j\ell })\left[ \sum _{i=1}^n y_{i\ell }\gamma _{ij}+\theta -\phi _{j\ell } \right] \, + \\&\quad + \Psi (\phi _{j\ell })(-1) - \Psi '\left( \sum _{\ell =1}^p\phi _{j\ell }\right) \left[ \sum _{i=1}^n y_{i\ell }\gamma _{ij}+\theta -\phi _{j\ell } \right] \, - \\&\quad - \Psi '\left( \sum _{\ell =1}^p\phi _{j\ell }\right) (-1) - \Psi '\left( \sum _{\ell =1}^p\phi _{j\ell }\right) + \Psi _{j\ell } \, = \\&= \Psi '(\phi _{j\ell })\left[ \sum _{i=1}^n y_{i\ell }\gamma _{ij}+\theta -\phi _{j\ell } \right] \, - \\&\quad - \Psi '\left( \sum _{\ell =1}^p\phi _{j\ell }\right) \left[ \sum _{i=1}^n y_{i\ell }\gamma _{ij}+\theta -\phi _{j\ell } \right] = 0, \end{aligned}$$

which, as in the previous case, it has a unique solution in ${\phi _{j\ell }}$ given by:

$$\begin{aligned} \phi _{j\ell }= \theta + \sum _{i=1}^n y_{i\ell }\gamma _{ij}. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bilancia, M., Di Nanni, M., Manca, F. et al. Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering. Comput Stat 38, 2015–2051 (2023). https://doi.org/10.1007/s00180-023-01350-8

Download citation

Received: 28 March 2022
Accepted: 13 March 2023
Published: 15 May 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00180-023-01350-8

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A The Dirichlet-multinomial distribution

Appendix B Calculating the ELBO in explicit form

Appendix C Maximizing the ELBO

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A The Dirichlet-multinomial distribution

Appendix B Calculating the ELBO in explicit form

Appendix C Maximizing the ELBO

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation