Abstract
This paper is meant to introduce a significant extension of the flexible Dirichlet (FD) distribution, which is a quite tractable special mixture model for compositional data, i.e. data representing vectors of proportions of a whole. The FD model displays several theoretical properties which make it suitable for inference, and fairly easy to handle from a computational viewpoint. However, the rigid type of mixture structure implied by the FD makes it unsuitable to describe many compositional datasets. Furthermore, the FD only allows for negative correlations. The new extended model, by considerably relaxing the strict constraints among clusters entailed by the FD, allows for a more general dependence structure (including positive correlations) and greatly expands its applicative potential. At the same time, it retains, to a large extent, its good properties. EM-type estimation procedures can be developed for this more complex model, including ad hoc reliable initialization methods, which permit to keep the computational issues at a rather uncomplicated level. Accurate evaluation of standard error estimates can be provided as well.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London (2003)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, I, pp. 610–624. Springer-Verlag (1973)
Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007). https://doi.org/10.1007/s11222-006-9010-y. arXiv:1301.6559
Barndorff-Nielsen, O., Jørgensen, B.: Some parametric models on the simplex. J. Multivar. Anal. 39(1), 106–116 (1991)
Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41, 561–575 (2003)
Byrd, L.P.R.H., Nocedal, J.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Stat. Comput. 16(5), 1190–1208 (1995)
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 4, 315–332 (1992)
Celeux, G., Chauveau, D., Diebolt, J.: Stochastic versions of the EM algorithm: an experimental study in the mixture case. J. Stat. Comput. Simul. 55, 287–314 (1996)
Comas-Cufí, M., Martín-Fernández, J.A., Mateu-Figueras, G.: Log-ratio methods in mixture models for compositional data sets. Sort 40(2), 349–374 (2016)
Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
Diebolt, J., Ip, E.: Stochastic EM: method and application. In: WR Gilks, S.R., Spiegelhalter, D.: (eds) Markov Chain Monte Carlo in Practice, Chapman & Hall, London, pp 259–273 (1996)
Favaro, S., Hadjicharalambous, G., Prunster, I.: On a class of distributions on the simplex. J. Stat. Plan. Inference 141(426), 2987–3004 (2011)
Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classification of olive oils from their fatty acid composition. In: Martens, R. (ed.) Food Research and Data Analysis. Dip. Chimica e Tecnologie Farmaceutiche ed Alimentari, University of Genova, Genoa (1983)
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions. J. Multivar. Anal. 23, 233–256 (1987)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, II. Prob. Math. Stat. 12, 291–309 (1991)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, III. J. Multivar. Anal. 43, 29–57 (1992)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, IV. J. Multivar. Anal. 54, 1–17 (1995)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, V. In: Johnson, N.L., Balakrishnan, N. (eds.) Advances in the Theory and Practice of Statistics: A Volume in Honour of Samuel Kotz, pp. 377–396. Wiley, New York (1997)
Gupta, R.D., Richards, D.S.P.: The covariance structure of the multivariate liouville distributions. Contemp. Math. 287, 125–138 (2001a)
Gupta, R.D., Richards, D.S.P.: The history of the Dirichlet and Liouville distributions. Int. Stat. Rev. 69(3), 433–446 (2001b)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694. arXiv:1511.00860
Migliorati, S., Ongaro, A., Monti, G.S.: A structured dirichlet mixture model for compositional data: inferential and applicative issues. Stat. Comput. 27, 963. https://doi.org/10.1007/s11222-016-9665-y
Ongaro, A., Migliorati, S.: A generalization of the Dirichlet distribution. J. Multivar. Anal. 114, 412–426 (2013)
Pawlowsky-Glahn, V., Egozcue, J., Tolosana-Delgado, R.: Modeling and Analysis of Compositional Data. Wiley, New York (2015)
R Core Team (2018) R: a language and environment for statistical computing. https://www.r-project.org/. Accessed 22 January
Rayens, W.S., Srinivasan, C.: Dependence properties of generalized Liouville distributions on the simplex. J. Am. Stat. Assoc. 89(428), 1465–1470 (1994)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 2, 461–464 (1978)
Smith, B., Rayens, W.: Conditional generalized Liouville distributions on the simplex. Statistics 36(2), 185–194 (2002)
Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)
Funding
This study was partially funded by Università degli Studi di Milano-Bicocca (Grant No. FA 2018).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Proposition 3
The conditional distribution function of \({\varvec{S}}_1\mid {\varvec{X}}_{2}={\varvec{x}}_{2}\) can be derived most easily by conditioning on \({\varvec{Z}}\):
Given that \({\varvec{X}}\mid {\varvec{Z}}={\varvec{e}}_i\sim \mathcal{D}({\dot{\varvec{\alpha }}}_i)\), by using well-known Dirichlet independence properties we have that:
Recalling that the Dirichlet distribution is closed under the operation of subcomposition, it follows that:
and
The probabilities \(P({\varvec{Z}}={\varvec{e}}_i\mid {\varvec{X}}_{2}={\varvec{x}}_{2})\) can be computed by the Bayes theorem. In particular, the distribution of \(({\varvec{X}}_{2},1-X_2^+)^\intercal | {\varvec{Z}}={\varvec{e}}_i\) can be obtained by resorting to closure of the Dirichlet under marginalization; it takes the form
if \(i\le k\) and
if \(i> k\). From the Bayes formula, some algebraic manipulations show that the probabilities \(P({\varvec{Z}}={\varvec{e}}_i\mid {\varvec{X}}_{2}={\varvec{x}}_{2})\) are proportional to the \(p_{i}^{'}\)’s provided by (14). By plugging all the computed quantities into (35), the result is obtained.
1.2 Proof of Proposition 5
It is obvious that if \(\varvec{\theta }=\varvec{\theta }^\prime \), then \(\mathbf{X } \sim \mathbf{X }^\prime \). In order to show the converse, one can focus on the marginal distribution of \(X_i\). By virtue of Proposition 3, we can write its density function \(g(x_i; \varvec{\theta })\) as:
If \(\mathbf{X } \sim \mathbf{X }^\prime \), then \(X_i \sim X_i^\prime \) and therefore, \(g(x_i; \varvec{\theta }) = g(x_i; \varvec{\theta }^\prime )\)\(\forall \)\(x_i\)\(\in \) (0, 1), as these density functions are continuous. It follows that \(\displaystyle \lim \limits _{x \rightarrow 0^+} \frac{g(x_i; \varvec{\theta })}{x_i^{\alpha _i - 1}} = \lim \limits _{x \rightarrow 0^+} \frac{g(x_i; \varvec{\theta }^\prime )}{x_i^{\alpha _i - 1}}\). We have:
and
In order to satisfy the equality of these two limits, the quantity \(\displaystyle \left( \lim _{x_i \rightarrow 0^+} \frac{x_i^{\alpha _i^\prime - 1}}{x^{\alpha _i - 1}}\right) \) must be finite and different from 0.
This implies that \(\varvec{\alpha }=\varvec{\alpha }^\prime \). As a consequence, the equality \(g(x_i; \varvec{\theta }) = g(x_i; \varvec{\theta }^\prime )\) can be rewritten as:
By taking the limits as \(x_i \rightarrow 1^-\) on both sides, one obtains:
Equation (38) implies that \(p_i\) and \( p_i^\prime \) are either both null or both strictly positive. In the former case, because of the parameter space definition, \( \tau _i=\tau _i^\prime =1\). In the latter case, plugging (38) into equality (37) and deriving both sides, the following equality must hold \(\forall \)\(x_i\)\(\in \) (0, 1):
Taking the limits as \(x_i \rightarrow 1^-\) on both sides, we have:
It follows that \(\tau _i=\tau _i^\prime \) for any i such that \(p_i>0\) and hence for all i. Finally, substituting this constraint in (38), it is possible to conclude that \(\mathbf{p }= \mathbf{p }^\prime \).
1.3 Proof of Proposition 8
Recall that \(\mathbf{X }| Y^+ = y^+ \sim EFD(\varvec{\alpha },\mathbf{p }^*(y^+),\varvec{\tau }, \beta )\), where \(\mathbf{p }^*(y^+)\) are defined as in (23). Then, if \(\tau _i=\tau \)\(\forall i\), it can be seen immediately that the \(p_i^*(y^+)\)’s are independent of \(y^+\) (and coincide with the \(p_i\)’s). Conversely, if the basis is compositional invariant, then \(p_i^*(y^+)\) does not depend on \(y^+\), and therefore, neither does the ratio \(p_i^*(y^+)/p_l^*(y^+)\)\(\forall i\ne l\). Because this ratio is proportional to \({(y^+)}^{\tau _i-\tau _l}\), \(\tau _i=\tau _l\)\(\forall i\ne l\).
1.4 Partial derivatives
In this section we show the partial derivatives of the complete-data log-likelihood (25). In particular, for \(i=1, \ldots ,D\), the first-order partial derivatives are:
where \(z_{\cdot i}=\sum _{j=1}^n z_{ji}\).
The second-order partial derivatives are:
where \(\mathbb {1}_{i=h}\) is the indicator function that is equal to 1 if \(i = h\) and 0 otherwise.
where \(\psi ^\prime (\cdot )\) is the trigamma function.
1.5 Results of the univariate case of the olive oil dataset
In this section we report the AIC and BIC criteria for the considered models (Table 7), and the fitted density curves (Fig. 9).
Rights and permissions
About this article
Cite this article
Ongaro, A., Migliorati, S. & Ascari, R. A new mixture model on the simplex. Stat Comput 30, 749–770 (2020). https://doi.org/10.1007/s11222-019-09920-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-019-09920-x