Abstract
Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that (1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, (2) achieves shared feature selection among tasks and cluster components, and (3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631–636
Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Stat Comput 27:1–24
Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in neural information processing systems, pp 41–48
Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in neural information processing systems, pp 25–32
Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787
Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Comput Stat Data Anal 48(4):821–834
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Math Program Comput 3(3):165–218
Bhat HS, Kumar N (2010) On the derivation of the Bayesian information criterion. School of Natural Sciences, University of California, Oakland
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:705–1732
Bishop CM (2006) Pattern recognition. Mach Learn 128:1–58
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772
Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint arXiv:1005.3579
Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 42–50
Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans Knowl Discov Data (TKDD) 5(4):22
Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. J R Stat Soc Ser B (Stat Methodol) 74(2):203–221
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
Doğru FZ, Arslan O (2016) Robust mixture regression using mixture of different distributions. In: Agostinelli C, Basu A, Filzmoser P, Mukherjee D (eds) Recent advances in robust statistics: theory and applications. Springer, New Delhi, pp 57–79
Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10,879–10,896
Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer, Berlin
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193
Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 895–903
Gong P, Ye J, Zhang C (2012b) Multi-stage multi-task feature learning. In: Advances in neural information processing systems, pp 1988–1996
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 25–32
Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87
Jacob L, Vert J, Bach FR (2009) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp 745–752
Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in neural information processing systems, pp 964–972
Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 457–464
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40
Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1847–1850
Jorgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49:127–162
Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. J Iran Stat Soc 10(2):201–235
Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102(479):1025–1038
Koller D (1996) Toward optimal feature selection. In: Proceedings of the 13th international conference on machine learning, pp 284–292
Kubat M (2015) An introduction to machine learning. Springer, Berlin
Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1723–1730
Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. J Stat Comput Simul 87:1–20
Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in neural information processing systems, pp 625–632
Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 482–489
Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient \(\ell _{2,1}\)-norm minimization. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, pp 339–348
McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. Springer, Dordrecht, pp 355–368
Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of statistical sciences. Wiley, Hoboken
Nesterov Y et al (2007) Gradient methods for minimizing composite objective function. Technical report, UCL
Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1283–1290
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647
She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639
Städler N, Bühlmann P, Van De Geer S (2010) \(\ell _1\)-penalization for mixture regression models. Test 19(2):209–256
Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th national conference on artificial intelligence. American Association for Artificial Intelligence, pp 93–98
Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. Open Biomed Eng J 4:113
Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press, Cambridge
Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71
Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer, Berlin
Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21–55
Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Trans Neural Netw Learn Syst 26(5):1098–1108
Wang HX, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognit Lett 25(6):701–710
Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in neural information processing systems, pp 2151–2159
Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst 23(8):1177–1193
Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907
Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI conference on artificial intelligence and the 23rd innovative applications of artificial intelligence conference, AAAI-11/IAAI-11, San Francisco, CA, 7–11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574
Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702–710
Acknowledgements
The authors would like to thank the editors and reviewers for their valuable suggestions on improving this paper. This work of Jian Liang and Changshui Zhang is (jointly or partly) funded by National Natural Science Foundation of China under Grant No. 61473167 and Beijing Natural Science Foundation under Grant No. L172037. Kun Chen’s work is partially supported by U.S. National Science Foundation under Grants DMS-1613295 and IIS-1718798. The work of Fei Wang is supported by National Science Foundation under Grants IIS-1650723 and IIS-1716432.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Pauli Miettinen.
Appendices
Appendix A: Definitions
Definition 1
\(Z = (Z_1,\ldots ,Z_{m'})^{\mathrm{T}}\in \mathbb {R}^{m'} \) has a sub-exponential distribution with parameters \((\sigma ,v,t)\) if for \(M>t\), it holds
Appendix B: The empirical process
In order to prove the first part of Theorem 1 that the bound in (26) has the probability in (25), we firstly follow Städler et al. (2010) to define the empirical process for fixed data points \(\mathbf {x}_1,\ldots ,\mathbf {x}_n\). For \(\tilde{\mathbf {y}}_i = (y_{ij}, j\in {\varOmega }_i)^{\mathrm{T}}\in \mathbb {R}^{|{\varOmega }_i|}\) and \(X = (X_1,\ldots ,X_d)\), let
By fixing some \(T\ge 1\) and \(\lambda _0\ge 0\), we define an event \(\mathcal {T}\) below, upon which the bound in (26) can be proved. So the probability of the event \(\mathcal {T}\) is the probability in (25).
It can be seen that, (21) defines a set of the parameter \(\theta \), and the bound in (26) will be proved with \(\hat{\theta }\) in the set.
For group-lasso type estimator, define an event similar to that in (21) in the following.
Appendix C: Lemmas
In order to show that the probability of event \(\mathcal {T}\) is large, we firstly invoke the following lemma.
Lemma 2
Under Condition 2, for model (1) with \(\theta _0 \in \tilde{{\varTheta }}\), \(M_n\) and \(\lambda _0\) defined in (24), some constants \(c_6,c_7\) depending on K, and for \(n\ge c_7\), we have
where \(\mathbb {P}_{\mathbf {X}}\) denote the conditional probability given \((X_1^{\mathrm{T}},\ldots ,X_n^{\mathrm{T}})^{\mathrm{T}}=(\mathbf {x}_1^{\mathrm{T}},\ldots ,\mathbf {x}_n^{\mathrm{T}})^{\mathrm{T}}= \mathbf {X}\), and \(F(\tilde{\mathbf {y}}_i) = G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i],\forall i\).
A proof is given in “Appendix F” section.
Then we can follow the Corollary 1 in Städler et al. (2010) to show that the probability of event \(\mathcal {T}\) is large below.
Lemma 3
Use Lemma 2. For model (1) with \(\theta _0 \in \tilde{{\varTheta }}\), some constants \(c_7,c_8,c_9,c_{10}\) depending on K, for \(\mathcal {T}\) is defined in (21), and for all \(T\ge c_{10}\) we have
A proof is given in “Appendix G” section.
Appendix D: Corollaries for models considering outlier samples
When considering outlier samples and modifying the natural parameter model as in (11), we can show in this section the similar results.
First, as \(\varvec{\beta }\) and \(\varvec{\zeta }\) are treated in the similar way, we denote them together by \(\varvec{\xi }\in \mathbb {R}^{((d+n)\times m)\times k}\), and \(\xi = vec(\varvec{\xi }) \in \mathbb {R}^{(d+n)mk}\) such that for all \(r = 1,\ldots ,k\),
where \(\mathbf {I}_{n}\in \mathbb {R}^{n\times n}\) is a identity matrix.
Thus it can be seen that the modification only results in new design matrix and regression coefficient matrix, therefore, we can apply Theorems 1–3 to have similar results for the modified models.
For lasso-type penalties, denote the set of indices of non-zero entries of \(\beta _0\) by \(S_{\beta }\), and the set of indices of non-zero entries of \(\zeta _0\) by \(S_{\zeta }\), where \(\zeta = \text{ vec }(\varvec{\zeta }_1,\ldots ,\varvec{\zeta }_k)\). Denote by \(s = |S_{\beta }| + |S_{\zeta }|\). Then for entry-wise \(\ell _1\) penalties in (5) (for \(\varvec{\beta }\)) with \(\gamma = 0\) and \(\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1\) (for \(\varvec{\zeta }\)), we need the following modified restricted eigenvalue condition.
Condition 6
For all \( \beta \in \mathbb {R}^{dmk}\) and all \( \zeta \in \mathbb {R}^{nmk}\) satisfying \(\Vert \beta _{S_{\beta }^c}\Vert _1 + \Vert \zeta _{S_{\zeta }^c}\Vert _1 \le 6(\Vert \beta _{S_{\beta }}\Vert _1+\Vert \zeta _{S_{\zeta }}\Vert _1)\), it holds for some constant \(\kappa \ge 1\) that,
Corollary 1
Consider the Hermit model in (1) with \(\theta _0\in \tilde{{\varTheta }}\), and consider the penalized estimator (12) with the \(\ell _1\) penalties in (5) and \(\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1\).
-
(a)
Assume Conditions 1–3 and 6 hold. Suppose \(\sqrt{mk} \lesssim n/M_n\), and take \(\lambda > 2T\lambda _0\) for some constant \(T>1\). For some constant \(c>0\) and large enough n, with probability \(1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have
$$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0) \left( \Vert \hat{\beta }_{S_{\beta }^c}\Vert _1 + \Vert \hat{\zeta }_{S_{\zeta }^c}\Vert _1\right) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$ -
(b)
Assume Conditions 1–3 hold (without Condition 6), assume
$$\begin{aligned} \Vert \beta _0\Vert _1 + \Vert \zeta _0\Vert _1&= o\left( \sqrt{n/((\log n)^{2+2c_1} \log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$as \(n\rightarrow \infty \). If \(\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}\) for some \(C>0\) sufficiently large, and for some constant \(c>0\) and large enough n, with the following probability \(1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have \(\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)\).
For group-lasso type penalties, denote
where \(\varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}\) and \(\varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}\) denote the pth group of \(\varvec{\beta }_0\) and the qth group of \(\varvec{\zeta }_0\), respectively. Now denote \(s = |\mathcal {I}_{\beta }| + |\mathcal {I}_{\zeta }|\) with some abuse of notation.
Then for group \(\ell _1\) penalties in (27) (for \(\varvec{\beta }\)) and \(\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F\) (for \(\varvec{\zeta }\)), we need the following modified restricted eigenvalue condition.
Condition 7
For all \( \varvec{\beta }\in \mathbb {R}^{d\times mk}\) and all \( \varvec{\zeta }\in \mathbb {R}^{n\times mk}\) satisfying
it holds that for some constant \(\kappa \ge 1\),
Corollary 2
Consider the Hermit model in (1) with \(\theta _0\in \tilde{{\varTheta }}\), and consider estimator (12) with the group \(\ell _1\) penalties in (27) and \(\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F\).
-
(a)
Assume Conditions 1–3 and 7 hold. Suppose \(\sqrt{mk} \lesssim n/M_n\), and take \(\lambda > 2T\lambda _0\) for some constant \(T>1\). For some constant \(c>0\) and large enough n, with probability \(1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have
$$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0)\biggl (\sum _{p\in \mathcal {I}_{\beta }^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_{\beta ,p}}\Vert _F+\sum _{q\in \mathcal {I}_{\zeta }^c}\Vert \hat{\varvec{\zeta }}_{\mathcal {G}_{\zeta ,q}}\Vert _F\biggr ) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$ -
(b)
Assume Conditions 1–3 hold (without Condition 7), assume
$$\begin{aligned} \sum _{p=1}^P\Vert \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q=1}^Q\Vert \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}\Vert _F&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$as \(n\rightarrow \infty \). If \(\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}\) for some \(C>0\) sufficiently large, and for some constant \(c>0\) and large enough n, with the following probability \(1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have \(\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)\).
Appendix E: Proof of Lemma 1
Proof
For non-negative continuous variable X, we have
Similarly, we have \(\mathbb {E}[X^21\{X>M\}] = M^2\mathbb {P}(X>M) + \int _M^{\infty }2x\mathbb {P}(X>x)dx\).
For X sub-exponential with parameters \((\sigma ,v ,t) \) such that for \(M>t \)
we have the following.
If \(M\le \frac{\sigma ^2}{v} \), we have
and similarly, \(\mathbb {E}[X^21\{X>M\}] \le \biggl (M^2+ 2v^2+2\sigma ^2\biggr )\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ).\)
If \(M> \frac{\sigma ^2}{v} \), we have \(\mathbb {E}[X1\{X>M\}] \le (M+v)\exp \biggl (-\frac{M }{v }\biggr )\) and \(\mathbb {E}[X^21\{X>M\}] \le (M^2+2v^2+2vM)\exp \biggl (-\frac{M }{v }\biggr )\).
Then for some constants \(c_1,c_2,c_3,c_4,c_5>0\), for non-negative continuous variable X which is sub-exponential with parameters \((\sigma ,v,t)\), for \(M>c_4>t\) and \(c' = 2+\frac{3}{c_1}\), we have
If \(t \le M\le \frac{\sigma ^2}{v}\), \(c_1 =1/2, c_2 = \sqrt{2}\sigma , c_3 = 16\sigma ^8\). And if \(M\ge \frac{\sigma ^2}{v}\), \(c_1 = 1,c_2 = 2v,c_3 = 32v^5\). And \(c_5 = \sqrt{2}(v + \sigma )\).
For non-negative discrete variables, the result is the same.
The result of Lemma 1 follows from the result above, \(\tilde{\mathbf {y}}_i\) has a finite mixture distribution for \(i=1,\ldots ,n\) and the following.
When dispersion parameter \(\phi \) is known, for a constant \(c_K\) depending on K, we have
\(\square \)
Appendix F: Proof of Lemma 2
Proof
Under Condition 2, \(M_n = c_2(\log n)^{c_1}\), and \(\lambda _0\) defined in (24), for a constant \(c_6\) depending on K, for \(i=1,\ldots ,n\), we have
The we can get
\(\square \)
Appendix G: Proof of Lemma 3
Proof
We follow Städler et al. (2010) to give a Entropy Lemma and then prove Lemma 3.
We use the following norm \(\Vert \cdot \Vert _{P_n}\) introduced in the Proof of Lemma 2 in Städler et al. (2010) and use \(H(\cdot ,\mathcal {H},\Vert \cdot \Vert _{P_n})\) as the entropy of covering number [see Van de Geer (2000)] which is equipped the metric induced by the norm for a collection \(\mathcal {H}\) of functions on \(\mathcal {X}\times \mathcal {Y}\),
Define \(\tilde{{\varTheta }}(\epsilon ) = \{\theta \in \tilde{{\varTheta }}: \Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta - \eta _0\Vert _2 \le \epsilon \}\).
Lemma 4
(Entropy Lemma) For a constant \(c_{12}>0\), for all \(u>0\) and \(M_n>0\), we have
Proof
(For Entropy Lemma) The difference between this proof and that of Entropy Lemma in the proof of Lemma 2 of Städler et al. (2010) is in the notations and the effect of multivariate responses.
For multivariate responses we have for \(i=1,\ldots ,n\),
where \(d_{\psi } = (2m+1)k\) is the maximum of dimension of \(\psi _i\) for \(i=1,\ldots ,n\).
Under the definition of the norm \(\Vert \cdot \Vert _{P_n}\) we have
Then by the result of Städler et al. (2010) we have
where \(d_{\eta } = (m+1)k\) is the dimension of \(\eta \).
And we follow Städler et al. (2010) to apply Lemma 2.6.11 of Van Der Vaart and Wellner (1996) to give a bound as
Thus we can get
\(\square \)
Now we turn to prove Lemma 3.
We follow Städler et al. (2010) to use the truncated version of the empirical process below.
We follow Städler et al. (2010) to apply the Lemma 3.2 in Van de Geer (2000) and a conditional version of Lemma 3.3 in Van de Geer (2000) to the class
For some constants \(\{c_{t}\}_{t>12}\) depending on K and \({\varLambda }_{\max }\) in Condition 2 of Städler et al. (2010), using the notation of Lemma 3.2 in Van de Geer (2000), we follow Städler et al. (2010) to choose \(\delta = c_{13} T\epsilon \lambda _0\) and \(R = c_{14}(\sqrt{mk}\epsilon \wedge 1)M_n\).
Thus we by choosing \(M_n = c_2(\log n)^{c_1}\) we can satisfy the condition of Lemma 3.2 of Van de Geer (2000) to have
Now for the rest we can apply Lemma 3.2 of Van de Geer (2000) to give the same result with Lemma 2 of Städler et al. (2010).
So we have
with probability at least \(1 - c_{9}\exp \biggl [- \frac{T^2(\log n)^2\log (d\vee n) }{c_{8}^2}\biggr ]\).
At last, for the case when \(G_1(\tilde{\mathbf {y}}_i)>M_n\), for \(i=1,\ldots ,n\), we have
and
Then the probability of the following inequality under our model is given in Lemma 2.
where \(d_{\psi } = 2(m+1)k\). \(\square \)
Appendix H: Proof of Theorem 1
Proof
This proof mostly follows that of Theorem 3 of Städler et al. (2010). The only difference is in the notations. As such, we omit the details. \(\square \)
Appendix I: Proof of Theorem 2
Proof
This proof also mostly follows that of Theorem 5 of Städler et al. (2010). The difference is in the notations and the choice of \(M_n\).
If the event \(\mathcal {T}\) happens, with \(M_n = c_2(\log n)^{c_1}\) for some constants \(0\le c_1,c_2<\infty \), where \(c_2\) depends on K,
we have
Under the definition of \(\theta \in \tilde{{\varTheta }}\) in (23) we have \(\Vert \eta -\eta _0\Vert _2\le 2K\). And as \( \bar{\varepsilon }(\psi _0\mid \psi _0) =0\) we have for n sufficiently large.
As \(C>0\) sufficiently large we have \(\lambda \ge 2T\lambda _0\).
And using the condition on \(\Vert \beta _0\Vert _1\) and \(\sqrt{mk}\), we have both \(T\lambda _02K = o(1)\) and \((\lambda +T\lambda _0)\Vert \beta _0\Vert _1 = o(1)\), so we have \(\bar{\varepsilon }(\hat{\psi }\mid \psi _0)\rightarrow 0 \ (n\rightarrow \infty )\).
At last, as the event \(\mathcal {T}\) has large probability, we have \(\bar{\varepsilon }(\hat{\theta }_{\lambda }\mid \theta _0) = o_P(1) \ (n\rightarrow \infty )\). \(\square \)
Appendix J: Proof of Theorem 3
Proof
First we discuss the bound for the probability of \(\mathcal {T}_{group}\) in (22).
The difference between \(\mathcal {T}_{group}\) and \(\mathcal {T}\) in (21) is only related to the following entropy of the Entropy Lemma in the proof of Lemma 3.
where \(\sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \) still maintains a convex hull for \(\varvec{\beta }\) in the metric space equipped with the metric induced by the norm \(\Vert \cdot \Vert _{P_n}\) defined in the proof of Lemma 3. Thus it still satisfies the Condition of Lemma 2.6.11 of Van Der Vaart and Wellner (1996) which can still be applied to give
So the probability of event \(\mathcal {T}_{group}\) remains the same form with that in Lemma 3.
Then we discuss the bound for the average excess risk and feature selection.
If the event \(\mathcal {T}_{group}\) happens, we have
Using Condition 3 we have \( \bar{\varepsilon }(\psi _0\mid \psi _0) =0\) and \(\bar{\varepsilon }(\hat{\psi }\mid \psi _0) \ge {\Vert \hat{\psi }-\psi _0\Vert _{Q_n}^2}/{c_0^2}\).
Case 1 When the following is true:
we have
Case 2 When the following is true:
As \(\sum _{p\in \mathcal {I}^c}\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F=0\), we have
Then we get
Case 3 When the following is true:
we have
Thus we have
so we can use the Condition 5 for \(\hat{\varvec{\beta }} -\varvec{\beta }_0\) to have
So we have
And without restricted eigenvalue Condition 5, we can prove similarly as in “Appendix I” section, assuming event \(\mathcal {T}_{group}\) happens and using the condition on \(\sum _p\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\) and \(\sqrt{mk}\). \(\square \)
Rights and permissions
About this article
Cite this article
Liang, J., Chen, K., Lin, M. et al. Robust finite mixture regression for heterogeneous targets. Data Min Knowl Disc 32, 1509–1560 (2018). https://doi.org/10.1007/s10618-018-0564-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0564-z