Robust finite mixture regression for heterogeneous targets

Jian Liang ORCID: orcid.org/0000-0001-5352-0278¹,
Kun Chen²,
Ming Lin³,
Changshui Zhang¹ &
…
Fei Wang⁴

1416 Accesses
1 Altmetric
Explore all metrics

Abstract

Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that (1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, (2) achieves shared feature selection among tasks and cluster components, and (3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous clustering and feature selection via nonparametric Pitman–Yor process mixture models

Article 02 January 2019

Robust mixture regression modeling based on two-piece scale mixtures of normal distributions

Article 23 March 2022

Semiparametric mixtures of regressions with single-index for model based clustering

Article 23 April 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631–636
Article Google Scholar
Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Stat Comput 27:1–24
MathSciNet MATH Google Scholar
Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in neural information processing systems, pp 41–48
Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in neural information processing systems, pp 25–32
Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787
Article MathSciNet Google Scholar
Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Comput Stat Data Anal 48(4):821–834
Article MathSciNet MATH Google Scholar
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Article MathSciNet MATH Google Scholar
Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Math Program Comput 3(3):165–218
Article MathSciNet MATH Google Scholar
Bhat HS, Kumar N (2010) On the derivation of the Bayesian information criterion. School of Natural Sciences, University of California, Oakland
Google Scholar
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:705–1732
Article MathSciNet MATH Google Scholar
Bishop CM (2006) Pattern recognition. Mach Learn 128:1–58
Google Scholar
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122
MATH Google Scholar
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772
Article MathSciNet MATH Google Scholar
Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint arXiv:1005.3579
Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 42–50
Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans Knowl Discov Data (TKDD) 5(4):22
Google Scholar
Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. J R Stat Soc Ser B (Stat Methodol) 74(2):203–221
Article MathSciNet MATH Google Scholar
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken
MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
MathSciNet MATH Google Scholar
Doğru FZ, Arslan O (2016) Robust mixture regression using mixture of different distributions. In: Agostinelli C, Basu A, Filzmoser P, Mukherjee D (eds) Recent advances in robust statistics: theory and applications. Springer, New Delhi, pp 57–79
Chapter MATH Google Scholar
Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10,879–10,896
Article MathSciNet MATH Google Scholar
Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer, Berlin
Book MATH Google Scholar
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148
MathSciNet MATH Google Scholar
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193
Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 895–903
Gong P, Ye J, Zhang C (2012b) Multi-stage multi-task feature learning. In: Advances in neural information processing systems, pp 1988–1996
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 25–32
Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499
Article MathSciNet MATH Google Scholar
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87
Article Google Scholar
Jacob L, Vert J, Bach FR (2009) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp 745–752
Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in neural information processing systems, pp 964–972
Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 457–464
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40
Article Google Scholar
Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1847–1850
Jorgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49:127–162
MathSciNet MATH Google Scholar
Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. J Iran Stat Soc 10(2):201–235
MathSciNet MATH Google Scholar
Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102(479):1025–1038
Article MathSciNet MATH Google Scholar
Koller D (1996) Toward optimal feature selection. In: Proceedings of the 13th international conference on machine learning, pp 284–292
Kubat M (2015) An introduction to machine learning. Springer, Berlin
Book MATH Google Scholar
Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1723–1730
Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. J Stat Comput Simul 87:1–20
MathSciNet Google Scholar
Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in neural information processing systems, pp 625–632
Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 482–489
Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient $\ell _{2,1}$-norm minimization. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, pp 339–348
McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken
MATH Google Scholar
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. Springer, Dordrecht, pp 355–368
Chapter Google Scholar
Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of statistical sciences. Wiley, Hoboken
Google Scholar
Nesterov Y et al (2007) Gradient methods for minimizing composite objective function. Technical report, UCL
Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1283–1290
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Article Google Scholar
She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647
Article MathSciNet MATH Google Scholar
She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639
Article MathSciNet MATH Google Scholar
Städler N, Bühlmann P, Van De Geer S (2010) $\ell _1$-penalization for mixture regression models. Test 19(2):209–256
Article MathSciNet MATH Google Scholar
Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
MathSciNet MATH Google Scholar
Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th national conference on artificial intelligence. American Association for Artificial Intelligence, pp 93–98
Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. Open Biomed Eng J 4:113
Article Google Scholar
Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press, Cambridge
MATH Google Scholar
Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71
Google Scholar
Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer, Berlin
MATH Google Scholar
Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21–55
Article MATH Google Scholar
Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Trans Neural Netw Learn Syst 26(5):1098–1108
Article MathSciNet Google Scholar
Wang HX, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognit Lett 25(6):701–710
Article Google Scholar
Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in neural information processing systems, pp 2151–2159
Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst 23(8):1177–1193
Article Google Scholar
Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907
Article Google Scholar
Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI conference on artificial intelligence and the 23rd innovative applications of artificial intelligence conference, AAAI-11/IAAI-11, San Francisco, CA, 7–11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574
Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702–710

Download references

Acknowledgements

The authors would like to thank the editors and reviewers for their valuable suggestions on improving this paper. This work of Jian Liang and Changshui Zhang is (jointly or partly) funded by National Natural Science Foundation of China under Grant No. 61473167 and Beijing Natural Science Foundation under Grant No. L172037. Kun Chen’s work is partially supported by U.S. National Science Foundation under Grants DMS-1613295 and IIS-1718798. The work of Fei Wang is supported by National Science Foundation under Grants IIS-1650723 and IIS-1716432.

Author information

Authors and Affiliations

Department of Automation, State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, 100084, People’s Republic of China
Jian Liang & Changshui Zhang
Department of Statistics, University of Connecticut, Storrs, CT, 06269, USA
Kun Chen
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
Ming Lin
Department of Healthcare Policy and Research, Cornell University, New York City, NY, 10065, USA
Fei Wang

Authors

Jian Liang
View author publications
You can also search for this author in PubMed Google Scholar
Kun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ming Lin
View author publications
You can also search for this author in PubMed Google Scholar
Changshui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Liang.

Additional information

Responsible editor: Pauli Miettinen.

Appendices

Appendix A: Definitions

Definition 1

$Z = (Z_1,\ldots ,Z_{m'})^{\mathrm{T}}\in \mathbb {R}^{m'} $ has a sub-exponential distribution with parameters $(\sigma ,v,t)$ if for $M>t$, it holds

$$\begin{aligned} \mathbb {P}(\Vert Z\Vert _{\infty }>M)\le \left\{ \begin{array}{ll} \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), &{} \quad t\le M\le \frac{\sigma ^2}{v}\\ \exp \biggl (-\frac{M }{v }\biggr ), &{}\quad M>\frac{\sigma ^2}{v}. \end{array} \right. \end{aligned}$$

Appendix B: The empirical process

In order to prove the first part of Theorem 1 that the bound in (26) has the probability in (25), we firstly follow Städler et al. (2010) to define the empirical process for fixed data points $\mathbf {x}_1,\ldots ,\mathbf {x}_n$. For $\tilde{\mathbf {y}}_i = (y_{ij}, j\in {\varOmega }_i)^{\mathrm{T}}\in \mathbb {R}^{|{\varOmega }_i|}$ and $X = (X_1,\ldots ,X_d)$, let

$$\begin{aligned} V_n(\theta ) = \frac{1}{n}\sum _{i=1}^n\left( \ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)-\mathbb {E}[\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)\mid X=\mathbf {x}_i]\right) . \end{aligned}$$

By fixing some $T\ge 1$ and $\lambda _0\ge 0$, we define an event $\mathcal {T}$ below, upon which the bound in (26) can be proved. So the probability of the event $\mathcal {T}$ is the probability in (25).

$$\begin{aligned} \mathcal {T} = \left\{ \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n(\theta )-V_n(\theta _0)|}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0}\le T\lambda _0 \right\} . \end{aligned}$$

(21)

It can be seen that, (21) defines a set of the parameter $\theta $, and the bound in (26) will be proved with $\hat{\theta }$ in the set.

For group-lasso type estimator, define an event similar to that in (21) in the following.

$$\begin{aligned} \mathcal {T}_{group} = \left\{ \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n(\theta )-V_n(\theta _0)|}{\left( \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _2 + \Vert \eta -\eta _0\Vert _2 \right) \vee \lambda _0}\le T\lambda _0 \right\} . \end{aligned}$$

(22)

Appendix C: Lemmas

In order to show that the probability of event $\mathcal {T}$ is large, we firstly invoke the following lemma.

Lemma 2

Under Condition 2, for model (1) with $\theta _0 \in \tilde{{\varTheta }}$, $M_n$ and $\lambda _0$ defined in (24), some constants $c_6,c_7$ depending on K, and for $n\ge c_7$, we have

$$\begin{aligned} \mathbb {P}_{\mathbf {X}}\left( \frac{1}{n}\sum _{i=1}^nF(\tilde{\mathbf {y}}_i)>c_6\lambda _0^2/(mk)\right) \le \frac{1}{n}, \end{aligned}$$

where $\mathbb {P}_{\mathbf {X}}$ denote the conditional probability given $(X_1^{\mathrm{T}},\ldots ,X_n^{\mathrm{T}})^{\mathrm{T}}=(\mathbf {x}_1^{\mathrm{T}},\ldots ,\mathbf {x}_n^{\mathrm{T}})^{\mathrm{T}}= \mathbf {X}$, and $F(\tilde{\mathbf {y}}_i) = G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i],\forall i$.

A proof is given in “Appendix F” section.

Then we can follow the Corollary 1 in Städler et al. (2010) to show that the probability of event $\mathcal {T}$ is large below.

Lemma 3

Use Lemma 2. For model (1) with $\theta _0 \in \tilde{{\varTheta }}$, some constants $c_7,c_8,c_9,c_{10}$ depending on K, for $\mathcal {T}$ is defined in (21), and for all $T\ge c_{10}$ we have

$$\begin{aligned} \mathbb {P}_{\mathbf {X}}(\mathcal {T})\ge 1 - c_9\exp \left( -\frac{T^2(\log n)^2\log (d\vee n)}{c_8}\right) - \frac{1}{n}, \forall n\ge c_7. \end{aligned}$$

A proof is given in “Appendix G” section.

Appendix D: Corollaries for models considering outlier samples

When considering outlier samples and modifying the natural parameter model as in (11), we can show in this section the similar results.

First, as $\varvec{\beta }$ and $\varvec{\zeta }$ are treated in the similar way, we denote them together by $\varvec{\xi }\in \mathbb {R}^{((d+n)\times m)\times k}$, and $\xi = vec(\varvec{\xi }) \in \mathbb {R}^{(d+n)mk}$ such that for all $r = 1,\ldots ,k$,

$$\begin{aligned} \varvec{\varphi }_r= & {} \mathbf {X}\varvec{\beta }_r + \varvec{\zeta }_r \ \Rightarrow \varvec{\varphi }_r = \mathbf {A}\varvec{\xi }_r,\\ \mathbf {A}= & {} [\mathbf {X}, \mathbf {I}_{n}]\in \mathbb {R}^{n\times (d+n)}, \ \varvec{\xi }_r = [\varvec{\beta }_r^{\mathrm{T}},\varvec{\zeta }_r^{\mathrm{T}}]^{\mathrm{T}}\in \mathbb {R}^{ (d+n)\times m}, \end{aligned}$$

where $\mathbf {I}_{n}\in \mathbb {R}^{n\times n}$ is a identity matrix.

Thus it can be seen that the modification only results in new design matrix and regression coefficient matrix, therefore, we can apply Theorems 1–3 to have similar results for the modified models.

For lasso-type penalties, denote the set of indices of non-zero entries of $\beta _0$ by $S_{\beta }$, and the set of indices of non-zero entries of $\zeta _0$ by $S_{\zeta }$, where $\zeta = \text{ vec }(\varvec{\zeta }_1,\ldots ,\varvec{\zeta }_k)$. Denote by $s = |S_{\beta }| + |S_{\zeta }|$. Then for entry-wise $\ell _1$ penalties in (5) (for $\varvec{\beta }$) with $\gamma = 0$ and $\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1$ (for $\varvec{\zeta }$), we need the following modified restricted eigenvalue condition.

Condition 6

For all $ \beta \in \mathbb {R}^{dmk}$ and all $ \zeta \in \mathbb {R}^{nmk}$ satisfying $\Vert \beta _{S_{\beta }^c}\Vert _1 + \Vert \zeta _{S_{\zeta }^c}\Vert _1 \le 6(\Vert \beta _{S_{\beta }}\Vert _1+\Vert \zeta _{S_{\zeta }}\Vert _1)$, it holds for some constant $\kappa \ge 1$ that,

$$\begin{aligned} \Vert \beta _{S_{\beta }}\Vert _2^2 + \Vert \zeta _{S_{\zeta }}\Vert _2^2 \le \kappa ^2 \Vert \varphi \Vert _{Q_n}^2 = \frac{\kappa ^2}{n}\sum _{i=1}^n\sum _{j\in {\varOmega }_i}\sum _{r=1}^k (\mathbf {x}_i\varvec{\beta }_{jr}+\zeta _{ijr})^2. \end{aligned}$$

Corollary 1

Consider the Hermit model in (1) with $\theta _0\in \tilde{{\varTheta }}$, and consider the penalized estimator (12) with the $\ell _1$ penalties in (5) and $\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1$.

(a)
Assume Conditions 1–3 and 6 hold. Suppose $\sqrt{mk} \lesssim n/M_n$, and take $\lambda > 2T\lambda _0$ for some constant $T>1$. For some constant $c>0$ and large enough n, with probability $1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}$, we have
$$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0) \left( \Vert \hat{\beta }_{S_{\beta }^c}\Vert _1 + \Vert \hat{\zeta }_{S_{\zeta }^c}\Vert _1\right) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$
(b)
Assume Conditions 1–3 hold (without Condition 6), assume
$$\begin{aligned} \Vert \beta _0\Vert _1 + \Vert \zeta _0\Vert _1&= o\left( \sqrt{n/((\log n)^{2+2c_1} \log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$
as $n\rightarrow \infty $. If $\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}$ for some $C>0$ sufficiently large, and for some constant $c>0$ and large enough n, with the following probability $1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}$, we have $\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)$.

For group-lasso type penalties, denote

$$\begin{aligned}&\mathcal {I}_{\beta } = \{p: \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}} = \mathbf {0}\}, \ \mathcal {I}_{\beta }^c = \{p: \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}} \ne \mathbf {0}\},\\&\mathcal {I}_{\zeta } = \{q: \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}} = \mathbf {0}\}, \ \mathcal {I}_{\zeta }^c = \{q: \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}} \ne \mathbf {0}\}, \end{aligned}$$

where $\varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}$ and $\varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}$ denote the pth group of $\varvec{\beta }_0$ and the qth group of $\varvec{\zeta }_0$, respectively. Now denote $s = |\mathcal {I}_{\beta }| + |\mathcal {I}_{\zeta }|$ with some abuse of notation.

Then for group $\ell _1$ penalties in (27) (for $\varvec{\beta }$) and $\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F$ (for $\varvec{\zeta }$), we need the following modified restricted eigenvalue condition.

Condition 7

For all $ \varvec{\beta }\in \mathbb {R}^{d\times mk}$ and all $ \varvec{\zeta }\in \mathbb {R}^{n\times mk}$ satisfying

$$\begin{aligned} \sum _{p\in \mathcal {I}_{\beta }^c}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q\in \mathcal {I}_{\zeta }^c}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F \le 6\left( \sum _{p\in \mathcal {I}_{\beta }}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q\in \mathcal {I}_{\zeta }}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F \right) , \end{aligned}$$

it holds that for some constant $\kappa \ge 1$,

$$\begin{aligned} \sum _{p\in \mathcal {I}_{\beta }}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F^2 + \sum _{q\in \mathcal {I}_{\zeta }}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F^2 \le \kappa ^2 \Vert \varphi \Vert _{Q_n}^2 =\frac{\kappa ^2}{n}\sum _{i=1}^n\sum _{j\in {\varOmega }_i}\sum _{r=1}^k (\mathbf {x}_i\varvec{\beta }_{jr}+\zeta _{ijr})^2. \end{aligned}$$

Corollary 2

Consider the Hermit model in (1) with $\theta _0\in \tilde{{\varTheta }}$, and consider estimator (12) with the group $\ell _1$ penalties in (27) and $\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F$.

(a)
Assume Conditions 1–3 and 7 hold. Suppose $\sqrt{mk} \lesssim n/M_n$, and take $\lambda > 2T\lambda _0$ for some constant $T>1$. For some constant $c>0$ and large enough n, with probability $1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}$, we have
$$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0)\biggl (\sum _{p\in \mathcal {I}_{\beta }^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_{\beta ,p}}\Vert _F+\sum _{q\in \mathcal {I}_{\zeta }^c}\Vert \hat{\varvec{\zeta }}_{\mathcal {G}_{\zeta ,q}}\Vert _F\biggr ) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$
(b)
Assume Conditions 1–3 hold (without Condition 7), assume
$$\begin{aligned} \sum _{p=1}^P\Vert \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q=1}^Q\Vert \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}\Vert _F&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$
as $n\rightarrow \infty $. If $\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}$ for some $C>0$ sufficiently large, and for some constant $c>0$ and large enough n, with the following probability $1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}$, we have $\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)$.

Appendix E: Proof of Lemma 1

Proof

For non-negative continuous variable X, we have

$$\begin{aligned} \mathbb {E}[X1\{X>M\}]&= \int _M^{\infty }tf_X(t)dt = \int _M^{\infty }\int _0^tf_X(t)dxdt \nonumber \\&= \int _0^M\int _M^{\infty }f_X(t)dtdx + \int _M^{\infty }\int _x^{\infty }f_X(t)dtdx \nonumber \\&= M\mathbb {P}(X>M) + \int _M^{\infty }\mathbb {P}(X>x)dx. \end{aligned}$$

Similarly, we have $\mathbb {E}[X^21\{X>M\}] = M^2\mathbb {P}(X>M) + \int _M^{\infty }2x\mathbb {P}(X>x)dx$.

For X sub-exponential with parameters $(\sigma ,v ,t) $ such that for $M>t $

$$\begin{aligned} \mathbb {P}(X>M)\le \left\{ \begin{array}{ll} \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), &{} t \le M\le \frac{\sigma ^2}{v}\\ \exp \biggl (-\frac{M }{v }\biggr ), &{} M\ge \frac{\sigma ^2}{v}, \end{array} \right. \end{aligned}$$

we have the following.

If $M\le \frac{\sigma ^2}{v} $, we have

$$\begin{aligned} \mathbb {E}[X1\{X>M\}]&= M\mathbb {P}(X>M) + \int _M^{\infty }\mathbb {P}(X>x)dx\\&\le M\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + \int _M^{\frac{\sigma ^2}{v}}\exp \biggl (-\frac{x^2}{\sigma ^2}\biggr )dx + \int _{\frac{\sigma ^2}{v}}^{\infty }\exp \biggl (-\frac{x }{v }\biggr )dx\\&\le M\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + \bigg (\frac{\sigma ^2}{v} - M\bigg ) \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + v\exp \biggl (-\frac{M}{v}\biggr )\\&= M \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + v\exp \biggl (-\frac{M}{v}\biggr )\le (M+v) \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), \end{aligned}$$

and similarly, $\mathbb {E}[X^21\{X>M\}] \le \biggl (M^2+ 2v^2+2\sigma ^2\biggr )\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ).$

If $M> \frac{\sigma ^2}{v} $, we have $\mathbb {E}[X1\{X>M\}] \le (M+v)\exp \biggl (-\frac{M }{v }\biggr )$ and $\mathbb {E}[X^21\{X>M\}] \le (M^2+2v^2+2vM)\exp \biggl (-\frac{M }{v }\biggr )$.

Then for some constants $c_1,c_2,c_3,c_4,c_5>0$, for non-negative continuous variable X which is sub-exponential with parameters $(\sigma ,v,t)$, for $M>c_4>t$ and $c' = 2+\frac{3}{c_1}$, we have

$$\begin{aligned}&\mathbb {E}[X1\{X>M\}] \le \biggl [ c_3\biggl (\frac{M}{c_2}\biggr )^{c'}+ c_5 \biggr ]\exp \biggl \{-\biggl (\frac{M}{c_2}\biggr )^{1/c_1}\biggr \},\\&\mathbb {E}[X^21\{X>M\}] \le \biggl [ c_3\biggl (\frac{M}{c_2}\biggr )^{c'}+ c_5 \biggr ]^2\exp \biggl \{-2\biggl (\frac{M}{c_2}\biggr )^{1/c_1}\biggr \}. \end{aligned}$$

If $t \le M\le \frac{\sigma ^2}{v}$, $c_1 =1/2, c_2 = \sqrt{2}\sigma , c_3 = 16\sigma ^8$. And if $M\ge \frac{\sigma ^2}{v}$, $c_1 = 1,c_2 = 2v,c_3 = 32v^5$. And $c_5 = \sqrt{2}(v + \sigma )$.

For non-negative discrete variables, the result is the same.

The result of Lemma 1 follows from the result above, $\tilde{\mathbf {y}}_i$ has a finite mixture distribution for $i=1,\ldots ,n$ and the following.

When dispersion parameter $\phi $ is known, for a constant $c_K$ depending on K, we have

$$\begin{aligned} G_1(\tilde{\mathbf {y}}_i) = e^K \max _{j\in {\varOmega }_i}|y_{ij}| + c_K, \ i=1,\ldots ,n. \end{aligned}$$

$\square $

Appendix F: Proof of Lemma 2

Proof

Under Condition 2, $M_n = c_2(\log n)^{c_1}$, and $\lambda _0$ defined in (24), for a constant $c_6$ depending on K, for $i=1,\ldots ,n$, we have

$$\begin{aligned}&\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|1\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}] \le c_6\lambda _0^2/(mk), \\&\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|^21\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}] \le c_6^2\lambda _0^4/(mk)^2. \end{aligned}$$

The we can get

$$\begin{aligned}&\mathbb {P}_{\mathbf {X}}\biggl (\frac{1}{n}\sum _{i=1}^nG_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}]>3c_6\lambda _0^2/(mk) \biggr )\\&\quad \le \mathbb {P}_{\mathbf {X}}\biggl (\frac{1}{n}\sum _{i=1}^nG_1(\tilde{\mathbf {y}}_i)1 \{G_1(\tilde{\mathbf {y}}_i)>M_n\} - \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}]>c_6\lambda _0^2/(mk) \biggr )\\&\quad \le \frac{\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|^21\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}]}{n}\frac{m^2k^2}{c_6^2\lambda _0^4} \le \frac{1}{n}. \end{aligned}$$

$\square $

Appendix G: Proof of Lemma 3

Proof

We follow Städler et al. (2010) to give a Entropy Lemma and then prove Lemma 3.

We use the following norm $\Vert \cdot \Vert _{P_n}$ introduced in the Proof of Lemma 2 in Städler et al. (2010) and use $H(\cdot ,\mathcal {H},\Vert \cdot \Vert _{P_n})$ as the entropy of covering number [see Van de Geer (2000)] which is equipped the metric induced by the norm for a collection $\mathcal {H}$ of functions on $\mathcal {X}\times \mathcal {Y}$,

$$\begin{aligned} \Vert h(\cdot ,\cdot )\Vert _{P_n} = \sqrt{\frac{1}{n}\sum _{i=1}^nh^2(\mathbf {x}_i,\tilde{\mathbf {y}}_i)}. \end{aligned}$$

Define $\tilde{{\varTheta }}(\epsilon ) = \{\theta \in \tilde{{\varTheta }}: \Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta - \eta _0\Vert _2 \le \epsilon \}$.

Lemma 4

(Entropy Lemma) For a constant $c_{12}>0$, for all $u>0$ and $M_n>0$, we have

$$\begin{aligned}&H\biggl (u,\biggl \{(\ell _{\theta } - \ell _{\theta ^{\star }})1\{G_1\le M_n\}: \theta \in \tilde{{\varTheta }}(\epsilon )\biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le c_{12}\frac{mk\epsilon ^2M_n^2}{u^2}\log \biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr ). \end{aligned}$$

Proof

(For Entropy Lemma) The difference between this proof and that of Entropy Lemma in the proof of Lemma 2 of Städler et al. (2010) is in the notations and the effect of multivariate responses.

For multivariate responses we have for $i=1,\ldots ,n$,

$$\begin{aligned} |\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i) - \ell _{\theta '} (\mathbf {x}_i,\tilde{\mathbf {y}}_i)|^2&\le G_1^2(\tilde{\mathbf {y}}_i)\Vert \psi _i - \psi '_i \Vert _1^2 \le d_{\psi }G_1^2(\tilde{\mathbf {y}}_i)\Vert \psi _i- \psi '_i\Vert _2^2\\&= d_{\psi }G_1^2(\tilde{\mathbf {y}}_i) \biggl [\sum _{r=1}^k\sum _{j\in {\varOmega }_i}|\mathbf {x}_i(\varvec{\beta }_{rj}- \varvec{\beta }'_{rj}) |^2 +\Vert \eta - \eta '\Vert _2^2\biggr ], \end{aligned}$$

where $d_{\psi } = (2m+1)k$ is the maximum of dimension of $\psi _i$ for $i=1,\ldots ,n$.

Under the definition of the norm $\Vert \cdot \Vert _{P_n}$ we have

$$\begin{aligned}&\Vert (\ell _{\theta } - \ell _{\theta '})1\{G_1\le M_n\}\Vert _{P_n}^2 \\&\quad \le d_{\psi }M_n^2\left[ \frac{1}{n}\sum _{i=1}^n\sum _{r=1}^k\sum _{j\in {\varOmega }_i}|\mathbf {x}_i(\varvec{\beta }_{rj}- \varvec{\beta }'_{rj}) |^2 + \Vert \eta - \eta '\Vert _2^2 \right] . \end{aligned}$$

Then by the result of Städler et al. (2010) we have

$$\begin{aligned} H (u,\{\eta \in \mathbb {R}^{d_{\eta }}: \Vert \eta -\eta _0\Vert _2\le \epsilon \},\Vert \cdot \Vert _2 )\le d_{\eta }\log \biggl (\frac{5\epsilon }{u}\biggr ), \end{aligned}$$

where $d_{\eta } = (m+1)k$ is the dimension of $\eta $.

And we follow Städler et al. (2010) to apply Lemma 2.6.11 of Van Der Vaart and Wellner (1996) to give a bound as

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1\biggr )\log (1+kmd). \end{aligned}$$

Thus we can get

$$\begin{aligned}&H \biggl (3\sqrt{d_{\psi }}M_nu,\biggl \{ (\ell _{\theta } - \ell _{\theta _0})1\{G_1\le M_n\} : \theta \in \tilde{{\varTheta }}(\epsilon ) \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1+d_{\eta }\biggr )\biggl (\log (1+kmd)+\log \biggl (\frac{5\epsilon }{u}\biggr )\biggr ). \end{aligned}$$

$\square $

Now we turn to prove Lemma 3.

We follow Städler et al. (2010) to use the truncated version of the empirical process below.

$$\begin{aligned}&V_n^{trunc}(\theta ) \\&\quad = \frac{1}{n}\sum _{i=1}^n\biggl ( \ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)\le M_n\} - \mathbb {E}[\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)\le M_n\}\mid X=\mathbf {x}_i]. \biggr ) \end{aligned}$$

We follow Städler et al. (2010) to apply the Lemma 3.2 in Van de Geer (2000) and a conditional version of Lemma 3.3 in Van de Geer (2000) to the class

$$\begin{aligned} \biggl \{ (\ell _{\theta } - \ell _{\theta _0})1\{G_1\le M_n\} : \theta \in \tilde{{\varTheta }}(\epsilon ) \biggr \}, \forall \epsilon >0. \end{aligned}$$

For some constants $\{c_{t}\}_{t>12}$ depending on K and ${\varLambda }_{\max }$ in Condition 2 of Städler et al. (2010), using the notation of Lemma 3.2 in Van de Geer (2000), we follow Städler et al. (2010) to choose $\delta = c_{13} T\epsilon \lambda _0$ and $R = c_{14}(\sqrt{mk}\epsilon \wedge 1)M_n$.

Thus we by choosing $M_n = c_2(\log n)^{c_1}$ we can satisfy the condition of Lemma 3.2 of Van de Geer (2000) to have

$$\begin{aligned}&\int _{\epsilon /c_{15}}^R H^{1/2} \biggl (u,\biggl \{(\ell _{\theta } - \ell _{\theta ^{\star }})1\{G_1\le M_n\}: \theta \in \tilde{{\varTheta }}(\epsilon )\biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) du \vee R \\&\quad =\int _{\epsilon /c_{15}}^{c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n} c_{12}\biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr )\log ^{1/2}\biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr )du \vee (c_{14}(\epsilon \wedge 1)M_n)\\&\quad \le \frac{2}{3}c_{12}\sqrt{mk}\epsilon M_n \left[ \log ^{3/2} (c_{15}\sqrt{mk}M_n) - \log ^{3/2} \left( \frac{\sqrt{mk} \epsilon M_n}{c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n}\right) \right] \\&\qquad \vee (c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n) \\&\quad \le \frac{2}{3}c_{12}\sqrt{mk}\epsilon M_n\log ^{3/2} (c_{15}\sqrt{mk}M_n)\\&\quad \le c_{16} \sqrt{mk}\epsilon M_n\log ^{3/2} (n) \quad \left( \text{ by } \text{ choosing } \ M_n = c_2(\log n)^{c_1}, \text{ and } \ \sqrt{mk} \le c_{17}\frac{n}{M_n}\right) \\&\quad \le c_{18} \sqrt{n} T\epsilon \lambda _0\le \sqrt{n}(\delta - \epsilon ). \end{aligned}$$

Now for the rest we can apply Lemma 3.2 of Van de Geer (2000) to give the same result with Lemma 2 of Städler et al. (2010).

So we have

$$\begin{aligned} \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n^{trunc}(\theta ) - V_n^{trunc}(\theta _0)|}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0} \le 2c_{23}T \lambda _0 \end{aligned}$$

with probability at least $1 - c_{9}\exp \biggl [- \frac{T^2(\log n)^2\log (d\vee n) }{c_{8}^2}\biggr ]$.

At last, for the case when $G_1(\tilde{\mathbf {y}}_i)>M_n$, for $i=1,\ldots ,n$, we have

$$\begin{aligned} | (\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i) - \ell _{\theta _0}(\mathbf {x}_i,\tilde{\mathbf {y}}_i))1\{G_1(\tilde{\mathbf {y}}_i)> M_n\} |\le d_{\psi }KG_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)> M_n\}, \end{aligned}$$

and

$$\begin{aligned}&\frac{|(V_n^{trunc}(\theta ) - V_n^{trunc}(\theta _0)) -(V_n(\theta )-V_n(\theta _0)) |}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0}\\&\quad \le \frac{d_{\psi }K}{n\lambda _0}\sum _{i=1}^n \biggl ( G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i] \biggr ). \end{aligned}$$

Then the probability of the following inequality under our model is given in Lemma 2.

$$\begin{aligned}&\frac{d_{\psi }K}{n\lambda _0}\sum _{i=1}^n \biggl ( G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i] \biggr ) \\&\quad \le c_{23}T \lambda _0, \end{aligned}$$

where $d_{\psi } = 2(m+1)k$. $\square $

Appendix H: Proof of Theorem 1

Proof

This proof mostly follows that of Theorem 3 of Städler et al. (2010). The only difference is in the notations. As such, we omit the details. $\square $

Appendix I: Proof of Theorem 2

Proof

This proof also mostly follows that of Theorem 5 of Städler et al. (2010). The difference is in the notations and the choice of $M_n$.

If the event $\mathcal {T}$ happens, with $M_n = c_2(\log n)^{c_1}$ for some constants $0\le c_1,c_2<\infty $, where $c_2$ depends on K,

$$\begin{aligned} \lambda _0 = \sqrt{mk} M_n\log n\sqrt{\log (d\vee n)/n} = c_2\sqrt{mk\log ^{2+2c_1}\log (d\vee n)/n}, \end{aligned}$$

we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \Vert \hat{\beta }\Vert _1&\le T\lambda _0[(\Vert \hat{\beta }-\beta _0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0] \\&\quad + \lambda \Vert \beta _0\Vert _1 + \bar{\varepsilon }(\psi _0\mid \psi _0). \end{aligned}$$

Under the definition of $\theta \in \tilde{{\varTheta }}$ in (23) we have $\Vert \eta -\eta _0\Vert _2\le 2K$. And as $ \bar{\varepsilon }(\psi _0\mid \psi _0) =0$ we have for n sufficiently large.

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \Vert \hat{\beta }\Vert _1 \le T\lambda _0(\Vert \hat{\beta }\Vert _1 +\Vert \beta _0\Vert _1 + 2K ) + \lambda \Vert \beta _0\Vert _1\\&\rightarrow \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\Vert \hat{\beta }\Vert _1 \le T\lambda _0 2K + (\lambda +T\lambda _0)\Vert \beta _0\Vert _1 \end{aligned}$$

As $C>0$ sufficiently large we have $\lambda \ge 2T\lambda _0$.

And using the condition on $\Vert \beta _0\Vert _1$ and $\sqrt{mk}$, we have both $T\lambda _02K = o(1)$ and $(\lambda +T\lambda _0)\Vert \beta _0\Vert _1 = o(1)$, so we have $\bar{\varepsilon }(\hat{\psi }\mid \psi _0)\rightarrow 0 \ (n\rightarrow \infty )$.

At last, as the event $\mathcal {T}$ has large probability, we have $\bar{\varepsilon }(\hat{\theta }_{\lambda }\mid \theta _0) = o_P(1) \ (n\rightarrow \infty )$. $\square $

Appendix J: Proof of Theorem 3

Proof

First we discuss the bound for the probability of $\mathcal {T}_{group}$ in (22).

The difference between $\mathcal {T}_{group}$ and $\mathcal {T}$ in (21) is only related to the following entropy of the Entropy Lemma in the proof of Lemma 3.

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) , \\&\quad \text{ for } \ i = 1\ldots ,n, \end{aligned}$$

where $\sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon $ still maintains a convex hull for $\varvec{\beta }$ in the metric space equipped with the metric induced by the norm $\Vert \cdot \Vert _{P_n}$ defined in the proof of Lemma 3. Thus it still satisfies the Condition of Lemma 2.6.11 of Van Der Vaart and Wellner (1996) which can still be applied to give

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1\biggr )\log (1+kmd), \ \text{ for } \ i = 1\ldots ,n. \end{aligned}$$

So the probability of event $\mathcal {T}_{group}$ remains the same form with that in Lemma 3.

Then we discuss the bound for the average excess risk and feature selection.

If the event $\mathcal {T}_{group}$ happens, we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F&\le T\lambda _0\biggl [\biggl (\sum _{\mathcal {G}_p}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \eta -\eta _0\Vert _2 \biggr )\vee \lambda _0\biggr ]\\&\quad + \lambda \sum _p\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \bar{\varepsilon }(\psi _0\mid \psi _0). \end{aligned}$$

Using Condition 3 we have $ \bar{\varepsilon }(\psi _0\mid \psi _0) =0$ and $\bar{\varepsilon }(\hat{\psi }\mid \psi _0) \ge {\Vert \hat{\psi }-\psi _0\Vert _{Q_n}^2}/{c_0^2}$.

Case 1 When the following is true:

$$\begin{aligned} \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \le \lambda _0, \end{aligned}$$

we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0)&\le T\lambda _0^2 + \lambda \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \bar{\varepsilon }(\psi _0\mid \psi _0) \le (\lambda +T\lambda _0)\lambda _0. \end{aligned}$$

Case 2 When the following is true:

$$\begin{aligned}&\sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \ge \lambda _0,\\&T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2 \ge (\lambda +T\lambda _0)\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F. \end{aligned}$$

As $\sum _{p\in \mathcal {I}^c}\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F=0$, we have

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 2T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2\\&\quad \le 2T^2\lambda _0^2c_0^2 + \Vert \hat{\eta }-\eta _0\Vert _2^2/(2c_0^2) \le 2T^2\lambda _0^2c_0^2 + \bar{\varepsilon }(\hat{\psi }\mid \psi _0)/2. \end{aligned}$$

Then we get

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + 2(\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F\le 4T^2\lambda _0^2c_0^2. \end{aligned}$$

Case 3 When the following is true:

$$\begin{aligned}&\sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \ge \lambda _0,\\&T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2 \le (\lambda +T\lambda _0)\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F, \end{aligned}$$

we have

Thus we have

$$\begin{aligned} \sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 6 \sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F, \end{aligned}$$

so we can use the Condition 5 for $\hat{\varvec{\beta }} -\varvec{\beta }_0$ to have

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 2(\lambda +T\lambda _0)\sqrt{s}\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\hat{\varvec{\beta }}_{0,\mathcal {G}_p}\Vert _F\\&\quad \le 2(\lambda +T\lambda _0)\sqrt{s}\kappa \Vert \hat{\varphi }-(\varphi _0)\Vert _{Q_n} \le 2(\lambda +T\lambda _0)^2 s \kappa ^2 c_0^2 + \Vert \hat{\varphi }-(\varphi _0)\Vert _{Q_n}^2/(2c_0^2)\\&\quad \le 2(\lambda +T\lambda _0)^2 s \kappa ^2 c_0^2 +\bar{\varepsilon }(\hat{\psi }\mid \psi _0)/2. \end{aligned}$$

So we have

And without restricted eigenvalue Condition 5, we can prove similarly as in “Appendix I” section, assuming event $\mathcal {T}_{group}$ happens and using the condition on $\sum _p\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F$ and $\sqrt{mk}$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, J., Chen, K., Lin, M. et al. Robust finite mixture regression for heterogeneous targets. Data Min Knowl Disc 32, 1509–1560 (2018). https://doi.org/10.1007/s10618-018-0564-z

Download citation

Received: 18 May 2017
Accepted: 30 March 2018
Published: 17 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s10618-018-0564-z

Robust finite mixture regression for heterogeneous targets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simultaneous clustering and feature selection via nonparametric Pitman–Yor process mixture models

Robust mixture regression modeling based on two-piece scale mixtures of normal distributions

Semiparametric mixtures of regressions with single-index for model based clustering

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Definitions

Definition 1

Appendix B: The empirical process

Appendix C: Lemmas

Lemma 2

Lemma 3

Appendix D: Corollaries for models considering outlier samples

Condition 6

Corollary 1

Condition 7

Corollary 2

Appendix E: Proof of Lemma 1

Proof

Appendix F: Proof of Lemma 2

Proof

Appendix G: Proof of Lemma 3

Proof

Lemma 4

Proof

Appendix H: Proof of Theorem 1

Proof

Appendix I: Proof of Theorem 2

Proof

Appendix J: Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now