Abstract
We propose an adaptive \(\ell _1\)-penalized estimator in the framework of Generalized Linear Models with identity-link and Poisson data, by taking advantage of a globally quadratic approximation of the Kullback-Leibler divergence. We prove that this approximation is asymptotically unbiased and that the proposed estimator has the variable selection consistency property in a deterministic matrix design framework. Moreover, we present a numerically efficient strategy for the computation of the proposed estimator, making it suitable for the analysis of massive counts datasets. We show with two numerical experiments that the method can be applied both to statistical learning and signal recovery problems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anscombe, F.J.: The transformation of Poisson binomial and negative-binomial data. Biometrika 35(3/4), 246–254 (1948). https://doi.org/10.2307/2332343
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Bertero, M., Lanteri, H., Zanni, L.: Iterative image reconstruction: a point of view. pp 37–63 (2008)
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E.J.: SLOPE-adaptive variable selection via convex optimization. Ann. Appl. Stat. 9(3), 1103 (2015)
Bonettini, S., Benvenuto, F., Zanella, R., Zanni, L., Bertero, M.: Gradient projection approaches for optimization problems in image deblurring and denoising. In: 2009 17th European Signal Processing Conference, pp. 1384–1388 (2009)
Candès, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(\ell \)1 minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008). https://doi.org/10.1007/s00041-008-9045-x
De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U.D., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6(May), 883–904 (2005)
Dobson, A.J., Barnett, A.: An Introduction to Generalized Linear Models. CRC Press, Boca Raton (2008)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Figueiredo, M.A.T., Bioucas-Dias, J.M.: Restoration of Poissonian images using alternating direction optimization. IEEE Trans. Image Process. 19(12), 3133–3145 (2010). https://doi.org/10.1109/TIP.2010.2053941
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1. Springer, Berlin (2001)
Friedman, J., Hastie, T., Hfling, H., Tibshirani, R., et al.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. https://doi.org/10.18637/jss.v033.i01 (2010)
Gu, R., Dogandžić, A.: A fast proximal gradient algorithm for reconstructing nonnegative signals with sparse transform coefficients. In: 48th Asilomar Conference on, IEEE Signals, Systems and Computers, 2014, pp. 1662–1667 (2014)
Hansen, N.R., Reynaud-Bouret, P., Rivoirard, V., et al.: Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21(1), 83–143 (2015)
Harmany, Z.T., Marcia, R.F., Willett, R.M.: SPIRAL out of convexity: sparsity-regularized algorithms for photon-limited imaging. In: International Society for Optics and Photonics IS&T/SPIE Electronic Imaging, pp. 75,330R–75,330R (2010)
Ivanoff, S., Picard, F., Rivoirard, V.: Adaptive Lasso and group-Lasso for functional Poisson regression. J. Mach. Learn. Res. 17(55), 1–46 (2016)
Jiang, X., Reynaud-Bouret, P., Rivoirard, V., Sansonnet, L., Willett, R.: A data-dependent weighted LASSO under Poisson noise. arXiv preprint arXiv:1509.08892 (2015)
Marschner, I.C., et al.: Glm2: fitting generalized linear models with convergence problems. The R Journal 3(2), 12–15 (2011)
Martinez, J.G., Carroll, R.J., Mller, S., Sampson, J.N., Chatterjee, N.: Empirical performance of cross-validation with oracle methods in a genomics context. Am. Stat. 65(4), 223–228 (2011)
McCullagh, P., Nelder, J.A.: Generalized linear models, 2nd edn. Chapman & Hall, New York (1989)
Peyré, G.: The numerical tours of signal processing. Comput. Sci. Eng. 13(4), 94–97 (2011). https://doi.org/10.1109/MCSE.2011.71
Prince, J.L., Links, J.M.: Medical Imaging Signals and Systems. Pearson Prentice Hall Upper Saddle River, New Jersey (2006)
Silva, J., Tenreyro, S.: Poisson: some convergence issues. Stata Journal 11(2), 207–212 (2011)
Starck, J.L., Murtagh, F.: Astronomical Image and Data Analysis. Springer, New York (2007)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)
Zanella, R., Boccacci, P., Zanni, L., Bertero, M.: Corrigendum: efficient gradient projection methods for edge-preserving removal of Poisson noise. Inverse Probl. 29(11), 119–501 (2013)
Zhao, P., Yu, B.: On model selection consistency of lasso. J Mach Learn Res 7, 2541–2563 (2006)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Zou, H., Zhang, H.H.: On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 37(4), 1733 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research leading to these results has received funding from the European Unions Horizon2020 research and innovation programme under Grant Agreement No. 640216.
Appendix: proofs
Appendix: proofs
In order to prove Theorem 1, we start by proving the following
Lemma 1
Let y be a Poisson random variable with mean \(\theta \). Let \(z>0\) be such that \(|z-\theta |\le c\sqrt{\theta }\), where c is a positive constant smaller than \(\sqrt{\theta }\). Then
Proof
Following Zanella et al. (2013) we obtain
where \(k\in \mathbb {N}\) and \(R_3\) is defined as follows
where \(\xi \ge -1\). By computing the moments of the Poisson random variable y centered in z we obtain
where \(r:=z-\theta \) and
To conclude we now prove that \(\mathscr {E}(\theta )=O(\frac{1}{\theta })\). Following the idea of the proof given in Zanella et al. (2013), we split the series into two parts: in the first ranging k between 0 and \([\frac{z}{2}]\) and in the second one \(k \ge [\frac{z}{2}]+1\), where \([\chi ]\) denotes the integer part of \(\chi \). We observe that for k from 1 to \( [\frac{z}{2}]\), or equivalently \(\xi \in (-\,1,-\,\frac{1}{2}]\), then
Since \(\frac{\theta ^s}{s!}=\frac{\theta ^s}{\varGamma (s+1)}\) is monotonically increasing for \(0\le s\le [\frac{\theta +c\sqrt{\theta }}{2}]\), using Eq. (24) and the Stirling formula we obtain
Since \(\frac{\theta +c\sqrt{\theta }}{2} - 1 \le \left[ \frac{\theta +c\sqrt{\theta }}{2} \right] \le \frac{\theta +c\sqrt{\theta }}{2}\) and \(\frac{\theta }{ \left[ \frac{\theta +c\sqrt{\theta }}{2} \right] } > 1\) for \(\theta \) large enough, the upper bound in inequality (25) can be bounded by
where \(v:=1+c\theta ^{-\frac{1}{2}}\). Then \(M(\theta )\rightarrow 0\) exponentially as \(\theta \rightarrow \infty \). Now we consider \(k\ge [\frac{z}{2}]+1\), or equivalently \(\xi >-\frac{1}{2}\). Since
we obtain
where
\(\square \)
are the 5th and 4th moments of the Poisson random variable y centered in z.
Proof of Theorem 1
By the triangular inequality we have
Then, to get the thesis, thanks to Lemma 1, it is sufficient to prove that
By writing the left hand side of the Eq. (29) as the difference between the second moments of a Poisson variable centered in z and in \(z+1\), we obtain
By some manipulations and by using that \(|z-\theta |\le c\sqrt{\theta }\) we get
\(\square \)
To prove Theorem 2 we need some preliminary results. We start by defining
We observe that the components \(\epsilon _i\) are independent random variables with zero mean and \(\mathrm{Var}(\epsilon _i)=(\mathbf X \beta ^*)_i\), for all \(i\in \{1,\ldots ,n\}\). Hereafter, for easy of notation we suppress the superscript (n) from the estimators.
Lemma 2
There exists a constant \(G<+\infty \) such that
Proof of Lemma 2
We compute the term in the l.h.s. in Eq. (31). We have
with \(\mathbf D :=\varLambda ^2\epsilon \). For the Singular Value Decomposition we can write
where \(\mathbf U \) is an orthogonal matrix and \(\varSigma \) a diagonal matrix containing the eigenvalues of \(\mathbf X ^T \mathbf X \). We define We define \(\mathbf H :=\mathbf U \mathbf D = \mathbf U \varLambda ^2 \epsilon \). The ith component of \(\mathbf H \) is given by
where \(u_{il}\) represents the (i, l)-entry of the matrix U. Since \(\frac{\epsilon _l}{Y_l+1}\) takes values between \(-\,(\mathbf X \beta ^*)_l\) and 1, we have that each component \(H_i\) takes values in a compact subset [R, S]. Therefore, as \(\mathbf H \in [R,S]^n\), the quadratic form \((\mathbf H ^T\varSigma \mathbf H )^2\) admits a maximum, i.e. there exists an \(\mathbf H ^*\) such that \((\mathbf H ^T\varSigma \mathbf H )^2\le ((\mathbf H ^*)^T\varSigma \mathbf H ^*)^2\). Then
with
\(\square \)
Corollary 2
Under assumption (H1) there exists a constant \(G<+\infty \) such that we have the following bound
where \(\hat{\beta }(\mathrm{PRLS})\) is the reweighted least square estimator defined as follows
Proof of Corollary 2
By using optimality conditions of problem in Eq. (36) and the definition of \(\epsilon \) we have
Then, by using the Cauchy-Schwartz inequality we obtain
where
By assumption (H1) and Lemma 2 we have the thesis. \(\square \)
Proof of Theorem 2
We want to prove the bound in Eq. (12). From Corollary 2, since
we have to establish a bound for the first term of the r.h.s. of (40). In order to do so, we follow similar arguments as in the proof of Theorem 3.1 by Zou and Zhang (2009). By definition of \(\hat{\beta }_{(\mathbf {w} ,\lambda )}\), the following inequality applies
From the optimality conditions of the optimization problem in Eq. (36), we have
and we notice that
and
Using (41), (42), (43) and (44) we obtain
and finally, the Cauchy Schwartz inequality and assumption (H1) lead to
The thesis follows from Eqs. (40), (46) and Corollary 2. \(\square \)
Proof of Theorem 3
For brevity we denote the APRiL estimator by \(\hat{\beta }\). To prove the model selection consistency we prove that for \(n\rightarrow +\infty \)
and
We now prove Eq. (47). The functional defined in Eq. (8) is convex and not differentiable and \(\mathscr {C}\) is convex set. Then the solution \(\hat{\beta }\) is characterized by the (KKT) optimality conditions:
-
\((\mathbf X \hat{\beta })_i\ge 0\)\(\forall \)\(i\in \{1,\ldots ,n\}\);
-
\(\nu _i\ge 0\)\(\forall \)\(i\in \{1,\ldots ,n\}\);
-
\(\nu _i (\mathbf X \hat{\beta })_i=0\)\(\forall \)\(i\in \{1,\ldots ,n\}\);
-
if \(\hat{\beta }_j\ne 0\)
$$\begin{aligned} -\,\mathbf x ^T_j\varLambda ^2(\mathbf Y -\mathbf X \hat{\beta })+\lambda \hat{w}_j sgn(\hat{\beta }_j)-\mathbf x _j^T\nu =0; \end{aligned}$$(49) -
if \(\hat{\beta }_j=0\)
$$\begin{aligned} -\,\mathbf x _j^T\varLambda ^2(\mathbf Y -\mathbf X \hat{\beta })+\lambda \hat{w}_j s_j-\mathbf x _j^T\nu =0, \end{aligned}$$(50)with \(s_j\in [-\,1,1]\).
\(\nu \) is the n-dimensional vector whose components are the Lagrangian multipliers. Thanks to KKT conditions, the event \(\{ \forall j\in (\mathscr {A^*})^C, \hat{\beta }_j=0 \}\) can be written as
where \(|s_j|\le 1\) [see Eq. (50)], \(\mathbf X _{\mathscr {A^*}}\) is the matrix constituted by the columns \(\mathbf x _j\) and \(\hat{\beta }_{\mathscr {A^*}}\) is the vector constituted by the components \(\hat{\beta }_j\) with \(j\in \mathscr {A^*}\). By taking the absolute value of each Equation in (51) the event takes the form
This implies that Eq. (47) is equivalent to \(\mathbb {P}\left( \exists j\in (\mathscr {A^*})^C, \left| \mathbf x ^T_j(\varLambda ^2(\mathbf Y -\mathbf X _{\mathscr {A^*}}\hat{\beta }_{\mathscr {A^*}})+\nu )\right| >\lambda \hat{w}_j\right) \rightarrow 0\) for \(n\rightarrow +\infty \). We set
and
Then
The idea is to determine three bounds \(M_1\), \(M_2\) and \(M_3\) depending on n, such that
and \(M_1\), \(M_2\) and \(M_3\) go to 0 for \(n\rightarrow +\infty \). Let us start with the determination of bound \(M_1\). Using Corollary 1, it follows that
For the determination of bound \(M_2\), we use again Corollary 1. We have that
Finally, for the determination of bound \(M_3\), we write
By definition of \(\epsilon \), we have
By using assumption (H4), we get
and
where we used that \(\mathbb {E}\left( \frac{\epsilon _i^2}{Y_i+1}\right) \le 2\) for all \(i\in \{1,\ldots ,n\}\). Following the idea of the proof of Lemma 2, in particular the calculus which leads (35), we have
Thanks to the Cauchy-Schwartz inequality, Eqs. (57), (58), (59), (60) and hypothesis (H1) we obtain the following bound
From optimality conditions in Eqs. (49) and (50) it follows that
so we have
where we have used the following bound
Then, we obtain
Now we prove that \(M_1\), \(M_2\) and \(M_3\) go to 0 for \(n\rightarrow +\infty \).
because \(\frac{\sqrt{n} \lambda \left( \frac{2}{\eta }\right) ^{\gamma }}{n^2} \longrightarrow 0\), \(\frac{\sqrt{n}\lambda n^{1+\delta \gamma }}{n^{2}}\longrightarrow 0\) and \(\frac{\lambda }{n}\left( \frac{2}{\eta }\right) ^{\gamma }\longrightarrow 0\) for \(n\rightarrow +\infty \), for the assumption (H3 c) and for the positivity of constants \(\gamma \) and \(\delta \);
because \(\frac{\lambda _1}{\left( \frac{\lambda }{n}\right) ^{\frac{1}{\gamma }} n}\longrightarrow 0\) for \(n\rightarrow +\infty \) for assumptions (H2) and (H3 a), \(\frac{\sqrt{n}}{n\left( \frac{\lambda }{n}\right) ^{\frac{1}{\gamma }}}=\frac{1}{\left( \lambda n^{\frac{\gamma }{2}-1}\right) ^{\frac{1}{\gamma }}}\longrightarrow 0\) for \(n\rightarrow +\infty \) for the assumption (H3 a), and at last \(\frac{1}{\left( \frac{\lambda }{n}\right) ^{\frac{1}{\gamma }} n^{\frac{1}{\gamma }+\delta }}\longrightarrow 0\) for \(n\rightarrow +\infty \) for the assumption (H3 b);
because \(\frac{\lambda _1}{n \eta } \longrightarrow 0\) for \(n\rightarrow +\infty \), for the assumption (H2) and the definition of \(\eta \), and \(\frac{\sqrt{n}}{n \eta }=O(\frac{1}{\sqrt{n}})\) for \(n\rightarrow +\infty \).
Now we prove Eq. (48). It is sufficient to show that
By Eq. (60) we have
where
Since \(\min _{j\in \mathscr {A}^*}|\beta _j^*|>0\), to conclude we show that \(\Vert \beta ^*_{\mathscr {A^*}}-\hat{\beta }(\mathrm{PRLS})_{\mathscr {A^*}}\Vert _2\) and \(\frac{2\lambda \sqrt{p}\hat{\eta }^{-\gamma }}{\tau _{\min }(\mathbf X ^T\varLambda ^2 \mathbf X )}\) go to 0 in probability. Equation (35) implies that the second term in the r.h.s of (64) goes to zero. Moreover, for the second term in Eq. (63) we have that, given \(M>0\)
as \(M_1\rightarrow 0\) for \(n\rightarrow +\infty \) and \(\frac{\lambda }{n }\left( \frac{2}{\eta }\right) ^{\gamma }\rightarrow 0\) for \(n\rightarrow +\infty \) thanks to assumption (H3 c). This proves Eq. (48) and concludes the proof. \(\square \)
Proof of Proposition 1
The proof is analogous to the one of Theorem 3. We start by proving that Eq. (47) holds. It is sufficient to show that the bounds \(M_1\), \(M_2\) and \(M_3\) go to 0 as \(n\rightarrow \infty \) under assumptions (H5), (H6), (H7) and (H8). We have that \(M_1 \rightarrow 0\) since \(\frac{\lambda _1 \sqrt{p}}{\eta n}\rightarrow 0\) for the assumption (H7 a) and \(\frac{\sqrt{p}}{\sqrt{n}\eta }\rightarrow 0\) for the assumption (H6). Moreover, \(M_2 \rightarrow 0\) since \(\left( \frac{\lambda }{n}\right) ^{-\frac{1}{\gamma }}\frac{\lambda _1 \sqrt{p}}{n}=\left( \frac{\lambda n^{\delta \gamma }}{p^{\gamma }}\right) ^{-\frac{1}{\gamma }} \frac{\lambda _1 n^{\delta +\frac{1}{\gamma }-1}}{\sqrt{p}}\rightarrow 0\) for assumptions (H8 b) and (H7 b), \(\left( \frac{\lambda }{n}\right) ^{-\frac{1}{\gamma }} \frac{\sqrt{p}}{\sqrt{n}}=\left( \frac{\lambda n^{\delta \gamma }}{p^{\gamma }}\right) ^{-\frac{1}{\gamma }} \frac{n^{-\frac{1}{2}+\delta +\frac{1}{\gamma }}}{\sqrt{p} }\rightarrow 0\) for the assumption (H8 b) and for the hypothesis \(\frac{1}{\gamma }+\delta <\frac{c+1}{2}\), and \(p \left( \frac{\lambda }{n}\right) ^{-\frac{1}{\gamma }} n^{-\frac{1}{\gamma }-\delta }\rightarrow 0\) for the assumption (H8 b). Finally, \(M_3 \rightarrow 0\) since the first two terms in the definition of \(M_3\) go to 0 for the assumption (H5), \(\frac{\lambda p}{n \eta ^{\gamma }}\frac{\sqrt{p}}{\sqrt{n}}\rightarrow 0\) for assumptions (H5) and (H8 c), \(\frac{p}{n}\frac{\lambda n^{1+\delta \gamma }}{\sqrt{n}}\sqrt{p}= \lambda n^{\delta \gamma -\frac{1}{2}} p \sqrt{p}\rightarrow 0\) for the assumption (H8 a) and \(\frac{p}{n}\frac{\lambda }{\eta ^{\gamma }}\rightarrow 0\) for the assumption (H8 c).
Now we have to prove that Eq. (48) holds. It is sufficient to observe that, as \(n\rightarrow \infty \), the bound in Eq. (35) goes to 0 for the assumption (H5) and the bound in Eq. (65) goes to 0 for the assumption (H8 c). \(\square \)
Rights and permissions
About this article
Cite this article
Guastavino, S., Benvenuto, F. A consistent and numerically efficient variable selection method for sparse Poisson regression with applications to learning and signal recovery. Stat Comput 29, 501–516 (2019). https://doi.org/10.1007/s11222-018-9819-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-018-9819-1
Keywords
- Adaptive regularization
- Lasso
- Model selection
- Sparse Poisson regression
- Statistical learning
- Image processing