On the interplay between acceleration and identification for the proximal gradient algorithm

416 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we study the interplay between acceleration and structure identification for the proximal gradient algorithm. While acceleration is generally beneficial in terms of functional decrease, we report and analyze several cases where its interplay with identification has negative effects on the algorithm behavior (iterates oscillation, loss of structure, etc.). Then, we present a generic method that tames acceleration when structure identification may be at stake; it benefits from a convergence rate that matches the one of the accelerated proximal gradient under some qualifying condition. We show empirically that the proposed method is much more stable in terms of subspace identification compared to the accelerated proximal gradient method while keeping a similar functional decrease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence rates of accelerated proximal gradient algorithms under independent noise

Article 27 June 2018

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Article 29 November 2019

Accelerated methods for weakly-quasi-convex optimization problems

Article 27 July 2023

Notes

For instance, if g is the absolute value on $\mathbb {R}$, the proximity operator of g is equal to 0 for all $x\in [-1,1]$ (and $x-\mathrm {sign}(x)$ elsewhere). Thus, if some sequence $x_k$ converges to a point within $(-1,1)$, then the output of the proximity operator of g at $x_k$ will be equal to 0 for all k greater than some finite, but unknown, K.
That is for all iterates $k>0$, $\Vert x_{k+1}-x^\star \Vert \le \Vert x_k-x^\star \Vert$ for any minimizer $x^\star \in {{\,\mathrm{argmin}\,}}_x F(x)$.
Other examples include problems regularized with nuclear norm, for which the target structure is the rank of the current (matrix) iterate $x\in \mathbb {R}^{m_1\times m_2}$. Manifolds of interest are defined as $\mathsf {M}_i =\{x\in \mathbb {R}^{m_1\times m_2} : {\text {rank}}(x)=i\}$, and the collection is $\mathsf {M}=\{\mathsf {M}_0, \ldots , \mathsf {M}_{\min (m_1,m_2)}\}$.
Note that we are only interested here in the transient phase of identification when the iterates are in the process of reaching the optimal manifold. The behavior of the accelerated proximal gradient is usually better than the vanilla version before this phase and after identification; see e.g. the recent analyses of [2, 29] and references therein.
$g = \min (\Vert \cdot \Vert _{p}-1, 0)$ for p in $\{1.3, 2.6\}$, the proximity operator of which is $\mathbf {prox}_{\gamma g}(u)$ is $u (1-\gamma / \Vert u\Vert _p)$ if $\Vert u\Vert _p > 1+\gamma$, $u/\Vert u\Vert _p$ if $1 \le \Vert u\Vert _p \le 1+\gamma$, and u otherwise.
This functional evaluation is actually only needed if the function is sharp with $p=1$ in Assumption 3. For instance, this is the case when $f\equiv 0$ and $g(x) = \Vert x\Vert _1$, in which case $F(x)=\Vert x\Vert _1$ verifies Assumption 3 with $\beta =1, p=1$. In these rather degenerate cases, the proximal gradient converges in a finite number of steps.
These properties are often supposed to hold only locally but can be globalized easily [11, Sec. 2.4, Cor. 6].
An example of such a computation is performed for the lasso problem $F(x) = \frac{1}{2} \left\| Ax-b\right\| _2^2 + \lambda _1 \Vert x\Vert _1$ in [11, Lem. 10]

References

Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set Valued Anal. 9(1–2), 3–11 (2001)
Article MathSciNet MATH Google Scholar
Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward-backward algorithm beyond nesterov’s rule. Math. Prog. 180(1), 137–156 (2020)
Article MathSciNet MATH Google Scholar
Attouch, H., Peypouquet, J.: The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than $1/k^2$. SIAM J. Optim. 26(3), 1824–1834 (2016)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Book MATH Google Scholar
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philepidia (2017)
Book MATH Google Scholar
Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans Image Process 18(11), 2419–2434 (2009)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Bertsekas, D.: On the goldstein–levitin–polyak gradient projection method. IEEE Trans. Automat. Control 21(2), 174–184 (1976)
Article MathSciNet MATH Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Article MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Article MATH Google Scholar
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Prog. 165(2), 471–507 (2017)
Article MathSciNet MATH Google Scholar
Burke, J.V., Moré, J.J.: On the identification of active constraints. SIAM J. Numer. Anal. 25(5), 1197–1211 (1988)
Article MathSciNet MATH Google Scholar
Candes, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 59(8), 1207–1223 (2006)
Article MathSciNet MATH Google Scholar
Catalina, A., Alaíz, C.M., Dorronsoro, J.R.: Revisiting fista for lasso: Acceleration strategies over the regularization path. In: ESANN (2018)
Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015)
Article MathSciNet MATH Google Scholar
Daniilidis, A., Hare, W., Malick, J.: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization 55(5&6), 481–503 (2006)
Article MathSciNet MATH Google Scholar
Donoho, D.L.: De-noising by soft-thresholding. IEEE Trans Inf. Theory 41(3), 613–627 (1995)
Article MathSciNet MATH Google Scholar
Drusvyatskiy, D., Lewis, A.S.: Optimality, identifiability, and sensitivity. Math. Prog. 147(1–2), 467–498 (2014)
Article MathSciNet MATH Google Scholar
Fadili, J., Malick, J., Peyré, G.: Sensitivity analysis for mirror-stratifiable convex functions. SIAM J. Optim. 28(4), 2975–3000 (2018)
Article MathSciNet MATH Google Scholar
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: 53rd IEEE Conference on Decision and Control, pp. 5058–5063. IEEE (2014)
Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2), 251–266 (2004)
MathSciNet MATH Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Ito, N., Takeda, A., Toh, K.C.: A unified formulation and fast accelerated proximal gradient method for classification. J. Mach. Learn. Res. 18(1), 510–558 (2017)
MathSciNet MATH Google Scholar
Iutzeler, F., Hendrickx, J.M.: A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optim. Methods Softw. 34(2), 383–405 (2019)
Article MathSciNet MATH Google Scholar
Iutzeler, F., Malick, J.: On the proximal gradient algorithm with alternated inertia. J. Optim. Theory Appl. 176(3), 688–710 (2018)
Article MathSciNet MATH Google Scholar
Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2002)
Article MathSciNet MATH Google Scholar
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in Neural Information Processing Systems, pp. 379–387 (2015)
Liang, J., Fadili, J., Peyré, G., Luke, R.: Activity identification and local linear convergence of douglas–rachford/admm under partial smoothness. In: International Conference on Scale Space and Variational Methods in Computer Vision, pp. 642–653. Springer (2015)
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
Article MathSciNet MATH Google Scholar
Liang, J., Schönlieb, C.B.: Improving “fast iterative shrinkage-thresholding algorithm”: Faster, smarter and greedier. arXiv preprint arXiv:1811.01430 (2018)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate $\cal{O} (1/k^{2})$. Dokl. akad. nauk Sssr 269, 543–547 (1983)
MathSciNet Google Scholar
NGuyen, T.P.: Kurdyka-lojasiewicz and convexity: algorithms and applications. Ph.D. thesis, Toulouse University (2017)
O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
Article MathSciNet MATH Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
MATH Google Scholar
Poon, C., Liang, J.: Trajectory of alternating direction method of multipliers and adaptive acceleration. In: Advances in Neural Information Processing Systems, pp. 7355–7363 (2019)
Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: International Conference on Machine Learning, pp. 4124–4132 (2018)
Scheinberg, K., Goldfarb, D., Bai, X.: Fast first-order methods for composite convex optimization with backtracking. Found. Comput. Math. 14(3), 389–417 (2014)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Vaiter, S., Peyré, G., Fadili, J.: Low Complexity Regularization of Linear Inverse Problems, chap. Sampling Theory, a Renaissance, pp. 103–153. Springer-Birkhäuser (2015)
Vaiter, S., Peyré, G., Fadili, J.: Model consistency of partly smooth regularizers. IEEE Trans. Inf. Theory 64(3), 1725–1737 (2017)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the AE and the two anonymous reviewers for their indications about the tone of the article and their suggestions for future works. FI benefited from the support of the ANR JCJC project STROLL (ANR-19-CE23-0008).

Author information

Authors and Affiliations

Laboratoire Jean Kuntzmann, Univ. Grenoble Alpes, Saint-Martin-d’Hères , France
Gilles Bareilles & Franck Iutzeler

Authors

Gilles Bareilles
View author publications
You can also search for this author in PubMed Google Scholar
Franck Iutzeler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Franck Iutzeler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Obtaining the qualifying condition

The general qualifying condition (QC) can be recovered under the setting of partial smoothness of g and a usual non-degeneracy condition at the optimal point $x^\star$ and associated point $u^\star$ such that $x^\star = \mathbf {prox}_{\gamma g}(x^\star )$.

Definition 1

(Relative interior) The relative interior of a subset S, noted $\mathrm {ri}~S$ is defined as

$$\begin{aligned} \mathrm {ri}~S = \{x\in S : \exists \varepsilon > 0 , {\mathcal {B}}(x, \varepsilon )\cap {\text {Aff}}S \subseteq S \} \end{aligned}$$

where ${\text {Aff}}$ denotes the affine hull of a set.

Definition 2

(Parallel space) For a connex subset A, the parallel space of A, noted $\mathrm {par}~A$, is defined as the vector space parallel to the affine space generated by A.

Definition 3

(Manifold) A subset $\mathsf {M}$ of $\mathbb {R}^n$ is said to be a p-dimensional $\mathcal C^k$-submanifold of $\mathbb {R}^n$ around $x \in \mathsf {M}$ ($1 \le k \le +\infty$) if there exists a local parameterization of $\mathsf {M}$ around x, that is, a ${\mathcal {C}}^k$ function $\varphi : \mathbb {R}^p \rightarrow \mathbb {R}^n$ such that $\varphi$ realizes a local homeomorphism between a neighborhood of $0 \in \mathbb {R}^p$ and a neighborhood of $x \in \mathsf {M}$ and the derivative of $\varphi$ at $\varphi ^{-1}(x)=0$ is injective.

A p-dimensional ${\mathcal {C}}^k$-submanifold of $\mathbb {R}^n$ can alternatively be defined via a local equation, that is, a $\mathcal C^k$ function $\varPhi : \mathbb {R}^n \rightarrow \mathbb {R}^{n-p}$ with a surjective derivative at ${\bar{x}} \in \mathsf {M}$, that satisfies for all x close enough to ${\bar{x}}$: $x \in \mathsf {M}\Leftrightarrow \varPhi (x) = 0$.

To lighten notations, we will only indicate the dimensionality and the smoothness degree of a manifold when it is relevant to the discussion.

Definition 4

(Tangent space) Given a point x living in a manifold $\mathsf {M}$, defined either by a local parametrization $\varphi$ or a local equation $\varPhi$, the tangent space of $\mathsf {M}$ at x, denoted $T_\mathsf {M}(x)$, is defined as:

$$\begin{aligned} T_\mathsf {M}(x) = \text {Im}~ D_\varphi (0) = \text {Ker} D_\varPhi (0) \end{aligned}$$

Definition 5

(Partly smooth function) Let $g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}$ be a proper, convex, lower semi-continuous function. g is said to be partly smooth at x relative to a manifold $\mathsf {M}$ containing x if $\partial g(x)\ne \emptyset$, and moreover

Smoothness: $\mathsf {M}$ is a ${\mathcal {C}}^2$-manifold around x, g restricted to $\mathsf {M}$ is ${\mathcal {C}}^2$ around x;
Sharpness: The tangent space $T_\mathsf {M}(x)$ coincides with $\mathrm {par} (\partial g(x))^\perp$;
Continuity: The set-valued mapping $\partial g$ is continuous at x relative to $\mathsf {M}$.

Lemma 4

Let $\mathsf {M}$be a manifold and $u^\star$a point such that $x^\star = \mathbf {prox}_{\gamma g}(u^\star )\in \mathsf {M}$. If g is partly smooth at $x^\star$relative to $\mathsf {M}$and $\frac{u^\star -x^\star }{\gamma } \in \mathrm {ri}~\partial g(x^\star )$, then (QC) holds.

Proof

The proof follows the reasoning of [16, Lem. 27, Th. 28]. Let us define the function $\rho (u, x) = g(x) + \frac{1}{2\gamma }\Vert x-u\Vert ^2$. Since g is partly smooth at $x^\star$ relative to $\mathsf {M}$, the function $(u, x) \mapsto g(x)$ is also partly smooth at $(u^\star , x^\star )$ relative to $\mathbb {R}^n\times \mathsf {M}$ [26, Prop. 4.5]. Since the addition of a smooth function does not change this property [26, Cor. 4.6], $\rho$ is partly smooth at $(u^\star , x^\star )$ relative to $\mathbb {R}^n\times \mathsf {M}$.

Besides, defining the parametrized function $\rho _{u^\star }(\cdot ) = \rho (u^\star , \cdot )$, we have that

(i)
$x^\star = \mathbf {prox}_{\gamma g}(u^\star ) = {{\,\mathrm{argmin}\,}}_{x\in \mathbb {R}^n} \{ g(x) + \frac{1}{2\gamma } \Vert x - u^\star \Vert ^2 \} = {{\,\mathrm{argmin}\,}}_{x\in \mathbb {R}^n} \rho _{u^\star } ( x )$ and as $\rho _{u^\star }$ is $1/\gamma$-strongly convex, we have for all $x\in \mathbb {R}^n$, $\rho _{u^\star }(x) \ge \rho _{u^\star }(x^\star ) + \frac{1}{2\gamma }\Vert x-x^\star \Vert ^2$;
(ii)
the qualifying constraint $\frac{u^\star -x^\star }{\gamma } \in \mathrm {ri}~\partial g(x^\star )$ implies that $0\in \mathrm {ri}~\left( \partial g(x^\star ) + \frac{x^\star -u^\star }{\gamma } \right) = \mathrm {ri}~\partial \rho _{u^\star }(x^\star )$;

which enable us to use [16, Lem. 27] (or equivalently [21, Th. 3.2 and 6.2i]).

Thus, there exists a neighborhood ${\mathcal {B}}(u^\star , \varepsilon )$ of $u^\star$ and a function $\varPhi$ such that for all u in ${\mathcal {B}}(u^\star , \varepsilon )$, $\varPhi (u)\in \mathsf {M}$ and is a critical point of $\rho _u$ restricted to a neighborhood of $x^\star$. Since $\rho _u$ is convex, $\varPhi (u)$ is actually the global optimum of $\rho _u$. Therefore, we have that for all u in $\mathcal B(u^\star , \varepsilon )$, $\mathbf {prox}_{\gamma g}(u)\in \mathsf {M}$ which is exactly (QC). $\square$

Appendix 2: Recalls on the (accelerated) proximal gradient descent

We recall here the descent lemma for the composite objective function F, which is central in the analysis of any first order method.

Lemma 5

(Descent lemma) Let $\gamma >0$, the following inequalities hold for any $x, y \in {\mathbb {R}}^n$

$$\begin{aligned} F\left( \mathcal {T}_\gamma (x)\right) + \frac{(1-\gamma L)}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-x\right\| ^{2}+\frac{1}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-y\right\| ^{2}&\le F(y) + \frac{1}{2 \gamma }\Vert x-y\Vert ^{2}\\ F\left( \mathcal {T}_\gamma (x)\right) + \frac{(2-\gamma L)}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-x\right\| ^{2} + \frac{1}{ \gamma } \langle x-y, \mathcal {T}_\gamma (x)-x \rangle&\le F(y) \end{aligned}$$

Proof

The first inequality is directly equivalent to [15, Lem. 1]. The second is derived from the identity $\Vert \mathcal {T}_\gamma (x)-x\Vert ^2 + \Vert y-x\Vert ^2 = 2\langle \mathcal {T}_\gamma (x)-x, y-x \rangle + \Vert y-\mathcal {T}_\gamma (x)\Vert ^2$. $\square$

We now give an accelerated descent lemma for the composite objective F, that is fundamental in the analysis of our provisional algorithm in Sect. 3.1.

Lemma 6

(Accelerated Descent lemma)Let Assumptions 1and 2hold. For any $\gamma >0$, any pair of points $(x_k, y_k)$and for $x^\star$a fixed point of ${\mathcal {T}}_{\gamma }$(i.e. a minimizer of F), we have (independently of the acceleration step)

$$\begin{aligned} t_{k}^2 v_k - t_{k-1}^2 v_{k-1} \le&-\frac{1-\gamma L}{2 \gamma } \Vert t_k x_{k+1}- t_k y_k\Vert ^2 \\&- \frac{1}{2\gamma }\Vert t_k x_{k+1} - (t_k-1)x_k -x^\star \Vert ^2 + \frac{1}{2\gamma } \Vert t_k y_k - (t_k-1)x_k -x^\star \Vert ^2 , \end{aligned}$$

where $x_{k+1} = {\mathcal {T}}_{\gamma }(y_k)$, and $v_k:=F(x_{k+1})-F^\star$.

Proof

The proof follows the same global layout as the one of FISTA [7]. Using the Lemma 5 at $(x=y_k, y=x_k)$ and $(x=y_k, y=x^\star )$,

$$\begin{aligned} v_k-v_{k-1}&= F(x_{k+1}) - F(x_k) \le -\frac{2-\gamma L}{2 \gamma }\Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle y_k-x_k, x_{k+1}-y_k \rangle \\ v_k&= F(x_{k+1}) - F^\star \le -\frac{2-\gamma L}{2 \gamma }\Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle y_k-x^\star , x_{k+1}-y_k \rangle \end{aligned}$$

The first equation times $(t_k-1)$ added to the second yields:

$$\begin{aligned} t_k v_k - (t_k-1)v_{k-1}&\le -\frac{2-\gamma L}{2 \gamma } t_k \Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle t_k y_k - (t_k-1)x_k -x^\star , x_{k+1}-y_k \rangle \end{aligned}$$

Multiplying by $t_k$, using the relation $t_k^2-t_k \le t_{k-1}^2$ from Assumption 2(i), we get:

$$\begin{aligned} t_{k}^2 v_k - t_{k-1}^2 v_{k-1}&\le -\frac{2-\gamma L}{2 \gamma } \Vert t_k x_{k+1}- t_k y_k\Vert ^2 - \frac{1}{ \gamma } \langle t_k y_k - (t_k-1)x_k -x^\star , t_k x_{k+1}- t_k y_k \rangle \end{aligned}$$

Applying the identity $\Vert c-a\Vert ^2+2\langle a-b,c-a\rangle = \Vert b-c\Vert ^2 - \Vert b-a\Vert ^2$ to the inner product of the last identity yields the result. $\square$

Appendix 3: About error bounds

In order to connect the functional suboptimality with the iterates behavior, the geometry of the function can be used. For a proper convex lower-semicontinuous function F achieving its minimum $F^\star$, two common variants of geometric inequalities are Error Bounds and Kurdyka-Łojasiewicz inequalities [10, 11, 32]. For any^{Footnote 7}$x \notin {{\,\mathrm{argmin}\,}}F$, they respectively write

Error Bound $\varphi (F(x)-F^\star )\ge {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)$
Kurdyka-Łojasiewicz (KL) ${{\,\mathrm{dist}\,}}(0,\partial F(x))\ge 1/\varphi '(F(x)-F^\star )$

where $\varphi (t) = Ct^\theta /\theta$ with $C>0$ and $\theta \in (0,1]$ is called a desingularizing function. These properties are widely satisfied (e.g. by any semi-algebraic function [10]) but the desingularizing function (more particularly $\theta$) is often hard to estimate^{Footnote 8}.

It is easy to see that Assumption 3 is a (global) error bound with $\varphi (t) = (1/\beta )^{1/p} t^{1/p}$ (matching the definition of $\varphi$ with $C = 1/p (1/\beta )^{1/p}>0$ and $\theta =1/p\in (0,1]$) but the knowledge of the constants is not necessary.

Then, an important result about Error Bounds and KL inequalities is that they are equivalent with the same desingularizing function [11, Th. 5] which allows us to get the following result for the proximal gradient operator.

Lemma 7

Let Assumptions 1and 3hold. Then, there exists a constant $B>0$such that for all x

$$\begin{aligned} \Vert x-\mathsf {T}_\gamma (x) \Vert \ge B ~ {{\,\mathrm{dist}\,}}^{1-\frac{1}{p}} ~(x,{{\,\mathrm{argmin}\,}}F) . \end{aligned}$$

Proof

First, let us notice that the KL and error bound properties can be combined as

$$\begin{aligned} {{\,\mathrm{dist}\,}}(0,\partial F(x)) \ge \frac{1}{C} ( F(x)-F^\star )^{1-\theta } \ge \frac{1}{C} \left( \frac{\theta }{C}\right) ^{\frac{1-\theta }{\theta }} \left( {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)\right) ^{\frac{1-\theta }{\theta }} . \end{aligned}$$

Furthermore, using that ${{\,\mathrm{dist}\,}}(0,\partial F(\mathsf {T}_\gamma (x) )) \le \frac{L\gamma + 1}{\gamma } \Vert x-\mathsf {T}_\gamma (x) \Vert$ for all $x\in \mathbb {R}^n$ (see e.g. [11, Sec. 4.1]), we get that

$$\begin{aligned} \Vert x-\mathsf {T}_\gamma (x) \Vert \ge \frac{\gamma }{C(L\gamma +1)} \left( \frac{\theta }{C}\right) ^{\frac{1-\theta }{\theta }} \left( {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)\right) ^{\frac{1-\theta }{\theta }} , \end{aligned}$$

which, combined with the fact that $\theta =1/p$, gets the claimed result. $\square$

Appendix 4: proof of corollary 1

Proof

First, Theorem 3.1 or Theorem 3.2 give the convergence of the sequence $(x_k)$ to the unique minimizer $x^\star$ and finite time convergence since the partial smoothness assumption of Lemma 1 implies the qualifying constraint (QC) of both theorems from Theorem 4.

In order to show the linear convergence behavior, we follow [29] and [35, Chap. 2.1.2] and apply their first-order analysis of accelerated proximal gradient to (Prov. Alg.). We are in the same setting and have compatible assumptions so that we can apply [29, Prop. 4.5] to (Prov. Alg.) as often as identification happened since both tests will return acceleration after that moment. We have, for $r_k=x_k-x^\star$, $d_k = [r_{k-1} \quad r_k]^T$:

$$\begin{aligned} d_{k+1} = M d_k + e_k, \qquad \Vert e_k\Vert = o(\Vert d_k\Vert ) \end{aligned}$$

(4.1)

with M a matrix of spectral radius $\rho (M)<1$ as $\gamma \in (0, 1/L]$ from [29, Cor. 4.9, Rem. 4.10].

Take any $\varepsilon \in (0,(1-\rho (M))/2)$ and let $K_2>0$ be the smallest time from which i) identification holds; and ii) $\Vert e_k\Vert / \Vert d_k\Vert \le \varepsilon$. $K_2$ is finite since identification happens in finite time and $\Vert e_k\Vert = o(\Vert d_k\Vert )$.

Expliciting Eq. (4.1) yields, for $k\ge 0$,

$$\begin{aligned} d_{K_2+k+1} = M^k d_{K_2+1} + \sum _{l=1}^k M^{k-l}e_{K_2+l}. \end{aligned}$$

Taking the norm and applying triangular inequalities yield

$$\begin{aligned} \Vert d_{K_2+k+1}\Vert&\le \Vert M^k\Vert \Vert d_{K_2+1}\Vert + \sum _{l=1}^k \Vert M^{k-l}\Vert \Vert e_{K_2+l}\Vert \le C_M \rho ^k \Vert d_{K_2+1}\Vert + C_M \sum _{l=1}^k \rho ^{k-l}\Vert e_{K_2+l}\Vert \end{aligned}$$

where $\rho :=\rho (M) +\varepsilon$ and $C_M>0$ is a constant such that $\Vert M^k\Vert \le C_M \rho ^k$ for all k (see [22, Cor. 5.6.13]).

Let us denote $w_{k+1} = C_M^{-1} (\rho + \varepsilon )^{-k}\Vert d_{K_2+k+1}\Vert$ ($\rho + \varepsilon = \rho (M) + 2\varepsilon \in (0,1)$) and show that this sequence is uniformly bounded by a constant $C_w$. The initialization is clear. Suppose that $w_l\le C$ for all $l\le k$; multiplying the previous equation by $C_M^{-1} (\rho +\epsilon )^{-k}$ yields

$$\begin{aligned} w_{k+1}&\le \left( \frac{\rho }{\rho +\varepsilon }\right) ^k \Vert d_{K+1}\Vert + \frac{1}{\rho +\epsilon }\sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} \frac{\Vert d_{K+l}\Vert }{(\rho +\epsilon )^{l-1}}\frac{\Vert e_{K+l}\Vert }{\Vert d_{K+l}\Vert } \\&= \left( \frac{\rho }{\rho +\epsilon }\right) ^k w_1 + \frac{1}{\rho +\epsilon }\sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} w_l \frac{\Vert e_{K+l}\Vert }{\Vert d_{K+l}\Vert } \end{aligned}$$

By recursion, we have

$$\begin{aligned} w_{k+1}&\le C_w \left( \frac{\rho }{\rho +\epsilon }\right) ^k + C_w \frac{\epsilon }{\rho +\epsilon } \sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} \\&= C_w \left( \frac{\rho }{\rho +\epsilon }\right) ^k + C_w \frac{\epsilon }{\rho +\epsilon } \frac{1-\left( \frac{\rho }{\rho +\epsilon }\right) ^k}{1-\frac{\rho }{\rho +\varepsilon }} = C_w \end{aligned}$$

Therefore, for all $k\ge 0$, $\Vert d_{K_2+k+1}\Vert \le C_M C_w (\rho (M) + 2\varepsilon )^k \Vert d_{K+1}\Vert$ for any $\varepsilon >0$. $\square$

Additional illustrations

We provide here additional illustrations Figs. 7 and 8, which consist in the same experiments as in Figs. 5a and 6 except for the starting point of the algorithms, which are taken here as the null vector (or matrix). The conclusions are quite similar to those of Sect. 4.2; the identification stability of the proposed methods is even more explicit.

Besides, we also report one further experiment where the solution is non qualified, and the structure-encoding manifolds have non-null curvature. The smooth part f is again a least-squares function, while the non-smooth part is defined as $g(x) = \max (0, \Vert x_{1:5}\Vert _{1.3}-1) + \ldots + \max (0, \Vert x_{45:50}\Vert _{1.3}-1)$ for $x\in {\mathbb {R}}^{50}$. The structure-encoding manifolds of g are any cartesian product $\mathsf {M}= \times _{i=1}^{10} \mathsf {M}_i \subset {\mathbb {R}}^{50}$, where each manifold $\mathsf {M}_i$ is either ${\mathbb {R}}^5$ or $\{x\in {\mathbb {R}}^5 : \Vert x\Vert _{1.3}=1\}$ (i.e. the 1.3-norm unit sphere). Figure 9 displays the performance of the four algorithms of interest (stopped as soon as suboptimality gets below $10^{-12}$). None is able to recover the complete structure of the solution, which lives in $\mathsf {M}^\star = \times _{i=1}^{10} \{x\in {\mathbb {R}}^5 : \Vert x\Vert _{1.3}=1\}$. This is not surprising for a non-qualified problem. The vanilla proximal gradient recovers half of the elementary manifolds, and $\mathsf {T}^2$ recovers 4 but much faster. Interestingly, the accelerated proximal gradient and $\mathsf {T}^1$ iterates are not able to retain even one of the manifolds they encountered. This behavior is consistent with the one noticed in Fig. 1b and suggests that proximal gradient may have more robust identification properties than accelerated proximal gradient for non-qualified problems. The second test seems to behave comparably to proximal gradient descent in terms of identification, while converging with the same number of iterations as accelerated proximal gradient.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bareilles, G., Iutzeler, F. On the interplay between acceleration and identification for the proximal gradient algorithm. Comput Optim Appl 77, 351–378 (2020). https://doi.org/10.1007/s10589-020-00218-7

Download citation

Received: 19 September 2019
Published: 06 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10589-020-00218-7

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convergence rates of accelerated proximal gradient algorithms under independent noise

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Accelerated methods for weakly-quasi-convex optimization problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Obtaining the qualifying condition

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Lemma 4

Proof

Appendix 2: Recalls on the (accelerated) proximal gradient descent

Lemma 5

Proof

Lemma 6

Proof

Appendix 3: About error bounds

Lemma 7

Proof

Appendix 4: proof of corollary 1

Proof

Additional illustrations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation