Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

On the interplay between acceleration and identification for the proximal gradient algorithm

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

In this paper, we study the interplay between acceleration and structure identification for the proximal gradient algorithm. While acceleration is generally beneficial in terms of functional decrease, we report and analyze several cases where its interplay with identification has negative effects on the algorithm behavior (iterates oscillation, loss of structure, etc.). Then, we present a generic method that tames acceleration when structure identification may be at stake; it benefits from a convergence rate that matches the one of the accelerated proximal gradient under some qualifying condition. We show empirically that the proposed method is much more stable in terms of subspace identification compared to the accelerated proximal gradient method while keeping a similar functional decrease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. For instance, if g is the absolute value on \(\mathbb {R}\), the proximity operator of g is equal to 0 for all \(x\in [-1,1]\) (and \(x-\mathrm {sign}(x)\) elsewhere). Thus, if some sequence \(x_k\) converges to a point within \((-1,1)\), then the output of the proximity operator of g at \(x_k\) will be equal to 0 for all k greater than some finite, but unknown, K.

  2. That is for all iterates \(k>0\), \(\Vert x_{k+1}-x^\star \Vert \le \Vert x_k-x^\star \Vert\) for any minimizer \(x^\star \in {{\,\mathrm{argmin}\,}}_x F(x)\).

  3. Other examples include problems regularized with nuclear norm, for which the target structure is the rank of the current (matrix) iterate \(x\in \mathbb {R}^{m_1\times m_2}\). Manifolds of interest are defined as \(\mathsf {M}_i =\{x\in \mathbb {R}^{m_1\times m_2} : {\text {rank}}(x)=i\}\), and the collection is \(\mathsf {M}=\{\mathsf {M}_0, \ldots , \mathsf {M}_{\min (m_1,m_2)}\}\).

  4. Note that we are only interested here in the transient phase of identification when the iterates are in the process of reaching the optimal manifold. The behavior of the accelerated proximal gradient is usually better than the vanilla version before this phase and after identification; see e.g. the recent analyses of [2, 29] and references therein.

  5. \(g = \min (\Vert \cdot \Vert _{p}-1, 0)\) for p in \(\{1.3, 2.6\}\), the proximity operator of which is \(\mathbf {prox}_{\gamma g}(u)\) is \(u (1-\gamma / \Vert u\Vert _p)\) if \(\Vert u\Vert _p > 1+\gamma\), \(u/\Vert u\Vert _p\) if \(1 \le \Vert u\Vert _p \le 1+\gamma\), and u otherwise.

  6. This functional evaluation is actually only needed if the function is sharp with \(p=1\) in Assumption 3. For instance, this is the case when \(f\equiv 0\) and \(g(x) = \Vert x\Vert _1\), in which case \(F(x)=\Vert x\Vert _1\) verifies Assumption 3 with \(\beta =1, p=1\). In these rather degenerate cases, the proximal gradient converges in a finite number of steps.

  7. These properties are often supposed to hold only locally but can be globalized easily [11, Sec. 2.4, Cor. 6].

  8. An example of such a computation is performed for the lasso problem \(F(x) = \frac{1}{2} \left\| Ax-b\right\| _2^2 + \lambda _1 \Vert x\Vert _1\) in [11, Lem. 10]

References

  1. Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set Valued Anal. 9(1–2), 3–11 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  2. Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward-backward algorithm beyond nesterov’s rule. Math. Prog. 180(1), 137–156 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  3. Attouch, H., Peypouquet, J.: The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than \(1/k^2\). SIAM J. Optim. 26(3), 1824–1834 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)

    Book  MATH  Google Scholar 

  5. Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philepidia (2017)

    Book  MATH  Google Scholar 

  6. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans Image Process 18(11), 2419–2434 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  7. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bertsekas, D.: On the goldstein–levitin–polyak gradient projection method. IEEE Trans. Automat. Control 21(2), 174–184 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  MATH  Google Scholar 

  11. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Prog. 165(2), 471–507 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  12. Burke, J.V., Moré, J.J.: On the identification of active constraints. SIAM J. Numer. Anal. 25(5), 1197–1211 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  13. Candes, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 59(8), 1207–1223 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Catalina, A., Alaíz, C.M., Dorronsoro, J.R.: Revisiting fista for lasso: Acceleration strategies over the regularization path. In: ESANN (2018)

  15. Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  16. Daniilidis, A., Hare, W., Malick, J.: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization 55(5&6), 481–503 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  17. Donoho, D.L.: De-noising by soft-thresholding. IEEE Trans Inf. Theory 41(3), 613–627 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  18. Drusvyatskiy, D., Lewis, A.S.: Optimality, identifiability, and sensitivity. Math. Prog. 147(1–2), 467–498 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  19. Fadili, J., Malick, J., Peyré, G.: Sensitivity analysis for mirror-stratifiable convex functions. SIAM J. Optim. 28(4), 2975–3000 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  20. Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: 53rd IEEE Conference on Decision and Control, pp. 5058–5063. IEEE (2014)

  21. Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2), 251–266 (2004)

    MathSciNet  MATH  Google Scholar 

  22. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  23. Ito, N., Takeda, A., Toh, K.C.: A unified formulation and fast accelerated proximal gradient method for classification. J. Mach. Learn. Res. 18(1), 510–558 (2017)

    MathSciNet  MATH  Google Scholar 

  24. Iutzeler, F., Hendrickx, J.M.: A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optim. Methods Softw. 34(2), 383–405 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  25. Iutzeler, F., Malick, J.: On the proximal gradient algorithm with alternated inertia. J. Optim. Theory Appl. 176(3), 688–710 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  26. Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  27. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in Neural Information Processing Systems, pp. 379–387 (2015)

  28. Liang, J., Fadili, J., Peyré, G., Luke, R.: Activity identification and local linear convergence of douglas–rachford/admm under partial smoothness. In: International Conference on Scale Space and Variational Methods in Computer Vision, pp. 642–653. Springer (2015)

  29. Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  30. Liang, J., Schönlieb, C.B.: Improving “fast iterative shrinkage-thresholding algorithm”: Faster, smarter and greedier. arXiv preprint arXiv:1811.01430 (2018)

  31. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(\cal{O} (1/k^{2})\). Dokl. akad. nauk Sssr 269, 543–547 (1983)

    MathSciNet  Google Scholar 

  32. NGuyen, T.P.: Kurdyka-lojasiewicz and convexity: algorithms and applications. Ph.D. thesis, Toulouse University (2017)

  33. O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  34. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  35. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)

    MATH  Google Scholar 

  36. Poon, C., Liang, J.: Trajectory of alternating direction method of multipliers and adaptive acceleration. In: Advances in Neural Information Processing Systems, pp. 7355–7363 (2019)

  37. Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: International Conference on Machine Learning, pp. 4124–4132 (2018)

  38. Scheinberg, K., Goldfarb, D., Bai, X.: Fast first-order methods for composite convex optimization with backtracking. Found. Comput. Math. 14(3), 389–417 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  39. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  40. Vaiter, S., Peyré, G., Fadili, J.: Low Complexity Regularization of Linear Inverse Problems, chap. Sampling Theory, a Renaissance, pp. 103–153. Springer-Birkhäuser (2015)

  41. Vaiter, S., Peyré, G., Fadili, J.: Model consistency of partly smooth regularizers. IEEE Trans. Inf. Theory 64(3), 1725–1737 (2017)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the AE and the two anonymous reviewers for their indications about the tone of the article and their suggestions for future works. FI benefited from the support of the ANR JCJC project STROLL (ANR-19-CE23-0008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Franck Iutzeler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Obtaining the qualifying condition

The general qualifying condition (QC) can be recovered under the setting of partial smoothness of g and a usual non-degeneracy condition at the optimal point \(x^\star\) and associated point \(u^\star\) such that \(x^\star = \mathbf {prox}_{\gamma g}(x^\star )\).

Definition 1

(Relative interior) The relative interior of a subset S, noted \(\mathrm {ri}~S\) is defined as

$$\begin{aligned} \mathrm {ri}~S = \{x\in S : \exists \varepsilon > 0 , {\mathcal {B}}(x, \varepsilon )\cap {\text {Aff}}S \subseteq S \} \end{aligned}$$

where \({\text {Aff}}\) denotes the affine hull of a set.

Definition 2

(Parallel space) For a connex subset A, the parallel space of A, noted \(\mathrm {par}~A\), is defined as the vector space parallel to the affine space generated by A.

Definition 3

(Manifold) A subset \(\mathsf {M}\) of \(\mathbb {R}^n\) is said to be a p-dimensional \(\mathcal C^k\)-submanifold of \(\mathbb {R}^n\) around \(x \in \mathsf {M}\) (\(1 \le k \le +\infty\)) if there exists a local parameterization of \(\mathsf {M}\) around x, that is, a \({\mathcal {C}}^k\) function \(\varphi : \mathbb {R}^p \rightarrow \mathbb {R}^n\) such that \(\varphi\) realizes a local homeomorphism between a neighborhood of \(0 \in \mathbb {R}^p\) and a neighborhood of \(x \in \mathsf {M}\) and the derivative of \(\varphi\) at \(\varphi ^{-1}(x)=0\) is injective.

A p-dimensional \({\mathcal {C}}^k\)-submanifold of \(\mathbb {R}^n\) can alternatively be defined via a local equation, that is, a \(\mathcal C^k\) function \(\varPhi : \mathbb {R}^n \rightarrow \mathbb {R}^{n-p}\) with a surjective derivative at \({\bar{x}} \in \mathsf {M}\), that satisfies for all x close enough to \({\bar{x}}\): \(x \in \mathsf {M}\Leftrightarrow \varPhi (x) = 0\).

To lighten notations, we will only indicate the dimensionality and the smoothness degree of a manifold when it is relevant to the discussion.

Definition 4

(Tangent space) Given a point x living in a manifold \(\mathsf {M}\), defined either by a local parametrization \(\varphi\) or a local equation \(\varPhi\), the tangent space of \(\mathsf {M}\) at x, denoted \(T_\mathsf {M}(x)\), is defined as:

$$\begin{aligned} T_\mathsf {M}(x) = \text {Im}~ D_\varphi (0) = \text {Ker} D_\varPhi (0) \end{aligned}$$

Definition 5

(Partly smooth function) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a proper, convex, lower semi-continuous function. g is said to be partly smooth at x relative to a manifold \(\mathsf {M}\) containing x if \(\partial g(x)\ne \emptyset\), and moreover

  • Smoothness: \(\mathsf {M}\) is a \({\mathcal {C}}^2\)-manifold around x, g restricted to \(\mathsf {M}\) is \({\mathcal {C}}^2\) around x;

  • Sharpness: The tangent space \(T_\mathsf {M}(x)\) coincides with \(\mathrm {par} (\partial g(x))^\perp\);

  • Continuity: The set-valued mapping \(\partial g\) is continuous at x relative to \(\mathsf {M}\).

Lemma 4

Let \(\mathsf {M}\)be a manifold and \(u^\star\)a point such that \(x^\star = \mathbf {prox}_{\gamma g}(u^\star )\in \mathsf {M}\). If g is partly smooth at \(x^\star\)relative to \(\mathsf {M}\)and \(\frac{u^\star -x^\star }{\gamma } \in \mathrm {ri}~\partial g(x^\star )\), then (QC) holds.

Proof

The proof follows the reasoning of [16, Lem. 27, Th. 28]. Let us define the function \(\rho (u, x) = g(x) + \frac{1}{2\gamma }\Vert x-u\Vert ^2\). Since g is partly smooth at \(x^\star\) relative to \(\mathsf {M}\), the function \((u, x) \mapsto g(x)\) is also partly smooth at \((u^\star , x^\star )\) relative to \(\mathbb {R}^n\times \mathsf {M}\) [26, Prop. 4.5]. Since the addition of a smooth function does not change this property [26, Cor. 4.6], \(\rho\) is partly smooth at \((u^\star , x^\star )\) relative to \(\mathbb {R}^n\times \mathsf {M}\).

Besides, defining the parametrized function \(\rho _{u^\star }(\cdot ) = \rho (u^\star , \cdot )\), we have that

  1. (i)

    \(x^\star = \mathbf {prox}_{\gamma g}(u^\star ) = {{\,\mathrm{argmin}\,}}_{x\in \mathbb {R}^n} \{ g(x) + \frac{1}{2\gamma } \Vert x - u^\star \Vert ^2 \} = {{\,\mathrm{argmin}\,}}_{x\in \mathbb {R}^n} \rho _{u^\star } ( x )\) and as \(\rho _{u^\star }\) is \(1/\gamma\)-strongly convex, we have for all \(x\in \mathbb {R}^n\), \(\rho _{u^\star }(x) \ge \rho _{u^\star }(x^\star ) + \frac{1}{2\gamma }\Vert x-x^\star \Vert ^2\);

  2. (ii)

    the qualifying constraint \(\frac{u^\star -x^\star }{\gamma } \in \mathrm {ri}~\partial g(x^\star )\) implies that \(0\in \mathrm {ri}~\left( \partial g(x^\star ) + \frac{x^\star -u^\star }{\gamma } \right) = \mathrm {ri}~\partial \rho _{u^\star }(x^\star )\);

which enable us to use [16, Lem. 27] (or equivalently [21, Th. 3.2 and 6.2i]).

Thus, there exists a neighborhood \({\mathcal {B}}(u^\star , \varepsilon )\) of \(u^\star\) and a function \(\varPhi\) such that for all u in \({\mathcal {B}}(u^\star , \varepsilon )\), \(\varPhi (u)\in \mathsf {M}\) and is a critical point of \(\rho _u\) restricted to a neighborhood of \(x^\star\). Since \(\rho _u\) is convex, \(\varPhi (u)\) is actually the global optimum of \(\rho _u\). Therefore, we have that for all u in \(\mathcal B(u^\star , \varepsilon )\), \(\mathbf {prox}_{\gamma g}(u)\in \mathsf {M}\) which is exactly (QC). \(\square\)

Appendix 2: Recalls on the (accelerated) proximal gradient descent

We recall here the descent lemma for the composite objective function F, which is central in the analysis of any first order method.

Lemma 5

(Descent lemma) Let \(\gamma >0\), the following inequalities hold for any \(x, y \in {\mathbb {R}}^n\)

$$\begin{aligned} F\left( \mathcal {T}_\gamma (x)\right) + \frac{(1-\gamma L)}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-x\right\| ^{2}+\frac{1}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-y\right\| ^{2}&\le F(y) + \frac{1}{2 \gamma }\Vert x-y\Vert ^{2}\\ F\left( \mathcal {T}_\gamma (x)\right) + \frac{(2-\gamma L)}{2 \gamma }\left\| \mathcal {T}_\gamma (x)-x\right\| ^{2} + \frac{1}{ \gamma } \langle x-y, \mathcal {T}_\gamma (x)-x \rangle&\le F(y) \end{aligned}$$

Proof

The first inequality is directly equivalent to [15, Lem. 1]. The second is derived from the identity \(\Vert \mathcal {T}_\gamma (x)-x\Vert ^2 + \Vert y-x\Vert ^2 = 2\langle \mathcal {T}_\gamma (x)-x, y-x \rangle + \Vert y-\mathcal {T}_\gamma (x)\Vert ^2\). \(\square\)

We now give an accelerated descent lemma for the composite objective F, that is fundamental in the analysis of our provisional algorithm in Sect. 3.1.

Lemma 6

(Accelerated Descent lemma)Let Assumptions 1and 2hold. For any \(\gamma >0\), any pair of points \((x_k, y_k)\)and for \(x^\star\)a fixed point of \({\mathcal {T}}_{\gamma }\)(i.e. a minimizer of F), we have (independently of the acceleration step)

$$\begin{aligned} t_{k}^2 v_k - t_{k-1}^2 v_{k-1} \le&-\frac{1-\gamma L}{2 \gamma } \Vert t_k x_{k+1}- t_k y_k\Vert ^2 \\&- \frac{1}{2\gamma }\Vert t_k x_{k+1} - (t_k-1)x_k -x^\star \Vert ^2 + \frac{1}{2\gamma } \Vert t_k y_k - (t_k-1)x_k -x^\star \Vert ^2 , \end{aligned}$$

where \(x_{k+1} = {\mathcal {T}}_{\gamma }(y_k)\), and \(v_k:=F(x_{k+1})-F^\star\).

Proof

The proof follows the same global layout as the one of FISTA [7]. Using the Lemma 5 at \((x=y_k, y=x_k)\) and \((x=y_k, y=x^\star )\),

$$\begin{aligned} v_k-v_{k-1}&= F(x_{k+1}) - F(x_k) \le -\frac{2-\gamma L}{2 \gamma }\Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle y_k-x_k, x_{k+1}-y_k \rangle \\ v_k&= F(x_{k+1}) - F^\star \le -\frac{2-\gamma L}{2 \gamma }\Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle y_k-x^\star , x_{k+1}-y_k \rangle \end{aligned}$$

The first equation times \((t_k-1)\) added to the second yields:

$$\begin{aligned} t_k v_k - (t_k-1)v_{k-1}&\le -\frac{2-\gamma L}{2 \gamma } t_k \Vert x_{k+1}-y_k\Vert ^2 - \frac{1}{ \gamma } \langle t_k y_k - (t_k-1)x_k -x^\star , x_{k+1}-y_k \rangle \end{aligned}$$

Multiplying by \(t_k\), using the relation \(t_k^2-t_k \le t_{k-1}^2\) from Assumption 2(i), we get:

$$\begin{aligned} t_{k}^2 v_k - t_{k-1}^2 v_{k-1}&\le -\frac{2-\gamma L}{2 \gamma } \Vert t_k x_{k+1}- t_k y_k\Vert ^2 - \frac{1}{ \gamma } \langle t_k y_k - (t_k-1)x_k -x^\star , t_k x_{k+1}- t_k y_k \rangle \end{aligned}$$

Applying the identity \(\Vert c-a\Vert ^2+2\langle a-b,c-a\rangle = \Vert b-c\Vert ^2 - \Vert b-a\Vert ^2\) to the inner product of the last identity yields the result. \(\square\)

Appendix 3: About error bounds

In order to connect the functional suboptimality with the iterates behavior, the geometry of the function can be used. For a proper convex lower-semicontinuous function F achieving its minimum \(F^\star\), two common variants of geometric inequalities are Error Bounds and Kurdyka-Łojasiewicz inequalities [10, 11, 32]. For anyFootnote 7\(x \notin {{\,\mathrm{argmin}\,}}F\), they respectively write

  • Error Bound                            \(\varphi (F(x)-F^\star )\ge {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)\)

  • Kurdyka-Łojasiewicz (KL)          \({{\,\mathrm{dist}\,}}(0,\partial F(x))\ge 1/\varphi '(F(x)-F^\star )\)

where \(\varphi (t) = Ct^\theta /\theta\) with \(C>0\) and \(\theta \in (0,1]\) is called a desingularizing function. These properties are widely satisfied (e.g. by any semi-algebraic function [10]) but the desingularizing function (more particularly \(\theta\)) is often hard to estimateFootnote 8.

It is easy to see that Assumption 3 is a (global) error bound with \(\varphi (t) = (1/\beta )^{1/p} t^{1/p}\) (matching the definition of \(\varphi\) with \(C = 1/p (1/\beta )^{1/p}>0\) and \(\theta =1/p\in (0,1]\)) but the knowledge of the constants is not necessary.

Then, an important result about Error Bounds and KL inequalities is that they are equivalent with the same desingularizing function [11, Th. 5] which allows us to get the following result for the proximal gradient operator.

Lemma 7

Let Assumptions 1and 3hold. Then, there exists a constant \(B>0\)such that for all x

$$\begin{aligned} \Vert x-\mathsf {T}_\gamma (x) \Vert \ge B ~ {{\,\mathrm{dist}\,}}^{1-\frac{1}{p}} ~(x,{{\,\mathrm{argmin}\,}}F) . \end{aligned}$$

Proof

First, let us notice that the KL and error bound properties can be combined as

$$\begin{aligned} {{\,\mathrm{dist}\,}}(0,\partial F(x)) \ge \frac{1}{C} ( F(x)-F^\star )^{1-\theta } \ge \frac{1}{C} \left( \frac{\theta }{C}\right) ^{\frac{1-\theta }{\theta }} \left( {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)\right) ^{\frac{1-\theta }{\theta }} . \end{aligned}$$

Furthermore, using that \({{\,\mathrm{dist}\,}}(0,\partial F(\mathsf {T}_\gamma (x) )) \le \frac{L\gamma + 1}{\gamma } \Vert x-\mathsf {T}_\gamma (x) \Vert\) for all \(x\in \mathbb {R}^n\) (see e.g. [11, Sec. 4.1]), we get that

$$\begin{aligned} \Vert x-\mathsf {T}_\gamma (x) \Vert \ge \frac{\gamma }{C(L\gamma +1)} \left( \frac{\theta }{C}\right) ^{\frac{1-\theta }{\theta }} \left( {{\,\mathrm{dist}\,}}(x,{{\,\mathrm{argmin}\,}}F)\right) ^{\frac{1-\theta }{\theta }} , \end{aligned}$$

which, combined with the fact that \(\theta =1/p\), gets the claimed result. \(\square\)

Appendix 4: proof of corollary 1

Proof

First, Theorem 3.1 or Theorem 3.2 give the convergence of the sequence \((x_k)\) to the unique minimizer \(x^\star\) and finite time convergence since the partial smoothness assumption of Lemma 1 implies the qualifying constraint (QC) of both theorems from Theorem 4.

In order to show the linear convergence behavior, we follow [29] and [35, Chap. 2.1.2] and apply their first-order analysis of accelerated proximal gradient to (Prov. Alg.). We are in the same setting and have compatible assumptions so that we can apply [29, Prop. 4.5] to (Prov. Alg.) as often as identification happened since both tests will return acceleration after that moment. We have, for \(r_k=x_k-x^\star\), \(d_k = [r_{k-1} \quad r_k]^T\):

$$\begin{aligned} d_{k+1} = M d_k + e_k, \qquad \Vert e_k\Vert = o(\Vert d_k\Vert ) \end{aligned}$$
(4.1)

with M a matrix of spectral radius \(\rho (M)<1\) as \(\gamma \in (0, 1/L]\) from [29, Cor. 4.9, Rem. 4.10].

Take any \(\varepsilon \in (0,(1-\rho (M))/2)\) and let \(K_2>0\) be the smallest time from which i) identification holds; and ii) \(\Vert e_k\Vert / \Vert d_k\Vert \le \varepsilon\). \(K_2\) is finite since identification happens in finite time and \(\Vert e_k\Vert = o(\Vert d_k\Vert )\).

Expliciting Eq. (4.1) yields, for \(k\ge 0\),

$$\begin{aligned} d_{K_2+k+1} = M^k d_{K_2+1} + \sum _{l=1}^k M^{k-l}e_{K_2+l}. \end{aligned}$$

Taking the norm and applying triangular inequalities yield

$$\begin{aligned} \Vert d_{K_2+k+1}\Vert&\le \Vert M^k\Vert \Vert d_{K_2+1}\Vert + \sum _{l=1}^k \Vert M^{k-l}\Vert \Vert e_{K_2+l}\Vert \le C_M \rho ^k \Vert d_{K_2+1}\Vert + C_M \sum _{l=1}^k \rho ^{k-l}\Vert e_{K_2+l}\Vert \end{aligned}$$

where \(\rho :=\rho (M) +\varepsilon\) and \(C_M>0\) is a constant such that \(\Vert M^k\Vert \le C_M \rho ^k\) for all k (see [22, Cor. 5.6.13]).

Let us denote \(w_{k+1} = C_M^{-1} (\rho + \varepsilon )^{-k}\Vert d_{K_2+k+1}\Vert\) (\(\rho + \varepsilon = \rho (M) + 2\varepsilon \in (0,1)\)) and show that this sequence is uniformly bounded by a constant \(C_w\). The initialization is clear. Suppose that \(w_l\le C\) for all \(l\le k\); multiplying the previous equation by \(C_M^{-1} (\rho +\epsilon )^{-k}\) yields

$$\begin{aligned} w_{k+1}&\le \left( \frac{\rho }{\rho +\varepsilon }\right) ^k \Vert d_{K+1}\Vert + \frac{1}{\rho +\epsilon }\sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} \frac{\Vert d_{K+l}\Vert }{(\rho +\epsilon )^{l-1}}\frac{\Vert e_{K+l}\Vert }{\Vert d_{K+l}\Vert } \\&= \left( \frac{\rho }{\rho +\epsilon }\right) ^k w_1 + \frac{1}{\rho +\epsilon }\sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} w_l \frac{\Vert e_{K+l}\Vert }{\Vert d_{K+l}\Vert } \end{aligned}$$

By recursion, we have

$$\begin{aligned} w_{k+1}&\le C_w \left( \frac{\rho }{\rho +\epsilon }\right) ^k + C_w \frac{\epsilon }{\rho +\epsilon } \sum _{l=1}^k \left( \frac{\rho }{\rho +\epsilon }\right) ^{k-l} \\&= C_w \left( \frac{\rho }{\rho +\epsilon }\right) ^k + C_w \frac{\epsilon }{\rho +\epsilon } \frac{1-\left( \frac{\rho }{\rho +\epsilon }\right) ^k}{1-\frac{\rho }{\rho +\varepsilon }} = C_w \end{aligned}$$

Therefore, for all \(k\ge 0\), \(\Vert d_{K_2+k+1}\Vert \le C_M C_w (\rho (M) + 2\varepsilon )^k \Vert d_{K+1}\Vert\) for any \(\varepsilon >0\). \(\square\)

Additional illustrations

We provide here additional illustrations Figs. 7 and 8, which consist in the same experiments as in Figs. 5a and 6 except for the starting point of the algorithms, which are taken here as the null vector (or matrix). The conclusions are quite similar to those of Sect. 4.2; the identification stability of the proposed methods is even more explicit.

Besides, we also report one further experiment where the solution is non qualified, and the structure-encoding manifolds have non-null curvature. The smooth part f is again a least-squares function, while the non-smooth part is defined as \(g(x) = \max (0, \Vert x_{1:5}\Vert _{1.3}-1) + \ldots + \max (0, \Vert x_{45:50}\Vert _{1.3}-1)\) for \(x\in {\mathbb {R}}^{50}\). The structure-encoding manifolds of g are any cartesian product \(\mathsf {M}= \times _{i=1}^{10} \mathsf {M}_i \subset {\mathbb {R}}^{50}\), where each manifold \(\mathsf {M}_i\) is either \({\mathbb {R}}^5\) or \(\{x\in {\mathbb {R}}^5 : \Vert x\Vert _{1.3}=1\}\) (i.e. the 1.3-norm unit sphere). Figure 9 displays the performance of the four algorithms of interest (stopped as soon as suboptimality gets below \(10^{-12}\)). None is able to recover the complete structure of the solution, which lives in \(\mathsf {M}^\star = \times _{i=1}^{10} \{x\in {\mathbb {R}}^5 : \Vert x\Vert _{1.3}=1\}\). This is not surprising for a non-qualified problem. The vanilla proximal gradient recovers half of the elementary manifolds, and \(\mathsf {T}^2\) recovers 4 but much faster. Interestingly, the accelerated proximal gradient and \(\mathsf {T}^1\) iterates are not able to retain even one of the manifolds they encountered. This behavior is consistent with the one noticed in Fig. 1b and suggests that proximal gradient may have more robust identification properties than accelerated proximal gradient for non-qualified problems. The second test seems to behave comparably to proximal gradient descent in terms of identification, while converging with the same number of iterations as accelerated proximal gradient.

Fig. 7
figure 7

\(g=\Vert \cdot \Vert _1\)—5 runs w/ initialization at 0—120 manifolds to identify

Fig. 8
figure 8

\(g=\Vert \cdot \Vert _*\)—1 run w/ null initialization—17 manifolds to identify

Fig. 9
figure 9

\(g(x) = \max (0, \Vert x_{1:5}\Vert _{1.3}-1) + \ldots + \max (0, \Vert x_{45:50}\Vert _{1.3}-1)\)—1 runs w/ random initialization—10 manifolds to identify

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bareilles, G., Iutzeler, F. On the interplay between acceleration and identification for the proximal gradient algorithm. Comput Optim Appl 77, 351–378 (2020). https://doi.org/10.1007/s10589-020-00218-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-020-00218-7

Keywords

Navigation