Abstract
Regularization techniques are widely employed in optimization-based approaches for solving ill-posed inverse problems in data analysis and scientific computing. These methods are based on augmenting the objective with a penalty function, which is specified based on prior domain-specific expertise to induce a desired structure in the solution. We consider the problem of learning suitable regularization functions from data in settings in which precise domain knowledge is not directly available. Previous work under the title of ‘dictionary learning’ or ‘sparse coding’ may be viewed as learning a regularization function that can be computed via linear programming. We describe generalizations of these methods to learn regularizers that can be computed and optimized via semidefinite programming. Our framework for learning such semidefinite regularizers is based on obtaining structured factorizations of data matrices, and our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with an operator analog of Sinkhorn scaling. Under suitable conditions on the input data, our algorithm provides a locally linearly convergent method for identifying the correct regularizer that promotes the type of structure contained in the data. Our analysis is based on the stability properties of Operator Sinkhorn scaling and their relation to geometric aspects of determinantal varieties (in particular tangent spaces with respect to these varieties). The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems.
Similar content being viewed by others
Notes
A rank-r matrix \(X \in \mathbb {R}^{q \times q}\) is a smooth point with respect to the variety of \(q \times q\) matrices of rank at most r.
Note that any affine variety over the reals may be defined by polynomials of degree at most two by suitably adding extra variables; in our discussion here on normalization, we consider varieties defined without additional variables.
Algorithm 1 requires the computation of a matrix square root at every iteration. By virtue of the fact that the operator \({\mathbf {\mathsf{{T}}}}_\mathcal {L}\) which we wish to rescale is completely positive, it is possible to normalize \(\mathcal {L}\) using only rational matrix operations via a modified scheme known as the Rational Operator Sinkhorn Iteration [34].
References
Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P.: Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization. SIAM Journal on Optimization 26(4), 2775–2799 (2016). https://doi.org/10.1137/140979861
Agarwal, A., Anandkumar, A., Netrapalli, P.: A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries. IEEE Transactions on Information Theory 63(1), 575–592 (2017). https://doi.org/10.1109/TIT.2016.2614684
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006). https://doi.org/10.1109/TSP.2006.881199
Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, Efficient, and Neural Algorithms for Sparse Coding. In: Conference on Learning Theory (2015)
Arora, S., Ge, R., Moitra, A.: New Algorithms for Learning Incoherent and Overcomplete Dictionaries. Journal of Machine Learning Research: Workshop and Conference Proceedings 35, 1–28 (2014)
Barak, B., Kelner, J.A., Steurer, D.: Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method. In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing. ACM (2015). https://doi.org/10.1145/2746539.2746605
Barron, A.R.: Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory 39(3), 930–945 (1993). https://doi.org/10.1109/18.256500
Bhaskar, B.N., Tang, G., Recht, B.: Atomic Norm Denoising with Applications to Line Spectral Estimation. IEEE Transactions on Signal Processing 61(23), 5987–5999 (2013)
Blumensath, T., Davies, M.E.: Iterative Hard Thresholding for Compressed Sensing. Applied and Computational Harmonic Analysis 27, 265–274 (2009). https://doi.org/10.1016/j.acha.2009.04.002
Bruckstein, A.M., Donoho, D.L., Elad, M.: From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images. SIAM Review 51(1), 34–81 (2009). https://doi.org/10.1137/060657704
Candès, E.J., Plan, Y.: Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements. IEEE Transactions on Information Theory 57(4), 2342–2359. https://doi.org/10.1109/TIT.2011.2111771
Candès, E.J., Recht, B.: Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics 9(6), 717–772 (2009). https://doi.org/10.1007/s10208-009-9045-5
Candès, E.J., Romberg, J., Tao, T.: Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information. IEEE Transactions on Information Theory 52(2), 489–509 (2006). https://doi.org/10.1109/TIT.2005.862083
Candès, E.J., Tao, T.: Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? IEEE Transactions on Information Theory 52(12), 5406–5425 (2006). https://doi.org/10.1109/TIT.2006.885507
Chandrasekaran, V., Parillo, P., Willsky, A.S.: Latent Variable Graphical Model Selection via Convex Optimization. The Annals of Statistics 40(4), 1935–1967 (2012). https://doi.org/10.1214/11-AOS949
Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics 12(6), 805–849 (2012). https://doi.org/10.1007/s10208-012-9135-7
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic Decomposition by Basis Pursuit. SIAM Journal on Scientific Computing 20(1), 33–61 (1998). https://doi.org/10.1137/S1064827596304010
Cho, E.: Inner Products of Random Vectors on \({S}^n\). Journal of Pure and Applied Mathematics: Advances and Applications 9(1), 63–68 (2013)
Cuturi, M.: Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. In: Advances in Neural Information Processing Systems (2013)
Davidson, K.R., Szarek, S.J.: Local Operator Theory, Random Matrices and Banach Spaces. In: W.B. Johnson, J. Lindenstrauss (eds.) Handbook of the Geometry of Banach Spaces, chap. 8, pp. 317–366. Elsevier B. V. (2011)
DeVore, R.A., Temlyakov, V.N.: Some Remarks on Greedy Algorithms. Advances in Computational Mathematics 5(1), 173–187 (1996). https://doi.org/10.1007/BF02124742
Donoho, D.L.: Compressed Sensing. IEEE Transactions on Information Theory 52(4), 1289–1306 (2006). https://doi.org/10.1109/TIT.2006.871582
Donoho, D.L.: For Most Large Underdetermined Systems of Linear Equations the Minimal \(\ell _1\)-norm Solution Is Also the Sparsest Solution. Communications on Pure and Applied Mathematics 59(6), 797–829 (2006). https://doi.org/10.1002/cpa.20132
Donoho, D.L., Huo, X.: Uncertainty Principles and Ideal Atomic Decomposition. IEEE Transactions on Information Theory 47(7), 2845–2862
Elad, M.: Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer (2010). https://doi.org/10.1007/978-1-4419-7011-4
Fazel, M.: Matrix Rank Minimization with Applications. Ph.D. thesis, Department of Electrical Engineering, Stanford University (2002)
Fazel, M., Candès, E., Recht, B., Parrilo, P.: Compressed Sensing and Robust Recovery of Low Rank Matrices. In: 42nd IEEE Asilomar Conference on Signals, Systems and Computers (2008)
Garg, A., Gurvits, L., Oliveira, R., Wigderson, A.: A Deterministic Polynomial Time Algorithm for Non-Commutative Rational Identity Testing with Applications. In: IEEE 57th Annual Symposium on Foundations of Computer Science (2016). https://doi.org/10.1109/FOCS.2016.95
Ge, R., Lee, J.D., Ma, T.: Matrix Completion has No Spurious Local Minimum. In: Advances in Neural Information Processing Systems (2016)
Goldfarb, D., Ma, S.: Convergence of Fixed-Point Continuation Algorithms for Matrix Rank Minimization. Foundations of Computational Mathematics 11, 183–210 (2011). https://doi.org/10.1007/s10208-011-9084-6
Gorman, W.M.: Estimating Trends in Leontief Matrices. Unplublished note, referenced in Bacharach (1970) (1963)
Gouveia, J., Parrilo, P.A., Thomas, R.R.: Lifts of Convex Sets and Cone Factorizations. Mathematics of Operations Research 38(2), 248–264 (2013). https://doi.org/10.1287/moor.1120.0575
Gribonval, R., Jenatton, R., Bach, F., Kleinsteuber, M., Seibert, M.: Sample Complexity of Dictionary Learning and Other Matrix Factorizations. IEEE Transactions on Information Theory 61(6), 3469–3486 (2015). https://doi.org/10.1109/TIT.2015.2424238
Gurvits, L.: Classical Complexity and Quantum Entanglement. Journal of Computer and Systems Sciences 69(3), 448–484 (2004). https://doi.org/10.1016/j.jcss.2004.06.003
Idel, M.: A Review of Matrix Scaling and Sinkhorn’s Normal Form for Matrices and Positive Maps. CoRR arXiv:1609.06349 (2016)
Jain, P., Meka, R., Dhillon, I.S.: Guaranteed Rank Minimization via Singular Value Projection. In: Advances in Neural Information Processing Systems (2009)
Jones, L.K.: A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training. The Annals of Statistics 20(1), 608–613 (1992). https://doi.org/10.1214/aos/1176348546
Kato, T.: Perturbation Theory for Linear Operators. Springer-Verlag (1966)
Khachiyan, L., Kalantari, B.: Diagonal Matrix Scaling and Linear Programming. SIAM Journal on Optimization 2(4), 668–672 (1991). https://doi.org/10.1137/0802034
Linial, N., Samorodnitsky, A., Wigderson, A.: A Deterministic Strongly Polynomial Algorithm for Matrix Scaling and Approximate Permanents. Combinatorica 20(4), 545–568 (2000). https://doi.org/10.1007/s004930070007
Mairal, J., Bach, F., Ponce, J.: Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision 8(2–3), 85–283 (2014). https://doi.org/10.1561/0600000058
Marcus, M., Moyls, B.N.: Transformations on Tensor Product Spaces. Pacific Journal of Mathematics 9(4), 1215–1221 (1959)
Meinhausen, N., Bühlmann, P.: High-Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics 34(3), 1436–1462 (2006). https://doi.org/10.1214/009053606000000281
Natarajan, B.K.: Sparse Approximate Solutions to Linear Systems. SIAM Journal on Computing 24(2), 227–234 (1993). https://doi.org/10.1137/S0097539792240406
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM Studies in Applied and Numerical Mathematics (1994). https://doi.org/10.1137/1.9781611970791
Olshausen, B.A., Field, D.J.: Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 381, 607–609 (1996). https://doi.org/10.1038/381607a0
Oymak, S., Hassibi, B.: Sharp MSE Bounds for Proximal Denoising. Foundations of Computational Mathematics 16(4), 965–1029 (2016). https://doi.org/10.1007/s10208-015-9278-4
Parikh, N., Boyd, S.: Proximal Algorithms. Foundations and Trends in Optimization 1(3), 127–239 (2014). https://doi.org/10.1561/2400000003
Pisier, G.: Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit “Maurey-Schwartz”) pp. 1–12 (1981)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review 52(3), 471–501 (2010). https://doi.org/10.1137/070697835
Renegar, J.: A Mathematical View of Interior-Point Methods in Convex Optimization. MOS-SIAM Series on Optimization (2001). https://doi.org/10.1137/1.9780898718812
Schnass, K.: On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD. Applied and Computational Harmonic Analysis 37(3), 464–491 (2014). https://doi.org/10.1016/j.acha.2014.01.005
Schnass, K.: Convergence Radius and Sample Complexity of ITKM Algorithms for Dictionary Learning. Applied and Computational Harmonic Analysis (2016). https://doi.org/10.1016/j.acha.2016.08.002
Shah, P., Bhaskar, B.N., Tang, G., Recht, B.: Linear System Identification via Atomic Norm Regularization. In: 51st IEEE Conference on Decisions and Control (2012)
Sinkhorn, R.: A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics 35(2), 876–879 (1964). https://doi.org/10.1214/aoms/1177703591
Spielman, D.A., Wang, H., Wright, J.: Exact Recovery of Sparsely-Used Dictionaries. Journal on Machine Learning and Research: Workshop and Conference Proceedings 23(37), 1–18 (2012)
Stewart, G., Sun, J.: Matrix Perturbation Theory. Academic Press (1990)
Sun, J., Qu, Q., Wright, J.: A Geometric Analysis of Phase Retrieval. Foundations of Computational Mathematics (2017). https://doi.org/10.1007/s10208-017-9365-9
Sun, J., Qu, Q., Wright, J.: Complete Dictionary Recovery over the Sphere I: Overview and the Geometric Picture. IEEE Transactions on Information Theory 63(2), 853–884 (2017). https://doi.org/10.1109/TIT.2016.2632162
Sun, J., Qu, Q., Wright, J.: Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method. IEEE Transactions on Information Theory 63(2), 885–914 (2017). https://doi.org/10.1109/TIT.2016.2632149
Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288 (1994)
Toh, K.C., Todd, M.J., Tütüncü, R.H.: SDPT3 – a MATLAB Software Package for Semidefinite Programming. Optimization Methods and Software 11, 545–581 (1999). https://doi.org/10.1080/10556789908805762
Tropp, J.A.: User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computational Mathematics 12(4), 389–434 (2012). https://doi.org/10.1007/s10208-011-9099-z
Tunçel, L.: Potential Reduction and Primal-Dual Methods. In: H. Wolkowicz, R. Saigal, L. Vandenberghe (eds.) Handbook of Semidefinite Programming – Theory, Algorithms, and Applications, chap. 9. Kluwer’s International Series in Operations Research and Management Science (2000). https://doi.org/10.1007/978-1-4615-4381-7
Vainsencher, D., Mannor, S., Bruckstein, A.M.: The sample complexity of dictionary learning. Journal of Machine Learning Research 12 (2011)
Yannakakis, M.: Expressing Combinatorial Optimization Problems by Linear Programs. Journal of Computer and System Sciences 43, 441–466 (1991). https://doi.org/10.1016/0022-0000(91)90024-Y
Acknowledgements
The authors were supported in part by NSF Career award CCF-1350590, by Air Force Office of Scientific Research Grants FA9550-14-1-0098 and FA9550-16-1-0210, by a Sloan research fellowship, and an A*STAR (Agency for Science, Technology, and Research, Singapore) fellowship. The authors thank Joel Tropp for a helpful remark that improved the result in Proposition 17.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by James Renegar.
Appendices
Appendix A: Proofs of Lemmas 1 and 2
Proof
(Lemma 1) Note that if \(Z \in \mathcal {T}\), then Z has rank at most 2r. As a consequence of the restricted isometry property, we have \((1-\delta _{2r}) \Vert Z \Vert _{\ell _2}^2 \le \Vert [\mathcal {L}\circ \mathcal {P}_{\mathcal {T}}] (Z)\Vert _{\ell _2}^2 \le (1+\delta _{2r}) \Vert Z \Vert _{\ell _2}^2\). Since \(Z \in \mathcal {T}\) is arbitrary, we have \(1 - \delta _{2r} \le \lambda (\mathcal {L}_{\mathcal {T}}^{\prime } \mathcal {L}_{\mathcal {T}}) \le 1+\delta _{2r}\), which proves (i). This immediately implies the bound in (ii). Moreover since \(\Vert \mathcal {L}\circ \mathcal {P}_{\mathcal {T}}\Vert _{2} = \Vert \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}\circ \mathcal {P}_{\mathcal {T}} \Vert _{2}^{1/2} \le \sqrt{1+\delta _{2r}}\), we have \(\Vert \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}\Vert _{2} \le \sqrt{1+\delta _{2r}}\Vert \mathcal {L}\Vert _{2} \), which is (iii). Last we have \(\Vert [(\mathcal {L}_{\mathcal {T}}^{\prime } \mathcal {L}_{\mathcal {T}})^{-1}]_{\mathbb {R}^{q \times q}} \circ \mathcal {L}^{\prime } \mathcal {L}\Vert _{2} \le \Vert [(\mathcal {L}_{\mathcal {T}}^{\prime } \mathcal {L}_{\mathcal {T}})^{-1}]_{\mathbb {R}^{q \times q}} \Vert _{2} \Vert \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}\Vert _{2} \le \frac{\sqrt{1+\delta _{2r}}}{1-\delta _{2r}}\Vert \mathcal {L}\Vert _{2}\), which proves (iv). \(\square \)
Proof
(Lemma 2) To simplify notation, we omit \((\mathfrak {X})\). Since \(\mathrm {trace}(\mathsf {\Sigma }) = \frac{1}{n}\sum _{j=1}^{n}\Vert X^{(j)}\Vert _{\ell _2}^2\), we have \(s_{\min } \le \mathrm {trace}(\mathsf {\Sigma }) \le s_{\max }\). Next we have the inequalities \((\varLambda - \varDelta ) \mathsf {I}\preceq \mathsf {\Sigma } \preceq (\varLambda + \varDelta ) \mathsf {I}\). The result follows by applying trace. \(\square \)
Appendix B: Proof of Proposition 8
In this section, we prove that the ensemble of random matrices \(\mathfrak {X}\) described in Proposition 8 satisfy the deterministic conditions in Theorem 3 with high probability. We begin with computing \(\mathbb {E}_{\mathcal {D}} [X^{(j)} \boxtimes X^{(j)}]\), and \(\mathbb {E}_{\mathcal {D}} [(X^{(j)} \boxtimes X^{(j)}) {\otimes }\mathcal {P}_{\mathcal {T}(X^{(j)})}]\). Note that the random matrices \(\{X^{(j)} \boxtimes X^{(j)}\}_{j=1}^{n}\) and the random operators \(\{(X^{(j)} \boxtimes X^{(j)}) {\otimes }\mathcal {P}_{\mathcal {T}(X^{(j)})}\}_{j=1}^{n}\) are almost surely bounded above in spectral norm by construction. This allows us to conclude Proposition 8 with an application of the Matrix Hoeffding Inequality [63].
To simplify notation, we adopt the following. In the first two results, we omit the superscript j from \(X^{(j)}\). In the remainder of the section, we let \(\mathbb {E}=\mathbb {E}_{\mathcal {D}}\), \(\bar{s}^2 := \mathbb {E} [s^2]\), \(\{\mathbf {e}_i\}_{i=1}^{q} \subset \mathbb {R}^q\) be the set of standard basis vectors, and \(\{E_{ij}\}_{i,j=1}^{q} \subset \mathbb {R}^{q\times q}\) be the set of matrices whose (i, j)th entry is 1 and is 0 everywhere else.
Proposition 9
Suppose \(X\sim \mathcal {D}\) as described in Proposition 8. Then, \(\mathbb {E}[X \boxtimes X] = \bar{s}^2 (r/q^2) \mathsf {I}\).
Proof
It suffices to show that \(\mathbb {E} \langle X \boxtimes X,\mathbf {e}_w \mathbf {e}_x^{\prime } \boxtimes \mathbf {e}_y \mathbf {e}_z^{\prime } \rangle = \mathbb {E} \langle X, \mathbf {e}_w \mathbf {e}_x^{\prime }\rangle \langle X, \mathbf {e}_y \mathbf {e}_z^{\prime }\rangle = \delta _{wy} \delta _{xz} \bar{s}^2 (r/q^2)\). Let \(X = \sum _{i=1}^{r} s_{i} \mathbf {u}_{i} \mathbf {v}_{i}^{\prime }\) as described in the statement of Proposition 8. Suppose we denote \(\mathbf {u}_{i} = (u_{i1},\ldots ,u_{iq})^{\prime }\), and \(\mathbf {v}_{i} = (v_{i1},\ldots ,v_{iq})^{\prime }\). By applying independence, we have \(\mathbb {E} \langle X, \mathbf {e}_w \mathbf {e}_x^{\prime }\rangle \langle X, \mathbf {e}_y \mathbf {e}_z^{\prime }\rangle = \mathbb {E}[(\sum _{i=1}^{r} s_i u_{iw} v_{ix})(\sum _{k=1}^{r} s_k u_{ky} v_{kz})] = \sum _{i,k=1}^{r} \mathbb {E}[s_{i}s_{k}] \mathbb {E}[u_{iw}u_{ky}] \mathbb {E}[v_{ix}v_{kz}]\). There are two cases we need to consider.
[Case \(w \ne y\) or \(x \ne z\)]: Without loss of generality, suppose that \(w \ne y\). Then, \(\mathbb {E}[u_{iw}u_{ky}]=0\) for all \(1\le i,k \le q\), and hence \(\mathbb {E} \langle X \boxtimes X,\mathbf {e}_w \mathbf {e}_x^{\prime } \boxtimes \mathbf {e}_y \mathbf {e}_z^{\prime } \rangle =0\).
[Case \(w=y\) and \(x=z\)]: Note that if \(i \ne k\), then \(\mathbb {E}[u_{iw}u_{ky}]=\mathbb {E}[u_{iw}] \mathbb {E}[u_{ky}]=0\). Since \(\mathbf {u}_i\) is a unit-norm vector distributed u.a.r., we have \(\mathbb {E}[u_{ix}^2] = 1/q\). Hence \(\mathbb {E} \langle X \boxtimes X,\mathbf {e}_w \mathbf {e}_x^{\prime } \boxtimes \mathbf {e}_y \mathbf {e}_z^{\prime } \rangle = \sum _{i=1}^{r} \mathbb {E}[s_{i}^2] \mathbb {E}[u_{iw}^2] \mathbb {E}[v_{ix}^2] = \bar{s}^2 r /q^2\). \(\square \)
Our next result requires the definition of certain subspaces of \(\mathbb {R}^{q\times q}\) and \(\mathrm {End}(\mathbb {R}^{q\times q})\).
We define the following subspaces in \(\mathbb {R}^{q\times q}\): Let \(\mathcal {G}:=\{W : W = W^{\prime }, W \in I^{\perp } \}\) be the subspace of symmetric matrices that are orthogonal to the identity, \(\mathcal {H}:=\{W : W = -W^{\prime }\}\) be the subspace of skew-symmetric matrices, and \(\mathcal {I}= \mathrm {Span}(I)\). It is clear that \(\mathbb {R}^{q\times q} = \mathcal {G}\oplus \mathcal {H}\oplus \mathcal {I}\).
In addition to the subspace \(\mathcal {W}\) defined in (12), we define the following subspaces in \(\mathrm {End}(\mathbb {R}^{q\times q})\):
-
1.
\(\mathcal {W}_{SS} := \mathrm {Span}(\{A {\otimes }B : A,B \in \mathcal {G}\})\),
-
2.
\(\mathcal {W}_{AA} := \mathrm {Span}(\{A {\otimes }B : A,B \in \mathcal {H}\})\),
-
3.
\(\mathcal {W}_{SA} := \mathrm {Span}(\{A {\otimes }B : A \in \mathcal {G}, B \in \mathcal {H}\})\),
-
4.
\(\mathcal {W}_{AS} := \mathrm {Span}(\{A {\otimes }B : A \in \mathcal {H}, B \in \mathcal {G}\})\).
Note that \(\mathrm {End}(\mathbb {R}^{q\times q}) = \mathcal {W}\oplus \mathcal {W}_{SS} \oplus \mathcal {W}_{AA} \oplus \mathcal {W}_{SA} \oplus \mathcal {W}_{AS}\). To verify this, first express an arbitrary linear map \(\mathsf {E}\in \mathrm {End}(\mathbb {R}^{q\times q})\) as a sum of Kronecker products \(\mathsf {E}= \sum _{i=1} A_i {\otimes }B_i\), second decompose each matrix \(A_i,B_i\) into components in the subspaces \(\{ \mathcal {G}, \mathcal {H}, \mathcal {I}\}\), and third expand the expression. The orthogonality between subspaces is immediate from the identity \(\langle A_i {\otimes }B_i , A_j {\otimes }B_j \rangle = \langle A_i , A_j \rangle \langle B_i, B_j \rangle \).
Proposition 10
Suppose \(X\sim \mathcal {D}\) as described in Proposition 8. Then,
where (i) \(c_{\mathcal {W}} = \bar{s}^2 r(\frac{1}{q^2})\), (ii) \(c_{\mathcal {W}_{SS}} = \bar{s}^2 r(\frac{1}{q^2} - \frac{(q-r)^2}{(q-1)^2(q+2)^2}) \), (iii) \(c_{\mathcal {W}_{AA}} = \bar{s}^2 r(\frac{1}{q^2}- \frac{(q-r)^2}{q^2 (q-1)^2} )\), and (iv) \(c_{\mathcal {W}_{SA}} = c_{\mathcal {W}_{AS}} = \bar{s}^2 r(\frac{1}{q^2} - \frac{(q-r)^2}{q(q-1)^2(q+2)})\).
Proof
The proof consists of two parts, namely (i) to prove that the mean, when restricted to the respective subspaces described above, has diagonal entries as specified, and (ii) to prove that the off-diagonal elements are zero with respect to any basis that obeys the specified decomposition of \(\mathrm {End}(\mathbb {R}^{q\times q})\). In addition, it suffices to only consider linear maps that are Kronecker products since these maps generate the respective subspaces. The following identity for all matrices \(A_i,B_i,A_j,B_j\) is particularly useful
One may equivalently describe the distribution of X as follows—let \(X = U \varSigma _R V^{\prime }\), where U, V are \(q\times q\) matrices drawn from the Haar measure and \(\varSigma _R\) is a diagonal matrix whose first r entries are drawn from \(\mathcal {D}\), and the remaining entries are 0 (to simplify notation, we omit the dependence on X in the matrices U, V). Let \(I_N = \mathrm {diag}(0,\ldots , 0, 1,\ldots ,1)\) be a diagonal matrix consisting of \(q-r\) ones. Under this notation, the projector is simply the map \(\mathcal {P}_{\mathcal {T}(X)} (Z) = Z - U I_N U^{\prime }ZV I_N V^{\prime } \). The remainder of the proof is divided into the two parts outlined above.
[Part (i)]: The restriction to diagonal entries correspond to the case \(i=j\), and hence, equation (30) simplifies to \(\mathbb {E} [\Vert \mathcal {P}_{\mathcal {T}(X)} (B X A) \Vert ^2_{\ell _2} ]\). Consequently, we have
First, we compute \(\mathbb {E} [ \Vert I_N U^{\prime } A U \varSigma _R V^{\prime } B V I_N \Vert _{\ell _2}^2 ]\). By the cyclicity of trace and iterated expectations, we have
It suffices to compute \(\mathbb {E} [\varSigma _R^{1/2} V^{\prime } B V I_N V^{\prime } B^{\prime } V \varSigma _R^{1/2}]=\varSigma _R^{1/2} \mathbb {E} [ V^{\prime } B V I_N V^{\prime } B^{\prime } V] \varSigma _R^{1/2}\) in the three cases corresponding to \(B \in \{ \mathcal {G}, \mathcal {H}, \mathcal {I}\}\), respectively. Using linearity and symmetry, it suffices to compute \(\mathbb {E} [V^{\prime } B V E_{11} V^{\prime } B^{\prime } V]\). We split this computation into the following three separate cases.
[Case \(B \in \mathcal {I}\)]: We have \(I_N \varSigma _R^{1/2} = 0\), and hence the mean is the zero matrix.
[Case \(B \in \mathcal {H}\)]: Claim: If \(B \in \mathcal {H}\), and \(\Vert B\Vert _{\ell _2} = 1\), then \(\mathbb {E} [V^{\prime } B V E_{11} V^{\prime } B^{\prime } V] = (I- E_{11})/(q(q-1))\).
Proof Denote \(V=[\mathbf {v}_1 | \ldots | \mathbf {v}_q]\). The off-diagonal entries vanish as \(\mathbb {E} \langle E_{ij}, V^{\prime } B V E_{11} V^{\prime } B^{\prime } V\rangle = \mathbb {E} (\mathbf {v}_1^{\prime } B \mathbf {v}_i)(\mathbf {v}_1^{\prime } B \mathbf {v}_j) = 0\) whenever \(i \ne j\), as one of the indices i, j appears exactly once. By a symmetry argument, we have \(\mathbb {E} [V^{\prime } B V E_{11} V^{\prime } B^{\prime } V] = \alpha I + \beta E_{11}\) for some \(\alpha ,\beta \). First \(\mathbb {E} [ \mathrm {trace}(V^{\prime } B V E_{11} V^{\prime } B^{\prime } V )] = \mathbb {E} [ \mathrm {trace}(B V E_{11} V^{\prime } B^{\prime })] = \mathrm {trace}(B \mathbb {E} [ V E_{11} V^{\prime }] B^{\prime }) = \mathrm {trace} (B (I/q) B^{\prime }) = 1/q\), which gives \(\alpha q + \beta = 1/q\). Second since B is asymmetric, \(V^{\prime } B V\) is also asymmetric and hence is 0 on the diagonals. Thus, \(\langle V^{\prime } B V E_{11} V^{\prime } B^{\prime } V, E_{11} \rangle = 0\), which gives \(\alpha + \beta = 0\). The two equations yield the values of \(\alpha \) and \(\beta \).
[Case: \(B \in \mathcal {G}\)]: Claim: If \(B \in \mathcal {G}\), and \(\Vert B\Vert _{\ell _2} = 1\), then \(\mathbb {E} [V^{\prime } B V E_{11} V^{\prime } B^{\prime } V] = (I + (1-2/q) E_{11})/((q-1)(q+2))\).
Proof With an identical argument as the previous claim, one has \(\mathbb {E} [V^{\prime } B V E_{11} V^{\prime } B^{\prime } V] = \alpha I + \beta E_{11}\), where \(\alpha q + \beta = 1/q\). Next \(\mathbb {E} [ \langle V^{\prime } B V E_{11} V^{\prime } B^{\prime } V, E_{11} \rangle ] = \mathbb {E} [(\mathbf {v}_1^{\prime } B \mathbf {v}_1)^2]\), where \(\mathbf {v}_1\) is a unit-norm vector distributed u.a.r. Since conjugation by orthogonal matrices preserves trace, and \(\mathbf {v}_1\) has the same distribution as \(Q\mathbf {v}_1\) for any orthogonal Q, we may assume that \(B = \mathrm {diag}(b_{11},\ldots , b_{qq})\) is diagonal without loss of generality. Suppose we let \(\mathbf {v}_1 = (v_1,\ldots ,v_q)^{\prime }\). Then, \(\mathbb {E} [(\mathbf {v}_1^{\prime } B \mathbf {v}_1)^2] = \mathbb {E} [ \sum b^2_{ii} v_{i}^4 + \sum _{i\ne j} b_{ii} b_{jj} v_{i}^2 v_{j}^2 ] = \mu _1 (\sum b_{ii}^2) + \mu _2 (\sum _{i \ne j} b_{ii} b_{jj} ) \), where \(\mu _1 = \mathbb {E} [v_1^4]\), and \(\mu _2 = \mathbb {E} [v_1^2 v_2^2]\). Since \(\mathrm {trace}(B) = 0\), we have \(\sum b_{ii}^2 = - \sum _{i \ne j} b_{ii} b_{jj}\). Last from Theorem 2 of [18], we have \(\mu _1 = 3 / (q (q+2))\), and \(\mu _2 = 1/(q(q+2))\), which gives \(\mathbb {E} [(\mathbf {v}_1^{\prime } B \mathbf {v}_1)^2] = 2/ (q(q+2))\), and hence \(\alpha + \beta = 2/(q(q+2))\). The two equations yield the values of \(\alpha \) and \(\beta \).
With a similar set of computations, one can show that \(\mathbb {E} [\Vert B U \varSigma _R V^{\prime } A \Vert _{\ell _2}^2 ] = \bar{s}^2 r/q^2\) for arbitrary unit-norm A, B. An additional set of computations yields the diagonal entries, which completes the proof. We omit these computations.
[Part (ii)]: We claim that it suffices to show that \(\mathbb {E} [ V^{\prime } A_i V E_{11} V^{\prime } A_j^{\prime } V] \) is the zero matrix whenever \(A_i,A_j \in \{ \mathcal {G}, \mathcal {H}, \mathcal {I}\}\), and satisfy \(\langle A_i, A_j \rangle = 0\). We show how this proves the result. Suppose \(A_i {\otimes }B_i, A_j {\otimes }B_j\) satisfy \(\langle A_i {\otimes }B_i , A_j {\otimes }B_j \rangle = \langle A_i, A_j \rangle \langle B_i, B_j \rangle = 0\). Without loss of generality, we may assume that \(\langle A_i, A_j \rangle = 0\). From equation (30), we have
By cyclicity of trace and iterated expectations, we have
which proves part (ii) of the proof. It leaves to prove the claim. We do so by verifying that the matrix \(\mathbb {E} [ V^{\prime } A_i V E_{11} V^{\prime } A_j^{\prime } V] \) is 0 in every coordinate, which is equivalent to showing that \(\mathbb {E} (\mathbf {v}^{\prime }_{m} A_i \mathbf {v}_1) (\mathbf {v}^{\prime }_{n} A_j \mathbf {v}_1) =0\) for all m, n. There are three cases.
[Case \(m \ne n\)]: Without loss of generality, suppose that \(m \ne 1\). Then, \(\mathbb {E} (\mathbf {v}^{\prime }_{m} A_i \mathbf {v}_1) (\mathbf {v}^{\prime }_{n} A_j \mathbf {v}_1) = \mathbb {E}[\mathbb {E} [ (\mathbf {v}^{\prime }_{m} A_i \mathbf {v}_1) (\mathbf {v}^{\prime }_{n} A_j \mathbf {v}_1)|\mathbf {v}_1,\mathbf {v}_n]] = 0\).
[Case \(m = n = 1\)]: We divide into further subcases depending on the subspaces \(A_i,A_j\) belong to. If \(A_i \in \mathcal {H}\), then \(\mathbf {v}_1^{\prime }A_i \mathbf {v}_1 = 0\) since it is a scalar. Hence, we eliminate the case where either matrix is in \(\mathcal {H}\). Since \(\langle A_i,A_j \rangle = 0\) it cannot be that both \(A_i,A_j \in \mathcal {I}\). Suppose that \(A_i = I / \sqrt{q}\) and \(A_j \in \mathcal {G}\). Then, \(\mathbb {E} [ (\mathbf {v}_1^{\prime }A_i \mathbf {v}_1) (\mathbf {v}_1^{\prime }A_j \mathbf {v}_1)] = \mathbb {E} [ (\mathbf {v}_1^{\prime }A_j \mathbf {v}_1)]/\sqrt{q} = \mathbb {E} [ \mathrm {trace}(A_j \mathbf {v}_1 \mathbf {v}^{\prime }_1)]/\sqrt{q}=0\). Our remaining case is when \(A_i,A_j \in \mathcal {G}\), and \(\langle A_i,A_j \rangle = 0\). As before, we let \(\mathbf {v}_1 = (v_1,\ldots ,v_q)^{\prime }\). Then,
where in the second equality we used the fact that \(A_i,A_j\) are symmetric to obtain a factor of 2 in the last term. Next we apply the relations \(\mathbb {E}[v_p^4] = 3/(q(q+2))\), \( \mathbb {E}[v_p^2 v_r^2] = 1/(q(q+2))\), as well as the relations \(0=\langle A_i, I \rangle \langle A_j, I\rangle =\sum _{p} A_{i,pp}A_{j,pp} + \sum _{p\ne r} A_{i,pp}A_{j,rr}\), and \(0=\langle A_i, A_j \rangle = \sum _{p} A_{i,pp}A_{j,pp} + \sum _{p\ne q} A_{i,pq}A_{j,pq}\) to conclude that the mean is zero.
[Case \(m = n \ne 1\)]: We have
where the first equality applies the fact that conditioned on \(\mathbf {v}_1\), \(\mathbb {E}[\mathbf {v}_m \mathbf {v}^{\prime }_m]\) is the identity matrix in the subspace \(\mathcal {T}(\mathbf {v}_1 \mathbf {v}_1^{\prime })^{\perp }\) suitably scaled, and the second inequality applies the previous case. \(\square \)
Proof
(Proposition 8) First, we have \(\mathsf {0} \preceq X^{(j)} \boxtimes X^{(j)} \preceq s^2 r \mathsf {I}\). By Proposition 9, we have \(\mathbb {E}[X^{(j)} \boxtimes X^{(j)}] = (\bar{s}^2 r/q^2) \mathsf {I}\). Since \((X^{(j)} \boxtimes X^{(j)} - (\bar{s}^2 r/q^2) \mathsf {I})^2 \preceq s^4 r^2 \mathsf {I}\), we have \( \mathbb {P} ( \Vert (1/n)\sum _{i=1}^{n} X^{(j)} \boxtimes X^{(j)} - (\bar{s}^2 r/q^2)\mathsf {I}\Vert > t r s^2 ) \le 2q \exp (- t^2 n / 8)\) via an application of the Matrix Hoeffding Inequality (Theorem 1.3 in [63]).
Second we have \(\Vert X^{(j)} \boxtimes X^{(j)}\Vert _{2} \le s^2 r\), and \(\Vert \mathcal {P}_{\mathcal {T}(X^{(j)})}\Vert _{2} = 1\), and hence . From Proposition 10, we have
Since , we have
by an application of the Matrix Hoeffding Inequality.
Let \(t = t_1 /(5q^2)\) in the first concentration bound, and \(t= t_2 / (5q^2)\) in the second concentration bound. Then, \(\varDelta ( \mathfrak {X}) \le t_1 s^2 r / (5q^2)\), and \(\varOmega ( \mathfrak {X}) \le 16s^2 r^2/q^3 + t_2 s^2 r / (5q^2)\), with probability greater than \(1- 2q\exp (-n t_1^2 / (200q^4)) - q \exp (- n t_2^2 / (200q^4))\). We condition on the event that both inequalities hold. Since \(\varDelta ( \mathfrak {X}) \le t_1 s^2 r / (5q^2) \le s^2 r/(20q^2)\), by Lemma 2, we have \(\varLambda ( \mathfrak {X}) \ge s^2 r/(5q^2)\), and hence \(\varDelta ( \mathfrak {X}) / \varLambda ( \mathfrak {X}) \le t_1\), and \(\varOmega ( \mathfrak {X}) / \varLambda ( \mathfrak {X}) \le 80r/q + t_2\). \(\square \)
Appendix C: Stability of Matrix and Operator Scaling
In this section, we prove a stability property of Sinkhorn scaling and Operator Sinkhorn scaling. For Sinkhorn scaling, we show that if a matrix is close to being doubly stochastic and has entries that are suitably bounded away from 0, then the resulting row and column scalings are close to \(\mathbf {1}:= (1,\ldots ,1)^{\prime }\). We also prove the operator analog of this result. These results are subsequently used to prove Propositions 2 and 7. We note that there is an extensive literature on the stability of matrix scaling, with results of a similar flavor to ours. However, Proposition 11 in this section is stated in a manner that is directly suited to our analysis, and we include it for completeness.
1.1 Main Results
Proposition 11
(Local stability of Matrix Scaling) Let \(T \in \mathbb {R}^{q\times q}\) be a matrix such that
-
1.
\(|\langle \mathbf {e}_i, T(\mathbf {e}_j)\rangle - 1/q| \le 1/(2q)\) for all standard basis vectors \(\mathbf {e}_i,\mathbf {e}_j\); and
-
2.
\(\epsilon := \max \{\Vert T \mathbf {1}- \mathbf {1}\Vert _{\infty },\Vert T^{\prime } \mathbf {1}- \mathbf {1}\Vert _{\infty } \} \le 1/(48 \sqrt{q})\).
Let \(D_1, D_2\) be diagonal matrices such that \(D_2 T D_1\) is doubly stochastic. Then,
Proposition 12
(Local stability of Operator Scaling) Let \({\mathbf {\mathsf{{T}}}}: \mathbb {S}^{q} \rightarrow \mathbb {S}^{q}\) be a rank-indecomposable linear operator such that
-
1.
\(|\langle \mathbf {v}\mathbf {v}^{\prime }, {\mathbf {\mathsf{{T}}}}(\mathbf {u}\mathbf {u}^{\prime })\rangle -1/q| \le 1/(2q)\) for all unit-norm vectors \(\mathbf {u},\mathbf {v}\in \mathbb {R}^{q}\); and
-
2.
\(\epsilon := \max \{ \Vert {\mathbf {\mathsf{{T}}}}(I)-I \Vert _{2} , \Vert {\mathbf {\mathsf{{T}}}}^{\prime }(I)-I \Vert _{2}\} \le 1/(48\sqrt{q})\).
Let \(N_1, N_2 \in \mathbb {S}^{q}\) be positive-definite matrices such that \( (N_2 {\otimes }N_2) \circ {\mathbf {\mathsf{{T}}}}\circ (N_1 {\otimes }N_1) \) is doubly stochastic. Then, \(\Vert N_2^2 {\otimes }N_1^2 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon \). Furthermore, we have \(\Vert N_2 {\otimes }N_1 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon \).
1.2 Proofs
The proof of Proposition 11 relies on the fact that matrix scaling can be cast as the solution of a convex program; specifically, we utilize the correspondence between diagonal matrices \(D_1, D_2\) such that \(D_2 T D_1\) is doubly stochastic, and the vectors \(\varvec{\varepsilon }:= (\varepsilon _1,\ldots , \varepsilon _q)^{\prime }, \varvec{\eta }:= (\eta _1,\ldots , \eta _q)^{\prime }\) that minimize the following convex function
via the maps \((D_2)_{ii} = \exp (\varepsilon _i)\) and \((D_1)_{jj} = \exp (\eta _j)\) [31] (see also [39])—this holds for all matrices T with positive entries. We remark that one can derive the above relationship from first-order optimality. In the following, we prove bounds on the minima of F (see Lemma 5).
The proof of Proposition 12 relies on a reduction to the setup in Proposition 11.
We begin with a lower estimate of the sum of exponential functions. We use the estimate to prove Proposition 11.
Definition 3
Let \(\alpha \ge 0\). Define the function \(c_{\alpha }: \mathbb {R} \rightarrow \mathbb {R}\)
Remark
Note that the function \(c_{\alpha }(\cdot )\) is continuous.
Lemma 3
For all x
Proof
(Lemma 3) The second derivative of \(\exp (x)\) is \(\exp (x)\), and it is greater than \(\exp (-\alpha )\) over all x such that \(|x| \le \alpha \). Hence, by strong convexity of \(\exp (x)\), we have \(\exp (x) \ge 1 + x + (1/2)\exp (-\alpha ) x^2\) over the interval \([-\alpha ,\alpha ]\).
It follows that \(\exp (\alpha ) \ge 1 + \alpha + c_{\alpha } (\alpha )\), and \(\exp (-\alpha ) \ge 1 - \alpha + c_{\alpha } (-\alpha )\). Since the function \(\exp (x)\) is convex, and \(c_{\alpha }\) is linear in the intervals \((-\infty ,-\alpha ]\) and \([\alpha ,\infty )\), respectively, it suffices to check that (i) the gradient of \(\exp (x)\) at \(x=\alpha \), which is \(\exp (\alpha )\), exceeds that of \(c_{\alpha }(\cdot )\), and (ii) the gradient of \(c_{\alpha }(\cdot )\) exceeds that of \(\exp (x)\) at \(x = -\alpha \), which is \(\exp (-\alpha )\).
First, we prove (i). Since \(\alpha \ge 0\), we have \(1+2\alpha \ge \sqrt{1+2\alpha }\). Hence, \(2 \exp (\alpha ) \ge 2 + 2 \alpha \ge 1 + \sqrt{1 + 2\alpha }\). By noting that the quadratic \(2z^2- 2z - \alpha =0\) has roots \((1/2) \pm (1/2)\sqrt{1 + 2\alpha }\), we have the inequality \(\exp (\alpha ) \ge 1 + (1/2)\exp (-\alpha ) \alpha \), from which (i) follows.
Next we prove (ii). Since \(\alpha \ge 0\), we have \(\exp (\alpha ) \ge 1+\alpha \ge 1 + \alpha /2\), and hence \(1 - (1/2)\exp (-\alpha ) \alpha \ge \exp (-\alpha )\) from which (ii) follows.\(\square \)
Lemma 4
Let \(\{\varepsilon _i\}_{i=1}^{q}\) and \(\{\eta _j\}_{j=1}^{q}\) be a collection of reals satisfying \((\sum _{i} \varepsilon _i) + (\sum _{j} \eta _j) \ge -2q\). Then, there is a constant \(d \in \mathbb {R}\) for which
Proof
Consider the function
Then, \(f(\cdot )\) is continuous in d, and \(f(d) \rightarrow \pm \infty \) as \(d \rightarrow \pm \infty \). By the Intermediate Value Theorem, there is a \(d^{\star }\) for which \(f(d^{\star }) = 0\). Then,
By summing both sides and noting that \(c_{\alpha }(\cdot ) \ge 0\), we have that each side of the above equation is nonnegative. It follows that
\(\square \)
Lemma 5
Given vectors \(\varvec{\varepsilon }:= (\varepsilon _1,\ldots ,\varepsilon _q)\) and \(\varvec{\eta }:= (\eta _1,\ldots , \eta _q)\) define
and \(\epsilon _{ij} := T_{ij}- 1/q \). Suppose (i) \(|\epsilon _{ij} | \le 1/2q\), and (ii) \( \epsilon := \max \{ |\sum _i \epsilon _{ij} |, |\sum _j \epsilon _{ij}| \}\le 1/(24 \sqrt{q})\). Let \(\varvec{\varepsilon }^{\star }, \varvec{\eta }^{\star }\) be a minimizer of F. Then, \(| \varepsilon _i^{\star } + \eta _j^{\star } | \le 48 \sqrt{q} \epsilon \), for all i, j.
Proof
Suppose \(| \varepsilon _i + \eta _j | > 48 \sqrt{q} \epsilon \) for some (i, j). We show that \(\varvec{\varepsilon },\varvec{\eta }\) cannot be a minimum. We split the analysis to two cases.
[\((\sum _{i} \varepsilon _i) + (\sum _{j} \eta _j) < -2q\)]: Since \(T_{ij} > 0\), we have \(F(\varvec{\varepsilon },\varvec{\eta }) > - (\sum _{i} \varepsilon _i) - (\sum _{j} \eta _j) \ge 2q\). Then, \(F(\varvec{0},\varvec{0}) = \sum _{i} (\sum _{j} T_{ij} ) = \sum _{i} (1+ \sum _{j} \epsilon _{ij}) \le q(1+ 1/(24\sqrt{q})) \le 2q < F(\varvec{\varepsilon },\varvec{\eta })\).
[\((\sum _{i} \varepsilon _i) + (\sum _{j} \eta _j) \ge -2q\)]: Let \(\alpha = 24 \sqrt{q}\epsilon \), and define the sets
-
1.
\(\mathfrak {S}(\varvec{\varepsilon }) = \{i: | \varepsilon _i | \ge \alpha \}\);
-
2.
\(\mathfrak {T}(\varvec{\varepsilon }) = \{i: \alpha > | \varepsilon _i | \ge 4 \epsilon \exp (\alpha ) \}\); and
-
3.
\(\mathfrak {U}(\varvec{\varepsilon }) = \{i: 4 \epsilon \exp (\alpha ) > | \varepsilon _i | \}\).
Similarly define the sets \(\mathfrak {S}(\varvec{\eta }), \mathfrak {T}(\varvec{\eta }), \mathfrak {U}(\varvec{\eta })\).
First since \(\alpha \le 1\), we have \(\alpha \ge \alpha \exp (\alpha )/3 \ge 8 \sqrt{q} \epsilon \exp (\alpha ) \ge 8 \epsilon \exp (\alpha )\), and hence
Second
Third since there is an index (i, j) such that \(| \varepsilon _i + \eta _j | > 48 \sqrt{q} \epsilon \), one of the sets \(\mathfrak {S}(\varvec{\varepsilon }), \mathfrak {S}(\varvec{\eta })\) is nonempty. By noting that \(\alpha \exp (-\alpha ) \ge 8 \sqrt{q} \epsilon \), we have
We have \(\epsilon (\sum _{i}|\varepsilon _i| + \sum _{j}|\eta _j|) \ge \sum _i (\varepsilon _i (\sum _j \epsilon _{ij})) + \sum _j (\eta _j (\sum _i \epsilon _{ij})) = \sum _{ij} \epsilon _{ij}(\varepsilon _i+\eta _j)\). By combining the above inequalities with Lemma 4, we have
Also, since \( \exp (\varepsilon _i + \eta _j ) -(\varepsilon _i + \eta _j) - 1 \ge 0\) for all i, j, and \(|\epsilon _{ij}| \le 1/(2q)\), we have
By combining equations (32) and (33), we have
which implies \(F(\varvec{\varepsilon }, \varvec{\eta }) > F(\varvec{0}, \varvec{0})\). \(\square \)
Proof
(Proposition 11) By Lemma 5 any minimum \(\varvec{\varepsilon }^{\star }, \varvec{\eta }^{\star }\) satisfies \(| \varepsilon _i^{\star } + \eta _j^{\star } | \le 48 \sqrt{q} \epsilon \). Hence by the one-to-one correspondence between the minima of F and the diagonal scalings \(D_1,D_2\) [31], we have \(\Vert D_2 {\otimes }D_1 - \mathsf {I}\Vert _{2} \le \exp (48 \sqrt{q} \epsilon ) -1 \le 96 \sqrt{q} \epsilon \). \(\square \)
Proof
(Proposition 12) Without loss of generality, we may assume that \(N_1,N_2\) are diagonal matrices, say \(D_1,D_2\), respectively. Define the matrix \(T_{ij} = \langle \mathbf {e}_i \mathbf {e}_i^{\prime }, {\mathbf {\mathsf{{T}}}}(\mathbf {e}_j \mathbf {e}_j^{\prime }) \rangle \). It is straightforward to check that T satisfies the conditions of Proposition 11; moreover, the condition that \( (N_2 {\otimes }N_2) \circ {\mathbf {\mathsf{{T}}}}\circ (N_1 {\otimes }N_1) \) is a doubly stochastic operator implies that \(D_2^2 T D_1^2\) is a doubly stochastic matrix. By Proposition 11, we have \(\Vert D_1^2 {\otimes }D_2^2 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon \), and hence \(\Vert N_1^2 {\otimes }N_2^2 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon \). Since \(N_1,N_2\) are self-adjoint, we also have \(\Vert N_1 {\otimes }N_2 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon \). \(\square \)
Appendix D: Proof of Proposition 7
In this section, we prove that Gaussian linear maps that are subsequently normalized satisfy the deterministic conditions in Theorem 3 concerning the linear map \(\mathcal {L}^{\star }\) with high probability. There are two steps to our proof. First, we state sufficient conditions for linear maps such that, when normalized, satisfy the deterministic conditions. Second we show that Gaussian maps satisfy these sufficient conditions with high probability.
We introduce the following parameter that measures how close a linear map \(\mathcal {L}\) is to being normalized.
Definition 4
Let \(\mathcal {L}\in \mathbb {R}^{q\times q} \rightarrow \mathbb {R}^d\) be a linear map. The nearly normalized parameter of \(\mathcal {L}\) is defined as
Proposition 13
Let \(\mathcal {L}: \mathbb {R}^{q\times q} \rightarrow \mathbb {R}^d\) be a linear map that satisfies (i) the restricted isometry condition \(\delta _r(\mathcal {L})\le 1/2\), and (ii) whose nearly normalized parameter satisfies \(\epsilon (\mathcal {L}) \le 1/ (650 \sqrt{q})\). Let \(\mathcal {L}\circ {{\textsf {\textit{N}}}}_{\mathcal {L}}\) be the normalized linear map where \({{\textsf {\textit{N}}}}_{\mathcal {L}}\) is a positive-definite rank-preserver. Then, \(\mathcal {L}\circ {{\textsf {\textit{N}}}}_{\mathcal {L}}\) satisfies the restricted isometry condition \(\delta _r(\mathcal {L}\circ {{\textsf {\textit{N}}}}) \le \bar{\delta _r}:= (1+\delta _r(\mathcal {L})) (1+ 96\sqrt{q} \epsilon (\mathcal {L}))^2 - 1 < 1\). Moreover, \(\Vert \mathcal {L}\circ {{\textsf {\textit{N}}}}_{\mathcal {L}} \Vert _{2} \le (1+ 96\sqrt{q} \epsilon (\mathcal {L})) \Vert \mathcal {L}\Vert _{2}\).
Proof
(Proposition 13) Since \(\mathcal {L}\) satisfies the restricted isometry condition \(\delta _1(\mathcal {L}) \le 1/2\), we have \(|\langle \mathbf {v}\mathbf {v}^{\prime }, {\mathbf {\mathsf{{T}}}}_{\mathcal {L}} (\mathbf {u}\mathbf {u}^{\prime })\rangle -1/q| \le 1/(2q)\) for all unit-norm vectors \(\mathbf {u},\mathbf {v}\in \mathbb {R}^{q}\). In addition, the linear map \(\mathcal {L}\) has nearly normalized parameter \(\epsilon (\mathcal {L}) \le 1/ (650 \sqrt{q})\). Hence by applying Proposition 12 to the linear map \({\mathbf {\mathsf{{T}}}}_{\mathcal {L}}\), any pair of positive-definite matrices \(Q_2,Q_1\) such that \(Q_2 {\otimes }Q_2 \circ {\mathbf {\mathsf{{T}}}}_{\mathcal {L}} \circ Q_1 {\otimes }Q_1\) is doubly stochastic satisfies \(\Vert Q_2 {\otimes }Q_1 - \mathsf {I}\Vert _{2} \le 96 \sqrt{q} \epsilon (\mathcal {L})\). By noting the correspondence between such matrices with the positive-definite rank-preserver \({{\textsf {\textit{N}}}}_{\mathcal {L}}\) such that \(\mathcal {L}\circ {{\textsf {\textit{N}}}}_{\mathcal {L}}\) is normalized via the relation \({{\textsf {\textit{N}}}}_{\mathcal {L}} = Q_2 {\otimes }Q_1\) (see Corollary 2), we have \(\Vert {{\textsf {\textit{N}}}}_{\mathcal {L}}\Vert _{2} \le 1+96 \sqrt{q} \epsilon (\mathcal {L})\).
Let X be a matrix with rank at most r. Then,
and hence \(\Vert \mathcal {L}({{\textsf {\textit{N}}}}_{\mathcal {L}} (X)) \Vert _{\ell _2}^2 \le (1+\bar{\delta _r}) \Vert X\Vert _{\ell _2}^2\). A similar set of steps show that \(\Vert \mathcal {L}({{\textsf {\textit{N}}}}_{\mathcal {L}} (X)) \Vert _{\ell _2}^2 \ge (1-\bar{\delta _r}) \Vert X\Vert _{\ell _2}^2\). Last \(\Vert \mathcal {L}\circ {{\textsf {\textit{N}}}}_{\mathcal {L}} \Vert _{2} \le \Vert \mathcal {L}\Vert _{2}\Vert {{\textsf {\textit{N}}}}_{\mathcal {L}} \Vert _{2} \le (1+96\sqrt{q}\epsilon )\Vert \mathcal {L}\Vert _{2}\). \(\square \)
Proposition 14
[20, Theorem II.13] Let \(t>0\) be fixed. Suppose \(\mathcal {L}\sim \mathcal {N} (0,1/d)\). Then, with probability greater than \(1 - \exp (- t^2d/2) \), we have \(\Vert \mathcal {L}\Vert _{2} \le \sqrt{q^2/d} + 1 + t\).
Proposition 15
[11, Theorem 2.3] Let \(0<\delta <1\) be fixed. There exists constants \(c_1,c_2\) such that for \(d \ge c_1 qr\), if \(\mathcal {L}\sim \mathcal {N} (0,1/d)\), then with probability greater than \(1 - 2 \exp (- c_2 d)\) the linear map \(\mathcal {L}\) satisfies the restricted isometry condition \(\delta _r(\mathcal {L}) \le \delta \).
Proposition 16
(Gaussian linear maps are nearly normalized) Suppose \(3/\sqrt{d} \le \epsilon \le 3\). Suppose \(\mathcal {L}\sim \mathcal {N}(0, 1/d)\). Then with probability greater than \(1-4 \exp ( - q(-1 + \sqrt{d} \epsilon / 3)^2/2)\), the nearly normalized parameter of \(\mathcal {L}\) is smaller than \(\epsilon \).
Bounding the nearly normalized parameter of a Gaussian linear map exactly corresponds to computing the deviation of the sum of independent Wishart matrices from its mean in spectral norm. To do so, we appeal to the following concentration bound.
Proposition 17
(Concentration of sum of Wishart matrices) Suppose \(3/\sqrt{d} \le t \le 3\). Let \(\{X^{(j)} \}_{j=1}^{d}, X^{(j)} = G^{(j)} G^{(j)\prime }\), where \(G^{(j)} \in \mathbb {R}^{q\times q}, G^{(j)} \sim \mathcal {N}(0,1/q)\), be a collection of independent Wishart matrices. Then, \(\mathbb {P} ( \Vert \frac{1}{d}\sum _{j=1}^{d} X^{(j)} - I \Vert _{2} \ge t) \le 2 \exp ( - q(-1 + \sqrt{d} t / 3)^2/2)\).
Proof
(Proposition 17) Consider the linear map \(G = [G^{(1)}|\ldots |G^{(d)}]\). Then, \(\sum _{j=1}^{d} X^{(j)} = G G^{\prime }\), and \(\Vert \frac{1}{d}\sum _{j=1}^{d} X^{(j)} - I \Vert _{2} \le t\) if and only if \(\sigma (G) \in [\sqrt{d(1-t)},\sqrt{d(1+t)}] \). By [20, Theorem II.13], we have \(\sigma (G) \in [\sqrt{d} - 1 - \tilde{t},\sqrt{t} + 1 + \tilde{t}]\) with probability greater than \(1-2\exp (-q\tilde{t}^2/2)\). The result follows with the choice of \(\tilde{t} = -1 + \sqrt{d} t / 3\). \(\square \)
Proof
(Proposition 16) This is a direct application of Proposition 17 with \(G^{(j)} =\sqrt{q/d} \mathcal {L}^{(j)}\) and \(G^{(j)\prime } =\sqrt{q/d} \mathcal {L}^{(j)}\), followed by a union bound. \(\square \)
Proof
(Proposition 7) We choose \(t = 1/50\) in Proposition 14, \(\delta = \delta _{4r}/2\) in Proposition 15, and \(\epsilon = \delta / (960\sqrt{q})\) in Proposition 16. Then, there are constants \(c_1, c_2, c_3\) such that if \(d \ge c_1 r q\), then (i) \(\Vert \tilde{\mathcal {L}}\Vert _{2}\le \sqrt{q^2/d}+ 51/50 \le (101/50) \sqrt{q^2/d}\), (ii) \(\tilde{\mathcal {L}}\) satisfies the restricted isometry condition \(\delta _{4r}(\tilde{\mathcal {L}}) \le \delta _{4r}/2\), and (iii) \(\tilde{\mathcal {L}}\) is nearly normalized with parameter \(\epsilon (\tilde{\mathcal {L}}) \le \delta _{4r}/ 960 \sqrt{q}\), with probability greater than \(1 - c_2 \exp ( - c_3 d)\).
By applying Proposition 13, we conclude that the linear map \(\mathcal {L}\) satisfies the restricted isometry condition \(\delta _{4r}(\mathcal {L})\le (1+\delta _{4r}/2) (1 + \delta _{4r}/10)^2 -1 \le \delta _{4r}\), and \(\Vert \mathcal {L}\Vert _{2} \le \sqrt{5q^2/d}\).
\(\square \)
Appendix E: Proof of Proposition 2
Proof
(Proposition 2) First, we check that the linear map \(\mathcal {L}^{\star } \circ (\mathsf {I}+ \mathsf {E})\) satisfies the restricted isometry condition \(\delta _1 (\mathcal {L}^{\star } \circ (\mathsf {I}+ \mathsf {E})) \le 1/2\). For any rank-one unit-norm matrix X, we have \(\Vert [\mathcal {L}^{\star } \circ (\mathsf {I}+ \mathsf {E})] (X) \Vert _{\ell _2} \le \Vert \mathcal {L}^{\star } (X) \Vert _{\ell _2} + \Vert \mathcal {L}^{\star } (\mathsf {E}(X)) \Vert _{\ell _2} \le \sqrt{1+ 1/10} + 1/150 \le \sqrt{1+1/2}\). A similar set of inequalities show that \(\Vert [\mathcal {L}^{\star } \circ (\mathsf {I}+ \mathsf {E})] (X) \Vert _{\ell _2} \ge \sqrt{1 - 1/2}\).
Second we check that the nearly normalized parameter of \(\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})\) satisfies \(\epsilon (\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})) \le 1/ 48\sqrt{q}\). Denote \(\mathcal {E} := \mathcal {L}^{\star } \circ \mathsf {E}\). For all unit-norm rank-one matrices E, we have \(\Vert \mathcal {E}(E)\Vert _{2}^2\le \Vert \mathcal {L}^{\star }\Vert _{2}^2\Vert \mathsf {E}\Vert _{\ell _2}^2\). Hence for any unit-norm \(\mathbf {u}\in \mathbb {R}^{q}\), we have
Using the fact that \(\mathcal {L}^{\star }\) is normalized, we have
By combining the previous inequalities with an application of Cauchy–Schwarz, we have
Furthermore, since \(\mathbf {u}\) is arbitrary, it follows that
Using a similar sequence of steps, one can show that \(\Vert {\mathbf {\mathsf{{T}}}}_{\mathcal {L}^{\star } + \mathcal {E}}^{\prime } (I) - I \Vert _{2} \le 3 \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}\Vert _{\ell _2}\). Thus, \(\epsilon (\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})) \le 3 \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}\Vert _{\ell _2} \le 1/( 48\sqrt{q})\).
The result follows by applying Proposition 12 to the linear map \({\mathbf {\mathsf{{T}}}}_{\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})}\). \(\square \)
Appendix F: Proof of Proposition 3
The proof of Proposition 3 is based on the following result concerning affine rank minimization, which may be of independent interest.
Proposition 18
Suppose \(X^{\star }\) is a \(q\times q\) rank-r matrix satisfying \(\sigma _r (X^{\star }) \ge 1/2\). Let \(\mathbf {y}= \mathcal {L}(X^{\star }) + \mathbf {z}\), where the linear map \(\mathcal {L}\) satisfies the restricted isometry condition \(\delta _{4r}(\mathcal {L}) \le 1/10\), and \(\Vert \mathcal {L}^{\prime } \mathbf {z}\Vert _{2} =: \epsilon \le 1 / (80 r^{3/2})\). Let \(\hat{X}\) be the optimal solution to
Then, (i) \(\Vert \hat{X} - X^{\star } \Vert _{2} \le 4 \sqrt{r} \epsilon \), and (ii) \( \hat{X} - X^{\star } = [(\mathcal {L}^{\prime }_{\mathcal {T}(X^{\star })}\mathcal {L}_{\mathcal {T}(X^{\star })})^{-1}]_{\mathbb {R}^{q \times q}} (\mathcal {L}_{\mathcal {T}(X^{\star })}^{\prime } \mathbf {z}) + G\), where \(\Vert G\Vert _{\ell _2} \le 340 r^{3/2} \epsilon ^2 \).
The proof of Proposition 18 requires two preliminary results which we state and prove first. Our development relies on results from matrix perturbation theory; we refer the reader to [38, 57] for detailed expositions. Several of our results are minor modifications of analogous results in [15].
The following result and the accompanying proof is a minor modification of Proposition 2.2 in supplementary material (s.m.) of [15] and its proof. The modification allows us to provide a bound that does not scale with the ambient dimension.
Proposition 19
Let \(X_1,X_2 \in \mathbb {R}^{q\times q}\) be rank-r matrices. Let \(\sigma \) be the smallest nonzero singular value of \(X_1\), and suppose that \(\Vert X_1 - X_2 \Vert _{2} \le \sigma / 8\). Then, \(\Vert \mathcal {P}_{\mathcal {T}(X_1)^{\perp }} (X_2) \Vert _{\ell _2} \le \sqrt{r} \Vert X_1 - X_2 \Vert _{2}^2 / (3\sigma )\), and \(\Vert \mathcal {P}_{\mathcal {T}(X_1)^{\perp }} (X_2) \Vert _{2} \le \Vert X_1 - X_2 \Vert _{2}^2 / (5\sigma )\).
In the following proof, given a matrix \(X \in \mathbb {R}^{q \times q}\), we denote \(\tilde{X} := \left( \begin{array}{cc} 0 &{} X^{\prime } \\ X &{} 0 \end{array} \right) \).
Proof
(Proposition 19) Let \(\tilde{\varDelta } = \tilde{X_2} - \tilde{X_1}\), and let \(\kappa = \sigma /4\). By combining equation (1.5) in the s.m. of [15] with the proofs of Propositions 1.2 and 2.2 in the s.m. of [15] it can be shown that \( \mathcal {P}_{\mathcal {T}(\tilde{X_1})^{\perp }} (\tilde{X_2}) = (1/(2\pi i))\oint _{\mathcal {C}_{\kappa }} \zeta [\tilde{X_1}-\zeta I]^{-1} \tilde{\varDelta } [\tilde{X_1} - \zeta I]^{-1} \tilde{\varDelta } [\tilde{X_2}-\zeta I]^{-1} d \zeta \), where the contour integral is taken along \(\mathcal {C}_{\kappa }\) defined as the circle centered at the origin with radius \(\kappa \).
By a careful use of the inequality \(\Vert AB\Vert _{\ell _2} \le \Vert A\Vert _{2} \Vert B\Vert _{\ell _2}\), we have \(\Vert [\tilde{X_1}-\zeta I]^{-1} \tilde{\varDelta } [\tilde{X_1}-\zeta I]^{-1} \tilde{\varDelta } [\tilde{X_2}-\zeta I]^{-1}\Vert _{\ell _2} \le \Vert [\tilde{X_1}-\zeta I]^{-1}\Vert _{2} \Vert \tilde{\varDelta } \Vert _{\ell _2} \Vert [\tilde{X_1}-\zeta I]^{-1}\Vert _{2} \Vert \tilde{\varDelta }\Vert _{2} \Vert [\tilde{X_2}-\zeta I]^{-1}\Vert _{2}\). Since \(\tilde{\varDelta }\) is a matrix with rank at most 4r, we have \(\Vert \tilde{\varDelta }\Vert _{\ell _2} \le \sqrt{4r}\Vert \tilde{\varDelta }\Vert _{2}\). We proceed to apply the same bounds as those used in the proof of Proposition 1.2 in the s.m. of [15] to obtain \(\Vert \mathcal {P}_{\mathcal {T}(\tilde{X_1})^{\perp }}(\tilde{X_2})\Vert _{\ell _2} \le 2 \sqrt{r}\kappa ^2 \Vert \tilde{\varDelta }\Vert _{2}^2 / ((\sigma -\kappa )^2(\sigma - 3\kappa /2)) \le \sqrt{2r} \Vert \tilde{X_1} - \tilde{X_2} \Vert _{2}^2 / (3\sigma )\). The first inequality follows by noting that \(\sqrt{2} \Vert \mathcal {P}_{\mathcal {T}(X_1)^{\perp }}(X_2)\Vert _{\ell _2} = \Vert \mathcal {P}_{\mathcal {T}(\tilde{X_1})^{\perp }}(\tilde{X_2})\Vert _{\ell _2}\) and that \(\Vert X_1-X_2\Vert _{2} = \Vert \tilde{X_1} - \tilde{X_2} \Vert _{2}\).
The proof of the second inequality follows from a similar argument. \(\square \)
We define the following distance measure between two subspaces \(\mathcal {T}_1\) and \(\mathcal {T}_2\) [15]
This definition is useful for quantifying the distance between tangent spaces with respect to the variety of low-rank matrices for pairs of nearby matrices.
Lemma 6
Let \(X_1, X_2 \in \mathbb {R}^{q \times q}\) be matrices with rank at most r, and satisfy \(\Vert X_1 - X_2 \Vert _{2} \le \sigma / 8\), where \(\sigma \) is the smallest nonzero singular value of \(X_2\). Let \(\mathcal {T}_1:= \mathcal {T}(X_1)\) and \(\mathcal {T}_2: = \mathcal {T}(X_2)\) be tangent spaces on the variety of matrices with rank at most r at the points \(X_1\) and \(X_2\), respectively. Let \(\mathcal {L}\) be a linear map satisfying the restricted isometry condition \(\delta _{4r}(\mathcal {L}) \le 1/10\). If \(Z_i \in \mathcal {T}_i\), \(i \in \{1,2\}\), then \(\Vert [(\mathcal {L}^{\prime }_{\mathcal {T}_1}\mathcal {L}_{\mathcal {T}_1})^{-1}]_{\mathbb {R}^{q \times q}} (Z_1) - [(\mathcal {L}^{\prime }_{\mathcal {T}_2}\mathcal {L}_{\mathcal {T}_2})^{-1}]_{\mathbb {R}^{q \times q}} (Z_2) \Vert _{\ell _2} \le (43/10) \sqrt{r} \Vert Z_1 - Z_2\Vert _{2} + 16 r\Vert X_1 - X_2 \Vert _{2} \Vert Z_2 \Vert _{2} / \sigma \).
Proof
(Lemma 6) To simplify notation, we denote \(Y_i = [(\mathcal {L}^{\prime }_{\mathcal {T}_i}\mathcal {L}_{\mathcal {T}_i})^{-1}]_{\mathbb {R}^{q \times q}} (Z_i)\), \(i \in \{1,2\}\). From the triangle inequality, we have \(\Vert Y_1 - Y_2 \Vert _{\ell _2} \le \Vert \mathcal {P}_{\mathcal {T}_1^{\perp }} (Y_1 - Y_2) \Vert _{\ell _2} + \Vert \mathcal {P}_{\mathcal {T}_1} (Y_1 - Y_2) \Vert _{\ell _2} \). We bound both components separately.
[\( \Vert \mathcal {P}_{\mathcal {T}_1^{\perp }} (Y_1 - Y_2) \Vert _{\ell _2} \)]: From Proposition 2.1 of the s.m. of [15], we have \(\rho (\mathcal {T}_1,\mathcal {T}_2) \le \frac{2}{\sigma }\Vert X_1-X_2\Vert _{2}\). From Lemma 1, we have \(\Vert Y_2 - Z_2 \Vert _{\ell _2} \le \delta _{4r} \Vert Y_2 \Vert _{\ell _2} \le \frac{\delta _{4r}}{1-\delta _{4r}} \Vert Z_2 \Vert _{\ell _2} \le \frac{\sqrt{2r}\delta _{4r}}{1-\delta _{4r}} \Vert Z_2 \Vert _{2}\). Hence
Here the first inequality follows by noting that \([\mathsf {I}-\mathcal {P}_{\mathcal {T}_1}]([\mathcal {P}_{\mathcal {T}_1} - \mathcal {P}_{\mathcal {T}_2}](Y_2 - Z_2))\) has rank at most 4r. Next
By combining both bounds with the triangle inequality, we obtain
[\( \Vert \mathcal {P}_{\mathcal {T}_1} (Y_1 - Y_2) \Vert _{\ell _2} \)]: Define the linear map \(\mathsf {G} = \mathcal {L}^{\prime }_{\mathcal {T}_1 \cup \mathcal {T}_2} \mathcal {L}_{\mathcal {T}_1 \cup \mathcal {T}_2} \). First \( \Vert [\mathcal {P}_{\mathcal {T}_2} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_2}] (Y_2) - [\mathcal {P}_{\mathcal {T}_1} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_2}] (Y_2) \Vert _{\ell _2} \le 2 \sqrt{r} \Vert [\mathcal {P}_{\mathcal {T}_2} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_2}] (Y_2) - [\mathcal {P}_{\mathcal {T}_1} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_2}] (Y_2) \Vert _{2} \le 2 \sqrt{r} \rho (\mathcal {T}_1,\mathcal {T}_2) \Vert \mathsf {G} (Y_2)\Vert _{2} \), where \( \Vert \mathsf {G}(Y_2) \Vert _{2} \le \Vert \mathsf {G}(Y_2) \Vert _{\ell _2} \le (1+\delta _{4r}) \Vert Y_2 \Vert _{\ell _2} \le \frac{1+ \delta _{4r}}{1 - \delta _{4r}} \Vert Z_2\Vert _{\ell _2} \le \sqrt{2r}\frac{1+ \delta _{4r}}{1 - \delta _{4r}} \Vert Z_2 \Vert _{2}\). Second \(\Vert [\mathcal {P}_{\mathcal {T}_1} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_2}] (Y_2) - [\mathcal {P}_{\mathcal {T}_1} \circ \mathsf {G} \circ \mathcal {P}_{\mathcal {T}_1}] (Y_2) \Vert _{\ell _2} = \Vert [\mathcal {P}_{\mathcal {T}_1} \circ \mathsf {G} \circ (\mathcal {P}_{\mathcal {T}_1} - \mathcal {P}_{\mathcal {T}_2})](Y_2)\Vert _{\ell _2} \le \Vert [\mathsf {G} \circ (\mathcal {P}_{\mathcal {T}_1} - \mathcal {P}_{\mathcal {T}_2})] (Y_2)\Vert _{\ell _2} \le (1+\delta _{4r})\Vert [\mathcal {P}_{\mathcal {T}_1} - \mathcal {P}_{\mathcal {T}_2}](Y_2)\Vert _{\ell _2} \le 2 \sqrt{r} (1+\delta _{4r})\Vert [\mathcal {P}_{\mathcal {T}_1} - \mathcal {P}_{\mathcal {T}_2}] (Y_2)\Vert _{2} \le 2 \sqrt{r} (1+\delta _{4r}) \rho (\mathcal {T}_1,\mathcal {T}_2) \Vert Y_2\Vert _{2}\), where \(\Vert Y_2 \Vert _{2} \le \Vert Y_2 \Vert _{\ell _2} \le \frac{\sqrt{2r}}{1 - \delta _{4r}} \Vert Z \Vert _{2}\). Third by combining these bounds with an application of Lemma 1 and the triangle inequality, we obtain
\(\square \)
Proof
(Proposition 18) We prove (i) and (ii) in sequence.
[(i)]: Let \(\hat{X}_o\) be the optimal solution to the following
Since \(4 \sqrt{r} \epsilon < 1/2 \le \sigma _r (X^{\star })\), \(\hat{X}_o\) has rank exactly r, and hence is a smooth point with respect to the variety of matrices with rank at most r. Define the tangent space \(\hat{\mathcal {T}}:= \mathcal {T}(\hat{X}_o)\), and the matrix \(\hat{X}_c\) as the solution to the following optimization instance
Here \(\hat{X}_c\) is the solution to the optimization instance where the constraint \(X \in \hat{\mathcal {T}}\), which is convex, replaces the only non-convex constraint in the previous optimization instance. Hence \(\hat{X}_{c} = \hat{X}_o\). Define \(\hat{X}_{\hat{\mathcal {T}}}\) as the solution to the following optimization instance
The first-order condition is given by \(\mathcal {L}^{\prime } \mathcal {L}( \hat{X}_{\hat{\mathcal {T}}} - X^{\star }) - \mathcal {L}^{\prime } \mathbf {z}+ Q_{\hat{\mathcal {T}}^{\perp }} = 0\), where \(Q_{\hat{\mathcal {T}}^{\perp }} \in \hat{\mathcal {T}}^{\perp }\) is the Lagrange multiplier associated with the constraint \(X \in \hat{\mathcal {T}}\). Project the above equation onto the subspace \(\hat{\mathcal {T}}\) to obtain \([\mathcal {P}_{\hat{\mathcal {T}}} \circ \mathcal {L}^{\prime } \mathcal {L}\circ \mathcal {P}_{\hat{\mathcal {T}}}] ( \hat{X}_{\hat{\mathcal {T}}} - X^{\star }) = [\mathcal {P}_{\hat{\mathcal {T}}} \circ \mathcal {L}^{\prime } \mathcal {L}\circ \mathcal {P}_{\hat{\mathcal {T}}^{\perp }}] (X^{\star }) + \mathcal {P}_{\hat{\mathcal {T}}} (\mathcal {L}^{\prime }\mathbf {z})\), and hence
We proceed to bound \(\Vert \hat{X}_{\hat{\mathcal {T}}} - X^{\star }\Vert _{2}\). First, we have \(\Vert \hat{X}_{c} - X^{\star }\Vert _{2} \le 4 \sqrt{r} \epsilon \le 1/20\), and hence \(\sigma _r (\hat{X}_c) \ge 9/20\). Second by applying Proposition 19, we have \(\Vert \mathcal {P}_{\hat{\mathcal {T}}^{\perp }} (X^{\star }) \Vert _{2} = \Vert \mathcal {P}_{\hat{\mathcal {T}}^{\perp }} (\hat{X}_c - X^{\star }) \Vert _{2} \le (4 \sqrt{r} \epsilon )^2 / (5 \sigma _r (\hat{X}_c)) \le (64/9) r \epsilon ^2\), and \(\Vert \mathcal {P}_{\hat{\mathcal {T}}^{\perp }} (X^{\star })\Vert _{\ell _2} \le (320/27) r^{3/2} \epsilon ^2 \). Third by Lemma 1 and noting the inequality \(\Vert {\cdot }\Vert _{2} \le \Vert {\cdot }\Vert _{\ell _2}\), we have
Fourth by Proposition 2.7 in [30], we have
Last, we combine the bounds to obtain \(\Vert \hat{X}_{\hat{\mathcal {T}}} - X^{\star } \Vert _{2} \le 8 r \epsilon ^2 + (16/5) \sqrt{r} \epsilon + 2 r^{3/2} \epsilon ^2 < 4 \sqrt{r} \epsilon \). This implies that the constraint \(\Vert X - X^{\star } \Vert _{2} \le 4 \sqrt{r} \epsilon \) for \(\hat{X}_c\) and \(\hat{X}_o\) are inactive, and hence \(\hat{X} = \hat{X}_o = \hat{X}_c = \hat{X}_{\hat{\mathcal {T}}}\).
[(ii)]: We have
We deal with the contributions of each term separately.
First \(\Vert [\mathcal {P}_{\mathcal {T}^{\star }} - \mathcal {P}_{\hat{\mathcal {T}}}] (\mathcal {L}^{\prime } \mathbf {z}) \Vert _{2} \le \rho (\hat{\mathcal {T}},\mathcal {T}^{\star }) \Vert \mathcal {L}^{\prime } \mathbf {z}\Vert _{2} \le (2\epsilon /\sigma _{r}(X^{\star })) \Vert \hat{X} - X^{\star } \Vert _{2} \le 16 \sqrt{r} \epsilon ^2\), where the second inequality applies Proposition 2.1 of the s.m. of [15]. Second \(\Vert \mathcal {P}_{\mathcal {T}^{\star }} (\mathcal {L}^{\prime } \mathbf {z})\Vert _{2} \le 2\Vert \mathcal {L}^{\prime } \mathbf {z}\Vert _{2} = 2 \epsilon \). Hence by applying Lemma 6 with the choice of \(Z_1 = \mathcal {P}_{\hat{\mathcal {T}}}(\mathcal {L}^{\prime } \mathbf {z})\) and \(Z_2 = \mathcal {P}_{\mathcal {T}^{\star }} (\mathcal {L}^{\prime } \mathbf {z})\), we obtain \(\Vert [(\mathcal {L}^{\prime }_{\mathcal {T}^{\star }}\mathcal {L}_{\mathcal {T}^{\star }})^{-1}]_{\mathbb {R}^{q \times q}} (\mathcal {L}^{\prime } \mathbf {z}) - [(\mathcal {L}^{\prime }_{\hat{\mathcal {T}}}\mathcal {L}_{\hat{\mathcal {T}}})^{-1}]_{\mathbb {R}^{q \times q}} (\mathcal {L}^{\prime } \mathbf {z})\Vert _{\ell _2} \le 70 r \epsilon ^2 + 256 r^{3/2} \epsilon ^2\). Third, we have \(\Vert [(\mathcal {L}^{\prime }_{\hat{\mathcal {T}}}\mathcal {L}_{\hat{\mathcal {T}}})^{-1}]_{\mathbb {R}^{q \times q}} \circ \mathcal {L}^{\prime } \mathcal {L}\circ \mathcal {P}_{\hat{\mathcal {T}}^{\perp }}] (X^{\star }) \Vert _{\ell _2} \le (320/243) r^{3/2} \epsilon ^2\), and \(\Vert \mathcal {P}_{\hat{\mathcal {T}}^{\perp }}(X^{\star })\Vert _{\ell _2} \le (320/27) r^{3/2} \epsilon ^2\).
The bound follows by summing up these bounds. \(\square \)
The proof of Proposition 3 requires two additional preliminary results; in particular, the first establishes the restricted isometry condition for linear maps that are near linear maps that already satisfy the restricted isometry condition.
Proposition 20
Suppose \(\mathcal {L}^{\star }\) is a linear map that satisfies the restricted isometry condition \(\delta _{r}(\mathcal {L}^{\star }) \le 1/ 20\). Let \(\mathsf {E}\) be a linear operator such that \(\Vert \mathsf {E}\Vert _{2} \le 1/(50\Vert \mathcal {L}^{\star }\Vert _{2})\). Then, \(\mathcal {L}= \mathcal {L}^{\star } \circ (\mathsf {I}+ \mathsf {E})\) satisfies the restricted isometry condition \(\delta _{r}(\mathcal {L}) \le 1/10\).
Proof
(Proposition 20) Let X be a matrix with rank at most r. Then,
A similar argument also proves the lower bound \(\Vert \mathcal {L}(X) \Vert _{\ell _2} \ge \sqrt{1 - 1/10} \Vert X\Vert _{\ell _2}\). \(\square \)
Lemma 7
Suppose \(\mathcal {L}\) satisfies the restricted isometry condition \(\delta _1(\mathcal {L}) < 1\). Then, \(\Vert \mathcal {L}^{\prime } \mathcal {L}\Vert _{\ell _2,2}\le \sqrt{2(1+\delta _1(\mathcal {L}))} \Vert \mathcal {L}\Vert _{2} \).
Proof
Let \(Z \in \mathrm {argmax}_{X:\Vert X\Vert _{\ell _2}\le 1} \Vert \mathcal {L}^{\prime } \mathcal {L}(X)\Vert _{2}\), and let \(\mathcal {T}\) be the tangent space of the rank-one matrix corresponding to the largest singular value of Z. Then, \(\sup _{X:\Vert X\Vert _{\ell _2} \le 1} \Vert \mathcal {L}^{\prime } \mathcal {L}(X)\Vert _{2} \le \sup _{X:\Vert X\Vert _{\ell _2} \le 1} \Vert [\mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}] (X)\Vert _{2} \le \sqrt{2} \sup _{X:\Vert X\Vert _{\ell _2} \le 1} \Vert [\mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}] (X)\Vert _{\ell _2} \le \sqrt{2} \Vert \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}\Vert _{2} \). By Lemma 1, we have \(\sqrt{2} \Vert \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\prime } \mathcal {L}\Vert _{2} \le \sqrt{2(1+\delta _1(\mathcal {L}))} \Vert \mathcal {L}\Vert _{2}\). \(\square \)
Proof
(Proposition 3) To simplify notation, we denote \(\mathcal {T}:=\mathcal {T}(X^{\star })\). Without loss of generality, we may assume that \(\Vert X^{\star }\Vert _{2}=1\). By the triangle inequality, we have
We bound each term separately.
[First term]: Let \(\tilde{\mathbf {z}}:=[\mathcal {L}^{\star } \circ \mathsf {E}] (X^{\star })\). First by Proposition 20, the linear map \(\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})\) satisfies the restricted isometry condition \(\delta _{4r}(\mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E})) \le 1/10\). Second, we have \(\Vert \mathsf {I}+\mathsf {E}\Vert _{2,2}\le 1+\sqrt{q} \Vert \mathsf {E}\Vert _{\ell _2} \le 51/50\). Third from Lemma 7, we have \(\Vert \mathcal {L}^{\star \prime }\mathcal {L}^{\star }\Vert _{\ell _2,2} \le \sqrt{2(1+\delta _{4r}(\mathcal {L}^{\star }))} \Vert \mathcal {L}^{\star }\Vert _{2}\). Fourth \(\Vert \mathsf {E}(X^{\star })\Vert _{\ell _2} \le \sqrt{r} \Vert \mathsf {E}\Vert _{\ell _2}\). Hence
By the initial conditions, we have that the above quantity is at most \(1/(80r^{3/2})\). Consequently, by applying Proposition 18 to optimization instance (9) with the choice of linear map \(\mathcal {L}^{\star }\circ (\mathsf {I}+\mathsf {E})\) and error term \(\tilde{\mathbf {z}}\) we have
[Second term]: First by Lemma 1, we have \(\Vert [(\mathcal {L}^{\star \prime }_{\mathcal {T}} \mathcal {L}^{\star }_{\mathcal {T}} )^{-1}]_{\mathbb {R}^{q \times q}} \Vert _{2} \le 20/19\). Second by the triangle inequality, we have \(\Vert \mathcal {P}_{\mathcal {T}} \circ (\mathsf {I}+\mathsf {E}^{\prime }) \circ \mathcal {L}^{\star \prime } \mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E}) \circ \mathcal {P}_{\mathcal {T}} - \mathcal {P}_{\mathcal {T}} \circ \mathcal {L}^{\star \prime } \mathcal {L}^{\star } \circ \mathcal {P}_{\mathcal {T}} \Vert _{2} \le 3 \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}\Vert _{\ell _2}\). Third by utilizing the identity \((\mathsf {A} + \mathsf {B})^{-1} = \mathsf {A}^{-1} - \mathsf {A}^{-1} \circ \mathsf {B} \circ \mathsf {A}^{-1} + \mathsf {A}^{-1} \circ \mathsf {B} \circ \mathsf {A}^{-1} \circ \mathsf {B} \circ \mathsf {A}^{-1} - \ldots \) with the choice of \(\mathsf {A} = \mathcal {L}^{\star \prime }_{\mathcal {T}} \mathcal {L}^{\star }_{\mathcal {T}}\) and \(\mathsf {B} = \mathcal {P}_{\mathcal {T}} \circ (\mathsf {I}+\mathsf {E})^{\prime } \circ \mathcal {L}^{\star \prime } \mathcal {L}^{\star } \circ (\mathsf {I}+\mathsf {E}) \circ \mathcal {P}_{\mathcal {T}} - \mathsf {A}\), we obtain
Fourth \(\Vert \mathcal {P}_{\mathcal {T}} \circ (\mathsf {I}+\mathsf {E}^{\prime }) \circ \mathcal {L}^{\star \prime } \mathcal {L}^{\star } \circ \mathsf {E}(X^{\star }) \Vert _{\ell _2} \le (11/10) \sqrt{r} \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}\Vert _{\ell _2}\). Hence
[Third Term]: We have
[Conclude]: The result follows by summing each bound and applying Lemma 7. \(\square \)
Appendix G: Proof of Proposition 4
Proof
(Proposition 4) To simplify notation, we let \(\varLambda := \varLambda ( \{ {A^{(j)}}\}_{j=1}^{n} )\), \(\varDelta := \varDelta ( \{ {A^{(j)}}\}_{j=1}^{n} ) \), and be the linear map defined as . In addition we define . Note that by the Cauchy–Schwarz inequality, we have \(\tau \le \omega / \sqrt{\varLambda } \le 1/20\).
We begin by noting that since , and .
Next we compute the following bounds. First . Second . Third . By applying these bounds to the following expansion, we obtain
where , and .
We apply the above expansion to derive the following approximation of
where \(\mathsf {E}_3\) satisfies \(\Vert \mathsf {E}_3\Vert _{2} \le 2(\tau (6/5)^{3/2})(2(\tau (6/5)^{3/2}) + \Vert \mathsf {E}_2 \Vert _{2}) +\Vert \mathsf {E}_2 \Vert _{2} \le 20 \tau ^2\).
Next we write . Then,
where . The result follows by noting that \(\Vert \mathsf {F}\Vert _{\ell _2} \le q \Vert \mathsf {F}\Vert _{2}\), \(\tau \le \omega / \sqrt{\varLambda }\) and that . \(\square \)
Appendix H: Proof of Proposition 5
Proposition 21
Given an operator \(\mathsf {E}: \mathbb {R}^{q\times q} \rightarrow \mathbb {R}^{q \times q}\), there exists matrices \(E_L\), \(E_R\) such that \(\mathcal {P}_{\mathcal {W}} (\mathsf {E}) = I {\otimes }E_L + E_R {\otimes }I\), and \(\Vert E_L\Vert _{\ell _2},\Vert E_R\Vert _{\ell _2} \le \Vert \mathsf {E}\Vert _{\ell _2}/\sqrt{q}\).
Proof
(Proposition 21) Define the subspaces \(\mathcal {W}_R := \{ S {\otimes }I : S \in \mathbb {R}^{q \times q} \}\) and \(\mathcal {W}_L := \{ I {\otimes }S : S \in \mathbb {R}^{q \times q} \}\). Note that \(\mathcal {W}_R \cap \mathcal {W}_L = \mathrm {Span}(\mathsf {I})\), and hence \(\mathcal {P}_{\mathcal {W}} = \mathcal {P}_{\mathcal {W}_R \cap \mathrm {Span}(\mathsf {I})^{\perp }} + \mathcal {P}_{\mathcal {W}_L \cap \mathrm {Span}(\mathsf {I})^{\perp }} + \mathcal {P}_{\mathrm {Span}(\mathsf {I})}\).
Define \(E_L\) and \(E_R\) to be matrices such that \(E_R {\otimes }I = \mathcal {P}_{\mathcal {W}_R \cap \mathrm {Span}(\mathsf {I})^{\perp }} (\mathsf {E}) + (1/2) \mathcal {P}_{\mathrm {Span}(\mathsf {I})} (\mathsf {E})\), and \(I {\otimes }E_L = \mathcal {P}_{\mathcal {W}_L \cap \mathrm {Span}(\mathsf {I})^{\perp }} (\mathsf {E}) + (1/2) \mathcal {P}_{\mathrm {Span}(\mathsf {I})} (\mathsf {E})\). For \(i \in \{L,R\}\), we have the following. Since \(\mathcal {P}_{\mathcal {W}_i \cap \mathrm {Span}(\mathsf {I})^{\perp }}\) and \((1/2) \mathcal {P}_{\mathrm {Span}(\mathsf {I})}\) are projectors onto orthogonal subspaces with spectral norm 1 and 1 / 2, respectively, we have \(\Vert E_i {\otimes }I \Vert _{\ell _2} \le \Vert \mathsf {E}\Vert _{\ell _2}\). Moreover, since \(\Vert E_i {\otimes }I \Vert _{\ell _2} = \Vert E_i\Vert _{\ell _2} \Vert I\Vert _{\ell _2}\), we have \(\Vert E_i\Vert _{\ell _2} \le \Vert \mathsf {E}\Vert _{\ell _2}/\sqrt{q}\). \(\square \)
Proof
(Proposition 5) By applying Proposition 21 to the operator \(\mathsf {D}\), we have \(\mathcal {P}_{\mathcal {W}}(\mathsf {D}) = I{\otimes }E_L + E_R{\otimes }I \) for a pair of matrices \(E_L,E_R \in \mathbb {R}^{q\times q}\) satisfying \(\Vert E_L\Vert _{\ell _2}, \Vert E_R\Vert _{\ell _2} \le \Vert \mathsf {D}\Vert _{\ell _2} / \sqrt{q}\). Moreover, since \(\Vert E_L\Vert _{2}, \Vert E_R\Vert _{2} <1\), it follows that the matrices \(I+E_R\) and \(I+E_L\) are invertible. Consider the following identity
We define \(\mathsf {H}= (\mathcal {P}_{\mathcal {W}^{\perp }}(\mathsf {D}) - E_R {\otimes }E_L) \circ (I+E_R)^{-1} {\otimes }(I+E_L)^{-1} - \mathcal {P}_{\mathcal {W}^{\perp }}(\mathsf {D})\), and we define \({{\textsf {\textit{W}}}}= (I+E_R){\otimes }(I+E_L)\). By the triangle inequality, we have \(\Vert {{\textsf {\textit{W}}}}- \mathsf {I}\Vert _{2} \le 3\Vert \mathsf {D}\Vert _{\ell _2}/\sqrt{q}\).
Next we note that \(\Vert (I+E_i)^{-1}\Vert _{2} \le 10/9\), \(i\in \{L,R\}\) and that \(\Vert (I+E_R)^{-1} {\otimes }(I+E_L)^{-1} \Vert _{2} \le 100/81\). We also have \(\Vert E_R{\otimes }E_L\Vert _{\ell _2} = \Vert E_R \Vert _{\ell _2} \Vert E_L\Vert _{\ell _2} \le \Vert \mathsf {D}\Vert ^2_{\ell _2} /q\). By noting that \(\Vert (I+E_i)^{-1} - I \Vert _{2} \le (10/9)\Vert E_i\Vert _{2}\), \(i\in \{L,R\}\), we have \(\Vert (I+E_R)^{-1} {\otimes }(I+E_L)^{-1} - I {\otimes }I\Vert _{2} \le 3\Vert \mathsf {D}\Vert _{\ell _2} / \sqrt{q}\). By combining these bounds, we obtain \(\Vert \mathsf {H}\Vert _{\ell _2} \le \Vert \mathcal {P}_{\mathcal {W}^{\perp }}(\mathsf {D}) \Vert _{\ell _2} \Vert (I+E_R)^{-1} {\otimes }(I+E_L)^{-1} - I {\otimes }I\Vert _{2} + \Vert E_R {\otimes }E_L \Vert _{\ell _2} \Vert (I+E_R)^{-1} {\otimes }(I+E_L)^{-1} \Vert _{2} \le 5 \Vert \mathsf {D}\Vert _{\ell _2}^2 / \sqrt{q}\). \(\square \)
Appendix I: Proof of Proposition 6
Proof
(Proposition 6) To simplify notation in the proof, we denote \(\alpha _8:=\alpha _8(q,\mathcal {L}^{\star }) = 96 \sqrt{q} \Vert \mathcal {L}^{\star }\Vert _{2} \). We show that
for some function \(\alpha _9:=\alpha _9(q,r,\mathcal {L}^{\star })\) that we specify later. In the proof of Theorem 3, we showed that \(\xi _{\mathcal {L}^{\star }}(\mathcal {L}^{(t)}) \le \gamma ^{t} \xi _{\mathcal {L}^{\star }}(\mathcal {L}^{(0)})\) for some \(\gamma <1\). Hence, establishing (34) immediately implies that the sequence \(\{\mathcal {L}^{(t)} \}_{t=1}^{\infty }\) is Cauchy.
Our proof builds on the proof of Theorem 3. Let
where \(\mathsf {E}^{(t)}\) is a linear map that satisfies \(\Vert \mathsf {E}^{(t)}\Vert _{\ell _2} < 1/\alpha _0\). In the proof of Theorem 3, we show that
where \(\Vert \mathsf {E}^{(t+1)}\Vert _{\ell _2} \le \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), \({{\textsf {\textit{W}}}}\) is a rank-preserver and \({{\textsf {\textit{N}}}}\) is a positive-definite rank-preserver. Moreover, as a consequence of applying Proposition 5 to establish (23) in the proof, we obtain the bound \(\Vert {{\textsf {\textit{W}}}}- \mathsf {I}\Vert _{2} \le 3 \alpha _7\Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\). We use these bounds and relations to prove (34).
By the triangle inequality, we have
By Proposition 2 applied to the pairs of linear maps \(\mathcal {L}^{(t)},\mathcal {L}^{\star }\) and \(\mathcal {L}^{(t+1)},\mathcal {L}^{\star }\), we have \(\Vert {{\textsf {\textit{M}}}}- {{\textsf {\textit{Q}}}}_1 \Vert _{2}, \Vert {{\textsf {\textit{W}}}}\circ {{\textsf {\textit{M}}}}\circ {{\textsf {\textit{N}}}}- {{\textsf {\textit{Q}}}}_2\Vert _{2} \le \alpha _8 \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), for some pair of orthogonal rank-preservers \({{\textsf {\textit{Q}}}}_1, {{\textsf {\textit{Q}}}}_2\). Since \(\alpha _8/\alpha _0\le 1\), we have \(\Vert {{\textsf {\textit{M}}}}\Vert _{2} \le 2\) and \(\Vert {{\textsf {\textit{W}}}}\circ {{\textsf {\textit{M}}}}\circ {{\textsf {\textit{N}}}}\Vert _{2} \le 2\). Consequently, \(\Vert \mathcal {L}^{\star } \circ \mathsf {E}^{(t)} \circ {{\textsf {\textit{M}}}}\Vert _{2}, \Vert \mathcal {L}^{\star } \circ \mathsf {E}^{(t+1)} \circ {{\textsf {\textit{W}}}}\circ {{\textsf {\textit{M}}}}\circ {{\textsf {\textit{N}}}}\Vert _{2} \le 2\Vert \mathcal {L}^{\star }\Vert _{2}\Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\).
Next we bound \(\Vert {{\textsf {\textit{N}}}}- \mathsf {I}\Vert _{2}\). By utilizing \(\Vert {{\textsf {\textit{W}}}}\circ {{\textsf {\textit{M}}}}\circ {{\textsf {\textit{N}}}}- {{\textsf {\textit{Q}}}}_2\Vert _{2} \le \alpha _8 / \alpha _0\), \(\Vert {{\textsf {\textit{M}}}}- {{\textsf {\textit{Q}}}}_1 \Vert _{2} \le \alpha _8 \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), and \(\Vert {{\textsf {\textit{W}}}}- \mathsf {I}\Vert _{2} \le 3 \alpha _7\Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), one can show that \(\Vert {{\textsf {\textit{N}}}}-{{\textsf {\textit{Q}}}}_3\Vert _{2} \le (6 \alpha _7 + 2\alpha _8 + 2) \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), where \({{\textsf {\textit{Q}}}}_3 = {{\textsf {\textit{Q}}}}_1^{\prime } \circ {{\textsf {\textit{Q}}}}_2\) is an orthogonal rank-preserver. Since \({{\textsf {\textit{N}}}}\) is self-adjoint, we have \(\Vert {{\textsf {\textit{N}}}}^2-\mathsf {I}\Vert _{2} \le 3 (6 \alpha _7 + 2\alpha _8 + 2) \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), and hence \(\Vert {{\textsf {\textit{N}}}}-\mathsf {I}\Vert _{2} \le 3(6 \alpha _7 + 2\alpha _8 + 2)\Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\). This also implies the bound \(\Vert {{\textsf {\textit{N}}}}\Vert _{2} \le 3\).
We apply these bounds to obtain \(\Vert \mathcal {L}^{\star } \circ {{\textsf {\textit{M}}}}\circ ({{\textsf {\textit{N}}}}- \mathsf {I}) \Vert _{2} \le 6 (6 \alpha _7 + 2\alpha _8 + 2) \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\), and \(\Vert \mathcal {L}^{\star } \circ ({{\textsf {\textit{W}}}}- \mathsf {I}) \circ {{\textsf {\textit{M}}}}\circ {{\textsf {\textit{N}}}}\Vert _{2} \le 9 \alpha _7 \Vert \mathcal {L}^{\star }\Vert _{2} \Vert \mathsf {E}^{(t)}\Vert _{2}\).
We define \(\alpha _9 := (4 + 6 (6 \alpha _7 + 2\alpha _8 + 2) + 9 \alpha _7 )\Vert \mathcal {L}^{\star }\Vert _{2}\) (these are exactly the sum of the coefficients of \(\Vert \mathsf {E}^{(t)}\Vert _{\ell _2}\) in the above bounds). The result follows by adding these bounds and subsequently taking the infimum over \(\mathsf {E}^{(t)}\) in (35). \(\square \)
Rights and permissions
About this article
Cite this article
Soh, Y.S., Chandrasekaran, V. Learning Semidefinite Regularizers. Found Comput Math 19, 375–434 (2019). https://doi.org/10.1007/s10208-018-9386-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-018-9386-z