Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Efficient Convex Optimization Requires Superlinear Memory

Published: 08 November 2024 Publication History

Abstract

We show that any memory-constrained, first-order algorithm which minimizes d-dimensional, 1-Lipschitz convex functions over the unit ball to 1/poly(d) accuracy using at most d1.25 - δ bits of memory must make at least \(\tilde{\Omega }(d^{1 + (4/3)\delta })\) first-order queries (for any constant \(\delta \in [0, 1/4]\)). Consequently, the performance of such memory-constrained algorithms are at least a polynomial factor worse than the optimal Õ(d) query bound for this problem obtained by cutting plane methods that use Õ(d2) memory. This resolves one of the open problems in the COLT 2019 open problem publication of Woodworth and Srebro.

1 Introduction

Minimizing a convex objective function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) given access to a first-order oracle—that returns the function evaluation and (sub)gradient \((f(\mathbf {x}),\nabla f(\mathbf {x}))\) when queried for point \(\mathbf {x}\)—is a canonical problem and fundamental primitive in machine learning and optimization. There are methods that, given any 1-Lipschitz, convex \(f : \mathbb {R}^d \rightarrow \mathbb {R}\) accessible via a first-order oracle, compute an \(\epsilon\)-approximate minimizer over the unit ball with just \(\mathcal {O}(\min \lbrace \epsilon ^{-2}, d \log (1/\epsilon)\rbrace)\) queries to the oracle. This query complexity is worst-case optimal [48] and foundational in optimization theory. \(\mathcal {O}(\epsilon ^{-2})\) queries are achievable using subgradient descent; this is a simple, widely-used, eminently practical method that solves the problem using a total of \(\mathcal {O}(d \epsilon ^{-2})\) arithmetic operations on \(\mathcal {O}(\log (d/\epsilon))\)-bit numbers. On the other hand, building on the \(\mathcal {O}(d^2 \log (1/\epsilon))\) query complexity of the well-known ellipsoid method [61, 77], different cutting plane methods achieve a query complexity of \(\mathcal {O}(d \log (1/\epsilon))\), e.g., center of mass with sampling-based techniques [10, 42], volumetric center [5, 68], inscribed ellipsoid [39, 49]; these methods are perhaps less frequently used in practice and large-scale learning and all use at least \(\Omega (d^3 \log (1/\epsilon))\)-time, even with recent improvements [37, 41].1
Though state-of-the-art cutting plane methods have larger computational overhead and are sometimes regarded as impractical, for small enough \(\epsilon\), they give the state-of-the-art query bounds. Furthermore, in different theoretical settings, e.g., semidefinite programming [3, 41, 64], submodular optimization [30, 36, 41, 44], and equilibrium computation [35, 52], cutting-plane-methods have yielded state-of-the-art runtimes at various points of time. This leads to the natural question of what is needed of a method to significantly outperform gradient descent and take advantage of the improved query complexity enjoyed by cutting plane methods? Can we design methods that obtain optimal query complexities while maintaining the practicality of gradient descent methods?
The COLT 2019 open problem [72] proposed tackling this question through the lens of memory: to what extent is large memory necessary to achieve improved query complexity? Any algorithm for the problem must use \(\Omega \left(d \log (1/\epsilon \right)\) memory irrespective of its query complexity since that much memory is needed to represent the answer [72]. Subgradient descent matches this lower bound and can be implemented with only \(\mathcal {O}(d \log (1/\epsilon))\)-bits of memory [72]. However, all known methods that achieve a query complexity significantly better than gradient descent, e.g., cutting plane methods, use \(\Omega (d^2 \log (1/\epsilon))\) bits of memory. Understanding the tradeoffs in memory and query complexity could inform the design of future efficient optimization methods.
In this article, we show that memory does play a critical role in attaining an optimal query complexity for convex optimization. Our main result is the following theorem which shows that any algorithm whose memory usage is sufficiently small (though potentially, still superlinear) must make polynomially more queries to a first-order oracle than cutting plane methods. Specifically, any algorithm that uses significantly less than \(d^{1.25}\) bits of memory requires a polynomial factor more first-order queries than the optimal \(\mathcal {O}(d \log (d))\) queries achieved by quadratic memory cutting plane methods. This establishes the first non-trivial tradeoff between memory and query complexity for convex optimization with access to a first-order oracle.
Fig. 1.
Fig. 1. Tradeoffs between available memory and first-order oracle complexity for minimizing 1-Lipschitz convex functions over the unit ball (adapted from [72]). The dashed red region corresponds to information-theoretic lower bounds on the memory and query-complexity. The dashed green region corresponds to known upper bounds. This work shows that the solid red region is not achievable for any algorithm.
Theorem 1.1.
For some \(\epsilon \ge 1/d^7\) and any \(\delta \in [0,1/4]\), any (potentially randomized) algorithm which outputs an \(\epsilon\)-optimal point with probability at least \(2/3\) given first-order oracle access to any 1-Lipschitz convex function must use either at least \(d^{1.25 - \delta }\) bits of memory or make \(\tilde{\Omega }(d^{1+ \frac{4}{3}\delta })\) first-order queries (where the \(\tilde{\Omega }\) notation hides factors poly-logarithmic in d).
Beyond shedding light on the complexity of a fundamental memory-constrained optimization problem, we provide several tools for establishing such lower bounds. In particular, we introduce a set of properties which are sufficient for an optimization problem to exhibit a memory lower bound and provide an information-theoretic framework to prove these lower bounds. We hope these tools are an aid to future work on the role of memory in optimization.
This work fits within the broader context of understanding fundamental resource tradeoffs for optimization and learning. For many settings, establishing (unconditional) query/time or memory/time tradeoffs is notoriously hard—perhaps akin to P vs. NP (e.g., providing time lower bounds for cutting plane methods). Questions of memory/query and memory/data tradeoffs, however, have a more information-theoretic nature and hence seem more approachable. Together with the increasing importance of memory considerations in large-scale optimization and learning, there is a strong case for pinning down the landscape of such tradeoffs, which may offer a new perspective on the current suite of algorithms and inform the effort to develop new ones.

1.1 Technical Overview and Contributions

To prove Theorem 1.1, we provide an explicit distribution over functions that is hard for any memory-constrained randomized algorithm to optimize. Though the proof requires care and we introduce a variety of machinery to obtain it, this lower-bounding family of functions is simple to state. The function is a variant of the so-called “Nemirovski function,” which has been used to show lower bounds for highly parallel non-smooth convex optimization [7, 18, 47].
Formally, our difficult class of functions for memory size M is constructed as follows: for some \(\gamma \gt 0\) and \(N= \tilde{\mathcal {O}}((d^2/M)^{1/3})\) let \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) be unit vectors drawn i.i.d. from the d dimensional scaled hypercube \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } \mathsf {Unif} (d^{-1/2} \mathcal {H}_d)\) and let \(\mathbf {a}_1,\ldots ,\mathbf {a}_{\lfloor d/2 \rfloor }\) be drawn i.i.d. from the hypercube, \(\mathbf {a}_j \sim \mathsf {Unif} (\mathcal {H}_d)\) where \(\alpha \mathcal {H}_d := \lbrace \pm \alpha \rbrace ^d\). Let \(\mathbf {A} = (\mathbf {a}_1, \dots , \mathbf {a}_{\lfloor d/2 \rfloor })\) and
\begin{equation} F(\mathbf {x}) = (1/d^6) \max \left\lbrace d^5 \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - 1, \max _{i \in [N]} \mathbf {v}_i^{\top } \mathbf {x}- i \gamma \right\rbrace . \end{equation}
(1.1)
We describe the key components of this function and the intuition behind their role in Section 2.1. At a high level, we show that any algorithm that optimizes this function must, with high probability, observe each of the \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) and will observe \(\mathbf {v}_i\) before \(\mathbf {v}_{i+1}\) due to the \(i \gamma\) term. The intuition for the memory-query tradeoff is that after seeing \(\mathbf {v}_{i}\), in order to see \(\mathbf {v}_{i+1}\), the algorithm must query a vector near the nullspace of \(\mathbf {A}\) but with a significant component in the direction of \(\mathbf {v}_{i}\). Since \(\mathbf {v}_{i}\) is random, this requires finding a “new” vector in the nullspace of \(\mathbf {A}\). This implies that for some appropriately chosen k and a block of k Nemirovki vectors \(\lbrace \mathbf {v}_i,\mathbf {v}_{i+1},\ldots ,\mathbf {v}_{i+k-1}\rbrace\), to see \(\lbrace \mathbf {v}_{i+1},\mathbf {v}_{i+2},\ldots ,\mathbf {v}_{i+k}\rbrace\) after seeing \(\mathbf {v}_{i}\) the algorithm must query k sufficiently independent vectors in the null space of \(\mathbf {A}\). We show that an algorithm with insufficient memory to store \(\mathbf {A}\) cannot find such a large set of sufficiently independent vectors in the nullspace of \(\mathbf {A}\), and must essentially “re-learn” \(\mathbf {A}\) for every such block of k Nemirovski vectors.
Rather than give a direct proof of Theorem 1.1 using the explicit function in Equation (1.1), we provide a more abstract framework which gives broader insight into which kinds of functions could lead to non-trivial memory-constrained lower bounds, and which might lead to tighter lower bounds in the future. To that end, we introduce the notion of a memory-sensitive class which delineates the key properties of a distribution over functions that lead to memory-constrained lower bounds. We show that for such functions, the problem of memory-constrained optimization is at least as hard as the following problem of finding a set of vectors which are approximately orthogonal to another set of vectors. The problem is motivated by the proof intuition discussed above, where we argued that minimizing Equation (1.1) requires finding a large number of sufficiently independent vectors in the null space of \(\mathbf {A}\).
Definition 1.2 (Informal version of the Orthogonal Vector Game).
Given \(\mathbf {A}\in \lbrace \pm 1\rbrace ^{d/2 \times d}\), the Player’s objective is to return a set of k vectors \(\lbrace \mathbf {y}_1,\dots , \mathbf {y}_{k}\rbrace\) which satisfy
(1)
\(\forall i \in [k], \mathbf {y}_i\) is approximately orthogonal to all the rows of \(\mathbf {A}: \left\Vert {\mathbf {A} \mathbf {y}_i}\right\Vert _\infty /\left\Vert {\mathbf {y}_i}\right\Vert _2 \le d^{-4}\).
(2)
The set of vectors \(\lbrace \mathbf {y}_1,\dots , \mathbf {y}_{k}\rbrace\) is robustly linearly independent: denoting \(S_0=\varnothing , S_i=\mathsf {span}(\mathbf {y}_1,\dots , \mathbf {y}_{i})\),
\({\left\Vert {\mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\right\Vert _2/\left\Vert {\mathbf {y}_i}\right\Vert _2 \le 1-1/d^2}\),
where the notation \(\mathrm{Proj}_S(\mathbf {x})\) denotes the vector in the subspace S which is closest in \(\left\Vert {\cdot }\right\Vert _2\) to \(\mathbf {x}\). The game proceeds as follows: The Player first observes \(\mathbf {A}\) and stores an M-bit \(\mathsf {Message}\) about \(\mathbf {A}\). She does not subsequently have free access to \(\mathbf {A}\), but can adaptively make up to m queries as follows: for \(i \in [m]\), based on \(\mathsf {Message}\) and all previous queries and their results, she can request any row \(i \in [d/2]\) of the matrix \(\mathbf {A}\). Finally, she outputs a set of k vectors as a function of \(\mathsf {Message}\) and all m queries and their results.
Note that the Player can trivially win the game for \(M\ge {\Omega }(dk), m=0\) (by just storing a satisfactory set of k vectors in the \(\mathsf {Message}\)) and for \(M=0, m=d/2\) (by querying all rows of \(\mathbf {A}\)). We show a lower bound that this is essentially all that is possible: for \(\mathbf {A}\) sampled uniformly at random from \(\left\lbrace \pm 1 \right\rbrace ^{d/2 \times d}\), if M is a constant factor smaller than dk, then the Player must make at least \(d/5\) queries to win with probability at least \(2/3\). Our analysis proceeds via an intuitive information-theoretic framework, which could have applications for showing query lower bounds for memory-constrained algorithms in other optimization and learning settings. We sketch the analysis in Section 3.2.

1.2 Related Work

Memory-sample tradeoffs for learning. There is a recent line of work to understand learning under information constraints such as limited memory or communication constraints [4, 6, 15, 23, 24, 26, 28, 58, 65, 66, 73, 78]. Most of these results obtain lower bounds for the regime when the available memory is less than that required to store a single datapoint (with the notable exception of [24] and [23]). However, the breakthrough article [54] showed an exponential lower bound on the number of random examples needed for learning parities with memory as large as quadratic. Subsequent work extended and refined this result to multiple learning problems over finite fields [9, 29, 40, 45, 46, 55].
Most related to our line of work is [60], which considers the continuous valued learning/optimization problem of performing linear regression given access to randomly drawn examples from an isotropic Gaussian. They show that any sub-quadratic memory algorithm for the problem needs \(\Omega (d\log \log (1/\epsilon)))\) samples to find an \(\epsilon\)-optimal solution for \(\epsilon \le 1/d^{\Omega (\log d)}\), whereas, in this regime an algorithm with memory \(\tilde {\mathcal {O}}(d^2)\) can find an \(\epsilon\)-optimal solution with only d examples. Since each example provides an unbiased estimate of the expected regression loss, this translates to a lower bound for convex optimization given access to a stochastic gradient oracle. However, the upper bound of d examples is not a generic convex optimization algorithm/convergence rate but comes from the fact that the linear systems can be solved to the required accuracy using d examples.
There is also significant work on memory lower bounds for streaming algorithms, e.g., [2, 8, 21, 23], where the setup is that the algorithm only gets a single-pass over a data stream. The work of [23] mentioned earlier uses an Approximate Null-Vector Problem (ANVP) to show memory lower bounds, which shares some similarities with the Orthogonal Vector Game used in our proof (informally presented in Definition 1.2). In the ANVP, the objective is to find a single vector approximately orthogonal to a stream of vectors. The goal of the ANVP is similar to the goal of the Orthogonal Vector Game, which is to find a set of such vectors. However, the key difference is that the ANVP is in the (one-pass) streaming setting, whereas, the Orthogonal Vector Game allows for stronger access to the input in two senses. First, the Orthogonal Vector Game algorithms see the entire input and can store a M-bit \(\mathsf {Message}\) about the input, and second, the algorithms adaptively query the input. The main challenge in our analysis (discussed later in Section 3.2) is to bound the power of these adaptively issued queries in conjunction with the \(\mathsf {Message}\).
Lower bounds for convex optimization. Starting with the early work of [48], there is extensive literature on lower bounds for convex optimization. Some of the key results in this area include classic lower bounds for finding approximate minimizers of Lipschitz functions with access to a subgradient oracle [14, 48, 50], including recent progress on lower bounds for randomized algorithms [16, 62, 63, 67, 71, 74]. There is also work on the effect of parallelism on these lower bounds [7, 18, 25, 47]. For a broader introduction to the oracle complexity of optimization, we refer the reader to surveys such as [50] and [17].
Memory-limited optimization algorithms. While the focus of this work is lower bounds, there is a long line of work on developing memory-efficient optimization algorithms, including methods that leverage second-order structure via first-order methods. Such methods include Limited-memory-BFGS [43, 51] and the conjugate gradient (CG) method for solving linear systems [33] and various non-linear extensions of CG [27, 31] and methods based on subsampling and sketching the Hessian [53, 56, 75]. A related line of work is on communication-efficient optimization algorithms for distributed settings (see [1, 34, 38, 59, 76] and references therein).

1.3 Subsequent and Future Work

Our work is a first step towards establishing optimal memory/query tradeoffs for optimization. An immediate open problem suggested by our work is that of improving our bounds, both to match the quadratic memory upper bound of cutting plane methods, establish tight bounds for all values of the desired accuracy, \(\epsilon\), and fully resolve the open problems of Woodworth and Srebro [72]. Since the initial publication of this work, there have been several advances in this direction.
First, recent work of Blanchard, Zhang, and Jaillet [13] built upon our proof framework to show that quadratic memory is indeed necessary for deterministic algorithms to obtain near-optimal query complexities. In our construction in Equation (1.1), we have \(N= \tilde{\mathcal {O}}((d^2/M)^{1/3})\) Nemirovski vectors \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\); however, packing more Nemirovski vectors would lead to a stronger lower bound. We discuss this further in Remark 2.12. In the case of deterministic algorithms, [13] provide a clever adaptive strategy which packs \(N=\Omega (d)\) Nemirovski vectors and obtains a quadratic memory lower bound.
A second recent article, Chen and Peng [20], established that quadratic memory is necessary to achieve the optimal convergence rate even for randomized algorithms, at least in the parameter regime where \(\epsilon\) is quasipolynomially small in d. Their proof leverages the same hard distribution of functions as we describe (Equation (1.1)), but introduces a variant of our Orthogonal Vector Game and develops an elegant analysis for it which allows them to show that a significant number of queries must be expended to observe each new Nemirovski vector \(v_i\). In contrast, our proof blocks together groups of such vectors.
Another recent work of Blanchard [11] considers the problem of finding a feasible point in a given set with access to a separation oracle. They construct an oracle which requires the algorithm to find robustly linearly independent vectors which are orthogonal to a sequence of subspaces. They build upon our proof framework and develop a novel recursive lower bound technique which allows them to show that gradient descent is almost optimal for the feasibility problem among linear memory algorithms. Their result also implies an exponential lower bound for sub-quadratic memory algorithms to solve the feasibility problem to sufficiently small error.
On the upper bound side, [12] develops a family of recursive cutting-plane algorithms which use a memory that can vary between \(\tilde{O}(d)\) and \(\tilde{O}(d^2)\). Previously, algorithms with established query complexities used either \(\tilde{O}(d)\) memory or \(\tilde{\Omega }(d^2)\). This work establishes new upper bounds on memory-constrained query complexity for any memory constraint that falls between linear and quadratic. In the regime when \(\epsilon \le d^{-\Omega (d)}\), their algorithm matches the memory usage of gradient descent while improving on gradient descent’s query complexity.
Even in light of these advances, multiple open problems remain. For example, relaxing the type of functions and oracles for which we can prove memory-constrained lower bounds is a natural further direction. Interestingly, our proofs (and those of [13, 20]) rely on the assumption that we are optimizing a Lipschitz but non-smooth function and that at a point queried we only observe the function’s value and a single subgradient (rather than the set of all subgradients or all information about the function in a neighborhood) and consequently standard smoothing techniques do not readily apply [19, 25]. Proving lower bounds for smooth functions and beyond could illuminate the necessity of larger memory footprints for prominent memory-intensive optimization methods with practical appeal (e.g., interior point methods and quasi-Newton methods). Additionally, finding algorithms with new tradeoffs between space usage and query complexity would be very interesting. In particular, improving the memory versus query tradeoffs for more structured functions, e.g., piecewise linear functions, is an interesting question for future work.

2 Setup and Overview of Results

We consider optimization methods for minimizing convex, Lipschitz-continuous functions \(F: \mathbb {R}^d \rightarrow \mathbb {R}\) over the unit-ball2 with access to a first-order oracle. Our goal is to understand how the (oracle) query complexity of algorithms is affected by restrictions on the memory available to the algorithm. Our results apply to a rather general definition of memory-constrained first-order algorithms given in the following Definition 2.1. Such algorithms include those that use arbitrarily large memory at query time but can save at most M bits of information in between interactions with the first-order oracle. More formally we have:
Definition 2.1 (M-bit memory-constrained deterministic algorithm)
An M-bit (memory-constrained) deterministic algorithm with first-order oracle access, \(\mathcal {A}_{\textrm {det}}\), is the iterative execution of a sequence of functions \(\lbrace \phi _{\textrm {query},t}, \phi _{\textrm {update},t} \rbrace _{t \ge 1}\), where t denotes the iteration number. The function \(\phi _{\textrm {query},t}\) maps the \({(t-1)}^{\textrm {th}}\) memory state of size at most M bits to the \({t}^{\textrm {th}}\) query vector, \(\phi _{\textrm {query},t}(\mathsf {Memory}_{t-1}) = \mathbf {x}_t\). The algorithm is allowed to use an arbitrarily large amount of memory to execute \(\phi _{\textrm {query},t}\). Upon querying \(\mathbf {x}_t\), some first-order oracle returns \(F(\mathbf {x}_t)\) and a subgradient \(\mathbf {g}_{\mathbf {x}_t} \in \partial F(\mathbf {x}_t)\). The second function maps the first-order information and the old memory state to a new memory state of at most M bits, \(\mathsf {Memory}_{t} = \phi _{\textrm {update},t}(\mathbf {x}_t, F(\mathbf {x}_t), \mathbf {g}_{\mathbf {x}_t}, \mathsf {Memory}_{t-1})\). Again, the algorithm may use unlimited memory to execute \(\phi _{\textrm {update},t}\).
Note that our definition of a memory-constrained algorithm is equivalent to the definition given by [72]. Our analysis also allows for randomized algorithms which will often be denoted as \(\mathcal {A}_{\textrm {rand}}\).
Definition 2.2 (M-bit memory-constrained randomized algorithm)
An M-bit (memory-constrained) randomized algorithm with first-order oracle access, \(\mathcal {A}_{\textrm {rand}}\), is a deterministic algorithm with access to a string R of uniformly random bits which has length \(2^d\).3
In what follows, we use the notation \(\mathcal {O}_F(\mathbf {x})\) to denote the first-order oracle which, when queried at some vector \(\mathbf {x}\), returns the pair \((F(\mathbf {x}), \mathbf {g}_{\mathbf {x}})\) where \(\mathbf {g}_{\mathbf {x}} \in \partial F(\mathbf {x})\) is a subgradient of F at \(\mathbf {x}\). We will also refer to a sub-gradient oracle and use the overloaded notation \(\mathbf {g}_F(\mathbf {x})\) to denote the oracle which simply returns \(\mathbf {g}_{\mathbf {x}}\). It will be useful to refer to the sequence of vectors \(\mathbf {x}\) queried by some algorithm \(\mathcal {A}\) paired with the subgradient \(\mathbf {g}\) returned by the oracle.
Definition 2.3 (Query sequence).
Given an algorithm \(\mathcal {A}\) with access to some first-order oracle \(\mathcal {O}\) we let \(\mathsf {Seq}(\mathcal {A}, \mathcal {O}) = \left\lbrace (\mathbf {x}_i, \mathbf {g}_i) \right\rbrace _{i}\) denote the query sequence of vectors \(\mathbf {x}_i\) queried by the algorithm paired with the subgradient \(\mathbf {g}_i\) returned by the oracle \(\mathcal {O}\).

2.1 Proof Strategy

We describe a broad family of optimization problems which may be sensitive to memory constraints (also see Figure 2 for an overview). As suggested by the Orthogonal Vector Game(Definition 1.2), the primitive we leverage is that finding vectors orthogonal to the rows of a given matrix requires either large memory or many queries to observe the rows of the matrix. With that intuition in mind, let \(f : \mathbb {R}^d \rightarrow \mathbb {R}\) be a convex function, let \(\mathbf {A}\in \mathbb {R}^{n \times d}\), let \(\eta\) be a scaling parameter, and let \(\rho\) be a shift parameter; define \(F_{f, \mathbf {A}, \eta , \rho }(\mathbf {x})\) as the maximum of \(f(\mathbf {x})\) and \(\eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\):
\begin{equation} F_{f, \mathbf {A}, \eta , \rho }(\mathbf {x}) := \max \left\lbrace f(\mathbf {x}), \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \right\rbrace . \end{equation}
(2.1)
We often drop the dependence of \(\eta\) and \(\rho\) and write \(F_{f, \mathbf {A}, \eta , \rho }\) simply as \(F_{f, \mathbf {A}}\). Intuitively, for large enough scaling \(\eta\) and appropriate shift \(\rho\), minimizing the function \(F_{f, \mathbf {A}}(\mathbf {x})\) requires minimizing \(f(\mathbf {x})\) close to the null space of the matrix \(\mathbf {A}\). Any algorithm which uses memory \(\Omega (nd)\) can learn and store \(\mathbf {A}\) in \(\mathcal {O}(d)\) queries so that all future queries are sufficiently orthogonal to \(\mathbf {A}\); thus this memory rich algorithm can achieve the information-theoretic lower bound for minimizing \(f(\mathbf {x})\) roughly constrained to the nullspace of \(\mathbf {A}\).
Fig. 2.
Fig. 2. A high-level overview of our proof approach. The rows of \(\mathbf {A}\) and the Nemirovski vectors \(\lbrace \mathbf {v}_1,\dots , \mathbf {v}_N\rbrace\) are sampled uniformly at random from the hypercube. \(\mathbf {x}_i\) is the first query such that the oracle returns \(\mathbf {v}_i\) as the response. We show that this function class is “memory-sensitive” (Definition 2.8) and has the following properties: (1) to successfully minimize the function, the algorithm must see a sufficiently large number of Nemirovski vectors, (2) to reveal new Nemivoski vectors, an algorithm must make queries which are robustly linearly independent and orthogonal to \(\mathbf {A}\). Using these properties, we show that minimizing the function is at least as hard as winning the Orthogonal Vector Game (Definition 1.2) \(\approx N/k\) times. Specifically, we show memory-query tradeoffs for the Orthogonal Vector Game where the goal is to obtain k vectors. This tradeoff is then leveraged \(\approx N/k\) times to obtain the memory-query tradeoff for the optimization problem.
However, if \(\mathbf {A}\) is a random matrix with sufficiently large entropy then \(\mathbf {A}\) cannot be compressed to fewer than \(\Omega (n d)\) bits. Thus, for \(n = \Omega (d)\), an algorithm which uses only memory \(o(d^{2-\delta })\) bits for some constant \(0 \lt \delta \le 1\) cannot remember all the information about \(\mathbf {A}\). Suppose the function f is such that in order to continue to observe new information about the function, it is insufficient to submit queries that belong to some small dimensional subspace of the null space of \(\mathbf {A}\). Then a memory-constrained algorithm must relearn enough information about \(\mathbf {A}\) in order to find new vectors in the null space of \(\mathbf {A}\) and make queries which return new information about f.
To show our lower bound, we set f in (2.1) to be the Nemirovski function:
\begin{align*} f(\mathbf {x}) = \max _{i \in [N]} \left(\mathbf {v}_i^{\top } \mathbf {x}- i \gamma \right), \end{align*}
where the vectors \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) are unit vectors drawn i.i.d. from the d dimensional scaled hypercube \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } \mathsf {Unif} (d^{-1/2} \mathcal {H}_d)\) and we refer to them as “Nemirovski vectors.” With this choice of f, the overall function \(F_{f, \mathbf {A}}(\mathbf {x})\) has certain “memory-sensitive” properties. In particular, to reveal new Nemirovski vectors an algorithm cannot make queries which lie in a low-dimensional subspace. In addition, because of the \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty\) term in the definition of \(F_{f, \mathbf {A}}(\mathbf {x})\), queries which reveal new Nemirovski vectors must also be sufficiently orthogonal to \(\mathbf {A}\). Together, these properties imply that a sequence of queries which reveals a new set of Nemirovski vectors must also be winning queries for the Orthogonal Vector Game (Definition 1.2). This allows us to leverage our information-theoretic memory-query tradeoffs for the Orthogonal Vector Game. We show that the algorithm must reveal a sufficiently large number of Nemirovski vectors to optimize \(F_{f, \mathbf {A}}(\mathbf {x})\), therefore, we can repeatedly apply the lower bound for the Orthogonal Vector Game to show our final lower bound (also see Figure 2 for a visual overview of the approach).

2.2 Proof Components

In summary of Section 2.1, there are two critical features of \(F_{f,\mathbf {A}}\) which lead to a difficult optimization problem for memory-constrained algorithms. First, the distribution underlying random instances of \(\mathbf {A}\) must ensure that \(\mathbf {A}\) cannot be compressed. Second, the function f must be such that queries which belong to a small dimensional subspace cannot return too many informative subgradients. We formalize these features by defining a memory-sensitive base and a memory-sensitive class. We show that if \(\mathbf {A}\) is drawn uniformly at random from a memory-sensitive base and if f and a corresponding first-order oracle are drawn according to a memory-sensitive class then the resulting distribution over \(F_{f, \mathbf {A}}(\mathbf {x})\) and its first-order oracle provides a difficult instance for memory-constrained algorithms to optimize.
Definition 2.4 (\((k\), \(c_{B}\))-memory-sensitive base)
Consider a sequence of sets \(\left\lbrace B_d \right\rbrace _{d\gt 0}\) where \(B_d \subset \mathbb {R}^d\). We say \(\left\lbrace B_d \right\rbrace\) is a \((k\), \(c_{B}\))-memory-sensitive base if for any d, \(\left| B_d \right| \lt \infty\) and for \(\mathbf {h}\sim \mathsf {Unif}(B_d)\) the following hold: for any matrix \(\mathbf {Z} = \left(\mathbf {z}_1, \dots , \mathbf {z}_{k} \right) \in \mathbb {R}^{d \times k}\) with orthonormal columns,
\begin{equation*} \mathbb {P}\left(\left\Vert {\mathbf {h}}\right\Vert _2 \gt d \right) \le 2^{-d} \qquad \textrm {and} \qquad \mathbb {P} \left(\left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _\infty \le 1/2 \right) \le 2^{- c_{B}k}. \end{equation*}
Setting \(n = \lfloor d/2 \rfloor\), we use \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) to mean \(\mathbf {A}= \left(\mathbf {a}_1, \dots , \mathbf {a}_n \right)^{\top }\) where each \(\mathbf {a}_i \sim \mathsf {Unif}(B_d)\). Theorem 2.5 shows that the sequence of hypercubes is a memory-sensitive base, which is proven in Section 4.
Theorem 2.5.
Recall that \(\mathcal {H}_d = \left\lbrace \pm 1 \right\rbrace ^d\) denotes the hypercube. The sequence \(\left\lbrace \mathcal {H}_d \right\rbrace _{d = d_0}^{\infty }\) is a \((k,c_{\mathcal {H}})\) memory-sensitive base for all \(k\in [d]\) with some absolute constant \(c_{\mathcal {H}}\gt 0\).
Next, we consider distributions over functions f paired with some subgradient oracle \(\mathbf {g}_f\). Given f and some \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) we consider minimizing \(F_{f, \mathbf {A}}\) as in Equation (2.1) with access to an induced first-order oracle:
Definition 2.6 (Induced First-Order Oracle \(\mathcal {O}_{F_{f, \mathbf {A}}}\))
Given a subgradient oracle for f, \(\mathbf {g}_f:\mathbb {R}^d \rightarrow \mathbb {R}^d\), matrix \(\mathbf {A}\in \mathbb {R}^{n \times d}\), and parameters \(\eta\) and \(\rho\), let
\[\begin{gather*} \mathbf {g}_{\mathbf {A}}(\mathbf {x}) := \mathbf {a}_{i_{\min }}, \textrm { where } i_{\min } := \min \left\lbrace i \in [n] \textrm { s.t. } \mathbf {a}_i^{\top } \mathbf {x}= \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty \right\rbrace \!, \end{gather*}\]
\[\begin{gather*} \mathbf {g}_{F_{f, \mathbf {A}}}(\mathbf {x}) := {\left\lbrace \begin{array}{ll} &\!\!\!\!\! \mathbf {g}_{\mathbf {A}}(\mathbf {x}), \textrm { if } \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \ge f(\mathbf {x}), \\ &\!\!\!\!\! \mathbf {g}_f(\mathbf {x}), \textrm { else.} \end{array}\right.} \end{gather*}\]
The induced first-order oracle for \(F_{f, \mathbf {A}}(\mathbf {x})\), denoted as \(\mathcal {O}_{F_{f, \mathbf {A}}}\) returns the pair \((F_{f, \mathbf {A}}(\mathbf {x}),\mathbf {g}_{F_{f, \mathbf {A}}}(\mathbf {x}))\).
We also define informative subgradients, formalizing the intuition described at the beginning of this section.
Definition 2.7 (Informative subgradients).
Given a query sequence \(\mathsf {Seq}(\mathcal {A}, \mathcal {O}_{F_{f, \mathbf {A}}}) = \left\lbrace (\mathbf {x}_i, \mathbf {g}_i) \right\rbrace _{i=1}^{T}\), we construct the sub-sequence of informative subgradients as follows: proceed from \(i = 1, \dots , T\) and include the pair \((\mathbf {x}_i, \mathbf {g}_i)\) if and only if \((1)\) \(f(\mathbf {x}_{i}) \gt \eta \left\Vert {\mathbf {A}\mathbf {x}_{i}}\right\Vert _\infty - \rho\) and \((2)\) no pair \((\mathbf {x},\mathbf {g})\) such that \(\mathbf {g}=\mathbf {g}_i\) has been selected in the sub-sequence so far. If \((\mathbf {x}_i, \mathbf {g}_i)\) is the \({j}^{\textrm {th}}\) pair included in the sub-sequence define \(t_j=i\) and we call \(\mathbf {x}_i\) the \({j}^{\textrm {th}}\) informative query and \(\mathbf {g}_i\) the \({j}^{\textrm {th}}\) informative subgradient.
We can now proceed to the definition of a \((L, M, k, \epsilon {^\star })\)-memory-sensitive class.
Definition 2.8 (\((L, N, k, \epsilon {^\star })\)-memory-sensitive class)
Let \(\mathcal {F}\) be a distribution over functions \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) paired with a valid subgradient oracle \(\mathbf {g}_f:\mathbb {R}^d \rightarrow \mathbb {R}^d\). For Lipschitz constant L, “depth” N, “round-length” k, and “optimality threshold” \(\epsilon {^\star }\), we call \(\mathcal {F}\) a \((L, N, k, \epsilon {^\star })\)-memory-sensitive class if one can sample from \(\mathcal {F}\) with at most \(2^d\) uniformly random bits, and there exists a choice of \(\eta\) and \(\rho\) (from Equation (2.1)) such that for any \(\mathbf {A}\in \mathbb {R}^{n \times d}\) with rows bounded in \(\ell _2\)-norm by d and any (potentially randomized) unbounded-memory algorithm \(\mathcal {A}_{\textrm {rand}}\) which makes at most Nd queries: if \((f, \mathbf {g}_f) \sim \mathcal {F}\) then with probability at least \(2/3\) (over the randomness in \((f,\mathbf {g}_f)\) and \(\mathcal {A}_{\textrm {rand}}\)) the following hold simultaneously:
(1)
Regularity: \(F_{f, \mathbf {A}, \eta , \rho }\) is convex and L-Lipschitz.
(2)
Robust independence of informative queries: If \(\lbrace \mathbf {x}_{t_j}\rbrace\) is the sequence of informative queries generated by \(\mathcal {A}_{\textrm {rand}}\) (as per Definition 2.7) and \(S_{j} := \mathsf {span}(\lbrace \mathbf {x}_{t_i}: \max (1, j-k) \le i \le j \rbrace)\) then \(\forall j \in [N]\)
\begin{equation} \left\Vert {\mathrm{Proj}_{S_{j-1}}({\mathbf {x}}_{t_j})}\right\Vert _2/\left\Vert {{\mathbf {x}}_{t_j}}\right\Vert _2 \le 1 - 1/d^2. \end{equation}
(2.2)
(3)
Approximate orthogonality: Any query \(\mathbf {x}\) with \(F_{f, \mathbf {A}, \eta , \rho }(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\) satisfies
\begin{equation} \mathbf {g}_{F_{f, \mathbf {A}}}(\mathbf {x}) = \mathbf {v}_1 \textrm { or } \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty /\left\Vert {\mathbf {x}}\right\Vert _2\le d^{-4}. \end{equation}
(2.3)
(4)
Necessity of informative subgradients: If \(r \lt N\) then for any \(i \le t_r\) (where \(t_r\) is as in Definition 2.7),
\begin{equation} F_{f,\mathbf {A},\eta , \rho }(\mathbf {x}_i) - F_{f,\mathbf {A}, \eta , \rho }(\mathbf {x}^\star) \ge \epsilon {^\star }. \end{equation}
(2.4)
Informally, the robust independence of informative queries condition ensures that any informative query is robustly linearly independent with respect to the previous k informative queries. The approximate orthogonality condition ensures that any query \(\mathbf {x}\) which reveals a gradient from f which is not \(\mathbf {v}_1\) is approximately orthogonal to the rows of \(\mathbf {A}\). Finally, the necessity of informative subgradients condition ensures that any algorithm which has queried only \(r \lt N\) informative subgradients from f has an optimality gap of at least \(\epsilon {^\star }\).
The following construction, discussed in the introduction, is a concrete instance of a \((d^6, N,\) k, \(1/(20 \sqrt {N}))\)-memory-sensitive classṪheorem 2.10 proves that the construction is memory-sensitive, and the proof appears in Section 4.
Definition 2.9 (Nemirovski class).
For a given \(\gamma \gt 0\) and for \(i \in [N]\) draw \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } \mathsf {Unif} \left(d^{-1/2} \mathcal {H}_d \right)\) and set
\begin{equation*} f(\mathbf {x}) := \max _{i \in [N]} \mathbf {v}_i^{\top } \mathbf {x}- i \gamma \qquad \textrm { and } \qquad \mathbf {g}_f(\mathbf {x}) = \mathbf {v}_{i_{\min }}, \end{equation*}
where \(i_{\min } := \min \lbrace i \in [N] \textrm { s.t. }\) \(\mathbf {v}_i^{\top } \mathbf {x}- i \gamma = f(\mathbf {x}) \rbrace\). Let \(\mathcal {N}_{N, \gamma }\) be the distribution of \((f,\mathbf {g}_f)\), and for a fixed matrix \(\mathbf {A}\) let \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) be the distribution of the induced first-order oracle in Definition 2.6.
Theorem 2.10.
For large enough dimension d there is an absolute constant \(c \gt 0\) such that for any given k, the Nemirovski function class with \(\gamma = (400 k\log (d) / d)^{1/2}\), \(L = d^6\), \(N= c (d/ (k\log d))^{1/3}\), and \(\epsilon {^\star }=1/(20 \sqrt { N})\) is a \((L, N, k, \epsilon {^\star })\)-memory-sensitive class.
Note that choosing \(k= \lceil 60 M/ (c_{\mathcal {H}} d) \rceil\) gives a \((d^6, N,\) \(\lceil 60 M/(c_{\mathcal {H}}d) \rceil ,\) \(1/(20 \sqrt {N}))\)-memory-sensitive class with \(N= \frac{1}{205} (\frac{ c_{\mathcal {H}} d^2}{M\log d })^{1/3} \approx (d^2/M)^{1/3}\).

2.3 Main Results

With the definitions from the previous section in place, we now state our main technical theorem for proving lower bounds on memory-constrained optimization.
Theorem 2.11 (From memory-sensitivity to lower bounds).
For \(d \in \mathbb {Z}_{\gt 0}\), \(M\in \mathbb {Z}_{\ge d}\), and \(\epsilon \in \mathbb {R}_{\gt 0}\), let \(\mathbf {A}\sim \mathsf {Unif}(B_d^{\lfloor d/2 \rfloor })\) for \((k, c_{B})\)-memory-sensitive base \(\left\lbrace B_d \right\rbrace _{d \gt 0}\), and let \((f, \mathbf {g}_f) \sim \mathcal {F}\) for \((L, N, k, \epsilon)\)-memory-sensitive class \(\mathcal {F}\), with \(k\ge \lceil 60 M/ (c_{B}d)\rceil\). Any M-bit memory-constrained randomized algorithm (as per Definition 2.2) which outputs an \(\epsilon\)-optimal point for \(F_{f, \mathbf {A}}\) with probability at least \(2/3\) requires \(\Omega (c_B d^2 N / M)\) queries to the first-order oracle, \(\mathcal {O}_{F_{f, \mathbf {A}}}\) (where \(F_{f,\mathbf {A}}\) and \(\mathcal {O}_{F_{f, \mathbf {A}}}\) are as defined in Definition 2.6).
Theorem 2.11 is agnostic to the particular memory-sensitive class and instead describes how any given memory-sensitive class can be used to derive a query complexity lower bound. The ultimate bound derived using a particular memory sensitive class depend on the classes’ relationship between Lipschitz constant L, depth N, and optimality threshold \(\epsilon\).
In Theorem 2.10, we exhibit a \((d^6, N, \lceil 60 M/(c_{\mathcal {H}}d) \rceil ,\) \(1/(20 \sqrt {N}))\)-memory-sensitive class. For any given L, by scaling this function class by \(L/d^6\), this implies the existence of an \((L, N, \lceil 60 M/(c_{\mathcal {H}}d) \rceil , L/(20 d^6 \sqrt {N}))\)-memory-sensitive class. Theorem 2.11, along with the existence of a memory-sensitive base from Theorem 2.5, then implies a lower bound of \(\Omega (c_B d^2 N / M)\) first-order oracle queries. Setting \(\epsilon = L/(20 d^6 \sqrt {N})\) and recalling that Theorem 2.10 requires \(N \le (c d^2/M \log (d))^{1/3}\) for some absolute constant \(c \gt 0\), we obtain the following memory-query tradeoff parameterized by \(\epsilon\): For any \(\epsilon \in [LM^{1/6}/(16d^{19/3}), L/(4 \sqrt {d})]\), any M-bit memory-constrained randomized algorithm (as per Definition 2.2) which outputs an \(\epsilon\)-optimal point with probability at least \(2/3\) requires \(\Omega (c_{B}L^2d^2/(16^2 d^{12} \epsilon ^2 M))\) first-order oracle queries. In Theorem 1.1, we explicitly choose \(\epsilon\) as a function of M and d to yield a result that is easy to parse.
Proof of Theorem 1.1
Consider M-bit memory-constrained (potentially) randomized algorithms (as per Definition 2.2) where M can be written in the form \(d^{1.25 - \delta }\) for some \(\delta \in [0,1/4]\). By Theorem 2.5, \(\lbrace \mathcal {H}_d\rbrace _{d = d_0}^{\infty }\) is a \((k,c_{\mathcal {H}})\) memory-sensitive base for all \(k \in [d]\), where \(c_{\mathcal {H}}\gt 0\) is an absolute constant. By Theorem 2.10, for some absolute constant \(c \gt 0\) and for d large enough and any given k, if \(\gamma =\sqrt {400 k\log d/d}\), \(N= c (d/(k\log d))^{1/3}\), and \(\epsilon = 1/(20 d^6 \sqrt {N}) {\ge 1/d^7}\) (since \(N\le d\)), the Nemirovski function class from Definition 2.9 is \((1, N, k, \epsilon)\)-memory-sensitive (where we rescaled by the Lipschitz constant \(1/d^6\)). Set \(k= \lceil 60 M/ (c_{\mathcal {H}} d) \rceil\). Let \((f, \mathbf {g}_f) \sim \mathcal {N}_{\gamma , N}\) and \(\mathbf {A}\sim \mathsf {Unif} (\mathcal {H}_d^n)\) and suppose \(\mathcal {A}_{\textrm {rand}}\) is an M-bit algorithm which outputs an \(\epsilon\)-optimal point for \(F_{f, \mathbf {A}}\) with failure probability at most \(1/3\). By Theorem 2.11, \(\mathcal {A}_{\textrm {rand}}\) requires at least \(c_{\mathcal {H}} Nd^2/M\ge c (d^2/M)^{4/3}(1/\log ^{1/3} d)\) many queries. Recalling that \(M= d^{1.25 - \delta }\) for \(\delta \in [0, 1/4]\) results in a lower bound of \(\tilde{\Omega }(d^{1+ \frac{4}{3}\delta })\) queries. □
Note that by [72], the center of mass algorithm can output an \(\epsilon\)-optimal point for any L-Lipschitz function using \(\mathcal {O}(d^2 \log ^2 \left(LB/\epsilon \right))\) bits of memory and \(\mathcal {O}\left(d \log (LB/\epsilon) \right)\) first-order oracle queries. Comparing this with the lower bound from Theorem 2.11, we see that M-bit algorithms are unable to achieve the optimal quadratic memory query complexity with less than \(d N/\log (L/\epsilon)\) bits. Furthermore, Theorem 2.11 yields a natural approach for obtaining improved lower bounds. If one exhibits a memory-sensitive class with depth \(N = d\), then Theorem 2.11 would imply that any algorithm using memory of size at most \(M= d^{2 - \delta }\) requires at least \(\Omega (d^{1 + \delta })\) many first-order oracle queries in order to output an \(\epsilon\)-optimal point. We note that the subsequent work of [13] shows that a memory-sensitive class with a similar property can be adaptively constructed for any deterministic algorithm, showing that quadratic memory is necessary for deterministic algorithms.
Remark 2.12 (Why \(d^{4/3}\): Parameter tradeoffs in our lower bound.)
An idealized lower bound in this setting would arise from exhibiting an f such that to obtain any informative gradient requires \(\Theta (d)\) more queries to re-learn \(\mathbf {A}\) and in order to optimize f we need to observe at least \(\Omega (d)\) such informative gradients. Our proof deviates from this ideal on both fronts. We do prove that we need \(\Theta (d)\) queries to get any informative gradients, though with this number we cannot preclude getting \(M/d\) informative gradients. Second, our Nemirovski function can be optimized while learning only \(\mathcal {O}((d^{2}/M)^{1/3})\)-informative gradients (there are modifications to Nemirovski that increase this number [18], though for those functions, informative gradients can be observed while working within a small subspace). For our analysis, we have, very roughly, that \(\Omega (d)\) queries are needed to observe \(M/d\) informative gradients and we must observe \((d^{2}/M)^{1/3}\) informative gradients to optimize the Nemirovski function. Therefore, optimizing the Nemirovksi function requires \(\Omega \left(d \right) \times \frac{(d^{2}/M)^{1/3}}{M/d} = (d^2/M)^{4/3}\) many queries; and so when \(M = \mathcal {O}(d)\) we have a lower bound of \(d^{4/3}\) queries.

3 Proof Overview

Here we give the proof of Theorem 2.11. In Section 3.1, we present our first key idea, which relates the optimization problem in (2.1) to the Orthogonal Vector Game (which was introduced informally in Definition 1.2). Next, in Section 3.2, we show that winning this Orthogonal Vector Game is difficult with memory constraints.

3.1 The Orthogonal Vector Game

We formally define the Orthogonal Vector Game in Game 1. The game proceeds in three parts. In Part 1, the oracle samples a random \((d/2\times d)\) dimensional matrix \(\mathbf {A}\), and the player gets to observe the matrix and write down an M-bit message \(\mathsf {Message}\) about the matrix. In Part 2, the player can make queries to the oracle. The oracle responds to any query \(\mathbf {x}\) with a row of the matrix \(\mathbf {A}\) which has the largest magnitude inner product with \(\mathbf {x}\). In Part 3, the player uses \(\mathsf {Message}\) and all the queries and responses from the previous part to find a set of k vectors which are robustly linearly independent, and are roughly orthogonal to all the rows of \(\mathbf {A}\). The player wins the game if she can do this successfully.
Next, we relate winning the Orthogonal Vector Game to minimizing \(F_{f,\mathbf {A}}\) (Equation (2.1)).
Lemma 3.1 (Optimizing \(F_{f, \mathbf {A}}\) is harder than winning the Orthogonal Vector Game)
Let \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) and \((f,\mathbf {g}_f) \sim \mathcal {F}\), where \(\mathcal {F}\) is a \((L, N, k, \epsilon {^\star })\) memory-sensitive class. If there exists an M-bit algorithm with first-order oracle access that outputs an \(\epsilon {^\star }\)-optimal point for minimizing \(F_{f, \mathbf {A}}\) using \(m \lfloor N/(k+ 1) \rfloor\) queries with probability at least \(2/3\) over the randomness in the algorithm and choice of f and \(\mathbf {A}\), then the Player can win the Orthogonal Vector Game with probability at least \(1/3\) over the randomness in the Player’s strategy and \(\mathbf {A}\).
Here we provide a brief proof sketch before the detailed proof: Let \(\mathcal {A}_{\textrm {rand}}\) denote a hypothetical M-bit memory algorithm for optimizing \(F_{f,\mathbf {A}}\) (Definition 2.2). The Player samples a random function/oracle pair \((f,\mathbf {g}_f)\) from the memory-sensitive class \(\mathcal {F}\). Then, she uses \(\mathcal {A}_{\textrm {rand}}\) to optimize \(F_{f, \mathbf {A}}\), and issues queries that \(\mathcal {A}_{\textrm {rand}}\) makes to the oracle in the Orthogonal Vector Game. Notice that with access to \(\mathbf {g}_f(\mathbf {x})\) and the oracle’s response \(\mathbf {g}_{\mathbf {A}}(\mathbf {x})\), she can implement the sub-gradient oracle \(\mathbf {g}_{F_{f, \mathbf {A}}}\) (Definition 2.6). We then use the memory-sensitive properties of f to argue that informative queries made by \(\mathcal {A}_{\textrm {rand}}\) must also be successful queries in the Orthogonal Vector Game, and that \(\mathcal {A}_{\textrm {rand}}\) must make enough informative queries to find an \(\epsilon {^\star }\)-optimal point.
Proof of Lemma 3.1
In Algorithm 2, we provide a strategy for the Player which uses \(\mathcal {A}_{\textrm {rand}}\) (the M-bit algorithm that minimizes \(F_{f,\mathbf {A}}\)) to win the Orthogonal Vector Game with probability at least \(1/3\) over the randomness in R and \(\mathbf {A}\).
The Player uses the random string \(R = (R_1, R_2, R_3)\) to sample a function/oracle pair \((f, \mathbf {g}_f) \sim \mathcal {F}\), and to run the algorithm \(\mathcal {A}_{\textrm {rand}}\) (if it is randomized). The first section \(R_1\) of the random string is used to sample \((f, \mathbf {g}_f) \sim \mathcal {F}\). Note by Definition 2.8, \((f, \mathbf {g}_f)\) can be sampled from \(\mathcal {F}\) with at most \(2^d\) random bits, therefore the Player can sample \((f, \mathbf {g}_f) \sim \mathcal {F}\) using \(R_1\). The Player uses \(R_2\) and \(R_3\) to supply any random bits for the execution of \(\mathcal {A}_{\textrm {rand}}\), which by Definition 2.2 of \(\mathcal {A}_{\textrm {rand}}\), is a sufficient number of random bits.
Let G be the event that \(R_1,R_2,R_3\), and \(\mathbf {A}\) have the property that (1) \(\mathcal {A}_{\textrm {rand}}\) succeeds in finding an \(\epsilon ^\star\)-optimal point for \(F_{f,\mathbf {A}}(\mathbf {x})\), and (2) all the properties of a memory-sensitive class in Definition 2.8 are satisfied. By the guarantee in Lemma 3.1, \(\mathcal {A}_{\textrm {rand}}\) finds an \(\epsilon ^\star\)-optimal point with failure probability at most \(1/3\) over the randomness in \(\mathbf {A}\), \(R_1\) (the randomness in \((f, \mathbf {g}_f)\)) and \(R_2, R_3\) (the internal randomness used by \(\mathcal {A}_{\textrm {rand}}\)). Also, the properties in Definition 2.8 are satisfied with failure probability at most \(1/3\) by definition. Therefore, by a union bound, \(\Pr [G]\ge 1/3\). We condition on G for the rest of the proof.
We now prove that Part 1, defined in Algorithm 2, does not fail under event G. Recall the definition of informative subgradients from Definition 2.7, and note that by Definition 2.8 in Definition 2.8, if \(\mathcal {A}_{\textrm {rand}}\) finds an \(\epsilon ^\star\)-optimal point, then \(\mathcal {A}_{\textrm {rand}}\) must observe at least N informative subgradients. Therefore, since \(\mathcal {A}_{\textrm {rand}}\) can find an \(\epsilon ^\star\)-optimal point using \(m \lfloor N/(k+ 1) \rfloor\) queries it can observe N informative subgradients from f using \(m \lfloor N/(k+ 1) \rfloor\) queries. Therefore, under event G, Part 1 does not fail for any index i in the for loop. Now we show that under event G there must exist some index i such that Part 1 returns \(\mathsf {Memory}_{i}\) as the \(\mathsf {Message}\) to be stored. For any execution of the algorithm, let \(\mathbf {v}_i\) denote the \({i}^{\textrm {th}}\) informative subgradient from f. Block the \(\mathbf {v}_i\)’s into \(\lfloor N/(k+ 1) \rfloor\) groups of size \((k+ 1)\): \(\mathsf {Block}_i = \left\lbrace \mathbf {v}_{{(i-1) (k+ 1) + 1}}, \dots , \mathbf {v}_{i (k+ 1)} \right\rbrace\). If \(\mathcal {A}_{\textrm {rand}}\) observes N informative subgradients from f using \(m \lfloor N/(k+ 1) \rfloor\) queries, then there is an index \(i^\star\) such that \(\mathcal {A}_{\textrm {rand}}\) observes \(\mathsf {Block}_{i^\star }\) using at most m queries. Therefore, the number of queries to observe \((k+1)\) more informative subgradients is at most m for iteration \(i^\star\). Hence Part 1 does not fail under event G and always return \(\mathsf {Memory}_{i}\) as the \(\mathsf {Message}\) for some index i.
In Part 2 of the strategy, the Player no longer has access to \(\mathbf {A}\), but by receiving the Oracle’s responses \(\mathbf {g}_i\) she can still implement the first-order oracle \(\mathcal {O}_{F_{f, \mathbf {A}}}\) (as defined in Definition 2.6) and hence run \(\mathcal {A}_{\textrm {rand}}\). Consequently, to complete the proof we will now show that under event G the Player can find a set of successful vectors in Part 3 and win. By the guarantee of Part 1, the Player has made at least \(k+ 1\) informative queries among the m queries she made in Part 2. By Definition 2.8, if \(\mathbf {x}_i\) is an informative query, then the query satisfies Equations (2.3) and (2.2). Using this, if the Player has observed \(k+ 1\) informative subgradients then she’s made at least k queries which are successful (where the extra additive factor of one comes from the fixed vector \(\mathbf {v}_1\) in Equation (2.3), and note that the subspace in Equation (2.2) is defined as the span of all previous k informative queries, and this will contain the subspace defined in the robust linear independence condition). Therefore, she will find a set of k successful vectors among the m queries made in Part 2 under event G. Since G happens with a probability of at least \(1/3\), she can win with a probability of at least \(1/3\). □

3.2 Analyzing the Orthogonal Vector Game: Proof Sketch

With Lemma 3.1 in hand, the next step towards proving Theorem 2.11 is to establish a query lower bound for any memory-constrained algorithm which wins the Orthogonal Vector Game. Although our results hold in greater generality, i.e., when \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) for some memory-sensitive base \(\lbrace B_d\rbrace _{d \gt 0}\), we first provide some intuition by considering the concrete case where the rows of \(\mathbf {A}\) are drawn uniformly at random from the hypercube. We also consider an alternative oracle model which we call the Index-Oracle model for which it is perhaps easier to get intuition, and for which our lower bound still holds: suppose the Player can specify any index \(i \in [n]\) as a query and ask the oracle to reveal the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (this is the setup in the informal version of the game in Definition 1.2). Note that this is in contrast to the oracle in the Orthogonal Vector Game which instead responds with the row of the matrix \(\mathbf {A}\) which is a subgradient of the query made by the algorithm. As described in Section 1.1, the instances where \(M=\Omega (kd), m=0\) and the instance where \(M=0, m=n\) are trivially achievable for the Player. We show that these trivial strategies are essentially all that is possible:
Theorem 3.2 (Informal version of Theorem 3.3).
For some constant \(c\gt 0\) and any \(k\ge \Omega (\log (d))\), if \(M\le cdk\) and the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).
Note that when \(m=0\), it is not difficult to show that the Player requires memory roughly \(\Omega (kd)\) to win the game with any decent probability. This is because each vector which is approximately orthogonal to the rows of \(\mathbf {A}\) will not be compressible below \(\Omega (d)\) bits with high probability, since rows of \(\mathbf {A}\) are drawn uniformly at random from the hypercube. Therefore, the Player must use \(M\ge \Omega (kd)\) to store k such vectors. The main challenge in the analysis is to bound the power of additional, adaptively issued queries. In particular, since observing every row \(\mathbf {a}_i\) gives \(\Theta (d)\) bits of information about \(\mathbf {A}\), and k successful vectors will only have about \(\mathcal {O}(kd)\) bits of information about \(\mathbf {A}\), perhaps only observing k additional rows of \(\mathbf {A}\) is enough to win the game. Slightly more concretely, why can’t the Player store some \(M \ll kd\) bits of information about \(\mathbf {A}\), such that by subsequently strategically requesting \(m\ll n\) rows of \(\mathbf {A}\) she now knows k vectors which are approximately orthogonal to the rows of \(\mathbf {A}\)? Our result guarantees that such a strategy is not possible.
We now give some intuition for our analysis. The underlying idea is to calculate the mutual information between \(\mathbf {A}\) and \(\mathbf {Y}\) conditioned on a fixed random string R and the information \(\mathbf {G}\) gleaned by the Player (in our proof we slightly augment the matrix \(\mathbf {G}\) and condition on that, but we ignore that detail for this proof sketch). We will use the notation \(I(\mathbf {X}; \mathbf {Y} \vert \mathbf {Z})\) to denote the mutual information between \(\mathbf {X}\) and \(\mathbf {Y}\) conditioned on \(\mathbf {Z}\) and \(H(\mathbf {X})\) to denote the entropy of \(\mathbf {X}\) (see [22]). Our analysis relies on computing a suitable upper bound and lower bound for \(I(\mathbf {A}; \mathbf {Y} \vert \mathbf {G}, R)\). We begin with the upper bound. We first argue that \(\mathbf {Y}\) can be computed just based on \(\mathbf {G}, \mathsf {Message}\) and R (without using \(\mathbf {X}\)),
\begin{align*} Y = g(\mathbf {G}, \mathsf {Message}, R), \text{for some function}\; {g.} \end{align*}
The proof of this statement relies on the fact that the Player’s strategy is deterministic conditioned on the random string R. Now using the data processing inequality [22] we can say,
\begin{equation*} I(\mathbf {A}; \mathbf {Y} \vert \mathbf {G}, R) \le I(\mathbf {A}; \mathbf {G}, \mathsf {Message}, R \vert \mathbf {G}, R) = I(\mathbf {A}; \mathsf {Message}\vert \mathbf {G}, R). \end{equation*}
Next, we use the definition of mutual information, the fact that conditioning only reduces entropy, and Shannon’s source coding theorem [22] to write,
\[\begin{gather} I(\mathbf {A}; \mathsf {Message}\vert \mathbf {G}, R) \le H(\mathsf {Message}\vert \mathbf {G}, R) \le H(\mathsf {Message}) \le M\\ \Rightarrow I(\mathbf {A}; \mathbf {Y} \vert \mathbf {G}, R) \le M. \end{gather}\]
(3.1)
On the other hand, we argue that if \(\mathbf {Y}\) is successful as per the Orthogonal Vector Game with probability \(1/3\) then there must be a larger amount of mutual information between \(\mathbf {Y}\) and \(\mathbf {A}\) after conditioning on \(\mathbf {G}\) and R. Recall that \(\mathbf {Y}\) contains k robustly linearly independent vectors in order to win the Orthogonal Vector Game, assume for now that \(\mathbf {Y}\) is also orthonormal. Then, we may use the fact that the hypercube is a memory-sensitive base (Definition 2.4) to derive that for some constant \(c \gt 0\)
\begin{align} \left| \left\lbrace \mathbf {a} \in \lbrace \pm 1\rbrace ^d \textrm { s.t. } \left\Vert {\mathbf {Y} \mathbf {a}}\right\Vert _\infty \le d^{-4}\right\rbrace \right| \le 2^{d-ck}. \end{align}
(3.2)
Since \(\mathbf {G}\) contains m rows of \(\mathbf {A}\), only the remaining \((n-m)\) rows of \(\mathbf {A}\) are random after conditioning on \(\mathbf {G}\). Assume for now that the remaining \((n-m)\) rows are independent of \(\mathbf {G}\) (as would be true under the Index-Oracle model defined earlier in this section). Then for these remaining rows, we use Equation (3.2) to bound their entropy conditioned on \(\mathbf {Y}\):
\begin{align} I(\mathbf {A}; \mathbf {Y}\vert \mathbf {G},R) & = H(\mathbf {A}\vert \mathbf {G}, R) - H(\mathbf {A}\vert \mathbf {Y}, \mathbf {G}, R)\\ &\ge (n-m)d-(n-m)(d-ck)= c(n-m)k= c(d/2-m)k, \end{align}
(3.3)
where we chose \(n = d/2\). Now combining Equations (3.1) and (3.3), we obtain a lower bound on the query complexity, \(m \ge (d/2) - M/(c k)\). Thus if \(M \ll cdk\), then \(m=\Omega (d)\).
While this proof sketch contains the skeleton of our proof, as also hinted in the sketch it oversimplifies many points. For example, note that the definition of a memory-sensitive base in Definition 2.4 requires that \(\mathbf {Y}\) be orthonormal to derive Equation (3.2), but successful vectors only satisfy a robust linear independence property. Additionally, in deriving Equation (3.3), we implicitly assumed that \(\mathbf {G}\) does not reduce the entropy of the rows of \(\mathbf {A}\) which are not contained in it. This is not true for the Orthogonal Vector Game, since every response \(\mathbf {g}_i\) of the oracle is a subgradient of \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty\) and, therefore, depends on all the rows of the matrix \(\mathbf {A}\). Our full proof which handles these issues and dependencies appears in Section 5. As noted before, our lower bound holds not only for the subgradient oracle, but also when the oracle response \(\mathbf {g}_i\) can be an arbitrary (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_n\rbrace\) (for example, the model where the Player can query any row \(\mathbf {a}_i\) of the matrix \(\mathbf {A}\)). We state the lower bound for the Orthogonal Vector Game which is the main result from Section 5 below.
Theorem 3.3.
Suppose for a memory-sensitive base \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) with constant \(c_{B}\gt 0\), \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\). Given \(\mathbf {A}\) let the oracle response \(\mathbf {g}_i\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be any (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_{d/2}\rbrace\), where \(\mathbf {a}_i\) is the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (note this includes the subgradient response in the Orthogonal Vector Game). Assume \(M \ge d \log (2d)\) and set \(k=\left\lceil 60M/(c_{B}d) \right\rceil\). For these values of k and M, if the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).

3.3 Putting Things Together: Proof of Main Theorem

Here we combine the results of the previous subsections to prove Theorem 2.11. Very roughly, by Lemma 3.1, any algorithm for finding an \(\epsilon ^\star\)-optimal point with probability \(2/3\) can be used to win the Orthogonal Vector Game with probability \(1/3\). Therefore, combining Theorem 3.3 and Lemma 3.1, and noting \(k \approx M/d\), any algorithm for finding an \(\epsilon\)-optimal point with probability \(2/3\) must use \(m \lfloor N/(k+1)\rfloor \ge c Nd^2/M\) queries.
Proof of Theorem 2.11
We take \(\epsilon =1/(20\sqrt {N})\) throughout the proof. By [72, Theorem 5], any algorithm for finding a \(\epsilon\)-optimal point for a L-Lipschitz function in the unit ball must use memory at least \(d\log (\frac{L}{2\epsilon })\). Therefore, since \(\epsilon \le L/(4\sqrt {d})\) then we may assume without loss of generality that \(M \ge d\log (\frac{L}{2\epsilon }) \ge d\log (2\sqrt {d})\ge (d/2)\log (4d)\). Therefore, by Theorem 3.3, if the Player wins the Orthogonal Vector Game with failure probability at most \(2/3\), then the number of queries m satisfy \(m\ge d/5\). By Lemma 3.1, any algorithm for finding an \(\epsilon\)-optimal point with a failure probability at most \(1/3\) can be used to win the Orthogonal Vector Game with a probability of at least \(1/3\). Therefore, combining Theorem 3.3 and Lemma 3.1, any algorithm for finding an \(\epsilon\)-optimal point with failure probability \(1/3\) must use
\begin{align*} m \lfloor N/(k+1)\rfloor \ge \frac{c_{B}N d^2}{(5 \cdot 60) M} \end{align*}
queries. Therefore, if \(\epsilon = 1/(20\sqrt {N})\) and \(c = c_{B}/(2\cdot 20\cdot 5\cdot 3\cdot 20^2)\ge c_{B}/10^6\), any memory-constrained algorithm requires at least \(c d^2/(\epsilon ^2 M)\) first-order queries. □

4 The Hypercube and Nemirovski Function Class Are Memory-Sensitive

In this section, we prove that the hypercube and the Nemirovski function class have the stated memory-sensitive properties. The proofs will rely on a few concentration bounds stated and proved in Appendix A.1.

4.1 The Hypercube is a Memory-Sensitive Base

Here we prove that the hypercube is a memory sensitive basis. Recall Theorem 2.5.
Theorem 2.5.
Recall that \(\mathcal {H}_d = \left\lbrace \pm 1 \right\rbrace ^d\) denotes the hypercube. The sequence \(\left\lbrace \mathcal {H}_d \right\rbrace _{d = d_0}^{\infty }\) is a \((k,c_{\mathcal {H}})\) memory-sensitive base for all \(k\in [d]\) with some absolute constant \(c_{\mathcal {H}}\gt 0\).
Proof of Theorem 2.5
First observe that for any \(\mathbf {h}\in \mathcal {H}_d\), \(\left\Vert {\mathbf {h}}\right\Vert _2 = \sqrt {d}\). Next let \(\mathbf {Z}= (\mathbf {z}_1, \dots , \mathbf {z}_k) \in \mathbb {R}^{d \times k}\) be k orthonormal d-dimensional vectors and let \(\mathbf {h}\sim \mathsf {Unif}(\mathcal {H}_d)\). Note that each coordinate of \(\mathbf {h}\) is sub-Gaussian with sub-Gaussian norm bounded by 2 and \(\mathbb {E}[\mathbf {h}\mathbf {h}^{\top }]=\mathbf {I}_d\). By Lemma A.6 in Appendix A.1, there is an absolute constant \(c_{\textrm {hw}}\gt 0\) such that for any \(s \ge 0\),
\begin{equation*} \mathbb {P} \left(k- \left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _2^2 \gt s \right) \le \exp \left(- c_{\textrm {hw}}\min \left\lbrace \frac{s^2}{16 k}, \frac{s}{4} \right\rbrace \right). \end{equation*}
Taking \(s = k/2\) we observe,
\begin{equation*} \mathbb {P} \left(\left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _2 \lt \sqrt {k/2} \right) \le \mathbb {P} \left(k- \left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _2^2 \gt k/ 2 \right) \le \exp \left(- c_{\textrm {hw}}k/ 64 \right). \end{equation*}
Noting that \(\left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _2 \le \sqrt {k} \left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _\infty\) we must have that for any \(t \le 1/\sqrt {2}\),
\begin{equation*} \mathbb {P} \left(\left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _\infty \le t \right) \le \exp \left(- c_{\textrm {hw}}k/ 64 \right) \le 2^{ - c k}, \end{equation*}
for \(c = c_{\textrm {hw}}/(64 \log _e 2)\). □

4.2 The Nemirovski Function Class is Memory-Sensitive

Recall we define \(\mathcal {N}_{N, \gamma }\) to be the distribution of \((f, \mathbf {g}_f)\) where for \(i \in [N]\) draw \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and set
\begin{equation*} f(\mathbf {x}) := \max _{i \in [N]} \mathbf {v}_i^{\top } \mathbf {x}- i \gamma \quad \textrm { and } \quad \mathbf {g}_f(\mathbf {x}) = \mathbf {v}_{i_{\min }} \quad \textrm { for } \quad i_{\min } := \min \left\lbrace i \in [N] \textrm { s.t. } \mathbf {v}_i^{\top } \mathbf {x}- i \gamma = f(\mathbf {x}) \right\rbrace . \end{equation*}
We will show that for \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\), where \(c=400\) is a fixed constant, \(\eta = d^5\), \(\rho = 1\), if \(N \le (1/32 \gamma)^{2/3}\) then \(\mathcal {N}_{N, \gamma }\) is a \((d^6, N, k, 1/(20 \sqrt {N}))\)-memory-sensitive class. To show the required properties of the Nemirovski function, we define the Resisting oracle which iteratively constructs the Nemirovski function. The Resisting oracle will be defined via a sequence of N Phases, where the \({j}^{\textrm {th}}\) Phase ends and the next begins when the algorithm makes a query which reveals the \({j}^{\textrm {th}}\) Nemirovski vector. We begin with some definitions for these Phases.
Definition 4.1.
For a fixed \(\mathbf {A}\) and vectors \(\lbrace \mathbf {v}_i, i \in [j]\rbrace\), the Nemirovski-induced function for Phase j is
\begin{equation*} F_{ \mathcal {N}, \mathbf {A}}^{(j)} (\mathbf {x}) := \max \left\lbrace \max _{i \in [j]}\left(\mathbf {v}_i^{\top } \mathbf {x}- i \gamma \right), \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \right\rbrace \qquad \textrm { with } \qquad F_{ \mathcal {N}, \mathbf {A}}^{(0)} (\mathbf {x}) := \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho . \end{equation*}
Analogous to Definition 2.9, we define the following subgradient oracle,
\begin{align*} \mathbf {g}_f^{(j)}(\mathbf {x}) := \mathbf {v}_{k_{\textrm {min}}}, \textrm { where } k_{\textrm {min}} := \min \left\lbrace i \in [j] \textrm { s.t. } \mathbf {v}_i^{\top } \mathbf {x}- i \gamma = F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \right\rbrace . \end{align*}
Following Definition 2.6, this defines a subgradient oracle \(\mathbf {g}_{F}^{(j)}(\mathbf {x})\) for the induced function \(F_{ \mathcal {N}, \mathbf {A}}^{(j)} (\mathbf {x})\).
\begin{align*} \mathbf {g}_{F}^{(j)}(\mathbf {x}) := {\left\lbrace \begin{array}{ll} &\!\!\!\! \mathbf {g}_{\mathbf {A}}(\mathbf {x}), \textrm { if } \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \ge f(\mathbf {x}), \\ &\!\!\!\! \mathbf {g}_f^{(j)}(\mathbf {x}), \textrm { else.} \end{array}\right.} \end{align*}
The first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(j)}}\) which returns the pair \((F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}), \mathbf {g}_{F}^{(j)}(\mathbf {x}))\) will be used by the Resisting oracle in Phase j. With these definitions in place, we can now define the Resisting oracle in Algorithm 3.
As we prove in the following simple lemma, for any fixed algorithm the distribution over the responses from \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) is the same as the distribution over the responses from the oracle \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) for the Nemirovksi function defined in Definition 2.9. Therefore, it will be sufficient to analyze the Resisting oracle and \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\).
Lemma 4.2.
The vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) for the final induced oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}(\mathbf {x})=(F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_{F}^{(N)}(\mathbf {x}))\) are sampled from the same distribution as the vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) for the Nemirovski function \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) (Definition 2.9). Moreover, for the same setting of the vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\), any fixed matrix \(\mathbf {A}\) and any fixed algorithm, the responses of both these oracles are identical.
Proof.
The proof follows from definitions. For the first part, note that the Nemirovski vectors in Algorithm 3 are sampled at random independent of the Algorithm’s queries, and from the same distribution as in Definition 2.9. For the second part, note that by definition, for a fixed set of vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) and matrix \(\mathbf {A}\), the first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) is identical to \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) from Definition 2.9. □
Our main goal now is to prove the following two statements (1) with high probability, the Resisting oracle is consistent with itself, i.e., its outputs are the same as the outputs of \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}=(F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_{F}^{(N)}(\mathbf {x}))\), (2) the Resisting oracle has the properties required from a memory-sensitive class. To prove these statements, we define an event E in Definition 4.3, and then argue that the failure probability of E is at most \(1/d\). Event E allows us to argue both that the Resisting oracle is identical to \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\), as well as to argue that \(\mathcal {N}_{N, \gamma }\) is memory-sensitive class.
Definition 4.3.
Let E denote the event that \(\forall j \in [N]\), the Nemirovski vector \(\mathbf {v}_j\) has the following properties:
(1)
For any query \(\mathbf {x}\) submitted by the Algorithm in Phase 1 through j, \(|\mathbf {x}^\top \mathbf {v}_j| \le \sqrt {\frac{10\log (d)}{d}}\).
(2)
\(\Vert {\mathrm{Proj}_{S_{j}}(\mathbf {v}_{j})}\Vert _2 \le \sqrt {\frac{30 k\log (d) }{d}}\) where \(S_j\) is defined as in Definition 2.8.
Lemma 4.4.
Recall the definition of event E in Definition 4.3. Suppose that for every Phase \(j\in [N]\), the Algorithm makes at most \(d^2\) queries in that Phase. Then \(\Pr [E]\ge 1 - (1/d)\).
Proof.
We prove for any fixed \(j \in [N]\), the failure probability of event E in Phase j is at most \(1/d^2\) and then apply a union bound over all \(j \in [N]\) (where \(N \le d\)). Define \(T_j\) to be the sequence of the next \(d^2\) vectors the Algorithm would query in the \({j}^{\textrm {th}}\) Phase, where the next query is made if the previous query does not get the new Nemirovski vector \(\mathbf {v}_{j}\) as the response (i.e., the previous query does not end the \({j}^{\textrm {th}}\) Phase). Since we can think of the Algorithm’s strategy as deterministic conditioned on its memory state and some random string R (as in Definition 2.2), the sequence \(T_j\) is fully determined by the function \(F_{ \mathcal {N}, \mathbf {A}}^{(j-1)}\), R and the state of the Algorithm when it receives \(\mathbf {v}_{j-1}\) as the response from the Resisting oracle at the end of the \({(j-1)}^{\textrm {th}}\) Phase. Critically, this sequence in independent of \(\mathbf {v}_j\). Separately, the Resisting oracle samples \(\mathbf {v}_j \overset{\text{i.i.d.}}{\sim } ((1/\sqrt {d}) \mathsf {Unif} (\mathcal {H}_d))\), and, therefore, \(\mathbf {v}_j\) is sampled independently of all these \(d^2\) queries in the set \(T_j\).
The proof now follows via simple concentration bounds. First consider property (1) of event E. By the argument above we see that for any any vector \(\mathbf {x}\) which is in the set \(T_j\) or was submitted by the Algorithm in any of the previous \((j-1)\) Phases, by Corollary A.3, \(\mathbf {x}^{\top } \mathbf {v}_j\) is a zero-mean random variable with sub-Gaussian parameter at most \(\left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d} \le 1/\sqrt {d}\). Then by Fact A.4, for any \(t \ge 0\),
\begin{equation*} \mathbb {P} \left(\left| \mathbf {x}^{\top } \mathbf {v}_j \right| \ge t \right) \le 2 \exp (-d t^2/2). \end{equation*}
We remark that to invoke Corollary A.3, we do not condition on whether or not vector \(\mathbf {x}\) will ultimately be queried by the Algorithm (this is critical to maintain the independence of \(\mathbf {x}\) and \(\mathbf {v}_j\)). Picking \(t = \sqrt {10 \log (d)/d}\) we have with failure probability at most \(2/d^5\),
\begin{equation*} \left| \mathbf {x}^{\top } \mathbf {v}_{j} \right| \le \sqrt {\frac{10\log (d)}{d}}. \end{equation*}
Since the Algorithm makes at most \(d^2\) queries in any Phase, there are at most \(Nd^2\) vectors in the union of the set \(T_j\) of possible queries which the Algorithm makes in the \({j}^{\textrm {th}}\) Phase, and the set of queries submitted by the Algorithm in any of the previous Phases. Therefore, by a union bound, property (1) of event E is satisfied by \(\mathbf {v}_{j}\) with failure probability at most \(2Nd^2/d^5\le 2/d^2\), where we used the fact that \(N\le d\).
We now turn to property (2) of event E. Note that \(S_{j}\) depends on \(\mathbf {x}_{t_{j}}\), which is the first query for which the Resisting oracle returns \(\mathbf {v}_j\). However, using the same idea as above, we consider the set \(T_j\) of the next \(d^2\) queries which the Algorithm would make, and then do a union bound over every such query. For any query \(\mathbf {x}\in T_j\), consider the subspace
\begin{equation*} S_{j-1,\mathbf {x}} := \mathsf {span}\left(\lbrace \mathbf {x}_{t_i}: j-k\le i \lt j, i\gt 0 \rbrace , \mathbf {x}\right). \end{equation*}
Note that \(S_j = S_{j-1, \mathbf {x}_{t_j}}\). Recalling the previous argument, \(\mathbf {v}_j\) is independent of all vectors in the above set \(S_{j-1,\mathbf {x}}\) . Therefore, by Corollary A.7 there is an absolute constant c such that with failure probability at most \(c/d^5\),
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1,\mathbf {x}}}(\mathbf {v}_j)}\right\Vert _2 \le \sqrt { \frac{30 k \log (d)}{d}}. \end{equation*}
Under the assumption that the Algorithm makes at most \(d^2\) queries in the \({j}^{\textrm {th}}\) Phase, the successful query \(\mathbf {x}_{t_{j}}\) must be in the set \(T_j\). Therefore, by a union bound over the \(d^2\) queries which the Algorithm submits,
\begin{equation*} \left\Vert { \mathrm{Proj}_{S_{j}} \left(\mathbf {v}_{j} \right) }\right\Vert _2 \le \sqrt {\frac{30 k\log (d) }{d}}, \end{equation*}
with failure probability at most \(c/d^{3} \le 1/d^2\) (for d large enough). Therefore, both properties (1) and (2) are satisfied for vector \(\mathbf {v}_j\) with failure probability at most \(1/d^{2}\) and thus by a union bound over all N Nemirovski vectors \(\mathbf {v}_j\), E happens with failure probability at most \(1/d\). □
Now that we have justified that event E has a small failure probability we will argue that conditioned on E, the Resisting oracle returns consistent first-order information. This is critical to our analysis because we prove memory-sensitive properties through the lens of the Resisting oracle.
Lemma 4.5.
Under event E defined in Definition 4.3, the Resisting oracle’s responses are consistent with itself, i.e., for any query made by the Algorithm, the Resisting oracle returns an identical first-order oracle response to the final oracle \((F^{(N)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x}), \mathbf {g}^{(N)}_F(\mathbf {x}))\).
Proof.
Assume event E and fix any \(j \in [N-1]\). Let \(\mathbf {x}\) be any query the Algorithm submits sometime before the end of the \({j}^{\textrm {th}}\) Phase. Under event E for any \(j^{\prime } \gt j\),
\begin{equation*} |\mathbf {x}^\top \mathbf {v}_j| \le \sqrt {\frac{10\log (d)}{d}}, \quad |\mathbf {x}^\top \mathbf {v}_{j^{\prime }}| \le \sqrt {\frac{10\log (d)}{d}}. \end{equation*}
Therefore, since \(\gamma =\sqrt {ck\log (d)/d}\), for any \(c\gt 40\) and any \(j^{\prime }\gt j\),
\begin{equation*} \mathbf {x}^\top (\mathbf {v}_{j^{\prime }}-\mathbf {v}_j) \lt \gamma . \end{equation*}
In particular, this implies that for \(j^{\prime } \gt j\)
\begin{equation*} \mathbf {x}^\top \mathbf {v}_j - j\gamma \gt \mathbf {x}^\top \mathbf {v}_{j^{\prime }} - j^{\prime }\gamma . \end{equation*}
Therefore, for all \(j^{\prime } \gt j\), \(\mathbf {v}_{j^{\prime }}\) is not the subgradient \(\mathbf {g}^{(N)}_F(\mathbf {x})\). This implies that in the \({j}^{\textrm {th}}\) Phase, the Resisting oracle always responds with the correct function value \(F^{(j)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x}) = F^{(N)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x})\) and subgradient \(\mathbf {g}^{(j)}_F(\mathbf {x}) = \mathbf {g}^{(N)}_F(\mathbf {x})\). □
Now we address the memory-senstive properties of the function class by analyzing the Resisting oracle. Here, event E is crucial since it ensures that the Nemirovski vectors lie outside the span of previous queries which returned a unique gradient.
Remark 4.6.
Recall from Definition 2.7 that \(\mathbf {x}_{t_{j}}\) is the query which returns the \({j}^{\textrm {th}}\) informative subgradient from the function f. In terms of the Nemirovski function and the Resisting oracle, we note that \(\mathbf {x}_{t_{j}}\) is the query which ends Phase j of the Resisting oracle (the first query such that the Resisting oracle returns \(\mathbf {v}_j\) as the gradient in response).
Lemma 4.7.
Recall the definition of \(\mathbf {x}_{t_j}\) and \(S_{j}\) from Definitions 2.7 and 2.8 and that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\). If \(c \ge 400\), then under event E from Definition 4.3,
\begin{equation*} \frac{\big \Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})\big \Vert }_2}{\left\Vert {\mathbf {x}_{t_j} }\right\Vert _2 }\le 1 - (1/d^{2}). \end{equation*}
Proof.
We prove Lemma 4.7 by providing both an upper and lower bound on \(\mathbf {x}_{t_{j}}^{\top }(\mathbf {v}_{j}-\mathbf {v}_{j-1})\) under event E. Towards this end, observe that for any \(\mathbf {x}\in S_{j-1}\),
\begin{equation*} \left| \mathbf {x}^{\top } \mathbf {v}_{j-1} \right| \le \left\Vert {\mathbf {x}}\right\Vert _2 \left\Vert {\mathrm{Proj}_{S_{j-1}}\left(\mathbf {v}_{j-1} \right)}\right\Vert _2. \end{equation*}
Therefore, recalling that every query \(\mathbf {x}\) satisfies \(\left\Vert {\mathbf {x}}\right\Vert _2\le 1\), we see that under E,
\begin{equation*} \left| \mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})^{\top }\mathbf {v}_{j-1} \right| \le \sqrt {\frac{30 k\log (d)}{d}}. \end{equation*}
Next, writing \(\mathbf {x}_{t_j}\) as \(\mathbf {x}_{t_j} = \mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j}) + \mathrm{Proj}_{S_{j-1}^{\perp }}(\mathbf {x}_{t_j})\) and recalling that \(\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2 \le 1\) we observe that
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1}^{\perp }}(\mathbf {x}_{t_j})}\right\Vert _2^2 = \left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2 \le 1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2. \end{equation*}
Recalling that \(\left\Vert {\mathbf {v}_j}\right\Vert _2 = 1\),
\begin{equation*} \left| \mathrm{Proj}_{S_{j-1}^{\perp }}(\mathbf {x}_{t_j})^{\top } \mathbf {v}_{j-1} \right| \le \left\Vert {\mathrm{Proj}_{S_{j-1}^{\perp }}(\mathbf {x}_{t_j})}\right\Vert _2 \left\Vert {\mathbf {v}_{j-1}}\right\Vert _2 \le \left(1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2\right)^{1/2}. \end{equation*}
We also note that under event E,
\begin{equation*} |\mathbf {x}_{t_j}^\top \mathbf {v}_j| \le \sqrt {\frac{10\log (d)}{d}}. \end{equation*}
Therefore,
\begin{align*} \mathbf {x}_{t_{j}}^{\top }(\mathbf {v}_{j}-\mathbf {v}_{j-1}) & \le \left| \mathbf {x}_{t_{j}}^{\top }\mathbf {v}_{j} \right| + \left| \mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})^{\top }\mathbf {v}_{j-1} \right| + \left| \mathrm{Proj}_{S_{j-1}^{\perp }}(\mathbf {x}_{t_j})^{\top } \mathbf {v}_{j-1} \right| \end{align*}
\begin{align*} & \le \sqrt {\frac{10\log (d)}{d}} + \sqrt {\frac{30 k\log (d) }{d}} + \sqrt {1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2} \end{align*}
\begin{align*} & \le \sqrt {\frac{80 k\log (d) }{d}} + \sqrt {1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2}. \end{align*}
Note that \(\mathbf {v}_j\) is an informative gradient and, therefore, \(\mathbf {v}_j \in \partial \mathcal {N}(\mathbf {x}_{t_j})\). Thus,
\begin{align*} \mathbf {x}_{t_{j}}^{\top }\mathbf {v}_{j-1}- (j-1) \gamma &\le \mathbf {x}_{t_j}^{\top }\mathbf {v}_{j} - j \gamma , \qquad \textrm { or equivalently} \qquad \mathbf {x}_{t_j}^{\top }\left(\mathbf {v}_j - \mathbf {v}_{j-1} \right) \ge \gamma . \end{align*}
Therefore,
\begin{align*} \sqrt {\frac{80 k\log d }{d}} + \sqrt {1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2} &\ge \gamma \end{align*}
\begin{align*} \Rightarrow \sqrt {1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2} &\ge \gamma -\sqrt {\frac{80 k\log d }{d}} , \end{align*}
\begin{align*} \Rightarrow {1 - \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2^2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2^2} &\ge \left(\gamma -\sqrt {\frac{80 k\log d }{d}}\right)^2 , \end{align*}
where the last step uses the fact that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) for \(c\ge 400\). Rearranging terms and using that \((1 - x)^{1/2} \le 1 - (1/4)x\) for \(x \ge 0\) gives us
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2 \le \sqrt {1 - \left(\gamma - \sqrt {\frac{80 k\log d }{d}} \right)^2} \le 1 - \frac{1}{4} \left(\gamma - \sqrt {\frac{80 k\log d }{d}} \right)^2. \end{equation*}
Plugging in \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\), for \(c\ge 400\) gives us the claimed result for d large enough,
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert _2 \le 1 - \frac{1}{4} \frac{k\log d}{d} \left(\sqrt {400} - \sqrt {80} \right)^2 \le 1 - \frac{1}{d^2}. \end{equation*}
 □
We next turn our attention to the optimality achievable by an algorithm which has seen a limited number of Nemirovski vectors. Lemma 4.8 lower bounds the best function value achievable by such an algorithm.
Lemma 4.8.
Recall that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) and suppose \(c \gt 10\). Assume the Algorithm has proceeded to Phase r of the Resisting oracle and chooses to not make any more queries. Let Q be the set of all queries made so far by the Algorithm. Then under event E defined in Definition 4.3,
\begin{equation*} \min _{ \mathbf {x}\in Q} F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}) \ge -(r+1)\gamma . \end{equation*}
Proof.
We first note that
\begin{equation*} \min _{ \mathbf {x}\in Q} F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}) \ge \min _{ \mathbf {x}\in Q} F_{ \mathcal {N}, \mathbf {A}}^{(r)}(\mathbf {x}). \end{equation*}
Now under event E, all queries \(\mathbf {x}\in Q\) made by the Algorithm satisfy,
\begin{equation*} |\mathbf {x}^\top \mathbf {v}_r| \le \sqrt {\frac{10\log (d)}{d}}. \end{equation*}
Assuming event E, by Lemma 4.5 if the Resisting oracle responds with some Nemirovski vector as the subgradient, since this Nemirovski vector must be some \(\mathbf {v}_i\) for \(i \lt r\) and since the returned subgradients are valid subgradients of the final function \(F_{ \mathcal {N}, \mathbf {A}}^{(N)}\), the Nemirovski vector \(\mathbf {v}_r\) could not have been a valid subgradient of the query. Therefore, all queries \(\mathbf {x}\in Q\) must satisfy,
\begin{align*} F_{ \mathcal {N}, \mathbf {A}}^{(r)}(\mathbf {x}) \ge \mathbf {x}^\top \mathbf {v}_r -r\gamma \ge -\sqrt {\frac{10\log (d)}{d}} -r\gamma \ge -(r+1)\gamma , \end{align*}
where in the last inequality we use the fact that \(\gamma \ge \sqrt {\frac{10\log (d)}{d}}\). □
Lemma 4.9 upper bounds the minimum value of the final function \(F_{ \mathcal {N}, \mathbf {A}}^{(N)} (\mathbf {x})\), and will be used to establish an optimality gap later.
Lemma 4.9.
Recall that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) and suppose \(c \gt 3\). For any \(\mathbf {A}\in \mathbb {R}^{n\times d}\) where \(n\le d/2\) and sufficiently large d, with failure probability at most \(2/d\) over the randomness of the Nemirovski vectors \(\lbrace \mathbf {v}_j, j \in [N]\rbrace\),
\begin{equation} \min _{\left\Vert {\mathbf {x}}\right\Vert _2 \le 1} F_{ \mathcal {N}, \mathbf {A}}^{(N)} (\mathbf {x}) \le - \frac{1}{8 \sqrt {N}}. \end{equation}
(4:1)
Proof.
Let the rank of \(\mathbf {A}\) be r. Let \(\mathbf {Z}\in \mathbb {R}^{d \times (d-r)}\) be an orthonormal matrix whose columns are an orthonormal basis for the null space \(A^{\perp }\) of \(\mathbf {A}\). We construct a vector \(\overline{\mathbf {x}}\) which (as we will show) attains \(F_{ \mathcal {N}, \mathbf {A}}(\overline{\mathbf {x}}) \le -1/8\sqrt {N}\) as follows:
\begin{equation*} \overline{\mathbf {x}}= \frac{-1}{2 \sqrt {N}} \sum _{i \in [N]} \mathrm{Proj}_{A^{\perp }}(\mathbf {v}_{i}) = \frac{-1}{2 \sqrt {N}} \sum _{i \in [N]} \mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_i = \mathbf {Z}\mathbf {Z}^{\top } \left(\frac{-1}{2 \sqrt {N}}\sum _{i \in [N]} \mathbf {v}_i \right). \end{equation*}
Our proof proceeds by providing an upper bound for \(F_{ \mathcal {N}, \mathbf {A}}(\overline{\mathbf {x}})\). First we bound \(\left\Vert {\overline{\mathbf {x}}}\right\Vert _2^2\). Let \(\bar{\mathbf {v}} = (\frac{-1}{2 \sqrt {N}}\sum _{i \in [N]} \mathbf {v}_i)\). Note that
\begin{equation*} \left\Vert {\overline{\mathbf {x}}}\right\Vert _2^2 = \left\Vert {\mathbf {Z}^{\top } \bar{\mathbf {v}}}\right\Vert _2^2. \end{equation*}
By Fact A.2, each coordinate of \(\bar{\mathbf {v}}\) is sub-Gaussian with sub-Gaussian norm \(\alpha /\sqrt {d}\) (where \(\alpha\) is the absolute constant in Fact A.2). Also \(\mathbb {E}[\bar{\mathbf {v}} \bar{\mathbf {v}}^{\top }]=(1/4d)\mathbf {I}_d\). Therefore, by Lemma A.6, with failure probability at most \(2 \exp (-c_1d)\) for some absolute constant \(c_1\gt 0\),
\begin{equation*} \left\Vert {\overline{\mathbf {x}}}\right\Vert _2^2 \le \frac{d-r}{4d}+ \frac{1}{2} \le 1. \end{equation*}
Therefore, \(\overline{\mathbf {x}}\) lies in the unit ball with failure probability at most \(2 \exp (-c_1d)\). Next fix any \(\mathbf {v}_j \in \left\lbrace \mathbf {v}_i \right\rbrace _{i \in [N]}\) and consider
\begin{align} \mathbf {v}_j^{\top } \overline{\mathbf {x}}& = \left(\mathbf {v}_j^{\top } \left(\mathbf {Z}\mathbf {Z}^{\top }\left(\frac{-1}{2 \sqrt {N}} \sum _{i \in [N], i \ne j}\mathbf {v}_i \right)\right) \right) - \frac{1}{2 \sqrt {N}} \mathbf {v}_j^{\top } \mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j \\ & = \left((\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j)^{\top } \left(\frac{-1}{2 \sqrt {N}} \sum _{i \in [N], i \ne j}\mathbf {v}_i \right) \right) - \frac{1}{2 \sqrt {N}} \left\Vert {\mathbf {Z}^{\top }\mathbf {v}_j}\right\Vert _2^2. \end{align}
(4.2)
Let \(\bar{\mathbf {v}}_{-j} = (\frac{-1}{2 \sqrt {N}}\sum _{i \in [N], i \ne j} \mathbf {v}_i)\). By Fact A.2, each coordinate of \(\bar{\mathbf {v}}_{-j}\) is sub-Gaussian with sub-Gaussian parameter \(1/\sqrt {4d}\). Also by Fact A.2, \((\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j)^{\top }\bar{\mathbf {v}}_{-j}\) is sub-Gaussian with sub-Gaussian parameter \(\left\Vert {\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2/\sqrt {4d}=\left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2/\sqrt {4d}\). Now using Fact A.4, with failure probability at most \(2/d^C\),
\begin{equation} \left| (\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j)^{\top }\bar{\mathbf {v}}_{-j} \right| \le \frac{\sqrt {2C \log (d)} \left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2}{\sqrt {4d}}. \end{equation}
(4.3)
We now turn to bounding \(\left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2\). Note that each coordinate of \(\mathbf {v}_j\) is sub-Gaussian with sub-Gaussian norm \(2/\sqrt {d}\) and \(\mathbb {E}[\mathbf {v}_j \mathbf {v}_j^{\top }]=(1/d) \mathbf {I}_d\). Therefore, by Lemma A.6, with failure probability at most \(2 \exp (-c_2d)\) for some absolute constant \(c_2\gt 0\),
\begin{equation} \frac{1}{4} \le \frac{d-r}{d}-1/4 \le \left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2^2 \le \frac{d-r}{d} +1/4 \le 2 , \end{equation}
(4.4)
where we have used the fact that \(r\le d/2\). Then if Equations (4.3) and (4.4) hold,
\begin{align} \left| (\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j)^{\top }\bar{\mathbf {v}}_{-j} \right| \le \sqrt {\frac{ {C \log (d)} }{ {d} }}. \end{align}
(4.5)
Moreover, under Equation (4.4),
\begin{equation} \frac{1}{2 \sqrt {N}} \left\Vert {\mathbf {Z}^{\top }\mathbf {v}_j}\right\Vert _2^2 \ge \frac{1}{8 \sqrt {N}}. \end{equation}
(4.6)
Combining Equations (4.2), (4.5), and (4.6), we find that with failure probability at most \(2/d^C +2 \exp (-c_2 d)\),
\begin{equation*} \mathbf {v}_j^{\top } \overline{\mathbf {x}}\le - \frac{1}{8 \sqrt {N}} + \sqrt {\frac{C \log (d)}{ {d}}}. \end{equation*}
Then, using a union bound we conclude that with failure probability at most \(N\left(2/d^C +2\exp (-c_2 d) \right)\),
\begin{equation*} \max _{j \in [N]} \mathbf {v}_j^{\top } \overline{\mathbf {x}}\le - \frac{1}{8 \sqrt {N}} + \sqrt {\frac{C \log (d)}{ {d}}}. \end{equation*}
Finally, since \(\eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho = - \rho = -1 \le -1/(8 \sqrt {N})\), we conclude
\begin{equation*} F_{ \mathcal {N}, \mathbf {A}}^{(N)} (\overline{\mathbf {x}}) \le -\frac{1}{8 \sqrt {N}} + \sqrt {\frac{C \log (d)}{ {d}}} - \gamma . \end{equation*}
We now take \(C = 3\), which gives us a overall failure probability of \(2 \exp (-c_1 d) + N\left(2/d^C +2 \exp (-c_2 d) \right) \le 2/d\) (for sufficiently large d and since \(N\le d)\). Noting that \(\gamma \gt \sqrt {3\log (d)/{d}}\) finishes the proof. □
The following simple lemma shows that all the Nemirovski vectors are unique with high probability, and will be useful for our definition of informative subgradients.
Lemma 4.10.
For sufficiently large d, with failure probability at most \(1/d\), \(\mathbf {v}_i \ne \mathbf {v}_j \; \forall \; i, j \in [N] , {i \ne j}\).
Proof.
The proof follows from the birthday paradox since \(\mathbf {v}_i\) are drawn from the uniform distribution over a support of size \(2^d\): the probability of \(\mathbf {v}_i \ne \mathbf {v}_j \; \forall \; i, j \in [N] , i \ne j\) is equal to,
\begin{align*} 1 \cdot \frac{2^d-1}{2^d} \cdot \frac{2^d-2}{2^d} \dots \frac{2^d-(N-1)}{2^d} \ge \left(\frac{2^d-d}{2^d}\right)^d = \left(1-\frac{d}{2^d}\right)^d. \end{align*}
For d sufficiently large we can write,
\begin{align*} \left(1-\frac{d}{2^d}\right)^d \ge e^{-2d^2/2^d} \ge 1-\frac{2d^2}{2^d}\ge 1-\frac{1}{d} \end{align*}
which completes the proof. □
Finally, we note that for the orthogonality condition from the definition of a memory-sensitive class is satisfied with our definition of the subgradient oracle.
Lemma 4.11.
For any \(j \in [N]\) the following is true: For any \(\mathbf {x}\in \mathbb {R}^d\) such that \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\), either \(\mathbf {g}_{F}^{(j)}(\mathbf {x}) = \mathbf {v}_1\) or \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty /\left\Vert {\mathbf {x}}\right\Vert _2 \le 1/d^4\).
Proof.
Fix any \(j \in [N]\) and assume \(\mathbf {x}\) is such that \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\). Note that since \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\), \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_1\), or \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_k\) for some \(k \in \left\lbrace 2, \dots , j \right\rbrace\). We consider the case when \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_k\) for some \(k \in \left\lbrace 2, \dots , j \right\rbrace\). This implies that \(\eta \left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty - \rho \le \mathbf {v}_k^{\top } \mathbf {x}- k \gamma\). Observe that \(\mathbf {v}_k^{\top } \mathbf {x}- k \gamma \le \left\Vert {\mathbf {v}_k}\right\Vert _2 \left\Vert {\mathbf {x}}\right\Vert _2\) and \(\left\Vert {\mathbf {v}_k}\right\Vert _2 = 1\). Therefore,
\begin{equation*} \eta \left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty - \rho \le \left\Vert {\mathbf {x}}\right\Vert _2. \end{equation*}
Next we bound \(\left\Vert {\mathbf {x}}\right\Vert _2\) from below. For \(k \in \left\lbrace 2, \dots , j \right\rbrace\), \(\mathbf {g}_{F}^{(j)}(\mathbf {x}) = \mathbf {v}_k\) implies that
\begin{equation*} \mathbf {v}_k^{\top } \mathbf {x}- k \gamma \ge \mathbf {v}_{k-1}^{\top } \mathbf {x}- (k-1) \gamma \Rightarrow (\mathbf {v}_k - \mathbf {v}_{k-1})^{\top } \mathbf {x}\ge \gamma . \end{equation*}
Therefore,
\begin{equation*} \left\Vert {\mathbf {x}}\right\Vert _2 \ge \frac{(\mathbf {v}_k - \mathbf {v}_{k-1})^{\top }}{\left\Vert {\mathbf {v}_k - \mathbf {v}_{k-1}}\right\Vert _2} \mathbf {x}\ge \frac{\gamma }{\left\Vert {\mathbf {v}_k - \mathbf {v}_{k-1}}\right\Vert _2} \ge \frac{\gamma }{2}. \end{equation*}
Since \(\left\Vert {\mathbf {x}}\right\Vert _2\gt 0\), we can write
\begin{equation*} \eta \left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty - \rho \le \left\Vert {\mathbf {x}}\right\Vert _2 \Rightarrow \frac{\left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty }{\left\Vert {\mathbf {x}}\right\Vert _2} \le \frac{1}{\eta } + \frac{\rho }{\eta \left\Vert {\mathbf {x}}\right\Vert _2} \le \frac{1}{\eta } + \frac{2 \rho }{\eta \gamma }. \end{equation*}
Then recalling that \(\eta = d^5\), \(\rho = 1\) and \(\gamma = \sqrt {c k\log (d)/d}\),
\begin{equation*} \frac{\left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty }{\left\Vert {\mathbf {x}}\right\Vert _2} \le d^{-4} \end{equation*}
which completes the proof. □
Putting together all the previous results, we show Theorem 4.12 which establishes that the Nemirovski function class has the required memory-sensitive properties.
Theorem 4.12.
Consider any \(\mathbf {A}\in \mathbb {R}^{n\times d}\) where \(n\le d/2\) and any row has \(\ell _2\)-norm bounded by d. Fix \(\eta = d^5\), \(\rho = 1\) and \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) for \(c = 400\). Let \(N\le (1/32 \gamma)^{2/3}\). Then \(\mathcal {N}_{N, \gamma }\) is a \((d^6, N, k, 1/(20\sqrt {N}))\)-memory-sensitive class. That is, we can sample \((f, \mathbf {g}_f)\) from \(\mathcal {N}_{N, \gamma }\) with at most \(d^2\) random bits, and for \((f, \mathbf {g}_f) \sim \mathcal {N}_{N, \gamma }\), with failure probability at most \(5/d\), any algorithm for optimizing \(F_{f, \mathbf {A}}(\mathbf {x})\) which makes fewer than \(d^2\) queries to oracle \(\mathbf {g}_f\) has the following properties:
(1)
\(F_{f, \mathbf {A}}\) is convex and \(d^6\)-Lipschitz.
(2)
With \(S_j\) and \(\mathbf {x}_{t_j}\) as defined in Definition 2.8,
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1}}({\mathbf {x}}_{t_j})}\right\Vert _2/\left\Vert {{\mathbf {x}}_{t_j}}\right\Vert _2 \le 1 - 1/d^2. \end{equation*}
(3)
Any query \(\mathbf {x}\) such that \(F_{f, \mathbf {A}, \eta , \rho }(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\) satisfies
\begin{equation*} \mathbf {g}_{F_{f, \mathbf {A},}}(\mathbf {x}) = \mathbf {v}_1 \textrm { or } \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty /\left\Vert {\mathbf {x}}\right\Vert _2\le d^{-4}. \end{equation*}
(4)
Any algorithm which has queried \(r \lt N\) unique gradients from f has a sub-optimality gap of at least
\begin{equation*} F_{f,\mathbf {A}}(\mathbf {x}_{{T}}) - F_{f,\mathbf {A}}(\mathbf {x}^\star)\ge \frac{1}{20 \sqrt {N}}. \end{equation*}
Proof.
Note that by Lemma 4.2, the first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}(\mathbf {x}) = (F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_F^{(N)}(\mathbf {x}))\) has the same distribution over responses as \(\mathcal {N}_{N,\gamma , \mathbf {A}}\). Therefore, we will analyze \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\), using the Resisting oracle.
We consider the following sequence of events: (a) event E defined in Lemma 4.4 holds, (b) no two Nemirovski vectors are identical, \(\mathbf {v}_i \ne \mathbf {v}_j\) for \(i\ne j\), (c) Equation (4.1) regarding the value for \(F_{f,\mathbf {A}}^{(N)}(\mathbf {x}^\star)\) holds. We claim that for any algorithm which makes fewer than \(d^2\) queries, events (a) through (c) happen with failure probability at most \(5/d\). To verify this, we do a union bound over the failure probability of each event: for (a), Lemma 4.4 shows that E holds with failure probability at most \(1/d\) for any algorithm which makes fewer than \(d^2\) queries, (b) holds with probability \(1/d\) from Lemma 4.10 and (c) holds with probability \(2/d\) from Lemma 4.8. Now note that by Lemma 4.5, the Resisting oracle’s responses are identical to the responses of oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) under E, and therefore when conditioned on events (a) through (c). Therefore, we will condition on events (a) through (c) all holding and consider the Resisting oracle to prove all four parts of the theorem.
We begin with the first part. Note that \(f(\mathbf {x})\) has Lipschitz constant bounded by 1. Next note that since each \(\mathbf {a} \in B_d\) has \(\left\Vert {\mathbf {a}}\right\Vert _2 \le d\) and \(\eta = d^5\) we have that the \(\left\lbrace \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \right\rbrace\) term has Lipschitz constant bounded by \(L \le d^6\). Therefore, \(F_{f, \mathbf {A}}\) has Lipschitz constant bounded by \(d^6\). For the second part, we first note that under event (b) all Nemirovski vectors are unique, therefore, the vectors \(\mathbf {x}_{t_j}\) and subspace \(S_{j-1}\) are well-defined. Now under event E by applying Lemma 4.7,
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{j-1}}(\mathbf {x}_{t_j})}\right\Vert _2/\left\Vert {\mathbf {x}_{t_j}}\right\Vert \le 1 - 1/d^2. \end{equation*}
The third part holds by Lemma 4.11. For the final part, we first note that under event (b) if the Algorithm observes r unique Nemirvoski vectors then it has only proceeded to the \({(r+1)}^{\textrm {th}}\) Phase of the Resisting oracle. Now by Lemma 4.8, under event E, any algorithm in the \({(r+1)}^{\textrm {th}}\) Phase of the Resisting oracle has function value at least \(-(r+2)\gamma\). Combining this with Equation (4.1) holding, for any algorithm which observes at most than r unique Nemirovski vectors as gradients,
\begin{align*} F_{f,\mathbf {A}}^{(N)}(\mathbf {x}_{{T}}) - F_{f,\mathbf {A}}^{(N)}(\mathbf {x}^\star) &\ge - (r+2)\gamma + {1}/{(8 \sqrt {N})} \ge - (N+ 2) \gamma + {1}/{(8 \sqrt {N})} \ge {1}/{(20\sqrt {N})} , \end{align*}
where the first inequality is by the above observation and Lemma 4.9 and the final inequality holds by noting that \(N\le (1/32 \gamma)^{2/3}\) and so \(-(N+ 2) \gamma \ge -1/(20 \sqrt {N})\). □

5 Memory Lower Bounds for the Orthogonal Vector Game

In this section, we prove the lower bound for the Orthogonal Vector Game (Game 1), building on the proof sketch in Section 3.2. We prove a slightly stronger result in that we allow the oracle response \({\mathbf {g}}_i \in \mathbb {R}^{d}\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be an arbitrary (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_n\rbrace\).
Recall that \({\mathbf {G}}=\left(\mathbf {g}_1, \dots , \mathbf {g}_m \right)^{\top } \in \mathbb {R}^{m \times d}\). We perform a small augmentation of the matrix \({\mathbf {G}}\) to streamline our analysis, defined by the following matching between the rows of \(\mathbf {A}\) and \(\mathbf {G}\). For every row of \(\mathbf {A}\), if that row is also present in the matrix \(\mathbf {G}\), then we match it to that row of \(\mathbf {G}\). If some row of \(\mathbf {A}\) is present multiple times in \(\mathbf {G}\), then we match it to its first occurrence in \(\mathbf {G}\). We define any matched row in \(\mathbf {G}\) as a unique row. Let \(m_{\textrm {U}}\in [m]\) denote the number of such unique rows. If \(m_{\textrm {U}}\lt m\), construct a matrix \(\tilde {{\mathbf {G}}} \in \mathbb {R}^{(2m-m_{\textrm {U}}) \times d}\) by appending \(m-m_{\textrm {U}}\) additional rows of \(\mathbf {A}\) to \({\mathbf {G}}\), choosing the first \(m-m_{\textrm {U}}\) rows of \(\mathbf {A}\) which were previously unmatched. We now match \(\mathbf {A}\) to \(\tilde {{\mathbf {G}}}\) in the same way, and note that \(\tilde {\mathbf {G}}\) now has exactly m unique rows by definition. Given \(\mathbf {A}\) and \(\tilde {\mathbf {G}}\) let \(\mathbf {A^{\prime }}\in \mathbb {R}^{(n-m) \times d}\) denote the matrix \(\mathbf {A}\) when all the rows of \(\mathbf {A}\) which were matched to some row in \(\tilde {\mathbf {G}}\) are removed. Drawing on these definitions, we begin with the following relation between the entropy of \(\mathbf {A}\) and \(\mathbf {A^{\prime }}\) conditioned on \(\tilde {\mathbf {G}}\) and R.
Lemma 5.1.
\(H(\mathbf {A}\vert \tilde {\mathbf {G}}, R) \le H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) + 2m\log (4n)\).
Proof.
Note that for any random variable X, there exists a description of the random variable with expected description length \(L_X\) bounded as [22, Chapter 5],
\begin{align} H(X)\le L_X \le H(X)+1. \end{align}
(5.1)
Now fix any \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\). Let \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }} := H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime })\). Note that by definition,
\begin{align*} H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) = \sum _{\tilde {\mathbf {G}}^{\prime }}\sum _{R^{\prime }} h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }} \mathbb {P}(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }) \mathbb {P}(R=R^{\prime }). \end{align*}
Now by Equation (5.1) if \(H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime })=h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}\), then given \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\) there exists some description of \(\mathbf {A^{\prime }}\) which has expected description length at most \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}+1\). Using this we construct a description of \(\mathbf {A}\) given \(\tilde {\mathbf {G}}=\tilde{\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\) which has expected description length at most \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}+1 + 2m\log (n)\). To do this, note that if there is already a description of \(\mathbf {A^{\prime }}\) and we are given \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\), then to specify \(\mathbf {A}\) we only need to additionally specify the m rows which were removed from \(\mathbf {A}\) to construct \(\mathbf {A^{\prime }}\). For every row which was removed from \(\mathbf {A}\), we can specify its row index in the original matrix \(\mathbf {A}\) using \(\lceil \log (n) \rceil \le \log (2n)\) bits, and specify which of the rows of \(\tilde {\mathbf {G}}^{\prime }\) it is using another \(\lceil \log (2m-m_{\textrm {U}}) \rceil \le \lceil \log (2m) \rceil \le \log (4n)\) bits. Therefore, given \(\tilde {\mathbf {G}}=\tilde{\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\),
\begin{align*} H(\mathbf {A}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime }) &\le h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}+1 + m\log (2n)+m\log (4n)\le h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }} + 2m\log (4n) \\ \Rightarrow H(\mathbf {A}\vert \tilde {\mathbf {G}}, R) &= \sum _{\tilde {\mathbf {G}}^{\prime }}\sum _{R^{\prime }} H(\mathbf {A}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime }) \mathbb {P}(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }) \mathbb {P}(R=R^{\prime }) \\ &\le H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R)+2m\log (4n). \end{align*}
 □
By a similar analysis we can bound the entropy of \(\tilde {\mathbf {G}}\).
Lemma 5.2.
\(H(\tilde {\mathbf {G}})\le m\log (\left| B_d \right|) + 2m\log (2n)\).
Proof.
We claim that \(\tilde {\mathbf {G}}\) can be written with at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) bits, therefore, its entropy is at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) (by Equation (5.1)). To prove this, note that \(\tilde {\mathbf {G}}\) consists of rows of \(\mathbf {A}\) which are sampled from \(B_d\), and recall that \(\tilde{\mathbf {G}}\) has exactly m unique rows. Therefore, we can first write down all the m unique rows of \(\tilde {\mathbf {G}}\) using \(m\log (\left| B_d \right|)\) bits, and then for each of the at most \((2m-m_{\textrm {U}})\le 2m\) rows of \(\tilde {\mathbf {G}}\) we can write down which of the unique rows it is with \(\lceil \log (m)\rceil \le \log (2n)\) bits each, for a total of at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) bits. □
The following lemma shows that Y is a deterministic function of \(\tilde{\mathbf {G}}, \mathsf {Message}\) and R.
Lemma 5.3.
The queries \(\lbrace \mathbf {x}_i, i \in [m]\rbrace\) are a deterministic function of \(\mathbf {G}, \mathsf {Message}\) and R. Therefore, \(\mathbf {Y}=g(\tilde{\mathbf {G}},\mathsf {Message},R)\) for some function g.
Proof.
Given \(\mathbf {A}\) and R, we observe that the Player is constrained to using a deterministic algorithm to both construct message \(\mathsf {Message}\) to store and to determine the queries \((\mathbf {x}_1, \dots , \mathbf {x}_m)\). Therefore, we see that for any i there exists a function \(\phi _i\) such that for any \(\mathsf {Message}\), R, and responses \((\mathbf {g}_1, \dots , \mathbf {g}_{i-1})\), \(\mathbf {x}_i = \phi _i(\mathsf {Message}, R, \mathbf {x}_1, \mathbf {g}_1, \dots , \mathbf {x}_{i-1}, \mathbf {g}_{i-1})\). Next, we remark that there exists a function \(\phi _i^{\prime }\) such that \(\mathbf {x}_i = \phi _i^{\prime }(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{i-1})\). We can prove this by induction. For the base case, we simply note that \(\phi _1^{\prime } = \phi _1\) and \(\mathbf {x}_1 = \phi _1^{\prime }(\mathsf {Message}, R)\). Next, assume the inductive hypothesis that for any \(j \le i-1\) we have \(\mathbf {x}_j = \phi _j^{\prime }(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{j-1})\). Then
\begin{align*} \mathbf {x}_i & = \phi _i(\mathsf {Message}, R, \mathbf {x}_1, \mathbf {g}_1, \dots , \mathbf {x}_{i-1}, \mathbf {g}_{i-1})\\ & = \phi _i(\mathsf {Message}, R, \phi ^{\prime }_1(M, R), \mathbf {g}_1, \dots , \phi ^{\prime }_{i-1}(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{i-2}), \mathbf {g}_{i-1}). \end{align*}
Thus, we define
\begin{align*} \phi ^{\prime }_i (\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{i-1}) := \; \phi _i\Big (\mathsf {Message}, R, \phi ^{\prime }_1(M, R), \mathbf {g}_1, \dots , \phi ^{\prime }_{i-1}(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{i-2}), \mathbf {g}_{i-1} \Big). \end{align*}
Therefore, given \((\mathsf {Message}, R, \mathbf {G})\), it is possible to reconstruct the queries \(\mathbf {X}\). Since \(\mathbf {Y}\) is a deterministic function of \((\mathbf {X},\mathbf {G},\mathsf {Message},R)\) in the Orthogonal Vector Game, and \(\mathbf {G}\) is just the first m rows of \(\tilde{\mathbf {G}},\) \(\mathbf {Y}=g(\tilde{\mathbf {G}},\mathsf {Message},R)\) for some function g. □
As sketched out in Section 3.2, in the following two lemmas, we compute \(I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R)\) in two different ways.
Lemma 5.4.
\(I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) \le M\).
Proof.
We note that by Lemma 5.3, \(\mathbf {Y} = g(\tilde {\mathbf {G}}, \mathsf {Message}, R)\) and, therefore, by the data processing inequality,
\begin{equation*} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) \le I(\mathbf {A^{\prime }}; \tilde {\mathbf {G}},\mathsf {Message}, R \vert \tilde {\mathbf {G}}, R)=I(\mathbf {A^{\prime }}; \mathsf {Message}\vert \tilde {\mathbf {G}}, R). \end{equation*}
We conclude by noting that
\begin{equation*} I(\mathbf {A^{\prime }}; \mathsf {Message}\vert \tilde {\mathbf {G}}, R) \le H(\mathsf {Message}\vert \tilde {\mathbf {G}}, R) \le H(\mathsf {Message}) \le M \end{equation*}
where in the last step, we use the fact that \(\mathsf {Message}\) is M-bit long. □
Lemma 5.5.
Suppose \(\mathbf {Y}\) wins the Two Player Orthogonal Vector Game with failure probability at most \(\delta\). Then if \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) is a memory sensitive base with constant \(c_{B}\gt 0\),
\begin{equation*} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) \ge (n - m)(1-\delta) \frac{c_{B}}{2} k- 4 m\log (4n). \end{equation*}
Proof.
We have
\begin{equation} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) = H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) - H(\mathbf {A^{\prime }}\vert \mathbf {Y} ,\tilde {\mathbf {G}}, R). \end{equation}
(5.2)
We will lower bound \(I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R)\) by providing a lower bound for \(H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R)\) and an upper bound for \(H(\mathbf {A^{\prime }}\vert \mathbf {Y}, \tilde {\mathbf {G}}, R)\). To that end, by Lemma 5.1 we have
\begin{equation*} H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) \ge H(\mathbf {A}\vert \tilde {\mathbf {G}}, R) - 2m\log (4n). \end{equation*}
We can lower bound \(H(\mathbf {A}\vert \tilde {\mathbf {G}}, R)\) as follows:
\begin{align*} H(\mathbf {A}\vert \tilde {\mathbf {G}}, R) &= H(\mathbf {A}\vert R) - I(\mathbf {A}; \tilde {\mathbf {G}} \vert R) = H(\mathbf {A}\vert R) - (H(\tilde {\mathbf {G}} \vert R) - H(\tilde {\mathbf {G}} \vert \mathbf {A}, R)) \\ &\ge H(\mathbf {A}\vert R) - H(\tilde {\mathbf {G}} \vert R) \ge H(\mathbf {A}\vert R) - H(\tilde {\mathbf {G}}), \end{align*}
where the inequalities use the fact that entropy is non-negative, and that conditioning reduces entropy. We now note that since \(\mathbf {A}\) is independent of R, \(H(\mathbf {A}\vert R)=H(\mathbf {A})\). Thus, we have
\begin{equation*} H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) \ge H(\mathbf {A}) - H(\tilde {\mathbf {G}}) - 2m\log (4n), \end{equation*}
and so recalling Lemma 5.2 and that \(H(\mathbf {A}) = n \log (\left| B_d \right|)\) we conclude
\begin{equation} H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R) \ge n \log (\left| B_d \right|) \ - m\log (\left| B_d \right|) - 4m\log (4n). \end{equation}
(5.3)
Next, we upper bound \(H(\mathbf {A^{\prime }}\vert \mathbf {Y}, \tilde {\mathbf {G}}, R)\). First we note
\begin{equation} H(\mathbf {A^{\prime }}\vert \mathbf {Y}, \tilde {\mathbf {G}}, R) \le H(\mathbf {A^{\prime }}\vert \mathbf {Y}) \le \sum _{j=1}^{n-m} H(\mathbf {a}^{\prime }_j \vert \mathbf {Y}), \end{equation}
(5.4)
where \(\mathbf {a}^{\prime }_j\) is the \({j}^{\textrm {th}}\) row of \(\mathbf {A^{\prime }}\). Next recall that if \(\mathbf {Y} = (\mathbf {y}_1, \dots , \mathbf {y}_{k})^{\top } \in \mathbb {R}^{k\times d}\) wins the Two Player Orthogonal Vector Game then for any \(\mathbf {y}_i\), \(\left\Vert {\mathbf {A}\mathbf {y}_i}\right\Vert _\infty /\left\Vert {\mathbf {y}_i}\right\Vert _2 \le d^{-4}\) and for \(S_i = \mathsf {span}(\mathbf {y}_1, \dots , \mathbf {y}_{i-1})\), we have \(\Vert {\mathrm{Proj}_{S_i}(\mathbf {y}_i)\Vert }_2/\left\Vert {\mathbf {y}_i}\right\Vert _2 \le 1 - (1/d^{2})\). Since these properties are normalized by \(\left\Vert {\mathbf {y}_i}\right\Vert _2\) we may, without loss of generality, overload our notation for \(\mathbf {y}_i\) by normalizing, we set \(\mathbf {y}_i \leftarrow \mathbf {y}_i/\left\Vert {\mathbf {y}_i}\right\Vert _2\). Note that for any j the entropy of \(\mathbf {a}^{\prime }_j\) conditioned on any value \(\mathbf {Y}^{\prime }\) taken by the random variable \(\mathbf {Y}\) is bounded above by the entropy of \(\mathbf {a}^{\prime }\) where the law of \(\mathbf {a}^{\prime }\) corresponds to the uniform distribution over the set \(\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4} \rbrace\) and this has entropy \(\log (|{\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4}\rbrace }|)\). We also note that since \(\mathbf {Y} = g(\tilde {\mathbf {G}}, \mathsf {Message}, R)\) (Lemma 5.3), and all of \(\tilde{\mathbf {G}}, \mathsf {Message}\) and R take values on a finite support, \(\mathbf {Y}\) also takes values on some finite support (\(\mathbf {Y}\) can still be real-valued, but its support must be finite). Therefore, for any j, we can write,
\begin{align} H(\mathbf {a}^{\prime }_j \vert \mathbf {Y}) & = \sum _{\mathbf {Y}^{\prime }} \left[ H \left(\mathbf {a}^{\prime }_j\vert \mathbf {Y} = \mathbf {Y}^{\prime } \right) \right]\mathbb {P}(\mathbf {Y}=\mathbf {Y}^{\prime })\\ & = \sum _{\mathbf {Y}^{\prime }: \mathbf {Y}^{\prime } \textrm { wins} } \left[ H \left(\mathbf {a}^{\prime }_j \vert \mathbf {Y} = \mathbf {Y}^{\prime } \right) \right] \mathbb {P}(\mathbf {Y}=\mathbf {Y}^{\prime }) + \sum _{\mathbf {Y}^{\prime }: \mathbf {Y}^{\prime } \textrm { loses } } \left[ H \left(\mathbf {a}^{\prime }_j \vert \mathbf {Y} = \mathbf {Y}^{\prime } \right) \right] \mathbb {P}(\mathbf {Y}=\mathbf {Y}^{\prime }) \\ & \le \sum _{\mathbf {Y}^{\prime }: \mathbf {Y}^{\prime } \textrm { wins} } \left[ H \left(\mathbf {a}^{\prime }_j \vert \mathbf {Y} = \mathbf {Y}^{\prime } \right) \right]\mathbb {P}(\mathbf {Y}=\mathbf {Y}^{\prime })+ p(\mathbf {Y} \textrm { loses}) H(\mathbf {a}^{\prime }_j) \\ & \le \sum _{\mathbf {Y}^{\prime }: \mathbf {Y}^{\prime } \textrm { wins} } \log \left(\left| \left\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4} \right\rbrace \right| \right) \mathbb {P}(\mathbf {Y}=\mathbf {Y}^{\prime }) + p(\mathbf {Y} \textrm { loses}) H(\mathbf {a}^{\prime }_j)\\ & \le (1 - \delta) \log \left(\max _{\mathbf {Y}^{\prime }: \mathbf {Y}^{\prime } \textrm {wins}} \left| \left\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4} \right\rbrace \right| \right) + \delta \log (\left| B_d \right|). \end{align}
(5.5)
Thus, for any \(\mathbf {Y} = (\mathbf {y}_1, \dots , \mathbf {y}_{k})\) which wins the Two Player Orthogonal Vector Game, we will upper bound \(|{\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y} \mathbf {a}}\right\Vert _\infty \le 1/d^4\rbrace }|\). We will use the following lemma (proved in Appendix A.2) which constructs a partial orthonormal basis from a set of robustly linearly independent vectors.
Lemma 5.6.
Let \(\lambda \in (0, 1]\) and suppose we have \(n_0\le d\) unit norm vectors \(\mathbf {y}_1, \dots , \mathbf {y}_{n_0} \in \mathbb {R}^d\). Let \(S_i := \mathsf {span}(\lbrace \mathbf {y}_1, \dots , \mathbf {y}_{i} \rbrace), S_0 := \phi\). Suppose that for any \(i \in [n_0]\),
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\right\Vert _2 \le 1 - \lambda . \end{equation*}
Let \(\mathbf {Y} = (\mathbf {y}_1, \dots , \mathbf {y}_{n_0}) \in \mathbb {R}^{d \times n_0}\). There exists \(n_1= \lfloor n_0/2 \rfloor\) orthonormal vectors \(\mathbf {m}_1, \dots , \mathbf {m}_{n_1}\) such that, letting \(\mathbf {M} = (\mathbf {m}_1, \dots , \mathbf {m}_{n_1})\), for any \(\mathbf {a} \in \mathbb {R}^d\),
\begin{equation*} \left\Vert {\mathbf {M}^{\top } \mathbf {a}}\right\Vert _\infty \le \frac{d}{\lambda } \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty . \end{equation*}
 □
By Lemma 5.6, there exist \(\lfloor k/2 \rfloor\) orthonormal vectors \(\lbrace \mathbf {z}_1, \dots , \mathbf {z}_{\lfloor k/2 \rfloor }\rbrace\) such that for \(\mathbf {Z} = (\mathbf {z}_1, \dots , \mathbf {z}_{\lfloor k/2 \rfloor })\) and for any \(\mathbf {a} \in \mathbb {R}^d\),
\begin{equation} \left\Vert {\mathbf {Z}^{\top } \mathbf {a}}\right\Vert _\infty \le d^3 \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty . \end{equation}
(5.6)
Define \(S := \left\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Z} \mathbf {a}}\right\Vert _\infty \le 1/d \right\rbrace\). By Equation (5.6),
\begin{equation*} \left| \left\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4} \right\rbrace \right| \le \left| S \right|. \end{equation*}
Observing that \(B_d\) is a memory-sensitive base as per Definition 2.4,
\begin{equation*} \left| S \right| \le \mathbb {P}_{\mathbf {a} \sim \mathsf {Unif}(B_d)} \left(\left\Vert {\mathbf {Z}^{\top } \mathbf {a}}\right\Vert _\infty \le 1/d \right) \left| B_d \right| \le 2^{- c_{B}k/2} \left| B_d \right|, \end{equation*}
for some constant \(c_{B}\gt 0\). From Equation (5.5),
\begin{equation*} H(\mathbf {a}^{\prime }_j \vert \mathbf {Y}) \le (1 - \delta) \log \left(2^{- c_{B}k/2} \left| B_d \right| \right) + \delta \log \left(\left| B_d \right| \right) = \log (\left| B_d \right|) - (1- \delta) c_{B}k/2. \end{equation*}
Plugging this into Equation (5.4),
\begin{equation*} H(\mathbf {A^{\prime }}\vert \mathbf {Y}) \le (n-m) \left(\log (\left| B_d \right|) - (1- \delta) c_{B}k/2 \right). \end{equation*}
Recalling Equations (5.2) and (5.3), we conclude,
\begin{align*} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) & \ge \log (\left| B_d \right|) (n - m) - 4m\log (4n) - (n-m) \left(\log (\left| B_d \right|) - (1- \delta) c_{B}k/2 \right) \\ & = (n -m)(1-\delta) \frac{c_{B}}{2} k- 4m\log (4n). \end{align*}
By combining Lemmas 5.4 and 5.5, we prove the memory-query tradeoff for the Orthogonal Vector Game.
Theorem 3.3.
Suppose for a memory-sensitive base \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) with constant \(c_{B}\gt 0\), \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\). Given \(\mathbf {A}\) let the oracle response \(\mathbf {g}_i\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be any (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_{d/2}\rbrace\), where \(\mathbf {a}_i\) is the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (note this includes the subgradient response in the Orthogonal Vector Game). Assume \(M \ge d \log (2d)\) and set \(k=\left\lceil 60M/(c_{B}d) \right\rceil\). For these values of k and M, if the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).
Proof.
By Lemma 5.4, we have that
\begin{equation*} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) \le M. \end{equation*}
However, if the player returns some \(\mathbf {Y}\) which wins the Orthogonal Vector Game with failure probability at most \(\delta\) then by Lemma 5.5,
\begin{equation*} I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R) \ge (n - m) (1-\delta)\frac{c_{B}}{2} k- 4 m\log (4n). \end{equation*}
Therefore, we must have
\begin{equation*} (n - m) (1-\delta)\frac{c_{B}}{2} k- 4 m\log (4n) \le M. \end{equation*}
Rearranging terms gives us
\begin{equation*} m\ge \frac{n (1-\delta) c_{B}k/2 - M}{(1-\delta)c_{B}k/2 + 4 \log (4n)}. \end{equation*}
Recall that \(k = \lceil 60 M / (c_{B}d) \rceil\) and thus, if \(M \ge d \log (4n)\) then for \(\delta = 2/3\) we have \((1 - \delta) c_{B}k/2 \ge 4 \log (4n)\). Using this and the fact that \(n = \lfloor d/2 \rfloor\),
\begin{align*} m\ge \frac{n (1-\delta) c_{B}k/2 - M}{(1-\delta)c_{B}k} \ge \frac{n}{2} -\frac{M}{(1-\delta)c_{B}k} \ge \frac{n}{2}-\frac{d}{20} \ge d/5. \end{align*}
 □

Acknowledgments

Thank you to anonymous reviewers for helpful feedback on earlier drafts of this paper.

Footnotes

1
This arises from \(\Omega (d \log (1/\epsilon))\) iterations of working in different change of basis or solving a linear system, each of which takes \(\Omega (d^2)\)-time naively.
2
By a blackbox-reduction in Appendix A.3 our results extend to unconstrained optimization while only losing \(\mathrm{poly}(d)\) factors in the accuracy \(\epsilon\) for which the lower bound applies.
3
We remark that \(2^d\) can be replaced by any finite-valued function of d.
4
Note that 0 queries are needed when \(\epsilon \ge L\).

A Additional Helper Lemmas

A.1 Useful Concentration Bounds

In this section, we establish some concentration bounds which we repeatedly use in our analysis.
Definition A.1 (sub-Gaussian random variable).
A zero-mean random variable X is sub-Gaussian if for some constant \(\sigma\) and for all \(\lambda \in \mathbb {R}\), \(\mathbb {E}[e^{\lambda X}]\le e^{\lambda ^2 \sigma ^2/2}\). We refer to \(\sigma\) as the sub-Gaussian parameter of X. We also define the sub-Gaussian norm \(\left\Vert {X}\right\Vert _{\psi _2}\) of a sub-Gaussian random variable X as follows:
\begin{equation*} \left\Vert {X}\right\Vert _{\psi _2} := \inf \left\lbrace K \gt 0 \textrm { such that } \mathbb {E}[\exp (X^2/K^2)] \le 2 \right\rbrace . \end{equation*}
We use that \(X\sim \mathsf {Unif}(\lbrace \pm 1\rbrace)\) is sub-Gaussian with parameter \(\sigma =1\) and sub-Gaussian norm \(\left\Vert {X}\right\Vert _{\psi _2} \le 2\). The following fact about sums of independent sub-Gaussian random variables is useful in our analysis.
Fact A.2.
[69] For any n, if \(\lbrace X_i, i \in [n]\rbrace\) are independent zero-mean sub-Gaussian random variables with sub-Gaussian parameter \(\sigma _i\), then \(X^{\prime }=\sum _{i \in [n]} X_i\) is a zero-mean sub-Gaussian with parameter \(\sqrt {\sum _{i \in [n]} \sigma _i^2}\) and norm bounded as \(\left\Vert {X^{\prime }}\right\Vert _{\psi _2}\le \alpha \sqrt {\sum _{i \in [n]} \left\Vert {X_i}\right\Vert _{\psi _2}^2}\), for some universal constant \(\alpha\).
Using Fact A.2, we obtain the following corollary which we use directly in our analysis.
Corollary A.3.
For \(\mathbf {v}\sim (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and any fixed vector \(\mathbf {x}\in \mathbb {R}^d\), \(\mathbf {x}^{\top } \mathbf {v}\) is a zero-mean sub-Gaussian with sub-Gaussian parameter at most \(\left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d}\) and sub-Gaussian norm at most \(\left\Vert {\mathbf {x}^{\top } \mathbf {v}}\right\Vert _{\psi _2} \le 2\alpha \left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d}\).
The following result bounds the tail probability of sub-Gaussian random variables.
Fact A.4.
[70] For a random variable X which is zero-mean and sub-Gaussian with parameter \(\sigma\), then for any t, \(\mathbb {P}(\left| X \right| \ge t) \le 2 \exp (-t^2/(2\sigma ^2))\).
We will use the following concentration bound for quadratic forms.
Theorem A.5 (Hanson–Wright inequality [32, 57]).
Let \(\mathbf {x}= (X_1, \dots , X_d) \in \mathbb {R}^d\) be a random vector with i.i.d components \(X_i\) which satisfy \(\mathbb {E}[X_i] = 0\) and \(\left\Vert {X_i}\right\Vert _{\psi _2} \le K\) and let \(\mathbf {M} \in \mathbb {R}^{n \times n}\). Then for some absolute constant \(c_{\textrm {hw}}\gt 0\) and for every \(t \ge 0\),
\[\begin{gather*} \max \left\lbrace \mathbb {P} \left(\mathbf {x}^{\top } \mathbf {M} \mathbf {x}- \mathbb {E}\left[ \mathbf {x}^{\top } \mathbf {M} \mathbf {x}\right] \gt t \right), \mathbb {P} \left(\mathbb {E}\left[ \mathbf {x}^{\top } \mathbf {M} \mathbf {x}\right] - \mathbf {x}^{\top } \mathbf {M} \mathbf {x}\gt t \right) \right\rbrace \\ \le \exp \left(- c_{\textrm {hw}}\min \left\lbrace \frac{t^2}{K^4 \left\Vert {\mathbf {M}}\right\Vert _{\textrm {F}}^2}, \frac{t}{K^2 \left\Vert {\mathbf {M}}\right\Vert _2} \right\rbrace \right). \end{gather*}\]
where \(\left\Vert {\mathbf {M}}\right\Vert _2\) is the operator norm of \(\mathbf {M}\) and \(\left\Vert {\mathbf {M}}\right\Vert _{\textrm {F}}\) is the Frobenius norm of \(\mathbf {M}\).
The following lemma will let us analyze projections of sub-Gaussian random variables (getting concentration bounds which look like projections of Gaussian random variables).
Lemma A.6.
Let \(\mathbf {x}= (X_1, \dots , X_d) \in \mathbb {R}^d\) be a random vector with i.i.d sub-Gaussian components \(X_i\) which satisfy \(\mathbb {E}[X_i] = 0\), \(\left\Vert {X_i}\right\Vert _{\psi _2} \le K\), and \(\mathbb {E}[\mathbf {x}\mathbf {x}^{\top }]=s^2 \mathbf {I}_d\). Let \(\mathbf {Z}\in \mathbb {R}^{d \times r}\) be a matrix with orthonormal columns \((\mathbf {z}_1,\dots , \mathbf {z}_r)\). There exists some absolute constant \(c_{\textrm {hw}}\gt 0\) (which comes from the Hanson–Wright inequality) such that for \(t\ge 0\),
\begin{equation} \max \left\lbrace \mathbb {P} \left(\left\Vert {\mathbf {Z}^{\top } \mathbf {x}}\right\Vert _2^2 - rs^2 \gt t \right), \mathbb {P} \left(rs^2 - \left\Vert {\mathbf {Z}^{\top } \mathbf {x}}\right\Vert _2^2 \gt t \right) \right\rbrace \le \exp \left(- c_{\textrm {hw}}\min \left\lbrace \frac{t^2}{r K^4 }, \frac{t}{K^2} \right\rbrace \right). \end{equation}
(A.1)
Proof.
We will use the Hanson–Wright inequality (see Theorem A.5). We begin by computing \(\mathbb {E}[ \Vert {\mathbf {Z}^{\top } \mathbf {x}}\Vert _2^2]\),
\begin{align*} \mathbb {E}\left[ \left\Vert {\mathbf {Z}^{\top } \mathbf {x}}\right\Vert _2^2 \right] & = \mathbb {E}\left[ \mathbf {x}^{\top } \mathbf {Z} \mathbf {Z}^{\top } \mathbf {x}\right] = \mathbb {E}\left[ \mathbf {tr}\left(\mathbf {Z}^{\top } \mathbf {x}\mathbf {x}^{\top } \mathbf {Z} \right) \right] = \mathbf {tr}\left(\mathbf {Z}^{\top } \mathbb {E}\left[ \mathbf {x}\mathbf {x}^{\top } \right] \mathbf {Z} \right) \\ & = s^2 \mathbf {tr}\left(\mathbf {Z}^{\top } \mathbf {Z} \right) = s^2 \mathbf {tr}(\mathbf {I}_{r \times r}) = r s^2 . \end{align*}
Next, we bound the operator norm and Frobenius-norm of \(\mathbf {Z}\mathbf {Z}^{\top }\). The operator norm of \(\mathbf {Z}\mathbf {Z}^{\top }\) is bounded by 1 since the columns of \(\mathbf {Z}\) are orthonormal vectors. For the Frobenius-norm of \(\mathbf {Z}\mathbf {Z}^{\top }\), note that
\begin{align*} \left\Vert {\mathbf {Z} \mathbf {Z}^{\top }}\right\Vert _{\textrm {F}}^2 & = \mathbf {tr}\left(\mathbf {Z} \mathbf {Z}^{\top } \mathbf {Z} \mathbf {Z}^{\top } \right) = \mathbf {tr}\left(\mathbf {Z} \mathbf {Z}^{\top } \right) = \sum _{i \in [r]}\mathbf {tr}\left(\mathbf {z}_i \mathbf {z}_i^{\top } \right) = \sum _{i \in [r]} \left\Vert {\mathbf {z}_i}\right\Vert _2^2 = r. \end{align*}
Applying the Hanson–Wright inequality (Theorem A.5) with these bounds on \(\mathbb {E}[ \Vert {\mathbf {Z}^{\top } \mathbf {x}}\Vert _2^2]\), \(\Vert {\mathbf {Z} \mathbf {Z}^{\top }}\Vert _{2}^2\), and \(\Vert {\mathbf {Z} \mathbf {Z}^{\top }}\Vert _{\textrm {F}}^2\) yields (A.1) and completes the proof. □
Corollary A.7.
Let \(\mathbf {v}\sim (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and let S be a fixed r-dimensional subspace of \(\mathbb {R}^d\) (independent of \(\mathbf {v}\)). Then with probability at least \(1-2 \exp (c_{\textrm {hw}})/d^5\),
\begin{equation*} \left\Vert { \mathrm{Proj}_{S}(\mathbf {v}) }\right\Vert _2 \le \sqrt {\frac{30 \log (d) r}{d}}. \end{equation*}
Proof.
Let \(\mathbf {z}_1, \dots , \mathbf {z}_r\) be an orthonormal basis for subspace S. Note that
\begin{equation*} \left\Vert {\mathrm{Proj}_S(\mathbf {v})}\right\Vert _2^2 = \left\Vert {\mathbf {Z}^{\top } \mathbf {v}}\right\Vert _2^2. \end{equation*}
Next, for \(\mathbf {v}= (\mathbf {v}_1, \dots , \mathbf {v}_d)\) we have \(\mathbb {E}[\mathbf {v}\mathbf {v}^{\top }] = (1/d) \mathbf {I}\) and \(\left\Vert {\mathbf {v}_i}\right\Vert _{\psi _2} \le 2/\sqrt {d}\), thus, we may set \(s = 1/\sqrt {d}\) and \(K = 2/\sqrt {d}\) and apply Lemma A.6,
\begin{equation*} \mathbb {P} \left(\left| \left\Vert { \mathrm{Proj}_{S}(\mathbf {v}) }\right\Vert _2^2 - \frac{r}{d} \right| \gt t \right) \le 2 \exp \left(-c_{\textrm {hw}}\min \left\lbrace \frac{d^2 t^2}{16 r} , \frac{dt}{4} \right\rbrace \right). \end{equation*}
Picking \(t = 20 \log (d)r/d\) concludes the proof. □

A.2 A Property of Robustly Linearly Independent Vectors: Proof of Lemma 5.6

Lemma 5.6. Let \(\lambda \in (0, 1]\) and suppose we have \(n_0\le d\) unit norm vectors \(\mathbf {y}_1, \dots , \mathbf {y}_{n_0} \in \mathbb {R}^d\). Let \(S_i := \mathsf {span}\left(\left\lbrace \mathbf {y}_1, \dots , \mathbf {y}_{i} \right\rbrace \right), S_0 := \phi\). Suppose that for any \(i \in [n_0]\),
\begin{equation*} \left\Vert {\mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\right\Vert _2 \le 1 - \lambda . \end{equation*}
Let \(\mathbf {Y} = \left(\mathbf {y}_1, \dots , \mathbf {y}_{n_0} \right) \in \mathbb {R}^{d \times n_0}\). There exists \(n_1= \lfloor n_0/2 \rfloor\) orthonormal vectors \(\mathbf {m}_1, \dots , \mathbf {m}_{n_1}\) such that, letting \(\mathbf {M} = \left(\mathbf {m}_1, \dots , \mathbf {m}_{n_1} \right)\), for any \(\mathbf {a} \in \mathbb {R}^d\),
\begin{equation*} \left\Vert {\mathbf {M}^{\top } \mathbf {a}}\right\Vert _\infty \le \frac{d}{\lambda } \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty . \end{equation*}
Proof.
Since for any \(i \in [n_0]\), \(\Vert {\mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\Vert _2 \le 1 - \lambda \lt 1\),
\begin{equation*} \mathsf {dim} \hspace{2.84526pt} \mathsf {span}(\mathbf {y}_1, \dots , \mathbf {y}_{n_0}) = n_0. \end{equation*}
Therefore, we may construct an orthonormal basis \(\mathbf {b}_1, \dots , \mathbf {b}_{n_0}\) via the Gram–Schmidt process,
\begin{equation*} \mathbf {b}_i := \frac{\mathbf {y}_i - \mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i) }{ \left\Vert {\mathbf {y}_i - \mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i) }\right\Vert _2}. \end{equation*}
Let \(\mathbf {B} = (\mathbf {b}_1, \dots , \mathbf {b}_{n_0})\in \mathbb {R}^{d \times n_0}\) and note that with this construction there exists a vector of coefficients \(\mathbf {c}^{(i)} \in \mathbb {R}^{n_0}\) such that
\begin{equation*} \mathbf {y}_i = \sum _{j \in [i]} \mathbf {c}^{(i)}_j \mathbf {b}_j = \mathbf {B} \mathbf {c}^{(i)}. \end{equation*}
Let \(\mathbf {C} = (\mathbf {c}^{(1)}, \dots , \mathbf {c}^{(n_0)}) \in \mathbb {R}^{n_0\times n_0}\) and observe that \(\mathbf {Y} = \mathbf {B} \mathbf {C}\). Denote the singular values of \(\mathbf {C}\) as \(\sigma _1 \ge \dots \ge \sigma _{n_0}\). We aim at bounding the singular values of \(\mathbf {C}\) from below. To this end, observe that \(\mathbf {C}\) is an upper triangular matrix such that for any \(i \in [n_0]\), \(\left| \mathbf {C}_{ii} \right| \ge \sqrt {\lambda }\). Indeed, note that
\begin{align*} \mathbf {C}_{ii}^2 & = \left\Vert {\mathbf {y}_i - \mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\right\Vert _2^2 = \left\Vert {\mathbf {y}_i}\right\Vert _2^2 - \left\Vert {\mathrm{Proj}_{S_{i-1}}(\mathbf {y}_i)}\right\Vert _2^2 \ge 1 - \left(1 - \lambda \right)^2 \ge \lambda . \end{align*}
Therefore,
\begin{equation*} \prod _{i \in [n_0]} \sigma _i = \det (\mathbf {C}) \ge \lambda ^{n_0/2} \end{equation*}
where in the last step we use the fact that the determinant of a triangular matrix is the product of its diagonal entries. Next, we consider the singular value decomposition of \(\mathbf {C}\): let \(\mathbf {U}\in \mathbb {R}^{n_0\times n_0}, \mathbf {V}\in \mathbb {R}^{n_0\times n_0}\) be orthogonal matrices and \(\mathbf {\Sigma }\in \mathbb {R}^{n_0\times n_0}\) be the diagonal matrix such that \(\mathbf {C} = \mathbf {U} \mathbf {\Sigma } \mathbf {V}^{\top }\). For the remainder of the proof assume without loss of generality that the columns of \(\mathbf {Y}\) and \(\mathbf {B}\) are ordered such that \(\Sigma _{ii} = \sigma _i\). We will upper bound the singular values of \(\mathbf {C}\) as follows: observe that since \(\left\Vert {\mathbf {y}_i}\right\Vert _2 = 1\), for any vector \(\mathbf {w}\) such that \(\left\Vert {\mathbf {w}}\right\Vert _2 \le 1\),
\begin{equation*} \left\Vert {\mathbf {Y}^{\top } \mathbf {w}}\right\Vert _2 \le \sqrt {d}, \qquad \textrm { and, therefore, } \qquad \left\Vert {\mathbf {C}^{\top } \mathbf {B}^{\top } \mathbf {w}}\right\Vert _2 \le \sqrt {d}. \end{equation*}
In particular, consider \(\mathbf {w} = \mathbf {B} \mathbf {u}_i\) where \(\mathbf {u}_i\) denotes the \({i}^{\textrm {th}}\) column of \(\mathbf {U}\). Since \(\mathbf {B}^{\top } \mathbf {B} = \mathbf {I}\) and \(\mathbf {U}, \mathbf {V}\) are orthogonal we have
\begin{align*} \left\Vert {\mathbf {C}^{\top } \mathbf {B}^{\top } \mathbf {w} }\right\Vert _2 = \left\Vert {\mathbf {V} \mathbf {\Sigma } \mathbf {e}_i}\right\Vert _2 = \sigma _i \le \sqrt {d}. \end{align*}
Note that for \(m \in [n_0]\), if \(i \gt n_0- m\) then \(\sigma _i \le \sigma _{n_0- m}\) and, therefore,
\begin{align*} \lambda ^{n_0/2} \le \prod _{i \in [n_0]} \sigma _i \le \sigma _{n_0- m}^m \prod _{i \in [n_0-m]} \sigma _i \le \sigma _{n_0- m}^m \left(\sqrt {d} \right)^{n_0- m } \Rightarrow \sigma _{n_0- m} \ge \left(\lambda ^{n_0} d^{(m-n_0)} \right)^{\frac{1}{2m}}. \end{align*}
In particular, when \(m = n_1= \lceil n_0/2 \rceil\) we have \(\sigma _{n_0- n_1} \ge \lambda /\sqrt {d}\). Therefore, for any \(i \le n_0- n_1= \lfloor \frac{n_0}{2} \rfloor\),
\begin{equation*} \sigma _i \ge \lambda /\sqrt {d}. \end{equation*}
Recall that \(\mathbf {u}_i\) denotes the \({i}^{\textrm {th}}\) column of \(\mathbf {U}\) and let \(\mathbf {v}_i\) denote the \({i}^{\textrm {th}}\) column of \(\mathbf {V}\). Observe that
\begin{equation*} \mathbf {U} \mathbf {\Sigma } = \mathbf {B}^{\top } \mathbf {Y} \mathbf {V} \qquad \Rightarrow \qquad \mathbf {u}_i = \frac{1}{\sigma _i} \mathbf {B}^{\top } \mathbf {Y} \mathbf {v}_i. \end{equation*}
Extend the basis \(\lbrace \mathbf {b}_1, \dots , \mathbf {b}_{n_0} \rbrace\) to \(\mathbb {R}^d\) and let \(\tilde{\mathbf {B}} = (\mathbf {B}, \mathbf {b}_{n_0+ 1}, \dots , \mathbf {b}_{d})\) be the \(d \times d\) matrix corresponding to this orthonormal basis. Note that if \(j \gt n_0\), then for any \(\mathbf {y}_i\), \(\mathbf {b}_j^{\top } \mathbf {y}_i = 0\). Define
\begin{equation*} \tilde{\mathbf {u}}_i := \frac{1}{\sigma _i} \tilde{\mathbf {B}}^{\top } \mathbf {Y} \mathbf {v}_i \in \mathbb {R}^d. \end{equation*}
For \(i \in [n_1]\) define \(\mathbf {m}_i = \tilde{\mathbf {B}} \tilde{\mathbf {u}}_i\) and let \(\mathbf {M} = (\mathbf {m}_1, \dots , \mathbf {m}_{n_1})\). Note that \(\lbrace \mathbf {m}_1, \dots , \mathbf {m}_{n_1} \rbrace\) is an orthonormal set. The result then follows since for any \(\mathbf {a} \in \mathbb {R}^d\),
\begin{equation*} \left| \mathbf {m}_i^{\top } \mathbf {a} \right| = \left| \tilde{\mathbf {u}}_i^{\top } \tilde{\mathbf {B}}^{\top } \mathbf {a} \right| = \left| \frac{1}{\sigma _i} \mathbf {v}_i^{\top } \mathbf {Y}^{\top } \tilde{\mathbf {B}} \tilde{\mathbf {B}}^{\top } \mathbf {a} \right| = \frac{1}{\sigma _i} \left| \mathbf {v}_i^{\top } \mathbf {Y}^{\top } \mathbf {a} \right| \le \frac{1}{\sigma _i} \left\Vert {\mathbf {v}_i}\right\Vert _1 \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty \le \frac{d}{\lambda } \left\Vert {\mathbf {Y}^{\top } \mathbf {a}}\right\Vert _\infty . \end{equation*}
 □

A.3 From Constrained to Unconstrained Lower Bounds

Here we provide a generic, black-box reduction from approximate Lipschitz convex optimization over a unit ball to unconstrained approximate Lipschitz convex optimization. The reduction leverages a natural extension of functions on the unit ball to \(\mathbb {R}^{d}\), provided and analyzed in the following lemma:
Lemma A.8.
Let \(f:B_{2}^{d}\rightarrow \mathbb {R}\) be a convex, L-Lipschitz function and let \(g_{r,L}^{f}:\mathbb {R}^{d}\rightarrow \mathbb {R}\) and \(h_{r,L}^{f}:\mathbb {R}^{d}\rightarrow \mathbb {R}\) be defined for all \(\mathbf {x}\in \mathbb {R}^{d}\) and some \(r\in (0,1)\) as
\begin{equation*} g_{r,L}^{f}(\mathbf {x}):= {\left\lbrace \begin{array}{ll}\max \lbrace f(\mathbf {x}),h_{r,L}^{f}(\mathbf {x})\rbrace & \text{for }\left\Vert {\mathbf {x}}\right\Vert _{2}\lt 1\\ h_{r,L}^{f}(\mathbf {x}) & \text{for }\left\Vert {\mathbf {x}}\right\Vert _{2}\ge 1 \end{array}\right.}\text{ and }h_{r,L}^{f}(\mathbf {x}):= f(\mathbf {0})+\frac{L}{1-r}\left((1+r)\left\Vert {\mathbf {x}}\right\Vert _2-2r\right)\,. \end{equation*}
Then, \(g_{r,L}^{f}(\mathbf {x})\) is a convex, \(L(\frac{1+r}{1-r})\)-Lipschitz function with \(g_{r,L}^{f}(\mathbf {x})=f(\mathbf {x})\) for all \(\mathbf {x}\) with \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\).
Proof.
For all \(\mathbf {x}\in B_{2}^{d}\) note that
\begin{equation} h_{r,L}^{f}(\mathbf {x})-f(\mathbf {x})=f(\mathbf {x})-f(\mathbf {0})+\frac{L}{1-r}\left(\left(1+r\right)\left\Vert {\mathbf {x}}\right\Vert _2-2r\right). \end{equation}
(A.2)
Now \(|f(\mathbf {x})-f(\mathbf {0})|\le L\left\Vert {\mathbf {x}}\right\Vert _2\) since f is L-Lipschitz. Correspondingly, when \(\left\Vert {\mathbf {x}}\right\Vert _2=1\) this implies
\begin{align*} h_{r,L}^{f}(\mathbf {x})-f(\mathbf {x}) & \ge -L+\frac{L}{1-r}\left((1+r)-2r\right)=0\,. \end{align*}
Consequently, \(h_{r,L}^{f}(\mathbf {x})\ge f(\mathbf {x})\) when \(\left\Vert {\mathbf {x}}\right\Vert _2=1\) and \(g_{r,L}^{f}(\mathbf {x})=h_{r,L}^{f}(\mathbf {x})=\max \lbrace f(\mathbf {x}),h(\mathbf {x})\rbrace\). Furthermore, since \(h_{r,L}^{f}\) is \((\frac{1+r}{1-r})\)-Lipschitz and convex and f is L-Lipschitz and convex this implies that \(g_{r,L}^{f}\) is \((\frac{1+r}{1-r})\)-Lipschitz and convex as well. Finally, if \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\) then again by (A.2) and that \(|f(\mathbf {x})-f(\mathbf {0})|\le L\left\Vert {\mathbf {x}}\right\Vert _2\) we have
\begin{align*} h_{r,L}^{f}(\mathbf {x})-f(\mathbf {x}) \le L\left\Vert {\mathbf {x}}\right\Vert _2+\frac{L}{1-r}\left((1+r)\left\Vert {\mathbf {x}}\right\Vert _2-2r\right) =\frac{2L}{1-r}\left(\left\Vert {\mathbf {x}}\right\Vert _2-r\right)\le 0. \end{align*}
Therefore, \(g_{r,L}^{f}(\mathbf {x})=f(\mathbf {x})\) for all \(\mathbf {x}\) with \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\). □
Leveraging this extension, we obtain the following result:
Lemma A.9.
Suppose any M-memory-constrained randomized algorithm must make (with high probability in the worst case) at least \(\mathcal {T}_{\epsilon }\) queries to a first-order oracle for convex, L-Lipschitz \(f:B_{2}^{d}\rightarrow \mathbb {R}\) in order to compute an \(\epsilon\)-optimal point for some \(\epsilon \lt L\).4 Then any M-memory-constrained randomized algorithm must make at least \(\mathcal {T}_{\epsilon }/2\) queries to a first-order oracle (with high probability in the worst case) to compute an \(\epsilon /2\)-optimal point for a \(\mathcal {O}(L^{2}/\epsilon)\)-Lipschitz, convex \(g:\mathbb {R}^{d}\rightarrow \mathbb {R}\) even though the minimizer of g is guaranteed to lie in \(B_{2}^{d}\).
Proof.
Let \(\mathbf {x}^\star\) be a minimizer of f. By convexity, for all \(\alpha \in [0,1]\) we have that
\begin{align*} f(\alpha \mathbf {x}^\star)-f(\mathbf {x}^\star) & =f(\alpha \mathbf {x}+(1-\alpha)\mathbf {0})-f(\mathbf {x}^\star)\\ & \le \alpha \left[f(\mathbf {x}^\star)-f(\mathbf {x}^\star)\right]+(1-\alpha)\left[f(\mathbf {0})-f(\mathbf {x}^\star)\right]\\ & \le (1-\alpha)\left\Vert {\mathbf {0}-\mathbf {x}^\star }\right\Vert _{2}\le (1-\alpha)L\,. \end{align*}
Consequently, for all \(\delta \gt 0\) and \(\alpha _{\delta }:= 1-\delta /L\) we have that \(f(\alpha _{\delta } \mathbf {x}^\star)\) is \(\delta\)-optimal.
Correspondingly, consider the function \(g:= g_{\alpha _{\epsilon /2},L}^{f}\) as defined in Lemma A.8. Note that g is \(L(\frac{1+\alpha _{\delta }}{1-\alpha _{\delta }})=\mathcal {O}(L^{2}/\delta)\) Lipschitz. The minimizer of g also lies in \(B_2^d\) since g monotonically increases for \(\left\Vert {\mathbf {x}}\right\Vert _{2}\ge 1\). Suppose, there is an algorithm to compute an \(\epsilon /2\)-optimal minimizer of all \(\mathcal {O}(L^{2}/\delta)\)-Lipschitz, convex functions \(g:\mathbb {R}^d \rightarrow \mathbb {R}\) whose minimizer lies in \(B_2^d\), with high probability using \(\mathcal {T}\) queries to a first-order oracle for g. Note that this algorithm can be implemented instead with \(2\mathcal {T}\) queries to f by simply querying \(f(\mathbf {0})\) along with any query, and then using the defining of g from Lemma A.8 (note that it is also possible to just compute and store \(f(\mathbf {0})\) once in the beginning and hence only require \(\mathcal {T}+1\) queries, but this would require us to also analyze the number of bits of precision to which \(f(\mathbf {0})\) should be stored, which this proof avoids for simplicity). If \(\mathbf {x}_{\mathrm{out}}\) is the output of the algorithm then
\begin{align*} \frac{\epsilon }{2} & \ge g(\mathbf {x}_{\mathrm{out}})-\inf _{x}g(\mathbf {x})\ge g(\mathbf {x}_{\mathrm{out}})-g(\alpha _{\epsilon /2}\mathbf {x}^\star)\ge g(\mathbf {x}_{\mathrm{out}})-f(\mathbf {x}^\star)-\frac{\epsilon }{2}\,. \end{align*}
Thus, \(g(\mathbf {x}_{\mathrm{out}})\le f(\mathbf {x}^\star)+\epsilon\). Let \(\tilde{\mathbf {x}}_{\mathrm{out}}=\min \lbrace 1,\left\Vert {\mathbf {x}_{\mathrm{out}}}\right\Vert _{2}^{-1}\rbrace \mathbf {x}_{\mathrm{out}}\). As g monotonically increases for \(\left\Vert {\mathbf {x}}\right\Vert _{2}\ge 1\), we have that \(g(\mathbf {x}_{\mathrm{out}})\ge g(\tilde{\mathbf {x}}_{\mathrm{out}})=f(\tilde{\mathbf {x}}_{\mathrm{out}})\) and thus, with \(2\mathcal {T}\) queries to f the algorithm can compute an \(\epsilon\)-approximate minimizer of f. Consequently, \(\mathcal {T}\ge \mathcal {T}_{\epsilon }/2\) as desired and the result follows. □
In this section, we focused on the unit \(\ell _2\) norm ball; however, the analysis generalizes to arbitrary norms.

References

[1]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems 31 (2018).
[2]
Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences 58, 1 (1999), 137–147.
[3]
Kurt M. Anstreicher. 2000. The volumetric barrier for semidefinite programming. Mathematics of Operations Research 25, 3 (2000), 365–380.
[4]
Yossi Arjevani and Ohad Shamir. 2015. Communication complexity of distributed convex learning and optimization. Advances in Neural Information Processing Systems 28 (2015).
[5]
David S. Atkinson and Pravin M. Vaidya. 1995. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical Programming 69, 1 (1995), 1–43.
[6]
Maria Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. 2012. Distributed learning, communication complexity and privacy. In Proceedings of the Conference on Learning Theory. JMLR Workshop and Conference Proceedings, 26–1.
[7]
Eric Balkanski and Yaron Singer. 2018. Parallelization does not accelerate convex optimization: Adaptivity lower bounds for non-smooth convex minimization. arXiv:1808.03880. Retrieved from https://arxiv.org/abs/1808.03880
[8]
Ziv Bar-Yossef, Thathachar S. Jayram, Ravi Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences 68, 4 (2004), 702–732.
[9]
Paul Beame, Shayan Oveis Gharan, and Xin Yang. 2018. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In Proceedings of the Conference On Learning Theory. PMLR, 843–856.
[10]
Dimitris Bertsimas and Santosh Vempala. 2004. Solving convex programs by random walks. Journal of the ACM 51, 4 (2004), 540–556.
[11]
Moise Blanchard. 2024. Gradient descent is pareto-optimal in the oracle complexity and memory tradeoff for feasibility problems. arXiv:2404.06720. Retrieved from https://arxiv.org/abs/2404.06720
[12]
Moise Blanchard, Junhui Zhang, and Patrick Jaillet. 2023. Memory-constrained algorithms for convex optimization via recursive cutting-planes. Advances in Neural Information Processing Systems (2023).
[13]
Moïse Blanchard, Junhui Zhang, and Patrick Jaillet. 2023. Quadratic memory is necessary for optimal query complexity in convex optimization: Center-of-mass is pareto-optimal. arXiv:2302.04963. Retrieved from https://arxiv.org/abs/2302.04963
[14]
Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta. 2017. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory 63, 7 (2017), 4709–4724.
[15]
Mark Braverman, Ankit Garg, Tengyu Ma, Huy L. Nguyen, and David P. Woodruff. 2016. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 1011–1020.
[16]
Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. 2020. The gradient complexity of linear regression. In Proceedings of the Conference on Learning Theory. PMLR, 627–647.
[17]
Sébastien Bubeck. 2014. Convex optimization: Algorithms and complexity. arXiv:1405.4980. Retrieved from https://arxiv.org/abs/1405.4980
[18]
Sébastien Bubeck, Qijia Jiang, Yin-Tat Lee, Yuanzhi Li, and Aaron Sidford. 2019. Complexity of highly parallel non-smooth convex optimization. Advances in Neural Information Processing Systems 32 (2019).
[19]
Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. 2021. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Proceedings of the Conference on Learning Theory. PMLR, 866–882.
[20]
Xi Chen and Binghui Peng. 2023. Memory-query tradeoffs for randomized convex optimization. arXiv:2306.12534. Retrieved from https://arxiv.org/abs/2306.12534
[21]
Kenneth L. Clarkson and David P. Woodruff. 2009. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing. 205–214.
[22]
T. M. Cover and J. A. Thomas. 1991. Elements of Information Theory. Wiley.
[23]
Yuval Dagan, Gil Kur, and Ohad Shamir. 2019. Space lower bounds for linear prediction in the streaming model. In Proceedings of the Conference on Learning Theory. PMLR, 929–954.
[24]
Yuval Dagan and Ohad Shamir. 2018. Detecting correlations with little memory and communication. In Proceedings of the Conference On Learning Theory. PMLR, 1145–1198.
[25]
Jelena Diakonikolas and Cristóbal Guzmán. 2019. Lower bounds for parallel and randomized convex optimization. In Proceedings of the Conference on Learning Theory. PMLR, 1132–1157.
[26]
John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 429–438.
[27]
R. Fletcher and C. M. Reeves. 1964. Function minimization by conjugate gradients. The Computer Journal 7, 2 (1964).
[28]
Ankit Garg, Tengyu Ma, and Huy Nguyen. 2014. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems 27 (2014).
[29]
Sumegha Garg, Ran Raz, and Avishay Tal. 2018. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 990–1002.
[30]
Martin Grötschel, László Lovász, and Alexander Schrijver. 2012. Geometric Algorithms and Combinatorial Optimization. Vol. 2. Springer Science and Business Media.
[31]
William W. Hager and Hongchao Zhang. 2006. A survey of nonlinear conjugate gradient methods. Pacific Journal of Optimization 2, 1 (2006).
[32]
David Lee Hanson and Farroll Tim Wright. 1971. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics 42, 3 (1971), 1079–1083.
[33]
M.R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, 6 (1952).
[34]
Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. 2014. Communication-efficient distributed dual coordinate ascent. Advances in Neural Information Processing Systems 27 (2014).
[35]
Albert Xin Jiang and Kevin Leyton-Brown. 2011. Polynomial-time computation of exact correlated equilibrium in compact games. In Proceedings of the 12th ACM Conference on Electronic Commerce. 119–126.
[36]
Haotian Jiang. 2021. Minimizing convex functions with integral minimizers. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 976–985.
[37]
Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. 2020. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 944–953.
[38]
Michael I. Jordan, Jason D. Lee, and Yun Yang. 2018. Communication-efficient distributed statistical inference. Journal of the American Statistical Association (2018).
[39]
Leonid G. Khachiyan, Sergei Pavlovich Tarasov, and II Erlikh. 1988. The method of inscribed ellipsoids. SovietMathematics-Doklady 37, 1 (1988), 226–230.
[40]
Gillat Kol, Ran Raz, and Avishay Tal. 2017. Time-space hardness of learning sparse parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. 1067–1080.
[41]
Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. 2015. A faster cutting plane method and its implications for combinatorial and convex optimization. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 1049–1065.
[42]
Anatoly Yur’evich Levin. 1965. An algorithm for minimizing convex functions. In Proceedings of the Doklady Akademii Nauk, Vol. 160. Russian Academy of Sciences, 1244–1247.
[43]
Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1-3 (1989).
[44]
S. Thomas McCormick. 2005. Submodular function minimization. Handbooks in Operations Research and Management Science 12 (2005), 321–391.
[45]
Dana Moshkovitz and Michal Moshkovitz. 2017. Mixing implies lower bounds for space bounded learning. In Proceedings of the Conference on Learning Theory. PMLR, 1516–1566.
[46]
Dana Moshkovitz and Michal Moshkovitz. 2018. Entropy samplers and strong generic lower bounds for space bounded learning. In Proceedings of the 9th Innovations in Theoretical Computer Science Conference. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[47]
Arkadi Nemirovski. 1994. On parallel complexity of nonsmooth convex optimization. Journal of Complexity 10, 4 (1994), 451–463.
[48]
Arkadij Nemirovski and David Yudin. 1983. Problem complexity and method efficiency in optimization. (1983).
[49]
Ju E. Nesterov. 1989. Self-concordant functions and polynomial-time methods in convex programming. Report, Central Economic and Mathematic Institute, USSR Acad. Sci (1989).
[50]
Yurii Nesterov. 2003. Introductory Lectures on Convex Optimization: A Basic Course. Vol. 87. Springer Science and Business Media.
[51]
Jorge Nocedal. 1980. Updating quasi-newton matrices with limited storage. Mathematics of Computation 35, 151 (1980).
[52]
Christos H. Papadimitriou and Tim Roughgarden. 2008. Computing correlated equilibria in multi-player games. Journal of the ACM 55, 3 (2008), 1–29.
[53]
Mert Pilanci and Martin J. Wainwright. 2017. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization 27, 1 (2017), 205–245.
[54]
Ran Raz. 2017. A time-space lower bound for a large class of learning problems. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science. IEEE, 732–742.
[55]
Ran Raz. 2018. Fast learning requires good memory: A time-space lower bound for parity learning. Journal of the ACM 66, 1 (2018), 1–18.
[56]
Farbod Roosta-Khorasani and Michael W. Mahoney. 2019. Sub-sampled newton methods. Mathematical Programming 174, 1 (2019), 293–326.
[57]
Mark Rudelson and Roman Vershynin. 2013. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability 18 (2013), 1–9.
[58]
Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. Advances in Neural Information Processing Systems 27 (2014).
[59]
Ohad Shamir, Nati Srebro, and Tong Zhang. 2014. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the International Conference on Machine Learning. PMLR, 1000–1008.
[60]
Vatsal Sharan, Aaron Sidford, and Gregory Valiant. 2019. Memory-sample tradeoffs for linear regression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 890–901.
[61]
Naum Z. Shor. 1977. Cut-off method with space extension in convex programming problems. Cybernetics 13, 1 (1977), 94–96.
[62]
Max Simchowitz. 2018. On the randomized complexity of minimizing a convex quadratic function. arXiv:1807.09386. Retrieved from https://arxiv.org/abs/1807.09386
[63]
Max Simchowitz, Ahmed El Alaoui, and Benjamin Recht. 2018. Tight query complexity lower bounds for PCA via finite sample deformed wigner law. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 1249–1259.
[64]
Kartik Sivaramakrishnan and John Mitchell. 2012. Properties of a cutting plane method for semidefinite programming. Pacific Journal of Optimization 8 (2012), 779–802.
[65]
Jacob Steinhardt and John Duchi. 2015. Minimax rates for memory-bounded sparse linear regression. In Proceedings of the Conference on Learning Theory. PMLR, 1564–1587.
[66]
Jacob Steinhardt, Gregory Valiant, and Stefan Wager. 2016. Memory, communication, and statistical queries. In Proceedings of the Conference on Learning Theory. PMLR, 1490–1516.
[67]
Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. 2021. Querying a matrix through matrix-vector products. ACM Transactions on Algorithms 17, 4 (2021), 1–19.
[68]
Pravin M. Vaidya. 1989. A new algorithm for minimizing convex functions over convex sets. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 338–343.
[69]
Roman Vershynin. 2018. High-dimensional Probability: An Introduction with Applications in Data Science. Vol. 47. Cambridge University Press.
[70]
Martin J. Wainwright. 2019. High-dimensional Statistics: A Non-asymptotic Viewpoint. Vol. 48. Cambridge University Press.
[71]
Blake Woodworth and Nathan Srebro. 2017. Lower bound for randomized first order convex optimization. arXiv:1709.03594. Retrieved from https://arxiv.org/abs/1709.03594
[72]
Blake Woodworth and Nathan Srebro. 2019. Open problem: The oracle complexity of convex optimization with limited memory. In Proceedings of the Conference on Learning Theory. PMLR, 3202–3210.
[73]
Blake E. Woodworth, Brian Bullins, Ohad Shamir, and Nathan Srebro. 2021. The min-max complexity of distributed stochastic convex optimization with intermittent communication. In Proceedings of the Conference on Learning Theory. PMLR, 4386–4437.
[74]
Blake E. Woodworth and Nati Srebro. 2016. Tight complexity bounds for optimizing composite objectives. Advances in Neural Information Processing Systems 29 (2016).
[75]
Peng Xu, Fred Roosta, and Michael W. Mahoney. 2020. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184, 1 (2020), 35–70.
[76]
Min Ye and Emmanuel Abbe. 2018. Communication-computation efficient gradient coding. In Proceedings of the International Conference on Machine Learning. PMLR, 5610–5619.
[77]
D. B. Yudin and Arkadi S. Nemirovskii. 1976. Informational complexity and efficient methods for the solution of convex extremal problems. Matekon 13, 2 (1976), 22–45.
[78]
Yuchen Zhang, John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Proceedings of the Conference on Neural Information Processing Systems. Citeseer, 2328–2336.

Cited By

View all
  • (2022)Memory Bounds for Continual Learning2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS54457.2022.00056(519-530)Online publication date: Oct-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 71, Issue 6
December 2024
269 pages
EISSN:1557-735X
DOI:10.1145/3613717
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2024
Online AM: 06 September 2024
Accepted: 04 August 2024
Revised: 03 June 2024
Received: 14 July 2023
Published in JACM Volume 71, Issue 6

Check for updates

Author Tags

  1. Convex optimization
  2. first-order methods
  3. cutting plane methods
  4. memory lower bounds

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • Naval Research Award YIP
  • NSF Award HDR
  • NSF CAREER
  • Microsoft Research Faculty Fellowship, NSF CAREER
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)107
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Memory Bounds for Continual Learning2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS54457.2022.00056(519-530)Online publication date: Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media