We show that any memory-constrained, first-order algorithm which minimizes d-dimensional, 1-Lipschitz convex functions over the unit ball to 1/poly(d) accuracy using at most d1.25 - δ bits of memory must make at least \(\tilde{\Omega }(d^{1 + (4/3)\delta })\) first-order queries (for any constant \(\delta \in [0, 1/4]\)). Consequently, the performance of such memory-constrained algorithms are at least a polynomial factor worse than the optimal Õ(d) query bound for this problem obtained by cutting plane methods that use Õ(d2) memory. This resolves one of the open problems in the COLT 2019 open problem publication of Woodworth and Srebro.
1 Introduction
Minimizing a convex objective function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) given access to a first-order oracle—that returns the function evaluation and (sub)gradient \((f(\mathbf {x}),\nabla f(\mathbf {x}))\) when queried for point \(\mathbf {x}\)—is a canonical problem and fundamental primitive in machine learning and optimization. There are methods that, given any 1-Lipschitz, convex \(f : \mathbb {R}^d \rightarrow \mathbb {R}\) accessible via a first-order oracle, compute an \(\epsilon\)-approximate minimizer over the unit ball with just \(\mathcal {O}(\min \lbrace \epsilon ^{-2}, d \log (1/\epsilon)\rbrace)\) queries to the oracle. This query complexity is worst-case optimal [48] and foundational in optimization theory. \(\mathcal {O}(\epsilon ^{-2})\) queries are achievable using subgradient descent; this is a simple, widely-used, eminently practical method that solves the problem using a total of \(\mathcal {O}(d \epsilon ^{-2})\) arithmetic operations on \(\mathcal {O}(\log (d/\epsilon))\)-bit numbers. On the other hand, building on the \(\mathcal {O}(d^2 \log (1/\epsilon))\) query complexity of the well-known ellipsoid method [61, 77], different cutting plane methods achieve a query complexity of \(\mathcal {O}(d \log (1/\epsilon))\), e.g., center of mass with sampling-based techniques [10, 42], volumetric center [5, 68], inscribed ellipsoid [39, 49]; these methods are perhaps less frequently used in practice and large-scale learning and all use at least \(\Omega (d^3 \log (1/\epsilon))\)-time, even with recent improvements [37, 41].1
Though state-of-the-art cutting plane methods have larger computational overhead and are sometimes regarded as impractical, for small enough \(\epsilon\), they give the state-of-the-art query bounds. Furthermore, in different theoretical settings, e.g., semidefinite programming [3, 41, 64], submodular optimization [30, 36, 41, 44], and equilibrium computation [35, 52], cutting-plane-methods have yielded state-of-the-art runtimes at various points of time. This leads to the natural question of what is needed of a method to significantly outperform gradient descent and take advantage of the improved query complexity enjoyed by cutting plane methods? Can we design methods that obtain optimal query complexities while maintaining the practicality of gradient descent methods?
The COLT 2019 open problem [72] proposed tackling this question through the lens of memory: to what extent is large memory necessary to achieve improved query complexity? Any algorithm for the problem must use \(\Omega \left(d \log (1/\epsilon \right)\) memory irrespective of its query complexity since that much memory is needed to represent the answer [72]. Subgradient descent matches this lower bound and can be implemented with only \(\mathcal {O}(d \log (1/\epsilon))\)-bits of memory [72]. However, all known methods that achieve a query complexity significantly better than gradient descent, e.g., cutting plane methods, use \(\Omega (d^2 \log (1/\epsilon))\) bits of memory. Understanding the tradeoffs in memory and query complexity could inform the design of future efficient optimization methods.
In this article, we show that memory does play a critical role in attaining an optimal query complexity for convex optimization. Our main result is the following theorem which shows that any algorithm whose memory usage is sufficiently small (though potentially, still superlinear) must make polynomially more queries to a first-order oracle than cutting plane methods. Specifically, any algorithm that uses significantly less than \(d^{1.25}\) bits of memory requires a polynomial factor more first-order queries than the optimal \(\mathcal {O}(d \log (d))\) queries achieved by quadratic memory cutting plane methods. This establishes the first non-trivial tradeoff between memory and query complexity for convex optimization with access to a first-order oracle.
Fig. 1.
Beyond shedding light on the complexity of a fundamental memory-constrained optimization problem, we provide several tools for establishing such lower bounds. In particular, we introduce a set of properties which are sufficient for an optimization problem to exhibit a memory lower bound and provide an information-theoretic framework to prove these lower bounds. We hope these tools are an aid to future work on the role of memory in optimization.
This work fits within the broader context of understanding fundamental resource tradeoffs for optimization and learning. For many settings, establishing (unconditional) query/time or memory/time tradeoffs is notoriously hard—perhaps akin to P vs. NP (e.g., providing time lower bounds for cutting plane methods). Questions of memory/query and memory/data tradeoffs, however, have a more information-theoretic nature and hence seem more approachable. Together with the increasing importance of memory considerations in large-scale optimization and learning, there is a strong case for pinning down the landscape of such tradeoffs, which may offer a new perspective on the current suite of algorithms and inform the effort to develop new ones.
1.1 Technical Overview and Contributions
To prove Theorem 1.1, we provide an explicit distribution over functions that is hard for any memory-constrained randomized algorithm to optimize. Though the proof requires care and we introduce a variety of machinery to obtain it, this lower-bounding family of functions is simple to state. The function is a variant of the so-called “Nemirovski function,” which has been used to show lower bounds for highly parallel non-smooth convex optimization [7, 18, 47].
Formally, our difficult class of functions for memory size M is constructed as follows: for some \(\gamma \gt 0\) and \(N= \tilde{\mathcal {O}}((d^2/M)^{1/3})\) let \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) be unit vectors drawn i.i.d. from the d dimensional scaled hypercube \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } \mathsf {Unif} (d^{-1/2} \mathcal {H}_d)\) and let \(\mathbf {a}_1,\ldots ,\mathbf {a}_{\lfloor d/2 \rfloor }\) be drawn i.i.d. from the hypercube, \(\mathbf {a}_j \sim \mathsf {Unif} (\mathcal {H}_d)\) where \(\alpha \mathcal {H}_d := \lbrace \pm \alpha \rbrace ^d\). Let \(\mathbf {A} = (\mathbf {a}_1, \dots , \mathbf {a}_{\lfloor d/2 \rfloor })\) and
We describe the key components of this function and the intuition behind their role in Section 2.1. At a high level, we show that any algorithm that optimizes this function must, with high probability, observe each of the \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) and will observe \(\mathbf {v}_i\) before \(\mathbf {v}_{i+1}\) due to the \(i \gamma\) term. The intuition for the memory-query tradeoff is that after seeing \(\mathbf {v}_{i}\), in order to see \(\mathbf {v}_{i+1}\), the algorithm must query a vector near the nullspace of \(\mathbf {A}\) but with a significant component in the direction of \(\mathbf {v}_{i}\). Since \(\mathbf {v}_{i}\) is random, this requires finding a “new” vector in the nullspace of \(\mathbf {A}\). This implies that for some appropriately chosen k and a block of k Nemirovki vectors \(\lbrace \mathbf {v}_i,\mathbf {v}_{i+1},\ldots ,\mathbf {v}_{i+k-1}\rbrace\), to see \(\lbrace \mathbf {v}_{i+1},\mathbf {v}_{i+2},\ldots ,\mathbf {v}_{i+k}\rbrace\) after seeing \(\mathbf {v}_{i}\) the algorithm must query k sufficiently independent vectors in the null space of \(\mathbf {A}\). We show that an algorithm with insufficient memory to store \(\mathbf {A}\) cannot find such a large set of sufficiently independent vectors in the nullspace of \(\mathbf {A}\), and must essentially “re-learn” \(\mathbf {A}\) for every such block of k Nemirovski vectors.
Rather than give a direct proof of Theorem 1.1 using the explicit function in Equation (1.1), we provide a more abstract framework which gives broader insight into which kinds of functions could lead to non-trivial memory-constrained lower bounds, and which might lead to tighter lower bounds in the future. To that end, we introduce the notion of a memory-sensitive class which delineates the key properties of a distribution over functions that lead to memory-constrained lower bounds. We show that for such functions, the problem of memory-constrained optimization is at least as hard as the following problem of finding a set of vectors which are approximately orthogonal to another set of vectors. The problem is motivated by the proof intuition discussed above, where we argued that minimizing Equation (1.1) requires finding a large number of sufficiently independent vectors in the null space of \(\mathbf {A}\).
Note that the Player can trivially win the game for \(M\ge {\Omega }(dk), m=0\) (by just storing a satisfactory set of k vectors in the \(\mathsf {Message}\)) and for \(M=0, m=d/2\) (by querying all rows of \(\mathbf {A}\)). We show a lower bound that this is essentially all that is possible: for \(\mathbf {A}\) sampled uniformly at random from \(\left\lbrace \pm 1 \right\rbrace ^{d/2 \times d}\), if M is a constant factor smaller than dk, then the Player must make at least \(d/5\) queries to win with probability at least \(2/3\). Our analysis proceeds via an intuitive information-theoretic framework, which could have applications for showing query lower bounds for memory-constrained algorithms in other optimization and learning settings. We sketch the analysis in Section 3.2.
1.2 Related Work
Memory-sample tradeoffs for learning. There is a recent line of work to understand learning under information constraints such as limited memory or communication constraints [4, 6, 15, 23, 24, 26, 28, 58, 65, 66, 73, 78]. Most of these results obtain lower bounds for the regime when the available memory is less than that required to store a single datapoint (with the notable exception of [24] and [23]). However, the breakthrough article [54] showed an exponential lower bound on the number of random examples needed for learning parities with memory as large as quadratic. Subsequent work extended and refined this result to multiple learning problems over finite fields [9, 29, 40, 45, 46, 55].
Most related to our line of work is [60], which considers the continuous valued learning/optimization problem of performing linear regression given access to randomly drawn examples from an isotropic Gaussian. They show that any sub-quadratic memory algorithm for the problem needs \(\Omega (d\log \log (1/\epsilon)))\) samples to find an \(\epsilon\)-optimal solution for \(\epsilon \le 1/d^{\Omega (\log d)}\), whereas, in this regime an algorithm with memory \(\tilde {\mathcal {O}}(d^2)\) can find an \(\epsilon\)-optimal solution with only d examples. Since each example provides an unbiased estimate of the expected regression loss, this translates to a lower bound for convex optimization given access to a stochastic gradient oracle. However, the upper bound of d examples is not a generic convex optimization algorithm/convergence rate but comes from the fact that the linear systems can be solved to the required accuracy using d examples.
There is also significant work on memory lower bounds for streaming algorithms, e.g., [2, 8, 21, 23], where the setup is that the algorithm only gets a single-pass over a data stream. The work of [23] mentioned earlier uses an Approximate Null-Vector Problem (ANVP) to show memory lower bounds, which shares some similarities with the Orthogonal Vector Game used in our proof (informally presented in Definition 1.2). In the ANVP, the objective is to find a single vector approximately orthogonal to a stream of vectors. The goal of the ANVP is similar to the goal of the Orthogonal Vector Game, which is to find a set of such vectors. However, the key difference is that the ANVP is in the (one-pass) streaming setting, whereas, the Orthogonal Vector Game allows for stronger access to the input in two senses. First, the Orthogonal Vector Game algorithms see the entire input and can store a M-bit \(\mathsf {Message}\) about the input, and second, the algorithms adaptively query the input. The main challenge in our analysis (discussed later in Section 3.2) is to bound the power of these adaptively issued queries in conjunction with the \(\mathsf {Message}\).
Lower bounds for convex optimization. Starting with the early work of [48], there is extensive literature on lower bounds for convex optimization. Some of the key results in this area include classic lower bounds for finding approximate minimizers of Lipschitz functions with access to a subgradient oracle [14, 48, 50], including recent progress on lower bounds for randomized algorithms [16, 62, 63, 67, 71, 74]. There is also work on the effect of parallelism on these lower bounds [7, 18, 25, 47]. For a broader introduction to the oracle complexity of optimization, we refer the reader to surveys such as [50] and [17].
Memory-limited optimization algorithms. While the focus of this work is lower bounds, there is a long line of work on developing memory-efficient optimization algorithms, including methods that leverage second-order structure via first-order methods. Such methods include Limited-memory-BFGS [43, 51] and the conjugate gradient (CG) method for solving linear systems [33] and various non-linear extensions of CG [27, 31] and methods based on subsampling and sketching the Hessian [53, 56, 75]. A related line of work is on communication-efficient optimization algorithms for distributed settings (see [1, 34, 38, 59, 76] and references therein).
1.3 Subsequent and Future Work
Our work is a first step towards establishing optimal memory/query tradeoffs for optimization. An immediate open problem suggested by our work is that of improving our bounds, both to match the quadratic memory upper bound of cutting plane methods, establish tight bounds for all values of the desired accuracy, \(\epsilon\), and fully resolve the open problems of Woodworth and Srebro [72]. Since the initial publication of this work, there have been several advances in this direction.
First, recent work of Blanchard, Zhang, and Jaillet [13] built upon our proof framework to show that quadratic memory is indeed necessary for deterministic algorithms to obtain near-optimal query complexities. In our construction in Equation (1.1), we have \(N= \tilde{\mathcal {O}}((d^2/M)^{1/3})\) Nemirovski vectors \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\); however, packing more Nemirovski vectors would lead to a stronger lower bound. We discuss this further in Remark 2.12. In the case of deterministic algorithms, [13] provide a clever adaptive strategy which packs \(N=\Omega (d)\) Nemirovski vectors and obtains a quadratic memory lower bound.
A second recent article, Chen and Peng [20], established that quadratic memory is necessary to achieve the optimal convergence rate even for randomized algorithms, at least in the parameter regime where \(\epsilon\) is quasipolynomially small in d. Their proof leverages the same hard distribution of functions as we describe (Equation (1.1)), but introduces a variant of our Orthogonal Vector Game and develops an elegant analysis for it which allows them to show that a significant number of queries must be expended to observe each new Nemirovski vector \(v_i\). In contrast, our proof blocks together groups of such vectors.
Another recent work of Blanchard [11] considers the problem of finding a feasible point in a given set with access to a separation oracle. They construct an oracle which requires the algorithm to find robustly linearly independent vectors which are orthogonal to a sequence of subspaces. They build upon our proof framework and develop a novel recursive lower bound technique which allows them to show that gradient descent is almost optimal for the feasibility problem among linear memory algorithms. Their result also implies an exponential lower bound for sub-quadratic memory algorithms to solve the feasibility problem to sufficiently small error.
On the upper bound side, [12] develops a family of recursive cutting-plane algorithms which use a memory that can vary between \(\tilde{O}(d)\) and \(\tilde{O}(d^2)\). Previously, algorithms with established query complexities used either \(\tilde{O}(d)\) memory or \(\tilde{\Omega }(d^2)\). This work establishes new upper bounds on memory-constrained query complexity for any memory constraint that falls between linear and quadratic. In the regime when \(\epsilon \le d^{-\Omega (d)}\), their algorithm matches the memory usage of gradient descent while improving on gradient descent’s query complexity.
Even in light of these advances, multiple open problems remain. For example, relaxing the type of functions and oracles for which we can prove memory-constrained lower bounds is a natural further direction. Interestingly, our proofs (and those of [13, 20]) rely on the assumption that we are optimizing a Lipschitz but non-smooth function and that at a point queried we only observe the function’s value and a single subgradient (rather than the set of all subgradients or all information about the function in a neighborhood) and consequently standard smoothing techniques do not readily apply [19, 25]. Proving lower bounds for smooth functions and beyond could illuminate the necessity of larger memory footprints for prominent memory-intensive optimization methods with practical appeal (e.g., interior point methods and quasi-Newton methods). Additionally, finding algorithms with new tradeoffs between space usage and query complexity would be very interesting. In particular, improving the memory versus query tradeoffs for more structured functions, e.g., piecewise linear functions, is an interesting question for future work.
2 Setup and Overview of Results
We consider optimization methods for minimizing convex, Lipschitz-continuous functions \(F: \mathbb {R}^d \rightarrow \mathbb {R}\) over the unit-ball2 with access to a first-order oracle. Our goal is to understand how the (oracle) query complexity of algorithms is affected by restrictions on the memory available to the algorithm. Our results apply to a rather general definition of memory-constrained first-order algorithms given in the following Definition 2.1. Such algorithms include those that use arbitrarily large memory at query time but can save at most M bits of information in between interactions with the first-order oracle. More formally we have:
Note that our definition of a memory-constrained algorithm is equivalent to the definition given by [72]. Our analysis also allows for randomized algorithms which will often be denoted as \(\mathcal {A}_{\textrm {rand}}\).
In what follows, we use the notation \(\mathcal {O}_F(\mathbf {x})\) to denote the first-order oracle which, when queried at some vector \(\mathbf {x}\), returns the pair \((F(\mathbf {x}), \mathbf {g}_{\mathbf {x}})\) where \(\mathbf {g}_{\mathbf {x}} \in \partial F(\mathbf {x})\) is a subgradient of F at \(\mathbf {x}\). We will also refer to a sub-gradient oracle and use the overloaded notation \(\mathbf {g}_F(\mathbf {x})\) to denote the oracle which simply returns \(\mathbf {g}_{\mathbf {x}}\). It will be useful to refer to the sequence of vectors \(\mathbf {x}\) queried by some algorithm \(\mathcal {A}\) paired with the subgradient \(\mathbf {g}\) returned by the oracle.
2.1 Proof Strategy
We describe a broad family of optimization problems which may be sensitive to memory constraints (also see Figure 2 for an overview). As suggested by the Orthogonal Vector Game(Definition 1.2), the primitive we leverage is that finding vectors orthogonal to the rows of a given matrix requires either large memory or many queries to observe the rows of the matrix. With that intuition in mind, let \(f : \mathbb {R}^d \rightarrow \mathbb {R}\) be a convex function, let \(\mathbf {A}\in \mathbb {R}^{n \times d}\), let \(\eta\) be a scaling parameter, and let \(\rho\) be a shift parameter; define \(F_{f, \mathbf {A}, \eta , \rho }(\mathbf {x})\) as the maximum of \(f(\mathbf {x})\) and \(\eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\):
We often drop the dependence of \(\eta\) and \(\rho\) and write \(F_{f, \mathbf {A}, \eta , \rho }\) simply as \(F_{f, \mathbf {A}}\). Intuitively, for large enough scaling \(\eta\) and appropriate shift \(\rho\), minimizing the function \(F_{f, \mathbf {A}}(\mathbf {x})\) requires minimizing \(f(\mathbf {x})\) close to the null space of the matrix \(\mathbf {A}\). Any algorithm which uses memory \(\Omega (nd)\) can learn and store \(\mathbf {A}\) in \(\mathcal {O}(d)\) queries so that all future queries are sufficiently orthogonal to \(\mathbf {A}\); thus this memory rich algorithm can achieve the information-theoretic lower bound for minimizing \(f(\mathbf {x})\) roughly constrained to the nullspace of \(\mathbf {A}\).
Fig. 2.
However, if \(\mathbf {A}\) is a random matrix with sufficiently large entropy then \(\mathbf {A}\) cannot be compressed to fewer than \(\Omega (n d)\) bits. Thus, for \(n = \Omega (d)\), an algorithm which uses only memory \(o(d^{2-\delta })\) bits for some constant \(0 \lt \delta \le 1\) cannot remember all the information about \(\mathbf {A}\). Suppose the function f is such that in order to continue to observe new information about the function, it is insufficient to submit queries that belong to some small dimensional subspace of the null space of \(\mathbf {A}\). Then a memory-constrained algorithm must relearn enough information about \(\mathbf {A}\) in order to find new vectors in the null space of \(\mathbf {A}\) and make queries which return new information about f.
To show our lower bound, we set f in (2.1) to be the Nemirovski function:
where the vectors \(\mathbf {v}_1,\ldots ,\mathbf {v}_N\) are unit vectors drawn i.i.d. from the d dimensional scaled hypercube \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } \mathsf {Unif} (d^{-1/2} \mathcal {H}_d)\) and we refer to them as “Nemirovski vectors.” With this choice of f, the overall function \(F_{f, \mathbf {A}}(\mathbf {x})\) has certain “memory-sensitive” properties. In particular, to reveal new Nemirovski vectors an algorithm cannot make queries which lie in a low-dimensional subspace. In addition, because of the \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty\) term in the definition of \(F_{f, \mathbf {A}}(\mathbf {x})\), queries which reveal new Nemirovski vectors must also be sufficiently orthogonal to \(\mathbf {A}\). Together, these properties imply that a sequence of queries which reveals a new set of Nemirovski vectors must also be winning queries for the Orthogonal Vector Game (Definition 1.2). This allows us to leverage our information-theoretic memory-query tradeoffs for the Orthogonal Vector Game. We show that the algorithm must reveal a sufficiently large number of Nemirovski vectors to optimize \(F_{f, \mathbf {A}}(\mathbf {x})\), therefore, we can repeatedly apply the lower bound for the Orthogonal Vector Game to show our final lower bound (also see Figure 2 for a visual overview of the approach).
2.2 Proof Components
In summary of Section 2.1, there are two critical features of \(F_{f,\mathbf {A}}\) which lead to a difficult optimization problem for memory-constrained algorithms. First, the distribution underlying random instances of \(\mathbf {A}\) must ensure that \(\mathbf {A}\) cannot be compressed. Second, the function f must be such that queries which belong to a small dimensional subspace cannot return too many informative subgradients. We formalize these features by defining a memory-sensitive base and a memory-sensitive class. We show that if \(\mathbf {A}\) is drawn uniformly at random from a memory-sensitive base and if f and a corresponding first-order oracle are drawn according to a memory-sensitive class then the resulting distribution over \(F_{f, \mathbf {A}}(\mathbf {x})\) and its first-order oracle provides a difficult instance for memory-constrained algorithms to optimize.
Setting \(n = \lfloor d/2 \rfloor\), we use \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) to mean \(\mathbf {A}= \left(\mathbf {a}_1, \dots , \mathbf {a}_n \right)^{\top }\) where each \(\mathbf {a}_i \sim \mathsf {Unif}(B_d)\). Theorem 2.5 shows that the sequence of hypercubes is a memory-sensitive base, which is proven in Section 4.
Next, we consider distributions over functions f paired with some subgradient oracle \(\mathbf {g}_f\). Given f and some \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) we consider minimizing \(F_{f, \mathbf {A}}\) as in Equation (2.1) with access to an induced first-order oracle:
We also define informative subgradients, formalizing the intuition described at the beginning of this section.
We can now proceed to the definition of a \((L, M, k, \epsilon {^\star })\)-memory-sensitive class.
Informally, the robust independence of informative queries condition ensures that any informative query is robustly linearly independent with respect to the previous k informative queries. The approximate orthogonality condition ensures that any query \(\mathbf {x}\) which reveals a gradient from f which is not \(\mathbf {v}_1\) is approximately orthogonal to the rows of \(\mathbf {A}\). Finally, the necessity of informative subgradients condition ensures that any algorithm which has queried only \(r \lt N\) informative subgradients from f has an optimality gap of at least \(\epsilon {^\star }\).
The following construction, discussed in the introduction, is a concrete instance of a \((d^6, N,\)k, \(1/(20 \sqrt {N}))\)-memory-sensitive classṪheorem 2.10 proves that the construction is memory-sensitive, and the proof appears in Section 4.
Note that choosing \(k= \lceil 60 M/ (c_{\mathcal {H}} d) \rceil\) gives a \((d^6, N,\)\(\lceil 60 M/(c_{\mathcal {H}}d) \rceil ,\)\(1/(20 \sqrt {N}))\)-memory-sensitive class with \(N= \frac{1}{205} (\frac{ c_{\mathcal {H}} d^2}{M\log d })^{1/3} \approx (d^2/M)^{1/3}\).
2.3 Main Results
With the definitions from the previous section in place, we now state our main technical theorem for proving lower bounds on memory-constrained optimization.
Theorem 2.11 is agnostic to the particular memory-sensitive class and instead describes how any given memory-sensitive class can be used to derive a query complexity lower bound. The ultimate bound derived using a particular memory sensitive class depend on the classes’ relationship between Lipschitz constant L, depth N, and optimality threshold \(\epsilon\).
In Theorem 2.10, we exhibit a \((d^6, N, \lceil 60 M/(c_{\mathcal {H}}d) \rceil ,\)\(1/(20 \sqrt {N}))\)-memory-sensitive class. For any given L, by scaling this function class by \(L/d^6\), this implies the existence of an \((L, N, \lceil 60 M/(c_{\mathcal {H}}d) \rceil , L/(20 d^6 \sqrt {N}))\)-memory-sensitive class. Theorem 2.11, along with the existence of a memory-sensitive base from Theorem 2.5, then implies a lower bound of \(\Omega (c_B d^2 N / M)\) first-order oracle queries. Setting \(\epsilon = L/(20 d^6 \sqrt {N})\) and recalling that Theorem 2.10 requires \(N \le (c d^2/M \log (d))^{1/3}\) for some absolute constant \(c \gt 0\), we obtain the following memory-query tradeoff parameterized by \(\epsilon\): For any \(\epsilon \in [LM^{1/6}/(16d^{19/3}), L/(4 \sqrt {d})]\), any M-bit memory-constrained randomized algorithm (as per Definition 2.2) which outputs an \(\epsilon\)-optimal point with probability at least \(2/3\) requires \(\Omega (c_{B}L^2d^2/(16^2 d^{12} \epsilon ^2 M))\) first-order oracle queries. In Theorem 1.1, we explicitly choose \(\epsilon\) as a function of M and d to yield a result that is easy to parse.
Note that by [72], the center of mass algorithm can output an \(\epsilon\)-optimal point for any L-Lipschitz function using \(\mathcal {O}(d^2 \log ^2 \left(LB/\epsilon \right))\) bits of memory and \(\mathcal {O}\left(d \log (LB/\epsilon) \right)\) first-order oracle queries. Comparing this with the lower bound from Theorem 2.11, we see that M-bit algorithms are unable to achieve the optimal quadratic memory query complexity with less than \(d N/\log (L/\epsilon)\) bits. Furthermore, Theorem 2.11 yields a natural approach for obtaining improved lower bounds. If one exhibits a memory-sensitive class with depth \(N = d\), then Theorem 2.11 would imply that any algorithm using memory of size at most \(M= d^{2 - \delta }\) requires at least \(\Omega (d^{1 + \delta })\) many first-order oracle queries in order to output an \(\epsilon\)-optimal point. We note that the subsequent work of [13] shows that a memory-sensitive class with a similar property can be adaptively constructed for any deterministic algorithm, showing that quadratic memory is necessary for deterministic algorithms.
3 Proof Overview
Here we give the proof of Theorem 2.11. In Section 3.1, we present our first key idea, which relates the optimization problem in (2.1) to the Orthogonal Vector Game (which was introduced informally in Definition 1.2). Next, in Section 3.2, we show that winning this Orthogonal Vector Game is difficult with memory constraints.
3.1 The Orthogonal Vector Game
We formally define the Orthogonal Vector Game in Game 1. The game proceeds in three parts. In Part 1, the oracle samples a random \((d/2\times d)\) dimensional matrix \(\mathbf {A}\), and the player gets to observe the matrix and write down an M-bit message \(\mathsf {Message}\) about the matrix. In Part 2, the player can make queries to the oracle. The oracle responds to any query \(\mathbf {x}\) with a row of the matrix \(\mathbf {A}\) which has the largest magnitude inner product with \(\mathbf {x}\). In Part 3, the player uses \(\mathsf {Message}\) and all the queries and responses from the previous part to find a set of k vectors which are robustly linearly independent, and are roughly orthogonal to all the rows of \(\mathbf {A}\). The player wins the game if she can do this successfully.
Next, we relate winning the Orthogonal Vector Game to minimizing \(F_{f,\mathbf {A}}\) (Equation (2.1)).
Here we provide a brief proof sketch before the detailed proof: Let \(\mathcal {A}_{\textrm {rand}}\) denote a hypothetical M-bit memory algorithm for optimizing \(F_{f,\mathbf {A}}\) (Definition 2.2). The Player samples a random function/oracle pair \((f,\mathbf {g}_f)\) from the memory-sensitive class \(\mathcal {F}\). Then, she uses \(\mathcal {A}_{\textrm {rand}}\) to optimize \(F_{f, \mathbf {A}}\), and issues queries that \(\mathcal {A}_{\textrm {rand}}\) makes to the oracle in the Orthogonal Vector Game. Notice that with access to \(\mathbf {g}_f(\mathbf {x})\) and the oracle’s response \(\mathbf {g}_{\mathbf {A}}(\mathbf {x})\), she can implement the sub-gradient oracle \(\mathbf {g}_{F_{f, \mathbf {A}}}\) (Definition 2.6). We then use the memory-sensitive properties of f to argue that informative queries made by \(\mathcal {A}_{\textrm {rand}}\) must also be successful queries in the Orthogonal Vector Game, and that \(\mathcal {A}_{\textrm {rand}}\) must make enough informative queries to find an \(\epsilon {^\star }\)-optimal point.
3.2 Analyzing the Orthogonal Vector Game: Proof Sketch
With Lemma 3.1 in hand, the next step towards proving Theorem 2.11 is to establish a query lower bound for any memory-constrained algorithm which wins the Orthogonal Vector Game. Although our results hold in greater generality, i.e., when \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\) for some memory-sensitive base \(\lbrace B_d\rbrace _{d \gt 0}\), we first provide some intuition by considering the concrete case where the rows of \(\mathbf {A}\) are drawn uniformly at random from the hypercube. We also consider an alternative oracle model which we call the Index-Oracle model for which it is perhaps easier to get intuition, and for which our lower bound still holds: suppose the Player can specify any index \(i \in [n]\) as a query and ask the oracle to reveal the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (this is the setup in the informal version of the game in Definition 1.2). Note that this is in contrast to the oracle in the Orthogonal Vector Game which instead responds with the row of the matrix \(\mathbf {A}\) which is a subgradient of the query made by the algorithm. As described in Section 1.1, the instances where \(M=\Omega (kd), m=0\) and the instance where \(M=0, m=n\) are trivially achievable for the Player. We show that these trivial strategies are essentially all that is possible:
Theorem 3.2 (Informal version of Theorem 3.3).
For some constant \(c\gt 0\) and any \(k\ge \Omega (\log (d))\), if \(M\le cdk\) and the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).
Note that when \(m=0\), it is not difficult to show that the Player requires memory roughly \(\Omega (kd)\) to win the game with any decent probability. This is because each vector which is approximately orthogonal to the rows of \(\mathbf {A}\) will not be compressible below \(\Omega (d)\) bits with high probability, since rows of \(\mathbf {A}\) are drawn uniformly at random from the hypercube. Therefore, the Player must use \(M\ge \Omega (kd)\) to store k such vectors. The main challenge in the analysis is to bound the power of additional, adaptively issued queries. In particular, since observing every row \(\mathbf {a}_i\) gives \(\Theta (d)\) bits of information about \(\mathbf {A}\), and k successful vectors will only have about \(\mathcal {O}(kd)\) bits of information about \(\mathbf {A}\), perhaps only observing k additional rows of \(\mathbf {A}\) is enough to win the game. Slightly more concretely, why can’t the Player store some \(M \ll kd\) bits of information about \(\mathbf {A}\), such that by subsequently strategically requesting \(m\ll n\) rows of \(\mathbf {A}\) she now knows k vectors which are approximately orthogonal to the rows of \(\mathbf {A}\)? Our result guarantees that such a strategy is not possible.
We now give some intuition for our analysis. The underlying idea is to calculate the mutual information between \(\mathbf {A}\) and \(\mathbf {Y}\) conditioned on a fixed random string R and the information \(\mathbf {G}\) gleaned by the Player (in our proof we slightly augment the matrix \(\mathbf {G}\) and condition on that, but we ignore that detail for this proof sketch). We will use the notation \(I(\mathbf {X}; \mathbf {Y} \vert \mathbf {Z})\) to denote the mutual information between \(\mathbf {X}\) and \(\mathbf {Y}\) conditioned on \(\mathbf {Z}\) and \(H(\mathbf {X})\) to denote the entropy of \(\mathbf {X}\) (see [22]). Our analysis relies on computing a suitable upper bound and lower bound for \(I(\mathbf {A}; \mathbf {Y} \vert \mathbf {G}, R)\). We begin with the upper bound. We first argue that \(\mathbf {Y}\) can be computed just based on \(\mathbf {G}, \mathsf {Message}\) and R (without using \(\mathbf {X}\)),
\begin{align*} Y = g(\mathbf {G}, \mathsf {Message}, R), \text{for some function}\; {g.} \end{align*}
The proof of this statement relies on the fact that the Player’s strategy is deterministic conditioned on the random string R. Now using the data processing inequality [22] we can say,
On the other hand, we argue that if \(\mathbf {Y}\) is successful as per the Orthogonal Vector Game with probability \(1/3\) then there must be a larger amount of mutual information between \(\mathbf {Y}\) and \(\mathbf {A}\) after conditioning on \(\mathbf {G}\) and R. Recall that \(\mathbf {Y}\) contains k robustly linearly independent vectors in order to win the Orthogonal Vector Game, assume for now that \(\mathbf {Y}\) is also orthonormal. Then, we may use the fact that the hypercube is a memory-sensitive base (Definition 2.4) to derive that for some constant \(c \gt 0\)
Since \(\mathbf {G}\) contains m rows of \(\mathbf {A}\), only the remaining \((n-m)\) rows of \(\mathbf {A}\) are random after conditioning on \(\mathbf {G}\). Assume for now that the remaining \((n-m)\) rows are independent of \(\mathbf {G}\) (as would be true under the Index-Oracle model defined earlier in this section). Then for these remaining rows, we use Equation (3.2) to bound their entropy conditioned on \(\mathbf {Y}\):
where we chose \(n = d/2\). Now combining Equations (3.1) and (3.3), we obtain a lower bound on the query complexity, \(m \ge (d/2) - M/(c k)\). Thus if \(M \ll cdk\), then \(m=\Omega (d)\).
While this proof sketch contains the skeleton of our proof, as also hinted in the sketch it oversimplifies many points. For example, note that the definition of a memory-sensitive base in Definition 2.4 requires that \(\mathbf {Y}\) be orthonormal to derive Equation (3.2), but successful vectors only satisfy a robust linear independence property. Additionally, in deriving Equation (3.3), we implicitly assumed that \(\mathbf {G}\) does not reduce the entropy of the rows of \(\mathbf {A}\) which are not contained in it. This is not true for the Orthogonal Vector Game, since every response \(\mathbf {g}_i\) of the oracle is a subgradient of \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty\) and, therefore, depends on all the rows of the matrix \(\mathbf {A}\). Our full proof which handles these issues and dependencies appears in Section 5. As noted before, our lower bound holds not only for the subgradient oracle, but also when the oracle response \(\mathbf {g}_i\) can be an arbitrary (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_n\rbrace\) (for example, the model where the Player can query any row \(\mathbf {a}_i\) of the matrix \(\mathbf {A}\)). We state the lower bound for the Orthogonal Vector Game which is the main result from Section 5 below.
Theorem 3.3.
Suppose for a memory-sensitive base \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) with constant \(c_{B}\gt 0\), \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\). Given \(\mathbf {A}\) let the oracle response \(\mathbf {g}_i\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be any (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_{d/2}\rbrace\), where \(\mathbf {a}_i\) is the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (note this includes the subgradient response in the Orthogonal Vector Game). Assume \(M \ge d \log (2d)\) and set \(k=\left\lceil 60M/(c_{B}d) \right\rceil\). For these values of k and M, if the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).
3.3 Putting Things Together: Proof of Main Theorem
Here we combine the results of the previous subsections to prove Theorem 2.11. Very roughly, by Lemma 3.1, any algorithm for finding an \(\epsilon ^\star\)-optimal point with probability \(2/3\) can be used to win the Orthogonal Vector Game with probability \(1/3\). Therefore, combining Theorem 3.3 and Lemma 3.1, and noting \(k \approx M/d\), any algorithm for finding an \(\epsilon\)-optimal point with probability \(2/3\) must use \(m \lfloor N/(k+1)\rfloor \ge c Nd^2/M\) queries.
Proof of Theorem 2.11
We take \(\epsilon =1/(20\sqrt {N})\) throughout the proof. By [72, Theorem 5], any algorithm for finding a \(\epsilon\)-optimal point for a L-Lipschitz function in the unit ball must use memory at least \(d\log (\frac{L}{2\epsilon })\). Therefore, since \(\epsilon \le L/(4\sqrt {d})\) then we may assume without loss of generality that \(M \ge d\log (\frac{L}{2\epsilon }) \ge d\log (2\sqrt {d})\ge (d/2)\log (4d)\). Therefore, by Theorem 3.3, if the Player wins the Orthogonal Vector Game with failure probability at most \(2/3\), then the number of queries m satisfy \(m\ge d/5\). By Lemma 3.1, any algorithm for finding an \(\epsilon\)-optimal point with a failure probability at most \(1/3\) can be used to win the Orthogonal Vector Game with a probability of at least \(1/3\). Therefore, combining Theorem 3.3 and Lemma 3.1, any algorithm for finding an \(\epsilon\)-optimal point with failure probability \(1/3\) must use
queries. Therefore, if \(\epsilon = 1/(20\sqrt {N})\) and \(c = c_{B}/(2\cdot 20\cdot 5\cdot 3\cdot 20^2)\ge c_{B}/10^6\), any memory-constrained algorithm requires at least \(c d^2/(\epsilon ^2 M)\) first-order queries. □
4 The Hypercube and Nemirovski Function Class Are Memory-Sensitive
In this section, we prove that the hypercube and the Nemirovski function class have the stated memory-sensitive properties. The proofs will rely on a few concentration bounds stated and proved in Appendix A.1.
4.1 The Hypercube is a Memory-Sensitive Base
Here we prove that the hypercube is a memory sensitive basis. Recall Theorem 2.5.
Theorem 2.5.
Recall that \(\mathcal {H}_d = \left\lbrace \pm 1 \right\rbrace ^d\) denotes the hypercube. The sequence \(\left\lbrace \mathcal {H}_d \right\rbrace _{d = d_0}^{\infty }\) is a \((k,c_{\mathcal {H}})\) memory-sensitive base for all \(k\in [d]\) with some absolute constant \(c_{\mathcal {H}}\gt 0\).
Proof of Theorem 2.5
First observe that for any \(\mathbf {h}\in \mathcal {H}_d\), \(\left\Vert {\mathbf {h}}\right\Vert _2 = \sqrt {d}\). Next let \(\mathbf {Z}= (\mathbf {z}_1, \dots , \mathbf {z}_k) \in \mathbb {R}^{d \times k}\) be k orthonormal d-dimensional vectors and let \(\mathbf {h}\sim \mathsf {Unif}(\mathcal {H}_d)\). Note that each coordinate of \(\mathbf {h}\) is sub-Gaussian with sub-Gaussian norm bounded by 2 and \(\mathbb {E}[\mathbf {h}\mathbf {h}^{\top }]=\mathbf {I}_d\). By Lemma A.6 in Appendix A.1, there is an absolute constant \(c_{\textrm {hw}}\gt 0\) such that for any \(s \ge 0\),
Noting that \(\left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _2 \le \sqrt {k} \left\Vert {\mathbf {Z}^{\top } \mathbf {h}}\right\Vert _\infty\) we must have that for any \(t \le 1/\sqrt {2}\),
4.2 The Nemirovski Function Class is Memory-Sensitive
Recall we define \(\mathcal {N}_{N, \gamma }\) to be the distribution of \((f, \mathbf {g}_f)\) where for \(i \in [N]\) draw \(\mathbf {v}_i \overset{\text{i.i.d.}}{\sim } (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and set
We will show that for \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\), where \(c=400\) is a fixed constant, \(\eta = d^5\), \(\rho = 1\), if \(N \le (1/32 \gamma)^{2/3}\) then \(\mathcal {N}_{N, \gamma }\) is a \((d^6, N, k, 1/(20 \sqrt {N}))\)-memory-sensitive class. To show the required properties of the Nemirovski function, we define the Resisting oracle which iteratively constructs the Nemirovski function. The Resisting oracle will be defined via a sequence of NPhases, where the \({j}^{\textrm {th}}\) Phase ends and the next begins when the algorithm makes a query which reveals the \({j}^{\textrm {th}}\) Nemirovski vector. We begin with some definitions for these Phases.
Definition 4.1.
For a fixed \(\mathbf {A}\) and vectors \(\lbrace \mathbf {v}_i, i \in [j]\rbrace\), the Nemirovski-induced function for Phase j is
Following Definition 2.6, this defines a subgradient oracle \(\mathbf {g}_{F}^{(j)}(\mathbf {x})\) for the induced function \(F_{ \mathcal {N}, \mathbf {A}}^{(j)} (\mathbf {x})\).
The first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(j)}}\) which returns the pair \((F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}), \mathbf {g}_{F}^{(j)}(\mathbf {x}))\) will be used by the Resisting oracle in Phase j. With these definitions in place, we can now define the Resisting oracle in Algorithm 3.
As we prove in the following simple lemma, for any fixed algorithm the distribution over the responses from \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) is the same as the distribution over the responses from the oracle \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) for the Nemirovksi function defined in Definition 2.9. Therefore, it will be sufficient to analyze the Resisting oracle and \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\).
Lemma 4.2.
The vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) for the final induced oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}(\mathbf {x})=(F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_{F}^{(N)}(\mathbf {x}))\) are sampled from the same distribution as the vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) for the Nemirovski function \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) (Definition 2.9). Moreover, for the same setting of the vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\), any fixed matrix \(\mathbf {A}\) and any fixed algorithm, the responses of both these oracles are identical.
Proof.
The proof follows from definitions. For the first part, note that the Nemirovski vectors in Algorithm 3 are sampled at random independent of the Algorithm’s queries, and from the same distribution as in Definition 2.9. For the second part, note that by definition, for a fixed set of vectors \(\lbrace \mathbf {v}_i, i \in [N]\rbrace\) and matrix \(\mathbf {A}\), the first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) is identical to \(\mathcal {N}_{N, \gamma , \mathbf {A}}\) from Definition 2.9. □
Our main goal now is to prove the following two statements (1) with high probability, the Resisting oracle is consistent with itself, i.e., its outputs are the same as the outputs of \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}=(F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_{F}^{(N)}(\mathbf {x}))\), (2) the Resisting oracle has the properties required from a memory-sensitive class. To prove these statements, we define an event E in Definition 4.3, and then argue that the failure probability of E is at most \(1/d\). Event E allows us to argue both that the Resisting oracle is identical to \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\), as well as to argue that \(\mathcal {N}_{N, \gamma }\) is memory-sensitive class.
Definition 4.3.
Let E denote the event that \(\forall j \in [N]\), the Nemirovski vector \(\mathbf {v}_j\) has the following properties:
(1)
For any query \(\mathbf {x}\) submitted by the Algorithm in Phase 1 through j, \(|\mathbf {x}^\top \mathbf {v}_j| \le \sqrt {\frac{10\log (d)}{d}}\).
(2)
\(\Vert {\mathrm{Proj}_{S_{j}}(\mathbf {v}_{j})}\Vert _2 \le \sqrt {\frac{30 k\log (d) }{d}}\) where \(S_j\) is defined as in Definition 2.8.
Lemma 4.4.
Recall the definition of event E in Definition 4.3. Suppose that for every Phase \(j\in [N]\), the Algorithm makes at most \(d^2\) queries in that Phase. Then \(\Pr [E]\ge 1 - (1/d)\).
Proof.
We prove for any fixed \(j \in [N]\), the failure probability of event E in Phase j is at most \(1/d^2\) and then apply a union bound over all \(j \in [N]\) (where \(N \le d\)). Define \(T_j\) to be the sequence of the next \(d^2\) vectors the Algorithm would query in the \({j}^{\textrm {th}}\) Phase, where the next query is made if the previous query does not get the new Nemirovski vector \(\mathbf {v}_{j}\) as the response (i.e., the previous query does not end the \({j}^{\textrm {th}}\) Phase). Since we can think of the Algorithm’s strategy as deterministic conditioned on its memory state and some random string R (as in Definition 2.2), the sequence \(T_j\) is fully determined by the function \(F_{ \mathcal {N}, \mathbf {A}}^{(j-1)}\), R and the state of the Algorithm when it receives \(\mathbf {v}_{j-1}\) as the response from the Resisting oracle at the end of the \({(j-1)}^{\textrm {th}}\) Phase. Critically, this sequence in independent of \(\mathbf {v}_j\). Separately, the Resisting oracle samples \(\mathbf {v}_j \overset{\text{i.i.d.}}{\sim } ((1/\sqrt {d}) \mathsf {Unif} (\mathcal {H}_d))\), and, therefore, \(\mathbf {v}_j\) is sampled independently of all these \(d^2\) queries in the set \(T_j\).
The proof now follows via simple concentration bounds. First consider property (1) of event E. By the argument above we see that for any any vector \(\mathbf {x}\) which is in the set \(T_j\) or was submitted by the Algorithm in any of the previous \((j-1)\) Phases, by Corollary A.3, \(\mathbf {x}^{\top } \mathbf {v}_j\) is a zero-mean random variable with sub-Gaussian parameter at most \(\left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d} \le 1/\sqrt {d}\). Then by Fact A.4, for any \(t \ge 0\),
We remark that to invoke Corollary A.3, we do not condition on whether or not vector \(\mathbf {x}\) will ultimately be queried by the Algorithm (this is critical to maintain the independence of \(\mathbf {x}\) and \(\mathbf {v}_j\)). Picking \(t = \sqrt {10 \log (d)/d}\) we have with failure probability at most \(2/d^5\),
Since the Algorithm makes at most \(d^2\) queries in any Phase, there are at most \(Nd^2\) vectors in the union of the set \(T_j\) of possible queries which the Algorithm makes in the \({j}^{\textrm {th}}\) Phase, and the set of queries submitted by the Algorithm in any of the previous Phases. Therefore, by a union bound, property (1) of event E is satisfied by \(\mathbf {v}_{j}\) with failure probability at most \(2Nd^2/d^5\le 2/d^2\), where we used the fact that \(N\le d\).
We now turn to property (2) of event E. Note that \(S_{j}\) depends on \(\mathbf {x}_{t_{j}}\), which is the first query for which the Resisting oracle returns \(\mathbf {v}_j\). However, using the same idea as above, we consider the set \(T_j\) of the next \(d^2\) queries which the Algorithm would make, and then do a union bound over every such query. For any query \(\mathbf {x}\in T_j\), consider the subspace
Note that \(S_j = S_{j-1, \mathbf {x}_{t_j}}\). Recalling the previous argument, \(\mathbf {v}_j\) is independent of all vectors in the above set \(S_{j-1,\mathbf {x}}\) . Therefore, by Corollary A.7 there is an absolute constant c such that with failure probability at most \(c/d^5\),
Under the assumption that the Algorithm makes at most \(d^2\) queries in the \({j}^{\textrm {th}}\) Phase, the successful query \(\mathbf {x}_{t_{j}}\) must be in the set \(T_j\). Therefore, by a union bound over the \(d^2\) queries which the Algorithm submits,
with failure probability at most \(c/d^{3} \le 1/d^2\) (for d large enough). Therefore, both properties (1) and (2) are satisfied for vector \(\mathbf {v}_j\) with failure probability at most \(1/d^{2}\) and thus by a union bound over all N Nemirovski vectors \(\mathbf {v}_j\), E happens with failure probability at most \(1/d\). □
Now that we have justified that event E has a small failure probability we will argue that conditioned on E, the Resisting oracle returns consistent first-order information. This is critical to our analysis because we prove memory-sensitive properties through the lens of the Resisting oracle.
Lemma 4.5.
Under event E defined in Definition 4.3, the Resisting oracle’s responses are consistent with itself, i.e., for any query made by the Algorithm, the Resisting oracle returns an identical first-order oracle response to the final oracle \((F^{(N)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x}), \mathbf {g}^{(N)}_F(\mathbf {x}))\).
Proof.
Assume event E and fix any \(j \in [N-1]\). Let \(\mathbf {x}\) be any query the Algorithm submits sometime before the end of the \({j}^{\textrm {th}}\) Phase. Under event E for any \(j^{\prime } \gt j\),
Therefore, for all \(j^{\prime } \gt j\), \(\mathbf {v}_{j^{\prime }}\) is not the subgradient \(\mathbf {g}^{(N)}_F(\mathbf {x})\). This implies that in the \({j}^{\textrm {th}}\) Phase, the Resisting oracle always responds with the correct function value \(F^{(j)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x}) = F^{(N)}_{ \mathcal {N}, \mathbf {A}}(\mathbf {x})\) and subgradient \(\mathbf {g}^{(j)}_F(\mathbf {x}) = \mathbf {g}^{(N)}_F(\mathbf {x})\). □
Now we address the memory-senstive properties of the function class by analyzing the Resisting oracle. Here, event E is crucial since it ensures that the Nemirovski vectors lie outside the span of previous queries which returned a unique gradient.
Remark 4.6.
Recall from Definition 2.7 that \(\mathbf {x}_{t_{j}}\) is the query which returns the \({j}^{\textrm {th}}\) informative subgradient from the function f. In terms of the Nemirovski function and the Resisting oracle, we note that \(\mathbf {x}_{t_{j}}\) is the query which ends Phase j of the Resisting oracle (the first query such that the Resisting oracle returns \(\mathbf {v}_j\) as the gradient in response).
Lemma 4.7.
Recall the definition of \(\mathbf {x}_{t_j}\) and \(S_{j}\) from Definitions 2.7 and 2.8 and that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\). If \(c \ge 400\), then under event E from Definition 4.3,
We prove Lemma 4.7 by providing both an upper and lower bound on \(\mathbf {x}_{t_{j}}^{\top }(\mathbf {v}_{j}-\mathbf {v}_{j-1})\) under event E. Towards this end, observe that for any \(\mathbf {x}\in S_{j-1}\),
where the last step uses the fact that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) for \(c\ge 400\). Rearranging terms and using that \((1 - x)^{1/2} \le 1 - (1/4)x\) for \(x \ge 0\) gives us
We next turn our attention to the optimality achievable by an algorithm which has seen a limited number of Nemirovski vectors. Lemma 4.8 lower bounds the best function value achievable by such an algorithm.
Lemma 4.8.
Recall that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) and suppose \(c \gt 10\). Assume the Algorithm has proceeded to Phase r of the Resisting oracle and chooses to not make any more queries. Let Q be the set of all queries made so far by the Algorithm. Then under event E defined in Definition 4.3,
Assuming event E, by Lemma 4.5 if the Resisting oracle responds with some Nemirovski vector as the subgradient, since this Nemirovski vector must be some \(\mathbf {v}_i\) for \(i \lt r\) and since the returned subgradients are valid subgradients of the final function \(F_{ \mathcal {N}, \mathbf {A}}^{(N)}\), the Nemirovski vector \(\mathbf {v}_r\) could not have been a valid subgradient of the query. Therefore, all queries \(\mathbf {x}\in Q\) must satisfy,
where in the last inequality we use the fact that \(\gamma \ge \sqrt {\frac{10\log (d)}{d}}\). □
Lemma 4.9 upper bounds the minimum value of the final function \(F_{ \mathcal {N}, \mathbf {A}}^{(N)} (\mathbf {x})\), and will be used to establish an optimality gap later.
Lemma 4.9.
Recall that \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) and suppose \(c \gt 3\). For any \(\mathbf {A}\in \mathbb {R}^{n\times d}\) where \(n\le d/2\) and sufficiently large d, with failure probability at most \(2/d\) over the randomness of the Nemirovski vectors \(\lbrace \mathbf {v}_j, j \in [N]\rbrace\),
Let the rank of \(\mathbf {A}\) be r. Let \(\mathbf {Z}\in \mathbb {R}^{d \times (d-r)}\) be an orthonormal matrix whose columns are an orthonormal basis for the null space \(A^{\perp }\) of \(\mathbf {A}\). We construct a vector \(\overline{\mathbf {x}}\) which (as we will show) attains \(F_{ \mathcal {N}, \mathbf {A}}(\overline{\mathbf {x}}) \le -1/8\sqrt {N}\) as follows:
By Fact A.2, each coordinate of \(\bar{\mathbf {v}}\) is sub-Gaussian with sub-Gaussian norm \(\alpha /\sqrt {d}\) (where \(\alpha\) is the absolute constant in Fact A.2). Also \(\mathbb {E}[\bar{\mathbf {v}} \bar{\mathbf {v}}^{\top }]=(1/4d)\mathbf {I}_d\). Therefore, by Lemma A.6, with failure probability at most \(2 \exp (-c_1d)\) for some absolute constant \(c_1\gt 0\),
Therefore, \(\overline{\mathbf {x}}\) lies in the unit ball with failure probability at most \(2 \exp (-c_1d)\). Next fix any \(\mathbf {v}_j \in \left\lbrace \mathbf {v}_i \right\rbrace _{i \in [N]}\) and consider
Let \(\bar{\mathbf {v}}_{-j} = (\frac{-1}{2 \sqrt {N}}\sum _{i \in [N], i \ne j} \mathbf {v}_i)\). By Fact A.2, each coordinate of \(\bar{\mathbf {v}}_{-j}\) is sub-Gaussian with sub-Gaussian parameter \(1/\sqrt {4d}\). Also by Fact A.2, \((\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j)^{\top }\bar{\mathbf {v}}_{-j}\) is sub-Gaussian with sub-Gaussian parameter \(\left\Vert {\mathbf {Z}\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2/\sqrt {4d}=\left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2/\sqrt {4d}\). Now using Fact A.4, with failure probability at most \(2/d^C\),
We now turn to bounding \(\left\Vert {\mathbf {Z}^{\top } \mathbf {v}_j}\right\Vert _2\). Note that each coordinate of \(\mathbf {v}_j\) is sub-Gaussian with sub-Gaussian norm \(2/\sqrt {d}\) and \(\mathbb {E}[\mathbf {v}_j \mathbf {v}_j^{\top }]=(1/d) \mathbf {I}_d\). Therefore, by Lemma A.6, with failure probability at most \(2 \exp (-c_2d)\) for some absolute constant \(c_2\gt 0\),
We now take \(C = 3\), which gives us a overall failure probability of \(2 \exp (-c_1 d) + N\left(2/d^C +2 \exp (-c_2 d) \right) \le 2/d\) (for sufficiently large d and since \(N\le d)\). Noting that \(\gamma \gt \sqrt {3\log (d)/{d}}\) finishes the proof. □
The following simple lemma shows that all the Nemirovski vectors are unique with high probability, and will be useful for our definition of informative subgradients.
Lemma 4.10.
For sufficiently large d, with failure probability at most \(1/d\), \(\mathbf {v}_i \ne \mathbf {v}_j \; \forall \; i, j \in [N] , {i \ne j}\).
Proof.
The proof follows from the birthday paradox since \(\mathbf {v}_i\) are drawn from the uniform distribution over a support of size \(2^d\): the probability of \(\mathbf {v}_i \ne \mathbf {v}_j \; \forall \; i, j \in [N] , i \ne j\) is equal to,
Finally, we note that for the orthogonality condition from the definition of a memory-sensitive class is satisfied with our definition of the subgradient oracle.
Lemma 4.11.
For any \(j \in [N]\) the following is true: For any \(\mathbf {x}\in \mathbb {R}^d\) such that \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\), either \(\mathbf {g}_{F}^{(j)}(\mathbf {x}) = \mathbf {v}_1\) or \(\left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty /\left\Vert {\mathbf {x}}\right\Vert _2 \le 1/d^4\).
Proof.
Fix any \(j \in [N]\) and assume \(\mathbf {x}\) is such that \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\). Note that since \(F_{ \mathcal {N}, \mathbf {A}}^{(j)}(\mathbf {x}) \ne \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho\), \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_1\), or \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_k\) for some \(k \in \left\lbrace 2, \dots , j \right\rbrace\). We consider the case when \(\mathbf {g}_{F}^{(j)}(\mathbf {x})=\mathbf {v}_k\) for some \(k \in \left\lbrace 2, \dots , j \right\rbrace\). This implies that \(\eta \left\Vert {\mathbf {A} \mathbf {x}}\right\Vert _\infty - \rho \le \mathbf {v}_k^{\top } \mathbf {x}- k \gamma\). Observe that \(\mathbf {v}_k^{\top } \mathbf {x}- k \gamma \le \left\Vert {\mathbf {v}_k}\right\Vert _2 \left\Vert {\mathbf {x}}\right\Vert _2\) and \(\left\Vert {\mathbf {v}_k}\right\Vert _2 = 1\). Therefore,
Putting together all the previous results, we show Theorem 4.12 which establishes that the Nemirovski function class has the required memory-sensitive properties.
Theorem 4.12.
Consider any \(\mathbf {A}\in \mathbb {R}^{n\times d}\) where \(n\le d/2\) and any row has \(\ell _2\)-norm bounded by d. Fix \(\eta = d^5\), \(\rho = 1\) and \(\gamma =\sqrt {\frac{c k\log (d) }{d}}\) for \(c = 400\). Let \(N\le (1/32 \gamma)^{2/3}\). Then \(\mathcal {N}_{N, \gamma }\) is a \((d^6, N, k, 1/(20\sqrt {N}))\)-memory-sensitive class. That is, we can sample \((f, \mathbf {g}_f)\) from \(\mathcal {N}_{N, \gamma }\) with at most \(d^2\) random bits, and for \((f, \mathbf {g}_f) \sim \mathcal {N}_{N, \gamma }\), with failure probability at most \(5/d\), any algorithm for optimizing \(F_{f, \mathbf {A}}(\mathbf {x})\) which makes fewer than \(d^2\) queries to oracle \(\mathbf {g}_f\) has the following properties:
(1)
\(F_{f, \mathbf {A}}\) is convex and \(d^6\)-Lipschitz.
(2)
With \(S_j\) and \(\mathbf {x}_{t_j}\) as defined in Definition 2.8,
Note that by Lemma 4.2, the first-order oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}(\mathbf {x}) = (F_{ \mathcal {N}, \mathbf {A}}^{(N)}(\mathbf {x}), \mathbf {g}_F^{(N)}(\mathbf {x}))\) has the same distribution over responses as \(\mathcal {N}_{N,\gamma , \mathbf {A}}\). Therefore, we will analyze \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\), using the Resisting oracle.
We consider the following sequence of events: (a) event E defined in Lemma 4.4 holds, (b) no two Nemirovski vectors are identical, \(\mathbf {v}_i \ne \mathbf {v}_j\) for \(i\ne j\), (c) Equation (4.1) regarding the value for \(F_{f,\mathbf {A}}^{(N)}(\mathbf {x}^\star)\) holds. We claim that for any algorithm which makes fewer than \(d^2\) queries, events (a) through (c) happen with failure probability at most \(5/d\). To verify this, we do a union bound over the failure probability of each event: for (a), Lemma 4.4 shows that E holds with failure probability at most \(1/d\) for any algorithm which makes fewer than \(d^2\) queries, (b) holds with probability \(1/d\) from Lemma 4.10 and (c) holds with probability \(2/d\) from Lemma 4.8. Now note that by Lemma 4.5, the Resisting oracle’s responses are identical to the responses of oracle \(\mathcal {O}_{F_{ \mathcal {N}, \mathbf {A}}^{(N)}}\) under E, and therefore when conditioned on events (a) through (c). Therefore, we will condition on events (a) through (c) all holding and consider the Resisting oracle to prove all four parts of the theorem.
We begin with the first part. Note that \(f(\mathbf {x})\) has Lipschitz constant bounded by 1. Next note that since each \(\mathbf {a} \in B_d\) has \(\left\Vert {\mathbf {a}}\right\Vert _2 \le d\) and \(\eta = d^5\) we have that the \(\left\lbrace \eta \left\Vert {\mathbf {A}\mathbf {x}}\right\Vert _\infty - \rho \right\rbrace\) term has Lipschitz constant bounded by \(L \le d^6\). Therefore, \(F_{f, \mathbf {A}}\) has Lipschitz constant bounded by \(d^6\). For the second part, we first note that under event (b) all Nemirovski vectors are unique, therefore, the vectors \(\mathbf {x}_{t_j}\) and subspace \(S_{j-1}\) are well-defined. Now under event E by applying Lemma 4.7,
The third part holds by Lemma 4.11. For the final part, we first note that under event (b) if the Algorithm observes r unique Nemirvoski vectors then it has only proceeded to the \({(r+1)}^{\textrm {th}}\) Phase of the Resisting oracle. Now by Lemma 4.8, under event E, any algorithm in the \({(r+1)}^{\textrm {th}}\) Phase of the Resisting oracle has function value at least \(-(r+2)\gamma\). Combining this with Equation (4.1) holding, for any algorithm which observes at most than r unique Nemirovski vectors as gradients,
where the first inequality is by the above observation and Lemma 4.9 and the final inequality holds by noting that \(N\le (1/32 \gamma)^{2/3}\) and so \(-(N+ 2) \gamma \ge -1/(20 \sqrt {N})\). □
5 Memory Lower Bounds for the Orthogonal Vector Game
In this section, we prove the lower bound for the Orthogonal Vector Game (Game 1), building on the proof sketch in Section 3.2. We prove a slightly stronger result in that we allow the oracle response \({\mathbf {g}}_i \in \mathbb {R}^{d}\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be an arbitrary (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_n\rbrace\).
Recall that \({\mathbf {G}}=\left(\mathbf {g}_1, \dots , \mathbf {g}_m \right)^{\top } \in \mathbb {R}^{m \times d}\). We perform a small augmentation of the matrix \({\mathbf {G}}\) to streamline our analysis, defined by the following matching between the rows of \(\mathbf {A}\) and \(\mathbf {G}\). For every row of \(\mathbf {A}\), if that row is also present in the matrix \(\mathbf {G}\), then we match it to that row of \(\mathbf {G}\). If some row of \(\mathbf {A}\) is present multiple times in \(\mathbf {G}\), then we match it to its first occurrence in \(\mathbf {G}\). We define any matched row in \(\mathbf {G}\) as a unique row. Let \(m_{\textrm {U}}\in [m]\) denote the number of such unique rows. If \(m_{\textrm {U}}\lt m\), construct a matrix \(\tilde {{\mathbf {G}}} \in \mathbb {R}^{(2m-m_{\textrm {U}}) \times d}\) by appending \(m-m_{\textrm {U}}\) additional rows of \(\mathbf {A}\) to \({\mathbf {G}}\), choosing the first \(m-m_{\textrm {U}}\) rows of \(\mathbf {A}\) which were previously unmatched. We now match \(\mathbf {A}\) to \(\tilde {{\mathbf {G}}}\) in the same way, and note that \(\tilde {\mathbf {G}}\) now has exactly m unique rows by definition. Given \(\mathbf {A}\) and \(\tilde {\mathbf {G}}\) let \(\mathbf {A^{\prime }}\in \mathbb {R}^{(n-m) \times d}\) denote the matrix \(\mathbf {A}\) when all the rows of \(\mathbf {A}\) which were matched to some row in \(\tilde {\mathbf {G}}\) are removed. Drawing on these definitions, we begin with the following relation between the entropy of \(\mathbf {A}\) and \(\mathbf {A^{\prime }}\) conditioned on \(\tilde {\mathbf {G}}\) and R.
Note that for any random variable X, there exists a description of the random variable with expected description length \(L_X\) bounded as [22, Chapter 5],
\begin{align} H(X)\le L_X \le H(X)+1. \end{align}
(5.1)
Now fix any \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\). Let \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }} := H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime })\). Note that by definition,
Now by Equation (5.1) if \(H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }, R=R^{\prime })=h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}\), then given \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\) there exists some description of \(\mathbf {A^{\prime }}\) which has expected description length at most \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}+1\). Using this we construct a description of \(\mathbf {A}\) given \(\tilde {\mathbf {G}}=\tilde{\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\) which has expected description length at most \(h_{\tilde {\mathbf {G}}^{\prime },R^{\prime }}+1 + 2m\log (n)\). To do this, note that if there is already a description of \(\mathbf {A^{\prime }}\) and we are given \(\tilde {\mathbf {G}}=\tilde {\mathbf {G}}^{\prime }\), then to specify \(\mathbf {A}\) we only need to additionally specify the m rows which were removed from \(\mathbf {A}\) to construct \(\mathbf {A^{\prime }}\). For every row which was removed from \(\mathbf {A}\), we can specify its row index in the original matrix \(\mathbf {A}\) using \(\lceil \log (n) \rceil \le \log (2n)\) bits, and specify which of the rows of \(\tilde {\mathbf {G}}^{\prime }\) it is using another \(\lceil \log (2m-m_{\textrm {U}}) \rceil \le \lceil \log (2m) \rceil \le \log (4n)\) bits. Therefore, given \(\tilde {\mathbf {G}}=\tilde{\mathbf {G}}^{\prime }\) and \(R=R^{\prime }\),
We claim that \(\tilde {\mathbf {G}}\) can be written with at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) bits, therefore, its entropy is at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) (by Equation (5.1)). To prove this, note that \(\tilde {\mathbf {G}}\) consists of rows of \(\mathbf {A}\) which are sampled from \(B_d\), and recall that \(\tilde{\mathbf {G}}\) has exactly m unique rows. Therefore, we can first write down all the m unique rows of \(\tilde {\mathbf {G}}\) using \(m\log (\left| B_d \right|)\) bits, and then for each of the at most \((2m-m_{\textrm {U}})\le 2m\) rows of \(\tilde {\mathbf {G}}\) we can write down which of the unique rows it is with \(\lceil \log (m)\rceil \le \log (2n)\) bits each, for a total of at most \(m\log (\left| B_d \right|) + 2m\log (2n)\) bits. □
The following lemma shows that Y is a deterministic function of \(\tilde{\mathbf {G}}, \mathsf {Message}\) and R.
Lemma 5.3.
The queries \(\lbrace \mathbf {x}_i, i \in [m]\rbrace\) are a deterministic function of \(\mathbf {G}, \mathsf {Message}\) and R. Therefore, \(\mathbf {Y}=g(\tilde{\mathbf {G}},\mathsf {Message},R)\) for some function g.
Proof.
Given \(\mathbf {A}\) and R, we observe that the Player is constrained to using a deterministic algorithm to both construct message \(\mathsf {Message}\) to store and to determine the queries \((\mathbf {x}_1, \dots , \mathbf {x}_m)\). Therefore, we see that for any i there exists a function \(\phi _i\) such that for any \(\mathsf {Message}\), R, and responses \((\mathbf {g}_1, \dots , \mathbf {g}_{i-1})\), \(\mathbf {x}_i = \phi _i(\mathsf {Message}, R, \mathbf {x}_1, \mathbf {g}_1, \dots , \mathbf {x}_{i-1}, \mathbf {g}_{i-1})\). Next, we remark that there exists a function \(\phi _i^{\prime }\) such that \(\mathbf {x}_i = \phi _i^{\prime }(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{i-1})\). We can prove this by induction. For the base case, we simply note that \(\phi _1^{\prime } = \phi _1\) and \(\mathbf {x}_1 = \phi _1^{\prime }(\mathsf {Message}, R)\). Next, assume the inductive hypothesis that for any \(j \le i-1\) we have \(\mathbf {x}_j = \phi _j^{\prime }(\mathsf {Message}, R, \mathbf {g}_1, \dots , \mathbf {g}_{j-1})\). Then
Therefore, given \((\mathsf {Message}, R, \mathbf {G})\), it is possible to reconstruct the queries \(\mathbf {X}\). Since \(\mathbf {Y}\) is a deterministic function of \((\mathbf {X},\mathbf {G},\mathsf {Message},R)\) in the Orthogonal Vector Game, and \(\mathbf {G}\) is just the first m rows of \(\tilde{\mathbf {G}},\)\(\mathbf {Y}=g(\tilde{\mathbf {G}},\mathsf {Message},R)\) for some function g. □
As sketched out in Section 3.2, in the following two lemmas, we compute \(I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R)\) in two different ways.
where in the last step, we use the fact that \(\mathsf {Message}\) is M-bit long. □
Lemma 5.5.
Suppose \(\mathbf {Y}\) wins the Two Player Orthogonal Vector Game with failure probability at most \(\delta\). Then if \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) is a memory sensitive base with constant \(c_{B}\gt 0\),
We will lower bound \(I(\mathbf {A^{\prime }}; \mathbf {Y} \vert \tilde {\mathbf {G}}, R)\) by providing a lower bound for \(H(\mathbf {A^{\prime }}\vert \tilde {\mathbf {G}}, R)\) and an upper bound for \(H(\mathbf {A^{\prime }}\vert \mathbf {Y}, \tilde {\mathbf {G}}, R)\). To that end, by Lemma 5.1 we have
where the inequalities use the fact that entropy is non-negative, and that conditioning reduces entropy. We now note that since \(\mathbf {A}\) is independent of R, \(H(\mathbf {A}\vert R)=H(\mathbf {A})\). Thus, we have
where \(\mathbf {a}^{\prime }_j\) is the \({j}^{\textrm {th}}\) row of \(\mathbf {A^{\prime }}\). Next recall that if \(\mathbf {Y} = (\mathbf {y}_1, \dots , \mathbf {y}_{k})^{\top } \in \mathbb {R}^{k\times d}\) wins the Two Player Orthogonal Vector Game then for any \(\mathbf {y}_i\), \(\left\Vert {\mathbf {A}\mathbf {y}_i}\right\Vert _\infty /\left\Vert {\mathbf {y}_i}\right\Vert _2 \le d^{-4}\) and for \(S_i = \mathsf {span}(\mathbf {y}_1, \dots , \mathbf {y}_{i-1})\), we have \(\Vert {\mathrm{Proj}_{S_i}(\mathbf {y}_i)\Vert }_2/\left\Vert {\mathbf {y}_i}\right\Vert _2 \le 1 - (1/d^{2})\). Since these properties are normalized by \(\left\Vert {\mathbf {y}_i}\right\Vert _2\) we may, without loss of generality, overload our notation for \(\mathbf {y}_i\) by normalizing, we set \(\mathbf {y}_i \leftarrow \mathbf {y}_i/\left\Vert {\mathbf {y}_i}\right\Vert _2\). Note that for any j the entropy of \(\mathbf {a}^{\prime }_j\) conditioned on any value \(\mathbf {Y}^{\prime }\) taken by the random variable \(\mathbf {Y}\) is bounded above by the entropy of \(\mathbf {a}^{\prime }\) where the law of \(\mathbf {a}^{\prime }\) corresponds to the uniform distribution over the set \(\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4} \rbrace\) and this has entropy \(\log (|{\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y}^{\prime } \mathbf {a}}\right\Vert _\infty \le \frac{1}{d^4}\rbrace }|)\). We also note that since \(\mathbf {Y} = g(\tilde {\mathbf {G}}, \mathsf {Message}, R)\) (Lemma 5.3), and all of \(\tilde{\mathbf {G}}, \mathsf {Message}\) and R take values on a finite support, \(\mathbf {Y}\) also takes values on some finite support (\(\mathbf {Y}\) can still be real-valued, but its support must be finite). Therefore, for any j, we can write,
Thus, for any \(\mathbf {Y} = (\mathbf {y}_1, \dots , \mathbf {y}_{k})\) which wins the Two Player Orthogonal Vector Game, we will upper bound \(|{\lbrace \mathbf {a} \in B_d \textrm { s.t. } \left\Vert {\mathbf {Y} \mathbf {a}}\right\Vert _\infty \le 1/d^4\rbrace }|\). We will use the following lemma (proved in Appendix A.2) which constructs a partial orthonormal basis from a set of robustly linearly independent vectors.
Lemma 5.6.
Let \(\lambda \in (0, 1]\) and suppose we have \(n_0\le d\) unit norm vectors \(\mathbf {y}_1, \dots , \mathbf {y}_{n_0} \in \mathbb {R}^d\). Let \(S_i := \mathsf {span}(\lbrace \mathbf {y}_1, \dots , \mathbf {y}_{i} \rbrace), S_0 := \phi\). Suppose that for any \(i \in [n_0]\),
By combining Lemmas 5.4 and 5.5, we prove the memory-query tradeoff for the Orthogonal Vector Game.
Theorem 3.3.
Suppose for a memory-sensitive base \(\left\lbrace B_d \right\rbrace _{d = 1}^{\infty }\) with constant \(c_{B}\gt 0\), \(\mathbf {A}\sim \mathsf {Unif}(B_d^n)\). Given \(\mathbf {A}\) let the oracle response \(\mathbf {g}_i\) to any query \(\mathbf {x}_i \in \mathbb {R}^d\) be any (possibly randomized) function from \(\mathbb {R}^d\rightarrow \lbrace \mathbf {a}_1, \dots , \mathbf {a}_{d/2}\rbrace\), where \(\mathbf {a}_i\) is the \({i}^{\textrm {th}}\) row of \(\mathbf {A}\) (note this includes the subgradient response in the Orthogonal Vector Game). Assume \(M \ge d \log (2d)\) and set \(k=\left\lceil 60M/(c_{B}d) \right\rceil\). For these values of k and M, if the Player wins the Orthogonal Vector Game with probability at least \(1/3\), then \(m\ge d/5\).
However, if the player returns some \(\mathbf {Y}\) which wins the Orthogonal Vector Game with failure probability at most \(\delta\) then by Lemma 5.5,
Recall that \(k = \lceil 60 M / (c_{B}d) \rceil\) and thus, if \(M \ge d \log (4n)\) then for \(\delta = 2/3\) we have \((1 - \delta) c_{B}k/2 \ge 4 \log (4n)\). Using this and the fact that \(n = \lfloor d/2 \rfloor\),
Thank you to anonymous reviewers for helpful feedback on earlier drafts of this paper.
Footnotes
1
This arises from \(\Omega (d \log (1/\epsilon))\) iterations of working in different change of basis or solving a linear system, each of which takes \(\Omega (d^2)\)-time naively.
2
By a blackbox-reduction in Appendix A.3 our results extend to unconstrained optimization while only losing \(\mathrm{poly}(d)\) factors in the accuracy \(\epsilon\) for which the lower bound applies.
3
We remark that \(2^d\) can be replaced by any finite-valued function of d.
4
Note that 0 queries are needed when \(\epsilon \ge L\).
A Additional Helper Lemmas
A.1 Useful Concentration Bounds
In this section, we establish some concentration bounds which we repeatedly use in our analysis.
Definition A.1 (sub-Gaussian random variable).
A zero-mean random variable X is sub-Gaussian if for some constant \(\sigma\) and for all \(\lambda \in \mathbb {R}\), \(\mathbb {E}[e^{\lambda X}]\le e^{\lambda ^2 \sigma ^2/2}\). We refer to \(\sigma\) as the sub-Gaussian parameter of X. We also define the sub-Gaussian norm\(\left\Vert {X}\right\Vert _{\psi _2}\) of a sub-Gaussian random variable X as follows:
\begin{equation*} \left\Vert {X}\right\Vert _{\psi _2} := \inf \left\lbrace K \gt 0 \textrm { such that } \mathbb {E}[\exp (X^2/K^2)] \le 2 \right\rbrace . \end{equation*}
We use that \(X\sim \mathsf {Unif}(\lbrace \pm 1\rbrace)\) is sub-Gaussian with parameter \(\sigma =1\) and sub-Gaussian norm \(\left\Vert {X}\right\Vert _{\psi _2} \le 2\). The following fact about sums of independent sub-Gaussian random variables is useful in our analysis.
Fact A.2.
[69] For any n, if \(\lbrace X_i, i \in [n]\rbrace\) are independent zero-mean sub-Gaussian random variables with sub-Gaussian parameter \(\sigma _i\), then \(X^{\prime }=\sum _{i \in [n]} X_i\) is a zero-mean sub-Gaussian with parameter \(\sqrt {\sum _{i \in [n]} \sigma _i^2}\) and norm bounded as \(\left\Vert {X^{\prime }}\right\Vert _{\psi _2}\le \alpha \sqrt {\sum _{i \in [n]} \left\Vert {X_i}\right\Vert _{\psi _2}^2}\), for some universal constant \(\alpha\).
Using Fact A.2, we obtain the following corollary which we use directly in our analysis.
Corollary A.3.
For \(\mathbf {v}\sim (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and any fixed vector \(\mathbf {x}\in \mathbb {R}^d\), \(\mathbf {x}^{\top } \mathbf {v}\) is a zero-mean sub-Gaussian with sub-Gaussian parameter at most \(\left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d}\) and sub-Gaussian norm at most \(\left\Vert {\mathbf {x}^{\top } \mathbf {v}}\right\Vert _{\psi _2} \le 2\alpha \left\Vert {\mathbf {x}}\right\Vert _2/\sqrt {d}\).
The following result bounds the tail probability of sub-Gaussian random variables.
Fact A.4.
[70] For a random variable X which is zero-mean and sub-Gaussian with parameter \(\sigma\), then for any t, \(\mathbb {P}(\left| X \right| \ge t) \le 2 \exp (-t^2/(2\sigma ^2))\).
We will use the following concentration bound for quadratic forms.
Theorem A.5 (Hanson–Wright inequality [32, 57]).
Let \(\mathbf {x}= (X_1, \dots , X_d) \in \mathbb {R}^d\) be a random vector with i.i.d components \(X_i\) which satisfy \(\mathbb {E}[X_i] = 0\) and \(\left\Vert {X_i}\right\Vert _{\psi _2} \le K\) and let \(\mathbf {M} \in \mathbb {R}^{n \times n}\). Then for some absolute constant \(c_{\textrm {hw}}\gt 0\) and for every \(t \ge 0\),
where \(\left\Vert {\mathbf {M}}\right\Vert _2\) is the operator norm of \(\mathbf {M}\) and \(\left\Vert {\mathbf {M}}\right\Vert _{\textrm {F}}\) is the Frobenius norm of \(\mathbf {M}\).
The following lemma will let us analyze projections of sub-Gaussian random variables (getting concentration bounds which look like projections of Gaussian random variables).
Lemma A.6.
Let \(\mathbf {x}= (X_1, \dots , X_d) \in \mathbb {R}^d\) be a random vector with i.i.d sub-Gaussian components \(X_i\) which satisfy \(\mathbb {E}[X_i] = 0\), \(\left\Vert {X_i}\right\Vert _{\psi _2} \le K\), and \(\mathbb {E}[\mathbf {x}\mathbf {x}^{\top }]=s^2 \mathbf {I}_d\). Let \(\mathbf {Z}\in \mathbb {R}^{d \times r}\) be a matrix with orthonormal columns \((\mathbf {z}_1,\dots , \mathbf {z}_r)\). There exists some absolute constant \(c_{\textrm {hw}}\gt 0\) (which comes from the Hanson–Wright inequality) such that for \(t\ge 0\),
Next, we bound the operator norm and Frobenius-norm of \(\mathbf {Z}\mathbf {Z}^{\top }\). The operator norm of \(\mathbf {Z}\mathbf {Z}^{\top }\) is bounded by 1 since the columns of \(\mathbf {Z}\) are orthonormal vectors. For the Frobenius-norm of \(\mathbf {Z}\mathbf {Z}^{\top }\), note that
Applying the Hanson–Wright inequality (Theorem A.5) with these bounds on \(\mathbb {E}[ \Vert {\mathbf {Z}^{\top } \mathbf {x}}\Vert _2^2]\), \(\Vert {\mathbf {Z} \mathbf {Z}^{\top }}\Vert _{2}^2\), and \(\Vert {\mathbf {Z} \mathbf {Z}^{\top }}\Vert _{\textrm {F}}^2\) yields (A.1) and completes the proof. □
Corollary A.7.
Let \(\mathbf {v}\sim (1/\sqrt {d}) \mathsf {Unif} \left(\mathcal {H}_d \right)\) and let S be a fixed r-dimensional subspace of \(\mathbb {R}^d\) (independent of \(\mathbf {v}\)). Then with probability at least \(1-2 \exp (c_{\textrm {hw}})/d^5\),
Let \(\mathbf {B} = (\mathbf {b}_1, \dots , \mathbf {b}_{n_0})\in \mathbb {R}^{d \times n_0}\) and note that with this construction there exists a vector of coefficients \(\mathbf {c}^{(i)} \in \mathbb {R}^{n_0}\) such that
Let \(\mathbf {C} = (\mathbf {c}^{(1)}, \dots , \mathbf {c}^{(n_0)}) \in \mathbb {R}^{n_0\times n_0}\) and observe that \(\mathbf {Y} = \mathbf {B} \mathbf {C}\). Denote the singular values of \(\mathbf {C}\) as \(\sigma _1 \ge \dots \ge \sigma _{n_0}\). We aim at bounding the singular values of \(\mathbf {C}\) from below. To this end, observe that \(\mathbf {C}\) is an upper triangular matrix such that for any \(i \in [n_0]\), \(\left| \mathbf {C}_{ii} \right| \ge \sqrt {\lambda }\). Indeed, note that
where in the last step we use the fact that the determinant of a triangular matrix is the product of its diagonal entries. Next, we consider the singular value decomposition of \(\mathbf {C}\): let \(\mathbf {U}\in \mathbb {R}^{n_0\times n_0}, \mathbf {V}\in \mathbb {R}^{n_0\times n_0}\) be orthogonal matrices and \(\mathbf {\Sigma }\in \mathbb {R}^{n_0\times n_0}\) be the diagonal matrix such that \(\mathbf {C} = \mathbf {U} \mathbf {\Sigma } \mathbf {V}^{\top }\). For the remainder of the proof assume without loss of generality that the columns of \(\mathbf {Y}\) and \(\mathbf {B}\) are ordered such that \(\Sigma _{ii} = \sigma _i\). We will upper bound the singular values of \(\mathbf {C}\) as follows: observe that since \(\left\Vert {\mathbf {y}_i}\right\Vert _2 = 1\), for any vector \(\mathbf {w}\) such that \(\left\Vert {\mathbf {w}}\right\Vert _2 \le 1\),
In particular, consider \(\mathbf {w} = \mathbf {B} \mathbf {u}_i\) where \(\mathbf {u}_i\) denotes the \({i}^{\textrm {th}}\) column of \(\mathbf {U}\). Since \(\mathbf {B}^{\top } \mathbf {B} = \mathbf {I}\) and \(\mathbf {U}, \mathbf {V}\) are orthogonal we have
In particular, when \(m = n_1= \lceil n_0/2 \rceil\) we have \(\sigma _{n_0- n_1} \ge \lambda /\sqrt {d}\). Therefore, for any \(i \le n_0- n_1= \lfloor \frac{n_0}{2} \rfloor\),
Recall that \(\mathbf {u}_i\) denotes the \({i}^{\textrm {th}}\) column of \(\mathbf {U}\) and let \(\mathbf {v}_i\) denote the \({i}^{\textrm {th}}\) column of \(\mathbf {V}\). Observe that
Extend the basis \(\lbrace \mathbf {b}_1, \dots , \mathbf {b}_{n_0} \rbrace\) to \(\mathbb {R}^d\) and let \(\tilde{\mathbf {B}} = (\mathbf {B}, \mathbf {b}_{n_0+ 1}, \dots , \mathbf {b}_{d})\) be the \(d \times d\) matrix corresponding to this orthonormal basis. Note that if \(j \gt n_0\), then for any \(\mathbf {y}_i\), \(\mathbf {b}_j^{\top } \mathbf {y}_i = 0\). Define
For \(i \in [n_1]\) define \(\mathbf {m}_i = \tilde{\mathbf {B}} \tilde{\mathbf {u}}_i\) and let \(\mathbf {M} = (\mathbf {m}_1, \dots , \mathbf {m}_{n_1})\). Note that \(\lbrace \mathbf {m}_1, \dots , \mathbf {m}_{n_1} \rbrace\) is an orthonormal set. The result then follows since for any \(\mathbf {a} \in \mathbb {R}^d\),
A.3 From Constrained to Unconstrained Lower Bounds
Here we provide a generic, black-box reduction from approximate Lipschitz convex optimization over a unit ball to unconstrained approximate Lipschitz convex optimization. The reduction leverages a natural extension of functions on the unit ball to \(\mathbb {R}^{d}\), provided and analyzed in the following lemma:
Lemma A.8.
Let \(f:B_{2}^{d}\rightarrow \mathbb {R}\) be a convex, L-Lipschitz function and let \(g_{r,L}^{f}:\mathbb {R}^{d}\rightarrow \mathbb {R}\) and \(h_{r,L}^{f}:\mathbb {R}^{d}\rightarrow \mathbb {R}\) be defined for all \(\mathbf {x}\in \mathbb {R}^{d}\) and some \(r\in (0,1)\) as
Then, \(g_{r,L}^{f}(\mathbf {x})\) is a convex, \(L(\frac{1+r}{1-r})\)-Lipschitz function with \(g_{r,L}^{f}(\mathbf {x})=f(\mathbf {x})\) for all \(\mathbf {x}\) with \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\).
Now \(|f(\mathbf {x})-f(\mathbf {0})|\le L\left\Vert {\mathbf {x}}\right\Vert _2\) since f is L-Lipschitz. Correspondingly, when \(\left\Vert {\mathbf {x}}\right\Vert _2=1\) this implies
Consequently, \(h_{r,L}^{f}(\mathbf {x})\ge f(\mathbf {x})\) when \(\left\Vert {\mathbf {x}}\right\Vert _2=1\) and \(g_{r,L}^{f}(\mathbf {x})=h_{r,L}^{f}(\mathbf {x})=\max \lbrace f(\mathbf {x}),h(\mathbf {x})\rbrace\). Furthermore, since \(h_{r,L}^{f}\) is \((\frac{1+r}{1-r})\)-Lipschitz and convex and f is L-Lipschitz and convex this implies that \(g_{r,L}^{f}\) is \((\frac{1+r}{1-r})\)-Lipschitz and convex as well. Finally, if \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\) then again by (A.2) and that \(|f(\mathbf {x})-f(\mathbf {0})|\le L\left\Vert {\mathbf {x}}\right\Vert _2\) we have
Therefore, \(g_{r,L}^{f}(\mathbf {x})=f(\mathbf {x})\) for all \(\mathbf {x}\) with \(\left\Vert {\mathbf {x}}\right\Vert _{2}\le r\). □
Leveraging this extension, we obtain the following result:
Lemma A.9.
Suppose any M-memory-constrained randomized algorithm must make (with high probability in the worst case) at least \(\mathcal {T}_{\epsilon }\) queries to a first-order oracle for convex, L-Lipschitz \(f:B_{2}^{d}\rightarrow \mathbb {R}\) in order to compute an \(\epsilon\)-optimal point for some \(\epsilon \lt L\).4 Then any M-memory-constrained randomized algorithm must make at least \(\mathcal {T}_{\epsilon }/2\) queries to a first-order oracle (with high probability in the worst case) to compute an \(\epsilon /2\)-optimal point for a \(\mathcal {O}(L^{2}/\epsilon)\)-Lipschitz, convex \(g:\mathbb {R}^{d}\rightarrow \mathbb {R}\) even though the minimizer of g is guaranteed to lie in \(B_{2}^{d}\).
Proof.
Let \(\mathbf {x}^\star\) be a minimizer of f. By convexity, for all \(\alpha \in [0,1]\) we have that
Consequently, for all \(\delta \gt 0\) and \(\alpha _{\delta }:= 1-\delta /L\) we have that \(f(\alpha _{\delta } \mathbf {x}^\star)\) is \(\delta\)-optimal.
Correspondingly, consider the function \(g:= g_{\alpha _{\epsilon /2},L}^{f}\) as defined in Lemma A.8. Note that g is \(L(\frac{1+\alpha _{\delta }}{1-\alpha _{\delta }})=\mathcal {O}(L^{2}/\delta)\) Lipschitz. The minimizer of g also lies in \(B_2^d\) since g monotonically increases for \(\left\Vert {\mathbf {x}}\right\Vert _{2}\ge 1\). Suppose, there is an algorithm to compute an \(\epsilon /2\)-optimal minimizer of all \(\mathcal {O}(L^{2}/\delta)\)-Lipschitz, convex functions \(g:\mathbb {R}^d \rightarrow \mathbb {R}\) whose minimizer lies in \(B_2^d\), with high probability using \(\mathcal {T}\) queries to a first-order oracle for g. Note that this algorithm can be implemented instead with \(2\mathcal {T}\) queries to f by simply querying \(f(\mathbf {0})\) along with any query, and then using the defining of g from Lemma A.8 (note that it is also possible to just compute and store \(f(\mathbf {0})\) once in the beginning and hence only require \(\mathcal {T}+1\) queries, but this would require us to also analyze the number of bits of precision to which \(f(\mathbf {0})\) should be stored, which this proof avoids for simplicity). If \(\mathbf {x}_{\mathrm{out}}\) is the output of the algorithm then
Thus, \(g(\mathbf {x}_{\mathrm{out}})\le f(\mathbf {x}^\star)+\epsilon\). Let \(\tilde{\mathbf {x}}_{\mathrm{out}}=\min \lbrace 1,\left\Vert {\mathbf {x}_{\mathrm{out}}}\right\Vert _{2}^{-1}\rbrace \mathbf {x}_{\mathrm{out}}\). As g monotonically increases for \(\left\Vert {\mathbf {x}}\right\Vert _{2}\ge 1\), we have that \(g(\mathbf {x}_{\mathrm{out}})\ge g(\tilde{\mathbf {x}}_{\mathrm{out}})=f(\tilde{\mathbf {x}}_{\mathrm{out}})\) and thus, with \(2\mathcal {T}\) queries to f the algorithm can compute an \(\epsilon\)-approximate minimizer of f. Consequently, \(\mathcal {T}\ge \mathcal {T}_{\epsilon }/2\) as desired and the result follows. □
In this section, we focused on the unit \(\ell _2\) norm ball; however, the analysis generalizes to arbitrary norms.
References
[1]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems 31 (2018).
Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences 58, 1 (1999), 137–147.
Yossi Arjevani and Ohad Shamir. 2015. Communication complexity of distributed convex learning and optimization. Advances in Neural Information Processing Systems 28 (2015).
David S. Atkinson and Pravin M. Vaidya. 1995. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical Programming 69, 1 (1995), 1–43.
Maria Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. 2012. Distributed learning, communication complexity and privacy. In Proceedings of the Conference on Learning Theory. JMLR Workshop and Conference Proceedings, 26–1.
Eric Balkanski and Yaron Singer. 2018. Parallelization does not accelerate convex optimization: Adaptivity lower bounds for non-smooth convex minimization. arXiv:1808.03880. Retrieved from https://arxiv.org/abs/1808.03880
Ziv Bar-Yossef, Thathachar S. Jayram, Ravi Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences 68, 4 (2004), 702–732.
Paul Beame, Shayan Oveis Gharan, and Xin Yang. 2018. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In Proceedings of the Conference On Learning Theory. PMLR, 843–856.
Moise Blanchard. 2024. Gradient descent is pareto-optimal in the oracle complexity and memory tradeoff for feasibility problems. arXiv:2404.06720. Retrieved from https://arxiv.org/abs/2404.06720
Moise Blanchard, Junhui Zhang, and Patrick Jaillet. 2023. Memory-constrained algorithms for convex optimization via recursive cutting-planes. Advances in Neural Information Processing Systems (2023).
Moïse Blanchard, Junhui Zhang, and Patrick Jaillet. 2023. Quadratic memory is necessary for optimal query complexity in convex optimization: Center-of-mass is pareto-optimal. arXiv:2302.04963. Retrieved from https://arxiv.org/abs/2302.04963
Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta. 2017. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory 63, 7 (2017), 4709–4724.
Mark Braverman, Ankit Garg, Tengyu Ma, Huy L. Nguyen, and David P. Woodruff. 2016. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 1011–1020.
Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. 2020. The gradient complexity of linear regression. In Proceedings of the Conference on Learning Theory. PMLR, 627–647.
Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. 2021. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Proceedings of the Conference on Learning Theory. PMLR, 866–882.
Xi Chen and Binghui Peng. 2023. Memory-query tradeoffs for randomized convex optimization. arXiv:2306.12534. Retrieved from https://arxiv.org/abs/2306.12534
Kenneth L. Clarkson and David P. Woodruff. 2009. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing. 205–214.
Yuval Dagan, Gil Kur, and Ohad Shamir. 2019. Space lower bounds for linear prediction in the streaming model. In Proceedings of the Conference on Learning Theory. PMLR, 929–954.
Yuval Dagan and Ohad Shamir. 2018. Detecting correlations with little memory and communication. In Proceedings of the Conference On Learning Theory. PMLR, 1145–1198.
Jelena Diakonikolas and Cristóbal Guzmán. 2019. Lower bounds for parallel and randomized convex optimization. In Proceedings of the Conference on Learning Theory. PMLR, 1132–1157.
John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 429–438.
Ankit Garg, Tengyu Ma, and Huy Nguyen. 2014. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems 27 (2014).
Sumegha Garg, Ran Raz, and Avishay Tal. 2018. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 990–1002.
Martin Grötschel, László Lovász, and Alexander Schrijver. 2012. Geometric Algorithms and Combinatorial Optimization. Vol. 2. Springer Science and Business Media.
David Lee Hanson and Farroll Tim Wright. 1971. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics 42, 3 (1971), 1079–1083.
M.R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, 6 (1952).
Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. 2014. Communication-efficient distributed dual coordinate ascent. Advances in Neural Information Processing Systems 27 (2014).
Albert Xin Jiang and Kevin Leyton-Brown. 2011. Polynomial-time computation of exact correlated equilibrium in compact games. In Proceedings of the 12th ACM Conference on Electronic Commerce. 119–126.
Haotian Jiang. 2021. Minimizing convex functions with integral minimizers. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 976–985.
Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. 2020. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 944–953.
Michael I. Jordan, Jason D. Lee, and Yun Yang. 2018. Communication-efficient distributed statistical inference. Journal of the American Statistical Association (2018).
Leonid G. Khachiyan, Sergei Pavlovich Tarasov, and II Erlikh. 1988. The method of inscribed ellipsoids. SovietMathematics-Doklady 37, 1 (1988), 226–230.
Gillat Kol, Ran Raz, and Avishay Tal. 2017. Time-space hardness of learning sparse parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. 1067–1080.
Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. 2015. A faster cutting plane method and its implications for combinatorial and convex optimization. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 1049–1065.
Anatoly Yur’evich Levin. 1965. An algorithm for minimizing convex functions. In Proceedings of the Doklady Akademii Nauk, Vol. 160. Russian Academy of Sciences, 1244–1247.
Dana Moshkovitz and Michal Moshkovitz. 2017. Mixing implies lower bounds for space bounded learning. In Proceedings of the Conference on Learning Theory. PMLR, 1516–1566.
Dana Moshkovitz and Michal Moshkovitz. 2018. Entropy samplers and strong generic lower bounds for space bounded learning. In Proceedings of the 9th Innovations in Theoretical Computer Science Conference. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
Ju E. Nesterov. 1989. Self-concordant functions and polynomial-time methods in convex programming. Report, Central Economic and Mathematic Institute, USSR Acad. Sci (1989).
Mert Pilanci and Martin J. Wainwright. 2017. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization 27, 1 (2017), 205–245.
Ran Raz. 2017. A time-space lower bound for a large class of learning problems. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science. IEEE, 732–742.
Mark Rudelson and Roman Vershynin. 2013. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability 18 (2013), 1–9.
Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. Advances in Neural Information Processing Systems 27 (2014).
Ohad Shamir, Nati Srebro, and Tong Zhang. 2014. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the International Conference on Machine Learning. PMLR, 1000–1008.
Vatsal Sharan, Aaron Sidford, and Gregory Valiant. 2019. Memory-sample tradeoffs for linear regression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 890–901.
Max Simchowitz. 2018. On the randomized complexity of minimizing a convex quadratic function. arXiv:1807.09386. Retrieved from https://arxiv.org/abs/1807.09386
Max Simchowitz, Ahmed El Alaoui, and Benjamin Recht. 2018. Tight query complexity lower bounds for PCA via finite sample deformed wigner law. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 1249–1259.
Kartik Sivaramakrishnan and John Mitchell. 2012. Properties of a cutting plane method for semidefinite programming. Pacific Journal of Optimization 8 (2012), 779–802.
Jacob Steinhardt and John Duchi. 2015. Minimax rates for memory-bounded sparse linear regression. In Proceedings of the Conference on Learning Theory. PMLR, 1564–1587.
Jacob Steinhardt, Gregory Valiant, and Stefan Wager. 2016. Memory, communication, and statistical queries. In Proceedings of the Conference on Learning Theory. PMLR, 1490–1516.
Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. 2021. Querying a matrix through matrix-vector products. ACM Transactions on Algorithms 17, 4 (2021), 1–19.
Pravin M. Vaidya. 1989. A new algorithm for minimizing convex functions over convex sets. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 338–343.
Blake Woodworth and Nathan Srebro. 2017. Lower bound for randomized first order convex optimization. arXiv:1709.03594. Retrieved from https://arxiv.org/abs/1709.03594
Blake Woodworth and Nathan Srebro. 2019. Open problem: The oracle complexity of convex optimization with limited memory. In Proceedings of the Conference on Learning Theory. PMLR, 3202–3210.
Blake E. Woodworth, Brian Bullins, Ohad Shamir, and Nathan Srebro. 2021. The min-max complexity of distributed stochastic convex optimization with intermittent communication. In Proceedings of the Conference on Learning Theory. PMLR, 4386–4437.
Blake E. Woodworth and Nati Srebro. 2016. Tight complexity bounds for optimizing composite objectives. Advances in Neural Information Processing Systems 29 (2016).
Peng Xu, Fred Roosta, and Michael W. Mahoney. 2020. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184, 1 (2020), 35–70.
Min Ye and Emmanuel Abbe. 2018. Communication-computation efficient gradient coding. In Proceedings of the International Conference on Machine Learning. PMLR, 5610–5619.
D. B. Yudin and Arkadi S. Nemirovskii. 1976. Informational complexity and efficient methods for the solution of convex extremal problems. Matekon 13, 2 (1976), 22–45.
Yuchen Zhang, John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Proceedings of the Conference on Neural Information Processing Systems. Citeseer, 2328–2336.
Given a set @S of spheres in E^d, with d>=3 and d odd, having a constant number of m distinct radii @r"1,@r"2,...,@r"m, we show that the worst-case combinatorial complexity of the convex hull of @S is @Q(@__ __"1"=<"i"<>"j"=<"mn"in"j^@__ __^d^2^@__ __), ...
SoCG '11: Proceedings of the twenty-seventh annual symposium on Computational geometry
Given a set Σ of spheres in Ed, with d≥3 and d odd, having a fixed number of m distinct radii ρ1,ρ2,...,ρm, we show that the worst-case combinatorial complexity of the convex hull CHd(Σ) of Σ is Θ(Σ{1≤i≠j≤m}ninj⌊ d/2 ⌋), where ni is the number of ...
Given two bounded convex sets $$X\subseteq \mathbb R^m$$X⊆Rm and $$Y\subseteq \mathbb R^n,$$Y⊆Rn, specified by membership oracles, and a continuous convex---concave function $$F:X\times Y\rightarrow \mathbb R$$F:X Y R, we consider the problem of ...