Open AccessArticle

Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models^†

Hossein Taheri

^*,

Ramtin Pedarsani

^* and

Christos Thrampoulidis

Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106, USA

Authors to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

Entropy 2021, 23(2), 178; https://doi.org/10.3390/e23020178

Submission received: 10 December 2020 / Revised: 25 January 2021 / Accepted: 26 January 2021 / Published: 30 January 2021

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

We study convex empirical risk minimization for high-dimensional inference in binary linear classification under both discriminative binary linear models, as well as generative Gaussian-mixture models. Our first result sharply predicts the statistical performance of such estimators in the proportional asymptotic regime under isotropic Gaussian features. Importantly, the predictions hold for a wide class of convex loss functions, which we exploit to prove bounds on the best achievable performance. Notably, we show that the proposed bounds are tight for popular binary models (such as signed and logistic) and for the Gaussian-mixture model by constructing appropriate loss functions that achieve it. Our numerical simulations suggest that the theory is accurate even for relatively small problem dimensions and that it enjoys a certain universality property.

Keywords:

signal processing in machine learning; statistics; optimization

1. Introduction

1.1. Motivation

Classical estimation theory studies problems in which the number of unknown parameters n is small compared to the number of observations m. In contrast, modern inference problems are typically high-dimensional, that is n can be of the same order as m. Examples are abundant in a wide range of signal processing and machine learning applications such as medical imaging, wireless communications, recommendation systems, etc. Classical tools and theories are not applicable in these modern inference problems [1]. As such, over the last two decades or so, the study of high-dimensional estimation problems has received significant attention.

Perhaps the most well-studied setting is that of noisy linear observations (namely, linear regression). The literature on the topic is vast with remarkable contributions from the statistics, signal processing and machine learning communities. Several recent works focus on the proportional/linear asymptotic regime and derive sharp results on the inference performance of appropriate convex optimization methods (e.g., [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]). These works show that, albeit challenging, sharp results are advantageous over loose order-wise bounds. Not only do they allow for accurate comparisons between different choices of the optimization parameters, but they also form the basis for establishing optimal such choices as well as fundamental performance limitations (e.g., [12,14,15,16,24,25,26]).

This paper takes this recent line of work a step further by demonstrating that results of this nature can be achieved in binary observation models. While we depart from the previously studied linear regression model, we remain faithful to the requirement and promise of sharp results. Binary models are popularly applicable in a wide range of signal-processing (e.g., highly quantized measurements) and machine learning (e.g., binary classification) problems. We derive sharp asymptotics for a rich class of convex optimization estimators, which include least-squares, logistic regression and hinge loss as special cases. Perhaps more interestingly, we use these results to derive fundamental performance limitations and design optimal loss functions that provably outperform existing choices. Our results hold both for discriminative and generative data models.

In Section 1.2, we formally introduce the problem setup. The paper’s main contributions and organization are presented in Section 1.4. A detailed discussion of prior art follows in Section 1.5.

Notation 1.

The symbols

P (\cdot)

E [\cdot]

and

Var [\cdot]

denote probability, expectation and variance, respectively. We use boldface notation for vectors.

{∥ v ∥}_{2}

denotes the Euclidean norm of a vector

v

. We write

i \in [m]

for

i = 1, 2, \dots, m

. When writing

x_{*} = arg {min}_{x} f (x),

we let the operator

arg min

return any one of the possible minimizers of f. For all

x \in R

Φ (x)

is the cumulative distribution function of standard normal and Gaussian Q-function at x is defined as

Q (x) = 1 - Φ (x) .

1.2. Data Models

Consider m data pairs

{(y_{i}, a_{i})}_{i = 1}^{m}

generated i.i.d from one of the following two models such that

y_{i} \in {- 1, + 1}

and

a_{i} \in R^{n}

for all

i \in [m]

Binary models with Gaussian features: Here, the feature/measurement vectors

a_{i}, i \in [n]

have i.i.d Gaussian entries, i.e.,

a_{i} \sim N (0, I_{n}) .

Given the feature vector

a_{i}

, the corresponding label takes the form

\begin{matrix} y_{i} = f (a_{i}^{T} x_{0}), i \in [m], \end{matrix}

(1)

for some unknown true signal

x_{0} \in R^{n}

and a label/link function

f : R \to {- 1, + 1}

a (possibly random) binary function. Some popular examples for the label function f include the following:

(Noisy) Signed: $\{\begin{matrix} sign (a_{i}^{T} x_{0}) & , w . p . 1 - ε, \\ - sign (a_{i}^{T} x_{0}) & , w . p . ε, \end{matrix} where ε \in [0, 1 / 2] .$
Logistic: $y_{i} = \{\begin{matrix} + 1 & , w . p . \frac{1}{1 + exp (- a_{i}^{T} x_{0})}, \\ - 1 & , w . p . 1 - \frac{1}{1 + exp (- a_{i}^{T} x_{0})} . \end{matrix}$
Probit: $y_{i} = \{\begin{matrix} + 1 & , w . p . Φ (a_{i}^{T} x_{0}), \\ - 1 & , w . p . 1 - Φ (a_{i}^{T} x_{0}) . \end{matrix}$

We remark that when the signal strength

∥ x_{0} ∥_{2} \to + \infty

, logistic and Probit label functions approach the signed model (i.e., noisy-signed function with

ε = 0

Throughout, we assume that

∥ x_{0} ∥_{2} = 1

. This assumption is without loss of generality since the norm of

x_{0}

can always be absorbed in the link function. Indeed, letting

∥ x_{0} ∥_{2} = r

, we can always write the measurements as

f (a^{T} x_{0}) = \tilde{f} (a^{T} {\tilde{x}}_{0})

, where

{\tilde{x}}_{0} = x_{0} / r

(hence,

∥ {\tilde{x}}_{0} ∥_{2} = 1

) and

\tilde{f} (t) = f (r t)

. We make no further assumptions on the distribution of the true vector

x_{0}

Gaussian-mixture model: In Section 5, we also study the following generative Gaussian-mixture model (GMM):

\begin{matrix} y_{i} = \{\begin{matrix} + 1 & , w . p . π, \\ - 1 & , w . p . 1 - π, \end{matrix}, a_{i} | y_{i} \sim N (y_{i} x_{0}, I_{n}), i \in [m] . \end{matrix}

(2)

Above,

π \in [0, 1]

is the prior of class

+ 1

and

x_{0} \in R^{n}

is the true signal, which here represents the mean of the features.

1.3. Empirical Risk Minimization

We study the performance of empirical-risk minimization (ERM) estimators

{\hat{x}}_{ℓ}

x_{0}

that solve the following optimization problem for some convex loss function

ℓ : R \to R

\begin{matrix} {\hat{x}}_{ℓ} : = arg min_{x} \frac{1}{m} \sum_{i = 1}^{m} ℓ (y_{i} a_{i}^{T} x) . \end{matrix}

(3)

Loss function. Different choices for ℓ lead to popular specific estimators including the following:

Least Squares (LS): $ℓ (t) = {(t - 1)}^{2}$ ,
Least-Absolute Deviations (LAD): $ℓ (t) = | t - 1 |$ ,
Logistic Loss: $ℓ (t) = log (1 + exp (- t))$ ,
Exponential Loss: $ℓ (t) = exp (- t)$ ,
Hinge Loss: $ℓ (t) = max {1 - t, 0}$ .

Performance Measure. We measure performance of the estimator

{\hat{x}}_{ℓ}

by the value of its correlation to

x_{0}

, i.e.,

\begin{matrix} corr ({\hat{x}}_{ℓ}; x_{0}) : = \frac{〈 {\hat{x}}_{ℓ}, x_{0} 〉}{∥ {\hat{x}}_{ℓ} ∥_{2} {∥ x_{0} ∥}_{2}} \in [- 1, 1] . \end{matrix}

(4)

Obviously, we seek estimates that maximize correlation. While correlation is the measure of primal interest, our results extend rather naturally to other prediction metrics, such as classification error given by (e.g., see [27] (Section D.2.)),

\begin{matrix} E_{ℓ} : = E_{a, y} [1_{\{y \neq sign (〈 {\hat{x}}_{ℓ}, a 〉)\}}] . \end{matrix}

(5)

Expectation in (5) is derived based on a test sample

(a, y)

from the same distribution of the training set.

1.4. Contributions and Organization

As mentioned, our techniques naturally apply to both binary Gaussian and Gaussian-mixture models. For concreteness, we focus our presentation on the former models (see Section 2, Section 3 and Section 4.1). Then, we extend our results to Gaussian mixtures in Section 5. Numerical simulations corroborating our theoretical findings for both models are presented in Section 6.

Now, we state the paper’s main contributions:

Precise Asymptotics: We show that the absolute value of correlation of ${\hat{x}}_{ℓ}$ to the true vector $x_{0}$ is sharply predicted by $\sqrt{1 / (1 + σ_{ℓ}^{2})}$ where the “effective noise” parameter $σ_{ℓ}$ can be explicitly computed by solving a system of three non-linear equations in three unknowns. We find that the system of equations (and, thus, the value of $σ_{ℓ}$ ) depends on the loss function ℓ through its Moreau envelope function. Our prediction holds in the linear asymptotic regime in which $m, n \to \infty$ and $m / n \to δ > 1$ (see Section 2).
Fundamental Limits: We establish fundamental limits on the performance of convex optimization-based estimators by computing an upper bound on the best possible correlation performance among all convex loss functions. We compute the upper bound by solving a certain nonlinear equation and we show that such a solution exists for all $δ > 1$ (see Section 3.1).
Optimal Performance and (sub)-optimality of LS for binary models: For certain binary models including signed and logistic, we find the loss functions that achieve the optimal performance, i.e., they attain the previously derived upper bound (see Section 3.2). Interestingly, for logistic and Probit models with $∥ x_{0} ∥_{2} = 1$ , we prove that the correlation performance of least-squares (LS) is at least as good 0.9972 and 0.9804 times the optimal performance. However, as $∥ x_{0} ∥_{2}$ grows large, logistic and Probit models approach the signed model, in which case LS becomes sub-optimal (see Section 4.1).
Extension to the Gaussian-Mixture Model: In Section 5, we extend the fundamental limits and the system of equations to the Gaussian-mixture model. Interestingly, our results indicate that, for this model, LS is optimal among all convex loss functions for all $δ > 1$ .
Numerical Simulations: We do numerous experiments to specialize our results to popular models and loss functions, for which we provide simulation results that demonstrate the accuracy of the theoretical predictions (see Section 6 and Appendix E).

Figure 1 contains a pictorial preview of our results described above for the special case of signed measurements. First, Figure 1a depicts the correlation performance of LS and LAD estimators as a function of the aspect ratio

δ

. Both theoretical predictions and numerical results are shown; note the close match between theory and empirical results for both i.i.d. Gaussian (shown by circles) and i.i.d. Rademacher (shown by squares) distributions of the feature vectors for even small dimensions. Second, the red line on the same figure shows the upper bound derived in this paper—there is no convex loss function that results in correlation exceeding this line. Third, we show that the upper bound can be achieved by the loss functions depicted in Figure 1b for several values of

δ

. We solve (3) for this choice of loss functions using gradient descent and numerically evaluate the achieved correlation performance. The recorded values are compared in Table 1 to the corresponding values of the upper bound; again, note the close agreement between the values as predicted by the findings of this paper, which suggests that the fundamental limits derived in this paper hold for sub-Gaussian features. We present corresponding results for the logistic and Probit models in Section 6 and for the noisy-signed model in Appendix E.

A remark on the Gaussianity assumption. Our results on precise asymptotics (to which our study of fundamental limits rely upon) hold rigorously for the two data models in Section 1.2, in which the feature vectors have entries i.i.d. standard Gaussian. However, we conjecture that the Gaussianity assumption can be relaxed. As partial numerical evidence, note in Figure 1a the perfect match of our theory with the empirical performance over data in which the feature vectors

a_{i}, i \in [m]

have entries i.i.d. Rademacher (i.e., centered Bernoulli with probability 1/2). Figure 2 shows corresponding results for the Gaussian-mixture model. Our conjecture that the so-called universality property holds in our setting is also in line with similar numerical observations and partial theoretical evidence previously made for linear regression settings [7,28,29,30,31]. A formal proof of universality of our results is beyond the scope of this paper. However, we remark that, as long as the asymptotic predictions of Section 2 enjoy this property, then all our results on fundamental performance limits and optimal functions automatically hold under the same relaxed assumptions.

1.5. Related Works

Over the past two decades, there has been a long list of works that derive statistical guarantees for high-dimensional estimation problems. Many of these are concerned with convex optimization-based inference methods. Our work is most closely related to the following three lines of research.

(a): Sharp asymptotics for linear measurements.

Most of the results in the literature of high-dimensional statistics are order-wise in nature. Sharp asymptotic predictions have only more recently appeared in the literature for the case of noisy linear measurements with Gaussian measurement vectors. There are by now three different approaches that have been used towards asymptotic analysis of convex regularized estimators: (i) the one that is based on the approximate message passing (AMP) algorithm and its state-evolution analysis (e.g., [5,8,14,20,32,33,34]); (ii) the one that is based on Gaussian process (GP) inequalities, specifically on the convex Gaussian min-max Theorem (CGMT) (e.g., [9,10,13,15,18,19]); and (iii) the “leave-one-out” approach [11,35]. The three approaches are quite different to each other and each comes with its unique distinguishing features and disadvantages. A detailed comparison is beyond our scope.

Our results in Theorems 2 and 3 for achieving the best performance across all loss functions is complementary to [12] (Theorem 1) and the work of Advani and Ganguli [16], who proposed a method for deriving optimal loss function and measuring its performance, albeit for linear models. Instead, we study binary models. The optimality of regularization for linear measurements is recently studied in [22].

In terms of analysis, we follow the GP approach and build upon the CGMT. Since the previous works are concerned with linear measurements, they consider estimators that solve minimization problems of the form

\begin{matrix} \hat{x} : = arg min_{x} \sum_{i = 1}^{m} \tilde{ℓ} (y_{i} - a_{i}^{T} x) + r R (x) \end{matrix}

(6)

Specifically, the loss function

\tilde{ℓ}

penalizes the residual. In this paper, we show that the CGMT is applicable to optimization problems in the form of (3). For our case of binary observations, (3) is more general than (6). To see this, note that, for

y_{i} \in \pm 1

and popular symmetric loss functions

\tilde{ℓ} (t) = \tilde{ℓ} (- t)

, e.g., least-squares (LS), (3) results in (6) by choosing

ℓ (t) = \tilde{ℓ} (t - 1)

in the former. Moreover, (3) includes several other popular loss functions such as the logistic loss and the hinge loss which cannot be expressed by (6).

(b): One-bit compressed sensing.

Our work naturally relates to the literature on one-bit compressed sensing (CS) [36]. The vast majority of performance guarantees for one-bit CS are order-wise in nature (e.g., [37,38,39,40,41,42]). To the best of our knowledge, the only existing sharp results are presented in [43] for Gaussian measurement vectors, which studies the asymptotic performance of regularized LS. Our work can be seen as a direct extension of the work in [43] to loss functions beyond least-squares (see Section 4.1 for details).

Similar to the generality of our paper, Genzel [41] also studied the high-dimensional performance of general loss functions. However, in contrast to our results, their performance bounds are loose (order-wise); as such, they are not informative about the question of optimal performance which we also address here.

(c): Classification in high-dimensions.

In [44,45], the authors studied the high-dimensional performance of maximum-likelihood (ML) estimation for the logistic model. The ML estimator is a special case of (3) and we consider general binary models. In addition, their analysis is based on the AMP framework. The asymptotics of logistic loss under different classification models is also recently studied in [46]. In yet another closely related recent work [47], the authors extended the results of Sur and Candes [45] to regularized ML by using the CGMT. Instead, we present results for general convex loss functions and for binary linear models. Importantly, we also study performance bounds and optimal loss functions.

We also remark on the following closely related parallel works. While the conference version of this paper was being reviewed, the CGMT was applied by Montanari et al. [48] and Deng et al. [49] to determine the generalization performance of max-margin linear classifiers in a binary classification setting. In essence, these results are complementary to the results of our paper in the following sense. Consider a binary classification setting under the logistic model and Gaussian regressors. As discussed in Section 4.2, the optimal set of (3) is bounded with probability approaching one if and only if

δ > δ_{f}^{☆}

, for appropriate threshold

δ_{f}^{☆}

determined for first time in [44] (see also Figure 3a). Our results hold in this regime. In contrast, the papers by Montanari et al. [48] and Deng et al. [49] study the regime

δ < δ_{f}^{☆}

We close this section by mentioning works that build on our results and appeared after the initial submission of this paper. The paper by Mignacco et al. [50] studies sharp asymptotics of ridge-regularized ERM with an intercept for Gaussian-mixture models. In [27], we extend the results of this paper on fundamental limits and optimality to the case of ridge-regularized ERM (see also the concurrent work by Aubin et al. [51]).

2. Sharp Performance Guarantees

2.1. Definitions

Moreau Envelopes. Before stating the first result, we need a definition. We write

M_{ℓ} (x; λ) : = min_{v} \frac{1}{2 λ} {(x - v)}^{2} + ℓ (v),

for the Moreau envelope function of the loss

ℓ : R \to R

at x with parameter

λ > 0

. The minimizer (which is unique by strong convexity) is known as the proximal operator of ℓ at x with parameter

λ

and we denote it as

{prox}_{ℓ} (x; λ)

. A useful property of the Moreau envelope function is that it is continuously differentiable with respect to both x and

λ

[52]. We denote these derivatives as follows

\begin{matrix} M_{ℓ, 1}^{'} (x; λ) & : = \frac{\partial M_{ℓ} (x; λ)}{\partial x}, \\ M_{ℓ, 2}^{'} (x; λ) & : = \frac{\partial M_{ℓ} (x; λ)}{\partial λ} . \end{matrix}

2.2. A System of Equations

As we show shortly the asymptotic performance of the optimization in (3) is tightly connected to the solution of a certain system of nonlinear equations, which we introduce here. Specifically, define random variables

G, S

and Y as follows:

\begin{matrix} G, S \overset{i . i . d .}{\sim} N (0, 1) and Y = f (S), \end{matrix}

(7)

and consider the following system of non-linear equations in three unknowns

(μ, α \geq 0, λ \geq 0)

\begin{matrix} E [Y S \cdot M_{ℓ, 1}^{'} (α G + μ S Y; λ)] & = 0, \end{matrix}

(8a)

\begin{matrix} λ^{2} δ E [{(M_{ℓ, 1}^{'} (α G + μ S Y; λ))}^{2}] & = α^{2}, \end{matrix}

(8b)

\begin{matrix} λ δ E [G \cdot M_{ℓ, 1}^{'} (α G + μ S Y; λ)] & = α . \end{matrix}

(8c)

The expectations are with respect to the randomness of the random variables G, S and Y. We remark that the equations are well defined even if the loss function ℓ is not differentiable. In Appendix A, we summarize some well-known properties of the Moreau envelope function and use them to simplify (8) for differentiable loss functions.

2.3. Asymptotic Prediction

We are now ready to state our first main result.

Theorem 1.

(Sharp Asymptotics). Assume data generated from the binary model with Gaussian features and assume

δ > 1

such that the set of minimizers in (3) is bounded and the system of Equation (8) has a unique solution

(μ, α \geq 0, λ \geq 0)

, such that

μ \neq 0

. Let

{\hat{x}}_{ℓ}

be as in (3). Then, in the limit of

m, n \to + \infty

m / n \to δ

, it holds with probability one that

\begin{matrix} lim_{n \to \infty} corr ({\hat{x}}_{ℓ}; x_{0}) = \frac{μ}{\sqrt{μ^{2} + α^{2}}} . \end{matrix}

(9)

Moreover,

\begin{matrix} lim_{n \to \infty} {∥{\hat{x}}_{ℓ} - μ \cdot \frac{x_{0}}{∥ x_{0} ∥_{2}}∥}_{2}^{2} = α^{2} . \end{matrix}

(10)

Theorem 1 holds for any convex loss function. In Section 4, we specialize the result to specific popular choices and also present numerical simulations that confirm the validity of the predictions (see Figure 1a, Figure 3a, Figure 4a and Figure A4a,b). Before that, we include a few remarks on the conditions, interpretation and implications of the theorem. The proof is deferred to Appendix B and uses the convex Gaussian min-max theorem (CGMT) [13,15].

Remark 1.

(The Role of

μ

and

α

). According to (9), the prediction for the limiting behavior of the correlation value is given in terms of an effective noise parameter

σ_{ℓ} : = α / μ

, where μ and α are unique solutions of (8). The smaller is the value of

σ_{ℓ}

is, the larger the correlation value becomes. While the correlation value is fully determined by the ratio of α and μ, their individual role is clarified in (10). Specifically, according to (10),

{\hat{x}}_{ℓ}

is a biased estimate of the true

x_{0}

and μ represents exactly the correlation bias term. In other words, solving (3) returns an estimator that is close to a μ-scaled version of

x_{0}

. When

x_{0}

and

{\hat{x}}_{ℓ}

are scaled appropriately, the

ℓ_{2}

-norm of their difference converges to α.

Remark 2.

(Why

δ > 1

). The theorem requires that

δ > 1

(equivalently,

m > n

asymptotically). Here, we show that this condition is necessary for Equations (8) to have a bounded solution. To see this, take squares in both sides of (8c) and divide by (8b) to find that

δ = \frac{E [{(M_{ℓ, 1}^{'} (α G + μ S Y; λ))}^{2}]}{{(E [G \cdot M_{ℓ, 1}^{'} (α G + μ S Y; λ)])}^{2}} \geq 1 .

The inequality follows by applying Cauchy–Schwarz and using the fact that

E [G^{2}] = 1

Remark 3.

(On the Existence of a Solution to (8)). While

δ > 1

is a necessary condition for the equations in (8) to have a solution, it is not sufficient in general. This depends on the specific choice of the loss function. For example, in Section 4.1, we show that, for the squared loss

ℓ (t) = {(t - 1)}^{2}

, the equations have a unique solution iff

δ > 1

. On the other hand, for logistic loss and hinge loss, it is argued in Section 4.2 that there exists a threshold value

δ_{f}^{☆} > 2

such that the set of minimizers in (3) is unbounded if

δ < δ_{f}^{☆}

. In this case, the assumptions of Theorem 1 do not hold. We conjecture that, for these choices of loss, Equations (8) are solvable iff

δ > δ_{f}^{☆}

. Justifying this conjecture and further studying more general sufficient and necessary conditions under which the Equation (8) admit a solution is left to future work. However, in what follows, given such a solution, we prove that it is unique for a wide class of convex loss functions of interest.

Remark 4.

(On the Uniqueness of Solutions to (8)). We show that, if the system of equations in (8) has a solution, then it is unique provided that ℓ is strictly convex, continuously differentiable and its derivative satisfies

ℓ^{'} (0) \neq 0

. For instance, this class includes the square, the logistic and the exponential losses. However, it excludes non-differentiable functions such as the LAD and hinge loss. We believe that the differentiability assumption can be relaxed without major modification in our proof, but we leave this for future work. Our result is summarized in Proposition 1 below.

Proposition 1.

(Uniqueness). Assume that the loss function

ℓ : R \to R

has the following properties: (i) it is proper strictly convex; and (ii) it is continuously differentiable and its derivative

ℓ^{'}

is such that

ℓ^{'} (0) \neq 0

. Further, assume that the (possibly random) link function f is such that

S Y = S f (S), S \sim N (0, 1)

has strictly positive density on the real line. The following statement is true. For any

δ > 1

, if the system of equations in (8) has a bounded solution, then it is unique.

The detailed proof of Proposition 1 is deferred to Appendix B.5. Here, we highlight some key ideas. The CGMT relates—in a rather natural way—the original ERM optimization (3) to the following deterministic min-max optimization on four variables

\begin{matrix} min_{α > 0, μ, τ > 0} max_{γ > 0} F (α, μ, τ, γ) : = \frac{γ τ}{2} - \frac{α γ}{\sqrt{δ}} + E [M_{ℓ} (α G + μ Y S; \frac{τ}{γ})] . \end{matrix}

(11)

In Appendix B.4, we show that the optimization above is convex-concave for any lower semi-continuous, proper and convex function

ℓ : R \to R

. Moreover, it is shown that one arrives at the system of equations in (8) by simplifying the first-order optimality conditions of the min-max optimization in (11). This connection is key to the proof of Proposition 1. Indeed, we prove uniqueness of the solution (if such a solution exists) to (8), by proving instead that the function

F (α, μ, τ, γ)

above is (jointly) strictly convex in

(α, μ, τ)

and strictly concave in γ, provided that ℓ satisfies the conditions of the proposition. Next, let us briefly discuss how strict convex-concavity of (11) can be shown. For concreteness, we only discuss strict convexity here; the ideas are similar for strict concavity. At the heart of the proof of strict convexity of F is understanding the properties of the expected Moreau envelope function

Ω : R_{+} \times R \times R_{+} \times R_{+} \to R

defined as follows:

Ω (α, μ, τ, γ) : = E [M_{ℓ} (α G + μ Y S; \frac{τ}{γ})] .

Specifically, we prove in Proposition A7 in Appendix A.6 that if ℓ is strictly convex, differentiable and does not attain its minimum at 0, then Ω is strictly convex in

(α, μ, τ)

and strictly concave in γ. It is worth noting that the Moreau envelope function

M_{ℓ} (α g + μ y s; τ)

for fixed

g, s

and

y = f (s)

is not necessarily strictly convex. Interestingly, we show that the expected Moreau envelope has this desired feature. We refer the reader to Appendix A.6 and Appendix B.5 for more details.

3. On Optimal Performance

3.1. Fundamental Limits

In this section, we establish fundamental limits on the performance of (3) by deriving an upper bound on the absolute value of correlation

corr ({\hat{x}}_{ℓ}; x_{0})

that holds for all choices of loss functions satisfying Theorem 1. The result builds on the prediction of Theorem 1. In view of (9), upper bounding correlation is equivalent to lower bounding the effective noise parameter

σ_{ℓ} = α / μ

. Theorem 2 derives such a lower bound.

Before stating the theorem, we need a definition. For a random variable H with density

p_{H} (h)

that has a derivative

p_{H}^{'} (h), \forall h \in R

, we denote its score function

ξ_{H} (h) : = \frac{\partial}{\partial h} log p_{H} (h) = \frac{p_{H}^{'} (h)}{p_{H} (h)}

. Then, the Fisher information of H, denoted by

I (H) \in R_{+}

, is defined as follows (e.g., [53] (Sec. 2)):

I (H) : = E [{(ξ_{H} (H))}^{2}] .

Theorem 2.

(Best Achievable Performance). Let the assumptions and notation of Theorem 1 hold and recall the definition of random variables

G, S

and Y in (7). For

σ > 0

, define a new random variable

W_{σ} : = σ G + S Y,

and the function

κ : (0, \infty] \to [0, 1]

as follows,

\begin{matrix} κ (σ) : = \frac{σ^{2} (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)}{1 + σ^{2} (σ^{2} I (W_{σ}) - 1)} . \end{matrix}

Further, define

σ_{opt}

as follows,

\begin{matrix} σ_{opt} : = min \{σ \geq 0 : κ (σ) = \frac{1}{δ}\} . \end{matrix}

(12)

Then, for

σ_{ℓ} : = \frac{α}{μ}

, it holds that

σ_{ℓ} \geq σ_{opt}

The theorem above establishes an upper bound on the best possible correlation performance among all convex loss functions. In Section 3.2, we show that this bound is often tight, i.e., there exists a loss function that achieves the specified best possible performance.

Remark 5.

Theorem 2 complements the results in [12,14] (Lem. 3.4) and [15] (Rem. 5.3.3), in which the authors considered only linear regression. In particular, Theorem 2 shows that it is possible to achieve results of this nature for the more challenging setting of binary classification considered here.

Proof of Theorem 2.

Fix a loss function ℓ and let

(μ \neq 0, α > 0, λ \geq 0)

be a solution to (8), which by assumptions of Theorem 1 is unique. The first important observation is that the error of a loss function is unique up to a multiplicative constant. To see this, consider an arbitrary loss function

ℓ (t)

and let

{\hat{x}}_{ℓ}

be a minimizer in (3). Now, consider (3) with the following loss function instead, for some arbitrary constants

C_{1} > 0, C_{2} \neq 0

\begin{matrix} \hat{ℓ} (t) : = \frac{1}{C_{1}} ℓ (C_{2} t) . \end{matrix}

(13)

It is not hard to see that

\frac{1}{C_{2}} {\hat{x}}_{ℓ}

is the minimizer for

\hat{ℓ}

. Clearly,

\frac{1}{C_{2}} {\hat{x}}_{ℓ}

has the same correlation value with

x_{0}

{\hat{x}}_{ℓ}

, showing that the two loss functions ℓ and

\hat{ℓ}

perform the same. With this observation in mind, consider the function

\hat{ℓ} : R \to R

such that

\hat{ℓ} (t) = \frac{λ}{μ^{2}} ℓ (μ t)

. Then, notice that

M_{ℓ, 1}^{'} (x; λ) = \frac{1}{λ} M_{\hat{ℓ}, 1}^{'} (x / μ; 1) .

Using this relation in (8) and setting

σ : = σ_{ℓ} = α / μ

, the system of equations in (8) can be equivalently rewritten in the following convenient form,

\begin{matrix} E [Y S \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] = 0, \end{matrix}

(14a)

\begin{matrix} E [{(M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1))}^{2}] = σ^{2} / δ, \end{matrix}

(14b)

\begin{matrix} E [G \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] = σ / δ . \end{matrix}

(14c)

Next, we show how to use (14) to derive an equivalent system of equations based on

W_{σ}

. Starting with (14c), we have

\begin{matrix} E [G \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] = \frac{1}{σ} \int \int u M_{\hat{ℓ}, 1}^{'} (u + z; 1) ϕ_{σ} (u) p_{S Y} (z) d u d z, \end{matrix}

(15)

where

ϕ_{σ} (u) : = p_{σ G} (u) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{u^{2}}{2 σ^{2}}}

. Since it holds that

ϕ_{σ} (u) = \frac{- σ^{2}}{u} ϕ_{σ}^{'} (u)

, using (A74), it follows that

\begin{matrix} \begin{matrix} E [G \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] & = - σ \int \int M_{\hat{ℓ}, 1}^{'} (u + z; 1) ϕ_{σ}^{'} (u) p_{S Y} (z) d u d z \\ = - σ \int \int M_{\hat{ℓ}, 1}^{'} (w; 1) ϕ_{σ}^{'} (u) p_{S Y} (w - u) d u d w \\ = - σ \int M_{\hat{ℓ}, 1}^{'} (w; 1) p_{W_{σ}}^{'} (w) d w, \end{matrix} \end{matrix}

(16)

where in the last step we use

p_{W_{σ}}^{'} (w) = \int ϕ_{σ}^{'} (u) p_{S Y} (w - u) d u .

Therefore, we have by (16) that

\begin{matrix} E [G \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] = - σ E [M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1) ξ_{W_{σ}} (W_{σ})] . \end{matrix}

(17)

This combined with (14c) gives

E [M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1) ξ_{W_{σ}} (W_{σ})] = - 1 / δ .

Second, multiplying (14c) with

σ^{2}

and adding it to (14a) yields that,

\begin{matrix} E [W_{σ} \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] & = σ^{2} / δ, \end{matrix}

(18)

Putting these together, we conclude with the following system of equations which is equivalent to (14),

\begin{matrix} E [W_{σ} \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] = σ^{2} / δ, \end{matrix}

(19a)

\begin{matrix} E [{(M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1))}^{2}] = σ^{2} / δ, \end{matrix}

(19b)

\begin{matrix} E [M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1) ξ_{W_{σ}} (W_{σ})] = - 1 / δ . \end{matrix}

(19c)

Note that, for

σ > 0

ξ_{W_{σ}} = p_{W_{σ}}^{'} / p_{W_{σ}}

exists everywhere. This is because for all

w \in R

p_{W_{σ}} (w) > 0

and

p_{W_{σ}} (\cdot)

is continuously differentiable. Combining (19a) and (19c), we derive the following equation which holds for

α_{1}, α_{2} \in R

\begin{matrix} E [(α_{1} W_{σ} + α_{2} ξ_{W_{σ}} (W_{σ})) \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)] & = α_{1} σ^{2} / δ - α_{2} / δ . \end{matrix}

By Cauchy–Schwarz inequality, we have that

\begin{matrix} {(E [(α_{1} W_{σ} + α_{2} ξ_{W_{σ}} (W_{σ})) \cdot M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1)])}^{2} \leq \\ E [{(α_{1} W_{σ} + α_{2} ξ_{W_{σ}} (W_{σ}))}^{2}] E [{(M_{\hat{ℓ}, 1}^{'} (W_{σ}; 1))}^{2}] . \end{matrix}

(20)

Using the fact that

E [W_{σ} ξ_{W_{σ}} (W_{σ})] = - 1

(by integration by parts),

E [{(ξ_{W_{σ}} (W_{σ}))}^{2}] = I (W_{σ})

E [W_{σ}^{2}] = σ^{2} + 1

and (19b), the right hand side of (20) is equal to

\begin{matrix} (α_{1}^{2} (σ^{2} + 1) + α_{2}^{2} I (W_{σ}) - 2 α_{1} α_{2}) σ^{2} / δ . \end{matrix}

Therefore, we conclude with the following inequality for

σ

\begin{matrix} δ σ^{2} (α_{1}^{2} (σ^{2} + 1) + α_{2}^{2} I (W_{σ}) - 2 α_{1} α_{2}) \geq {(α_{1} σ^{2} - α_{2})}^{2}, \end{matrix}

(21)

which holds for all

α_{1}, α_{2} \in R

. In particular, (21) holds for the following choice of values for

α_{1}

and

α_{2}

\begin{matrix} α_{1} = \frac{1 - σ^{2} I (W_{σ})}{δ (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)}, α_{2} = \frac{1}{δ (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)} . \end{matrix}

(The choice above is motivated by the result of Section 3.2; see Theorem 3). Rewriting (21) with the chosen values of

α_{1}

and

α_{2}

yields the following inequality,

\begin{matrix} \frac{1}{δ} \leq \frac{σ^{2} (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)}{1 + σ^{2} (σ^{2} I (W_{σ}) - 1)} = κ (σ), \end{matrix}

(22)

where on the right-hand side above, we recognize the function

κ

defined in the theorem.

Next, we use (22) to show that

σ_{opt}

defined in (12) yields a lower bound on the achievable value of

σ

. For the sake of contradiction, assume that

σ < σ_{opt}

. By the above,

1 / δ \leq κ (σ)

. Moreover, by the definition of

σ_{opt}

, we must have that

1 / δ < κ (σ) .

Since

κ (0) = 0

and

κ (\cdot)

is a continuous function we conclude that for some

σ_{1} \in (0, σ)

, it holds that

κ (σ_{1}) = 1 / δ

. Therefore, for

σ_{1} < σ_{opt}

, we have

κ (σ_{1}) = 1 / δ

, which contradicts the definition of

σ_{opt}

. This proves that

σ \geq σ_{opt}

, as desired.

To complete the proof, it remains to show that the equation

κ (σ) = 1 / δ

admits a solution for all

δ > 1

. For this purpose, we use the continuous mapping theorem and the fact that the Fisher information is a continuous function [54]. Recall that, for two independent and non-constant random variables, it holds that

I (X + Y) < I (X)

[53] (Eq. 2.18). Since G and

S Y

are independent random variables, we find that

I (σ G + S Y) < I (S Y)

which implies that

I (σ G + S Y)

is uniformly bounded for all values of

σ

. Therefore,

lim_{σ \to 0} κ (σ) = lim_{σ \to 0} \frac{σ^{2} (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)}{1 + σ^{2} (σ^{2} I (W_{σ}) - 1)} = 0 .

Furthermore,

σ^{2} I (σ G + S Y) = I (G + \frac{1}{σ} S Y) \to I (G) = 1

when

σ \to \infty

. Hence,

lim_{σ \to \infty} κ (σ) = lim_{σ \to \infty} \frac{σ^{2} (σ^{2} I (W_{σ}) + I (W_{σ}) - 1)}{1 + σ^{2} (σ^{2} I (W_{σ}) - 1)} = 1 .

Note that

σ^{2} I (σ G + S Y) < σ^{2} I (σ G) = 1

, which further yields that

κ (σ) < 1

for all

σ \geq 0

. Finally, since

I (\cdot)

is a continuous function, we deduce that range of

κ : R^{+} \cup 0 \to R

[0, 1)

, implying the existence of a solution to (12) for all

δ > 1

. This completes the proof of Theorem 2. □

A useful closed-form bound on the best achievable performance: In general, determining

σ_{opt}

requires computing the Fisher information of the random variable

σ G + S Y

for

σ > 0

. If the probability distribution of

S Y

is continuously differentiable (e.g., logistic model; see Appendix C.1), then we obtain the following simplified bound.

Corollary 1.

(Closed-form Lower Bound on

σ_{opt}

). Let

p_{S Y} : R \to R

be the probability distribution of

S Y

. If

p_{S Y} (x)

is differentiable for all

x \in R

, then,

\begin{matrix} σ_{opt}^{2} \geq \frac{1}{(δ - 1) (I (S Y) - 1)} . \end{matrix}

(23)

Proof.

Based on Theorem 2, the following equation holds for

σ = σ_{opt}

\frac{1}{δ} = κ (σ)

or, equivalently, by rewriting the right-hand side,

\begin{matrix} \frac{1}{δ} = 1 - \frac{1}{\frac{1}{1 - σ^{2} I (W_{σ})} - σ^{2}} . \end{matrix}

(24)

Define the following function

\begin{matrix} h (x) : = 1 - \frac{1}{\frac{1}{1 - σ^{2} x} - σ^{2}} . \end{matrix}

The function h is increasing in the region

R_{σ} = {z : z > σ^{- 2} - σ^{- 4}} .

According to Stam’s inequality [55], for two independent random variables X and Y with continuously differentiable

p_{X}

and

p_{Y}

, it holds that

I (X + Y) \leq \frac{I (X) \cdot I (Y)}{I (X) + I (Y)},

where equality is achieved if and only if X and Y are independent Gaussian random variables. Therefore, since by assumption

p_{S Y}

is differentiable on the real line, Stam’s inequality yields

\begin{matrix} I (W_{σ}) = I (σ G + S Y) \leq \frac{I (σ G) \cdot I (S Y)}{I (σ G) + I (S Y)} . \end{matrix}

(25)

Next, we prove that for all

σ > 0

, both sides of (25) are in the region

R_{σ}

. First, we prove that

I (W_{σ}) \in R_{σ}

. By Cramer–Rao bound (e.g., see [53] (Eq. 2.15)) for Fisher information of a random variable X, we have that

I (X) \geq 1 / (Var [X])

. In addition, for the random variable

W_{σ}

, we know that

Var [W_{σ}] = 1 + σ^{2} - {(E [S Y])}^{2}

, thus

\begin{matrix} I (W_{σ}) \geq \frac{1}{1 + σ^{2} - {(E [S Y])}^{2}} . \end{matrix}

(26)

Using the relation

{(E [S Y])}^{2} \leq E [S^{2}] E [Y^{2}] = 1

, one can check that the following inequality holds:

\begin{matrix} \frac{1}{1 + σ^{2} - {(E [S Y])}^{2}} \geq σ^{- 2} - σ^{- 4} . \end{matrix}

(27)

Therefore, from (26) and (27), we derive that

I (W_{σ}) \in R_{σ}

for all

σ > 0

. Furthermore, by the inequality in (25) and the definition of

R_{σ}

it directly follows that for all

σ > 0

\frac{I (σ G) I (S Y)}{I (σ G) + I (S Y)} \in R_{σ} .

Finally, noting that

h (\cdot)

is increasing in

R_{σ}

, combined with (25), we have

\frac{1}{δ} = h (I (W_{σ})) \leq h (\frac{I (σ G) \cdot I (S Y)}{I (σ G) + I (S Y)}),

which after using the relation

I (σ G) = σ^{- 2}

and further simplification yields the inequality in the statement of the corollary. □

The proof of the corollary reveals that (23) holds with equality when

S Y

is Gaussian. In Appendix C.1, we compute

p_{S Y}

for the logistic and the Probit models with

∥ x_{0} ∥_{2} = 1

and numerically show that it is close to the density of a Gaussian random variable. Consequently, the lower bound of Corollary 1 is almost exact when measurements are obtained according to the logistic and Probit models (see Figure A2 in the Appendix C).

3.2. On the Optimal Loss Function

It is natural to ask whether there exists a loss function that attains the bound of Theorem 2. If such a loss function exists, then we say it is optimal in the sense that it maximizes the correlation performance among all convex loss functions in (3).

Our next theorem derives a candidate for the optimal loss function, which we denote

ℓ_{opt}

. Before stating the result, we provide some intuition about the proof which builds on Theorem 2. The critical observation in the proof of Theorem 2 is that the effective noise

σ_{\hat{ℓ}}

\hat{ℓ}

is minimized (i.e., it attains the value

σ_{opt}

) if the Cauchy–Schwartz inequality in (20) holds with equality. Hence, we seek

\hat{ℓ} = ℓ_{opt}

so that for some

c \in R

\begin{matrix} M_{ℓ_{opt}, 1}^{'} (w; 1) = c (α_{1} w + α_{2} \cdot ξ_{W_{opt}} (w)) . \end{matrix}

(28)

By choosing

c = - 1

, integrating and ignoring constants irrelevant to the minimization of the loss function, the previous condition is equivalent to the following

M_{ℓ_{opt}} (w; 1) = - α_{1} w^{2} / 2 - α_{2} log (p_{W_{opt}} (w)) .

It turns out that this condition can be “inverted” to yield the explicit formula for

ℓ_{opt}

as,

ℓ_{opt} (w) = - M_{α_{1} q + α_{2} log (p_{W_{opt}})} (w; 1) .

Of course, one has to properly choose

α_{1}

and

α_{2}

to make sure that this function satisfies the system of equations in (19) with

σ = σ_{opt}

. The correct choice is specified in the theorem below. The proof is deferred to Appendix D.1.

Theorem 3.

(Optimal Loss Function). Recall the definition of

σ_{opt}

in (12). Define the random variable

W_{opt} : = σ_{opt} G + S Y

and let

p_{W_{opt}}

denote its density. Consider the following loss function

ℓ_{opt} : R \to R

\begin{matrix} ℓ_{opt} (w) = - M_{α_{1} q + α_{2} log (p_{W_{opt}})} (w; 1), \end{matrix}

(29)

where

q (x) = x^{2} / 2

and

\begin{matrix} \begin{matrix} α_{1} & = \frac{1 - σ_{opt}^{2} I (W_{opt})}{δ (σ_{opt}^{2} I (W_{opt}) + I (W_{opt}) - 1)}, \\ α_{2} & = \frac{1}{δ (σ_{opt}^{2} I (W_{opt}) + I (W_{opt}) - 1)} . \end{matrix} \end{matrix}

(30)

ℓ_{opt}

defined as in (29) is convex and the equation

κ (σ) = 1 / δ

has a unique solution, then

σ_{ℓ_{opt}} = σ_{opt}

In general, there is no guarantee that the function

ℓ_{opt} (\cdot)

as defined in (29) is convex. However, if this is the case, the theorem above guarantees that it is optimal (Strictly speaking, the performance is optimal among all convex loss functions ℓ for which (8) has a unique solution as required by Theorem 2.). A sufficient condition for

ℓ_{opt} (w)

to be convex is provided in Appendix D.2. Importantly, in Appendix D.2.1, we show that this condition holds for observations following the signed model. Thus, for this case, the resulting function is convex. Although we do not prove the convexity of optimal loss function for the logistic and Probit models, our numerical results (e.g., see Figure 3b) suggest that this is the case. Concretely, we conjecture that the loss function

ℓ_{opt}

is convex for logistic and Probit models, and therefore by Theorem 3 its performance is optimal.

4. Special Cases

4.1. Least-Squares

By choosing

ℓ (t) = {(t - 1)}^{2}

in (3), we obtain the standard least-squares estimate. To see this, note that since

y_{i} = \pm 1

, it holds for all i that

{(y_{i} a_{i}^{T} x - 1)}^{2} = {(y_{i} - a_{i}^{T} x)}^{2} .

Thus,

\hat{x}

is minimizing the sum of squares of the residuals:

\begin{matrix} \hat{x} = arg min_{x} \sum {(y_{i} - a_{i}^{T} x)}^{2} . \end{matrix}

(31)

For this choice of a loss function, we can solve the equations in (8) in closed form. Furthermore, the equations have a (unique, bounded) solution for any

δ > 1

provided that

E [S Y] > 0

. The final result is summarized in the corollary below (see Appendix F.1 for the proof).

Corollary 2.

(Least-squares). Assume data generated from the binary model and

δ > 1

. For the label function assume that

E [S Y] > 0

in the notation of (7). Let

\hat{x}

be as in (41). Then, in the limit of

m, n \to + \infty

m / n \to δ

, Equations (9) and (10) hold with probability one with α and μ given as follows:

\begin{matrix} μ & = E [S Y], \end{matrix}

(32)

\begin{matrix} α & = \sqrt{1 - {(E [S Y])}^{2}} \cdot \sqrt{\frac{1}{δ - 1}} . \end{matrix}

(33)

Corollary 2 appears in [43] (see also [40,41,56] and Appendix F for an interpretation of the result). However, these previous works obtain results that are limited to least-squares loss. In contrast, our results are general and LS prediction is obtained as a simple corollary of our general Theorem 1. Moreover, our study of fundamental limits allows us to quantify the sub-optimality gap of least-square (LS) as follows.

On the Optimality of LS. On the one hand, Corollary 2 derives an explicit formula for the effective noise variance

σ_{LS} = α / μ

of LS in terms of

E [Y S]

and

δ

. On the other hand, Corollary 1 provides an explicit lower bound on the optimal value

σ_{opt}

in terms of

I (S Y)

and

δ

. Combining the two, we conclude that

\frac{σ_{LS}^{2}}{σ_{opt}^{2}} \leq ξ : = (I (S Y) - 1) \frac{1 - {(E [S Y])}^{2}}{{(E [S Y])}^{2}} .

In terms of correlation,

\begin{matrix} \frac{{corr}_{opt}}{{corr}_{LS}} = \sqrt{\frac{1 + σ_{LS}^{2}}{1 + σ_{opt}^{2}}} \leq \frac{σ_{LS}}{σ_{opt}} \leq \sqrt{ξ}, \end{matrix}

where the first inequality follows from the fact that

σ_{LS} \geq σ_{opt} .

Therefore, the performance of LS is at least as good as

\frac{1}{\sqrt{ξ}}

times the optimal one. In particular, assuming

∥ x_{0} ∥ = 1

and for logistic and Probit models (for which Corollary 1 holds), we can explicitly compute

\frac{1}{\sqrt{ξ}} = 0.9972 and 0.9804

, respectively. However, we recall that for large

∥ x_{0} ∥

logistic and Probit models approach the signed model, and, as Figure 1a demonstrates, LS becomes suboptimal.

Another interesting consequence of combining Corollaries 1 and 2 is that LS would be optimal if

S Y

were a Gaussian random variable. To see this, recall from Corollary 1 that, if

S Y

is Gaussian, then:

σ_{opt}^{2} = \frac{1}{(δ - 1) (I (S Y) - 1)} .

However, for

S Y

Gaussian, we can explicitly compute

I (S Y) = 1 / Var [S Y]

, which leads to

σ_{opt}^{2} = \frac{1 - {(E [S Y])}^{2}}{{(E [S Y])}^{2} (δ - 1)} .

The right hand side is exactly

σ_{LS}^{2}

. Therefore, the optimal performance is achieved by the square loss function if

S Y

is a Gaussian random variable. Remarkably, for logistic and Probit models with small SNR (i.e., small

∥ x_{0} ∥

), density of

S Y

is close to the density of a normal random variable (see Figure A2 in the Appendix C), implying the optimality of LS for these models.

4.2. Logistic and Hinge Loss

Theorem 1 only holds in regimes for which the set of minimizers of (3) is bounded. As we show here, this is

n o t

always the case. Specifically, consider non-negative loss functions

ℓ (t) \geq 0

with the property

{lim}_{t \to + \infty} ℓ (t) = 0

. For example, the hinge, exponential and logistic loss functions all satisfy this property. Now, we show that for such loss functions the set of minimizers is unbounded if

δ < δ_{f}^{☆}

for some appropriate

δ_{f}^{☆} > 2

. First, note that the set of minimizers is unbounded if the following condition holds:

\begin{matrix} \exists x_{s} \neq 0 such that y_{i} a_{i}^{T} x_{s} \geq 0, \forall i \in [m] . \end{matrix}

(34)

Indeed, if (34) holds then

x = c \cdot x_{s}

with

c \to + \infty

, attains zero cost in (3); thus, it is optimal and the set of minimizers is unbounded. To proceed, we rely on a recent result by Candes and Sur [44] who proved that (34) holds iff (To be precise, Candes and Sur [44] proved the statement for measurements

y_{i}, i \in [m]

that follow a logistic model. Close inspection of their proof shows that this requirement can be relaxed by appropriately defining the random variable Y in (7) (see also [48,49]).)

\begin{matrix} δ \leq δ_{f}^{☆} : = {(min_{c \in R} E [{(G + c S Y)}_{-}^{2}])}^{- 1}, \end{matrix}

(35)

where

G, S

and Y are random variables as in (7) and

{(t)}_{-} : = min {0, t}

. We highlight that logistic and hinge losses give unbounded solutions in the noisy-signed model with

ε = 0

, since the condition (34) holds for

x_{s} = x_{0}

. However, their performances are comparable to the optimal performance in both logistic and Probit models (see Figure 3a and Figure 4a).

5. Extensions to Gaussian-Mixture Models

In this section, we show that our results on sharp asymptotics and lower bounds on error can be extended to include the Gaussian-Mixture model (GMM) presented in Section 1.2. The discussions on the phase transition for the existence of a bounded solution in Section 4.2 applies here as well. We rely on a phase-transition result [49] (Prop. 3.1), which proves that (34) holds if and only if

\begin{matrix} δ \leq δ^{☆} : = {(min_{t \in R} E [{(W_{1} + t W_{2})}_{-}^{2}])}^{- 1}, \end{matrix}

(36)

where

W_{1}

and

W_{2}

are random variables defined in (7) and

{(x)}_{-}^{2} : = {(min {x, 0})}^{2}

. Therefore, for loss functions satisfying this property, e.g., hinge loss and logistic loss, the solution to (3) is unbounded if and only if

δ \leq δ^{☆} .

5.1. System of Equations for GMM

It turns out that, similar to the generative models, the asymptotic performance of (3) for GMM depends on the loss function ℓ via its Moreau envelope. Specifically, let

W_{1}

and

W_{2}

be independent Gaussian random variables such that

\begin{matrix} W_{1} \sim N (0, 1), W_{2} \sim N (r, 1), \end{matrix}

(37)

where

r : = ∥ x_{0} ∥_{2} > 0 .

Consider the following system of non-linear equations in three unknowns

(μ, α \geq 0, λ \geq 0)

\begin{matrix} 0 & = E [W_{2} \cdot M_{ℓ, 1}^{'} (α W_{1} + μ W_{2}; λ)], \end{matrix}

(38a)

\begin{matrix} α^{2} & = λ^{2} δ E [{(M_{ℓ, 1}^{'} (α W_{1} + μ W_{2}; λ))}^{2}], \end{matrix}

(38b)

\begin{matrix} α & = λ δ E [W_{1} \cdot M_{ℓ, 1}^{'} (α W_{1} + μ W_{2}; λ)] . \end{matrix}

(38c)

The expectations above are with respect to the randomness of the random variables

W_{1}

and

W_{2}

As we show shortly, the solution to these equations is tightly connected to the asymptotic behavior of the optimization in (3).

5.2. Theoretical Prediction of Error for Convex Loss Functions

Theorem 4.

(Asymptotic Prediction). Assume data generated from the Gaussian-mixture model and assume

δ > 1

such that the set of minimizers in (3) is bounded and the system of Equation (38) has a unique solution

(μ, α, λ)

, such that

μ \neq 0

. Let

{\hat{x}}_{ℓ}

be as in (3) and

σ_{ℓ} = α / μ

. Then, in the limit of

m, n \to + \infty

m / n \to δ

, it holds with probability one that

\begin{matrix} lim_{n \to \infty} corr ({\hat{x}}_{ℓ}; x_{0}) = \frac{μ}{\sqrt{μ^{2} + α^{2}}}, lim_{n \to \infty} E_{ℓ} = Q (\frac{r}{\sqrt{1 + σ_{ℓ}^{2}}}), \end{matrix}

(39)

where

E_{ℓ}

denotes the classification test error defined in (5).

Remark 6

(Proof of Theorem 4). The high-level steps of the proof of Theorem 4 follow closely the proof of Theorem 1. Particularly, for GMM one can show the correlation of the ERM estimate with the true vector

x_{0}

is predicted by a system of Equations as in (38), only with

W_{2}

replaced by a non-gaussian random variable (denoted as

S Y

in Theorem 1). Specifically, by rotational invariance of the Gaussian feature vectors

a_{i}

, we can assume, without loss of generality, that

x_{0} = {[r, 0, 0, \dots, 0]}^{T}

. Then, we can can guarantee that with probability one it holds that

\begin{matrix} lim_{n \to \infty} {\hat{x}}_{ℓ} (1) = μ, lim_{n \to \infty} \sum_{j = 2}^{n} {\hat{x}}_{ℓ}^{2} (j) = α^{2}, \end{matrix}

(40)

where μ and α are specified by (38). To see how this implies (39), we argue as follows. Recalling that

x | y \sim N (y x_{0}, I)

, we have

y 〈 {\hat{x}}_{ℓ}, a 〉 \sim N (r {\hat{x}}_{ℓ} (1), {∥{\hat{x}}_{ℓ}∥}_{2}^{2}) .

Using this and (40) leads to the asymptotic value of correlation and classification error as presented in (39).

Remark 7.

(On the Uniqueness of Solutions to Equation (38)) Our results in proving the uniqueness of solutions to the equations for generative models (8) in Proposition 1, extend to GMM. Noting that

W_{2} \sim N (r, 1)

in (38) plays the role of

S Y

in (8), we straightforwardly deduce the following result for uniqueness of solutions to (38).

Proposition 2.

Assume that the loss function

ℓ : R \to R

has the following properties: (i) it is proper strictly convex; and (ii) it is continuously differentiable and its derivative

ℓ^{'}

is such that

ℓ^{'} (0) \neq 0

. The following statement is true. For any

δ > 1

, if the system of equations in (38) has a bounded solution, then it is unique.

5.3. Special Case: Least-Squares

By choosing

ℓ (t) = {(t - 1)}^{2}

in (3), we obtain the standard least-squares estimate. To see this, note that since

y_{i} = \pm 1

, it holds for all i that

{(y_{i} a_{i}^{T} x - 1)}^{2} = {(y_{i} - a_{i}^{T} x)}^{2} .

Thus, the estimator

{\hat{x}}_{L S}

is minimizing the sum of squares of the residuals:

\begin{matrix} {\hat{x}}_{L S} = arg min_{x} \sum {(y_{i} - a_{i}^{T} x)}^{2} . \end{matrix}

(41)

For the choice

ℓ (t) = {(t - 1)}^{2}

, it turns out that we can solve the equations in (38) in closed form. The final result is summarized in the corollary below and proved in Appendix G.1.

Corollary 3.

(Least-Squares). Let

{\hat{x}}_{L S}

be as in (31) and

δ > 1

. Then, in the limit of

m, n \to + \infty

m / n \to δ

, Equation (39) holds with probability one with

σ_{L S}^{2}

given as follows:

\begin{matrix} σ_{L S}^{2} = \frac{1 + r^{2}}{r^{2}} \cdot \frac{1}{(δ - 1)} . \end{matrix}

(42)

5.4. Optimal Risk for GMM

Next, we characterize the best achievable classification error by different choices of loss function. Considering (39), we see that an optimal choice of ℓ is the one that minimizes

σ_{ℓ}^{2}

. The next theorem characterizes the best achievable

σ_{ℓ}

among convex loss functions by deriving an equivalent set of equations to (38) and combining them with proper coefficients. Similar to the proof of Theorem 2, a key step in the proof is properly setting up a Cauchy–Schwarz inequality that exploits the structure of the new set of equations. The proof is deferred to Appendix G.2.

Theorem 5.

(Lower Bound on Risk). Under the assumptions of Theorem 4, the following inequality holds for the effective risk parameter (

σ_{ℓ}

) of a loss function ℓ:

\begin{matrix} lim_{n \to \infty} σ_{ℓ}^{2} \geq σ_{☆}^{2} : = \frac{1 + r^{2}}{r^{2}} \cdot \frac{1}{δ - 1} \end{matrix}

(43)

Remark 8.

(Optimality of Least-squares for GMM). Theorem 5 provides a lower bound for the asymptotic value of

σ_{ℓ}

which holds for all

δ > 1

and

r > 0

. This result together with Corollary 3 implies that least-squares achieves the least value of risk (i.e.,

σ_{ℓ}

and

E_{ℓ}

) for all

δ > 1

and

r > 0

among all convex loss functions ℓ for which the set of minimizers in (3) is bounded.

6. Numerical Experiments

In this section, we present numerical simulations that validate the predictions of Theorems 1–5. To begin, we use the following three popular models as our case study: signed, logistic and Probit. We generate random measurements according to (1). Without loss of generality (due to rotational invariance of the Gaussian measure), we set

x_{0} = {[1, 0, \dots, 0]}^{T}

. We then obtain estimates

{\hat{x}}_{ℓ}

x_{0}

by numerically solving (3) and measure performance by the correlation value

corr ({\hat{x}}_{ℓ}; x_{0})

. Throughout the experiments, we set

n = 128

and the recorded values of correlation are averages over 25 independent realizations. For each label function, we first provide plots that compare results of Monte Carlo simulations to the asymptotic predictions for loss functions discussed in Section 4, as well as to the optimal performance of Theorem 2. We next present numerical results on optimal loss functions. To empirically derive the correlation of optimal loss function, we run gradient descent-based optimization with 1000 iterations. As a general comment, we note that, despite being asymptotic, our predictions appear accurate even for relatively small problem dimensions. For the analytical predictions, we apply Theorem 1. In particular, for solving the system of non-linear equations in (3), we empirically observe (see also [15,47] for similar observation) that, if a solution exists, then it can be efficiently found by the following fixed-point iteration method. Let

v : = {[μ, α, λ]}^{T}

and

F : R^{3} \to R^{3}

be such that (3) is equivalent to

v = F (v)

. With this notation, we initialize

v = v_{0}

and for

k \geq 1

repeat the iterations

v_{k + 1} = F (v_{k})

until convergence.

Logistic model. For the logistic model, comparison between the predicted values and the numerical results is illustrated in Figure 3a. Results are shown for LS, logistic and hinge loss functions. Note that minimizing the logistic loss corresponds to the maximum-likelihood estimator (MLE) for logistic model. An interesting observation in Figure 3a is that in the high-dimensional setting (finite

δ

) LS has comparable (if not slightly better) performance to MLE. Additionally, we observe that in this model, performance of LS is almost the same as the best possible performance derived according to Theorem 2. This confirms the analytical conclusion of Section 4.1. The comparison between the optimal loss function as in Theorem 3 and other loss functions is illustrated in Figure 3b. We note the obvious similarity between the shapes of optimal loss functions and LS which further explains the similarity between their performance.

Probit model. Theoretical predictions for the performance of hinge and LS loss functions are compared with the empirical results and optimal performance of Theorem 2 in Figure 4a. Similar to the logistic model, in this model, LS also outperforms hinge loss and its performance resembles the performance of optimal loss function derived according to Theorem 3. Figure 4b illustrates the shapes of LS, hinge loss and the optimal loss functions for the Probit model. The obvious similarity between the shape of LS and optimal loss functions for all values of

δ

explains the close similarity of their performance.

Additionally, by comparing the LS performance for the three models in Figure 1a, Figure 3a and Figure 4a, it is clear that higher (respectively, lower) correlation values are achieved for signed (respectively, logistic) measurements. This behavior is indeed predicted by Corollary 2: correlation performance is higher for higher values of

μ = E [S Y]

. It can be shown that, for the signed, probit and logistic models (with

∥ x_{0} ∥_{2} = 1

), we have

μ = \sqrt{2 / π}, \sqrt{1 / π} and 0.4132

, respectively.

Optimal loss function. By putting together Theorems 2 and 3, we obtain a method on deriving the optimal loss function for generative binary models. This requires the following steps.

Find $σ_{opt}$ by solving (12).
Compute the density of $W_{opt} = σ_{opt} G + S Y$ .
Compute $ℓ_{opt}$ according to (29).

Note that computing

σ_{opt}

needs the density function

p_{W}

of the random variable

W = σ G + S Y

. In principle

p_{W}

can be calculated as the convolution of the Gaussian density with the pdf

p_{S Y}

S Y

. Moreover, it follows from the recipe above that the optimal loss function depends on

δ

in general. This is because

σ_{opt}

itself depends on

δ

via (12).

Numerical Experiments for GMM

Theorem 5 implies the optimality of least-squares among convex loss functions in the under-parameterized regime

δ > 1

. In Figure 2, we demonstrate the classification risk of least-squares alongside other well-known loss functions LAD and logistic, for

r = 1

. Solid lines correspond to the theoretical predictions of Theorem 4. For least-squares we rely on the result of Corollary 3 and for LAD and logistic loss, the system of equations are solved by iterating over the equations, where we observe that after relatively small number of iterations the triple

(μ, α, λ)

converges to

(μ^{☆}, α^{☆}, λ^{☆})

. We use

10^{5}

and

10^{3}

samples to compute the expectations in (38) for LAD and logistic loss, respectively. After deriving

σ_{ℓ} = α / μ

, the classification risk

E_{ℓ}

is obtained according to the formula in (39). Dots correspond to the empirical evaluations of the classification risk of loss functions for

n = 60

and for different values of

δ = m / n > 1 .

The resulting numbers are averaged over 30 independent experiments. As is observed, the empirical results closely follow the theoretical predictions of Theorem 4. Furthermore, as predicted by Theorem 5, least-squares has the minimum expected classification risk among other convex loss functions and for all

δ > 1

7. Conclusions

We derive theoretical predictions for the generalization error of estimators obtained by ERM for generative binary models and a Gaussian Mixture model. Furthermore, we use this theoretical characterizations to find the optimal performance and optimal loss function among all convex losses. Although our analysis is true for Gaussian matrices, we empirically show they hold for sub-Gaussian matrices as well. As an exciting future direction, we plan to extend our analysis on sharp asymptotics and optimal loss function to non-isotropic (Gaussian) features with arbitrary covariance. A more challenging, albeit interesting, direction is going beyond (binary) linear models studied in this paper, by considering asymptotics and optimal error for kernel models and neural networks (see [48,57] for partial progress in this direction).

Author Contributions

Formal analysis, H.T., R.P. and C.T.; Funding acquisition, R.P. and C.T.; Investigation, H.T., R.P. and C.T.; Methodology, H.T., R.P. and C.T.; Project administration, H.T., R.P. and C.T.; Software, H.T.; Supervision, H.T., R.P. and C.T.; Writing—original draft, H.T., R.P. and C.T.; Writing—review & editing, H.T., R.P. and C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NSF CNS-2003035, NSF CCF-2009030 and NSF CCF-1909320.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Properties of Moreau Envelopes

Appendix A.1. Derivatives

Recall the definition of the Moreau envelope

M_{ℓ} (x; λ)

and proximal operator

{prox}_{ℓ} (x; λ)

of a function ℓ:

\begin{matrix} M_{ℓ} (x; λ) = min_{y} \frac{1}{2 λ} {(x - y)}^{2} + ℓ (y), \end{matrix}

(A1)

and

{prox}_{ℓ} (x; λ) = arg {min}_{y} \frac{1}{2 λ} {(x - y)}^{2} + ℓ (y)

Proposition A1

(Basic properties of

M_{ℓ}

and

{prox}_{ℓ}

[52]). Let

ℓ : R \to R

be lower semi-continuous (lsc), proper and convex. The following statements hold for any

λ > 0

(a): The proximal operator ${prox}_{ℓ} (x; λ)$ is unique and continuous. In fact, ${prox}_{ℓ} (x; λ) \to {prox}_{ℓ} (x^{'}; λ^{'})$ whenever $(x, λ) \to (x^{'}, λ^{'})$ with $λ^{'} > 0$ .
(b): The value $M_{ℓ} (x; λ)$ is finite and depends continuously on $(λ, x)$ , with $M_{ℓ} (x; λ) \to f (x)$ for all x as $λ \to 0_{+}$ .
(c): The Moreau envelope function is differentiable with respect to both arguments. Specifically, for all $x \in R$ , the following properties are true:

$\begin{matrix} M_{ℓ, 1}^{'} (x; λ) & = \frac{1}{λ} (x - {prox}_{ℓ} (x; λ)), \end{matrix}$

(A2)

$\begin{matrix} M_{ℓ, 2}^{'} (x; λ) & = - \frac{1}{2 λ^{2}} {(x - {prox}_{ℓ} (x; λ))}^{2} . \end{matrix}$

(A3)

If in addition ℓ is differentiable and $ℓ^{^{'}}$ denotes its derivative, then

$\begin{matrix} M_{ℓ, 1}^{'} (x; λ) & = ℓ^{'} ({prox}_{ℓ} (x; λ)), \end{matrix}$

(A4)

$\begin{matrix} M_{ℓ, 2}^{'} (x; λ) & = - \frac{1}{2} (ℓ^{'} {({prox}_{ℓ} (x; λ))}^{2} . \end{matrix}$

(A5)

Appendix A.2. Alternative Representations of (8)

Replacing the above relations for derivative of

M_{ℓ}

in (8), we can write the equations in terms of the proximal operator. If ℓ is differentiable, then Equations (8) can be equivalently written as follows:

\begin{matrix} E [Y S \cdot ℓ^{'} ({prox}_{ℓ} (α G + μ S Y; λ))] & = 0, \end{matrix}

(A6a)

\begin{matrix} λ^{2} δ E [{(ℓ^{'} ({prox}_{ℓ} (α G + μ S Y; λ)))}^{2}] & = α^{2}, \end{matrix}

(A6b)

\begin{matrix} λ δ E [G \cdot ℓ^{'} ({prox}_{ℓ} (α G + μ S Y; λ))] & = α . \end{matrix}

(A6c)

Finally, if ℓ is two times differentiable, then applying integration by parts in Equation (14c) results in the following reformulation of (8c):

\begin{matrix} 1 & = λ δ E [\frac{ℓ^{''} ({prox}_{ℓ} (α G + μ S Y; λ))}{1 + λ ℓ^{''} ({prox}_{ℓ} (α G + μ S Y; λ))}] . \end{matrix}

(A7)

Appendix A.3. Examples of Proximal Operators

LAD.

For

ℓ (t) = | t - 1 |

, the proximal operator admits a simple expression, as follows:

\begin{matrix} {prox}_{ℓ} (x; λ) = 1 + H (x - 1; λ), \end{matrix}

(A8)

where

H (x; λ) = \{\begin{matrix} x - λ, & if x > λ, \\ x + λ, & if x < - λ, \\ 0, & otherwise . \end{matrix}

is the standard soft-thresholding function.

Hinge Loss.

When

ℓ (t) = max {0, 1 - t}

, the proximal operator can be expressed in terms of the soft-thresholding function as follows:

\begin{matrix} {prox}_{ℓ} (x; λ) = 1 + H (x + \frac{λ}{2} - 1; \frac{λ}{2}) . \end{matrix}

Appendix A.4. Fenchel–Legendre Conjugate Representation

For a function

h : R \to R

, its Fenchel–Legendre conjugate,

h^{☆} : R \to R

is defined as:

\begin{matrix} h^{☆} (x) = max_{y} [x y - h (y)] . \end{matrix}

The following proposition relates Moreau Envelope of a function to its Fenchel–Legendre conjugate.

Proposition A2.

For

λ > 0

and a function h, we have:

\begin{matrix} M_{h} (x; λ) = \frac{q (x)}{λ} - \frac{1}{λ} {(q + λ h)}^{☆} (x), \end{matrix}

(A9)

where

q (x) = x^{2} / 2 .

Proof.

\begin{matrix} M_{h} (x; λ) & = \frac{1}{2 λ} min_{y} [{(x - y)}^{2} + 2 λ h (y)] \\ = \frac{x^{2}}{2 λ} + \frac{1}{2 λ} min_{y} [y^{2} - 2 x y + 2 λ h (y)] \\ = \frac{x^{2}}{2 λ} - \frac{1}{λ} max_{y} [x y - (y^{2} / 2 + λ h (y))] \\ = \frac{q (x)}{λ} - \frac{1}{λ} {(q + λ h)}^{☆} (x) . \end{matrix}

□

Appendix A.5. Convexity of the Moreau Envelope

Lemma A1.

The function

H : R^{3} \to R

defined as follows

\begin{matrix} H (x, v, λ) = \frac{1}{2 λ} {(x - v)}^{2}, \end{matrix}

(A10)

is jointly convex in its arguments.

Proof.

Note that the function

h (x, v) = {(x - v)}^{2}

is jointly convex in

(x, v)

. Thus, its perspective function

λ h (x / λ, v / λ) = {(x - v)}^{2} / λ = 2 H (x, v, λ)

is jointly convex in

(x, v, λ)

[58] (Sec. 2.3.3), which completes the proof. □

Proposition A3.

(a) Ref. [52] (Prop. 2.22) Let

f (x, y)

be jointly convex in its arguments. Then, the function

g (x) = {min}_{y} f (x, y)

is convex.

(b) Ref. [58] (Sec. 3.2.3) Suppose

f_{i} : R \to R

is a set of concave functions, with

i \in A

an index set. Then, the function

f : R \to R

defined as

f (x) : = {inf}_{i \in A} f_{i} (x)

is concave.

Lemma A2.

Let

ℓ : R \to R

be a lsc, proper, convex function. Then,

M_{ℓ} (x; λ)

is jointly convex in

(x, λ)

Proof.

Recall that

\begin{matrix} M_{ℓ} (x; λ) = min_{v} G (a) : = \frac{1}{2 λ} {(x - v)}^{2} + ℓ (v), \end{matrix}

(A11)

where, for compactness, we let

a \in R^{3}

denote the triplet

(x, v, λ)

. Now, let

a_{i} = (x_{i}, v_{i}, λ_{i}), i = 1, 2

θ \in (0, 1)

and

\bar{θ} : = 1 - θ

. With this notation, we may write

\begin{matrix} G (θ a_{1} + \bar{θ} a_{2}) & = H (θ x_{1} + \bar{θ} x_{2}, θ λ_{1} + \bar{θ} λ_{2}, θ v_{1} + \bar{θ} v_{2}) + ℓ (θ v_{1} + \bar{θ} v_{2}) \\ \leq θ H (x_{1}, v_{1}, λ_{1}) + \bar{θ} H (x_{2}, v_{2}, λ_{2}) + θ ℓ (v_{1}) + \bar{θ} ℓ (v_{2}) \\ = θ G (a_{1}) + \bar{θ} G (a_{2}) . \end{matrix}

For the first equality above, we recall the definition of

H : R^{3} \to R

in (A10) and the inequality right after follows from Lemma A1 and convexity of ℓ. Thus, the function G is jointly convex in its arguments. Using this fact, as well as (A11), and applying Proposition A3(a) completes the proof. □

Appendix A.6. The Expected Moreau-Envelope (EME) Function and its Properties

The performance of the ERM estimator (3) is governed by the system of equations (8) in which the Moreau envelope function

M_{ℓ} (x; λ)

of the loss function ℓ plays a central role. More precisely, as already hinted by (8) and becomes clear in Appendix B, what governs the behavior is the function

\begin{matrix} (α > 0, μ, τ > 0, γ > 0) \mapsto E [M_{ℓ} (α G + μ S Y; τ / γ)], \end{matrix}

(A12)

which we call the expected Moreau envelope (EME). Recall here that

Y = f (S)

. Hence, the EME is the key summary parameter that captures the role of both the loss function

ℓ : R \to R

and of the link function

f : R \to {\pm 1}

on the statistical performance of (3).

In this section, we study several favorable properties of the EME. In (A12), the expectation is over

G, S \overset{iid}{\sim} N (0, 1)

. We first study the EME under more general distribution assumptions in Appendix A.6.1, Appendix A.6.2 and Appendix A.6.3 and we then specialize our results to Gaussian random variables G and S in Appendix A.6.4.

Appendix A.6.1. Derivatives

Proposition A4.

Let

ℓ : R \to R

be a lsc, proper and convex function. Further, let

X, Z

be independent random variables with bounded second moments

E [X^{2}] < \infty

E [Z^{2}] < \infty

. Then, the expected Moreau envelope function

E [M_{ℓ} (c X + Z; λ)]

is differentiable with respect to both c and λ and the derivatives are given as follows:

\begin{matrix} \frac{\partial}{\partial c} E [M_{ℓ} (c X + Z; λ)] & = E [X M_{ℓ, 1}^{'} (c X + Z; λ)], \end{matrix}

(A13)

\begin{matrix} \frac{\partial}{\partial λ} E [M_{ℓ} (c X + Z; λ)] & = E [M_{ℓ, 2}^{'} (c X + Z; λ)] . \end{matrix}

(A14)

Proof.

The proof is an application of the Dominated Convergence Theorem (DCT). First, by Proposition A1(b), for every

c \in R

and any

λ > 0

, the function

E [M_{ℓ} (c X + Z; λ)]

takes a finite value. Second, by Proposition A1(c),

M_{ℓ} (c x + z; λ)

is continuously differentiable with respect to both c and

λ

\begin{matrix} \frac{\partial}{\partial c} M_{ℓ} (c X + Z; λ) & = X M_{ℓ, 1}^{'} (c X + Z; λ) = X \frac{1}{λ} (c X + Z - {prox}_{ℓ} (c X + Z; λ)), \\ \frac{\partial}{\partial λ} M_{ℓ} (c X + Z; λ) & = M_{ℓ, 2}^{'} (c X + Z; λ) = - \frac{1}{2 λ^{2}} {(c X + Z - {prox}_{ℓ} (c X + Z; λ))}^{2} . \end{matrix}

From this, note that the Cauchy–Schwarz inequality gives

E [\frac{\partial}{\partial c} M_{ℓ} (c X + Z; λ)] \leq (E [X^{2}])^{1 / 2}) {(E [\frac{1}{λ^{2}} \underset{: = A}{\underset{︸}{{(c X + Z - {prox}_{ℓ} (c X + Z; λ))}^{2}}}])}^{1 / 2},

Therefore, the remaining condition to check so that DCT can be applied is that the term

A / λ^{2}

above is integrable. To begin with, we can easily bound A as:

A \leq 2 {(c X + Z)}^{2} + 2 {({prox}_{ℓ} (c X + Z; λ))}^{2} .

Next, by non-expansiveness (Lipschitz property) of the proximal operator [52] (Prop. 12.19), we have that

| {prox}_{ℓ} (c X + Z; λ) | \leq | c X + Z | + | {prox}_{ℓ} (0; λ) | .

Putting together, we find that

A \leq 6 {(c X + Z)}^{2} + 2 | {prox}_{ℓ} (0; λ) |^{2} \leq 12 c^{2} X^{2} + 12 Z^{2} + 2 {| {prox}_{ℓ} (0; λ) |}^{2} .

We consider two cases. First, for fixed

λ > 0

and any compact interval

I

, we have that

E sup_{c \in I} [A] \leq 12 (sup_{c \in I} c^{2}) E [X^{2}] + 12 E {[Z]}^{2} + 2 {| {prox}_{ℓ} (0; λ) |}^{2} < \infty .

Similarly, for fixed c and any compact interval

J

on the positive real line, we have that

E sup_{λ \in J} [A / λ^{2}] \leq 12 sup_{λ \in J} \frac{c^{2} E [X^{2}] + E {[Z]}^{2}}{λ^{2}} + 2 sup_{λ \in J} \frac{| {prox}_{ℓ} (0; λ) |^{2}}{λ^{2}} < \infty,

where we also used boundedness of the proximal operator (cf. Proposition A1(a)). This completes the proof. □

Appendix A.6.2. Strict Convexity

We study convexity properties of the expected Moreau envelope function

Ψ : R^{3} \to R

\begin{matrix} Ψ (v) : = Ψ (α, μ, λ) : = E [M_{ℓ} (α X + μ Z; λ)], \end{matrix}

(A15)

for a lsc, proper, convex function ℓ and independent random variables X and Z with positive densities. Here, and onwards, we let

v \in R^{3}

denote a triplet

(α, μ, λ)

and the expectation is over the randomness of X and Z. From Lemma A2, it is easy to see that

Ψ (v)

is convex. In this section, we prove a stronger claim:

“ If ℓ is strictly convex and does not attain its minimum at 0, then $Ψ (v)$ is also strictly convex. ”

This is summarized in Proposition A5 below.

Proposition A5.

(Strict Convexity). Let

ℓ : R \to R

be a function with the following properties: (i) it is proper strictly convex; and (ii) it is continuously differentiable and its derivative

ℓ^{'}

is such that

ℓ^{'} (0) \neq 0

. Further, let

X, Z

be independent random variables with strictly positive densities. Then, the function

Ψ : R^{3} \to R

in (A15) is jointly strictly convex in its arguments.

Proof.

Let

v_{i} = (α_{i}, μ_{i}, λ_{i}), i = 1, 2

θ \in (0, 1)

and

\bar{θ} = 1 - θ

. Further, assume that

v_{1} \neq v_{2}

and define the proximal operators

p_{i} (X, Z) : = {prox}_{ℓ} (α_{i} X + μ_{i} Z; λ_{i}) = arg min_{v} \frac{1}{2 λ_{i}} {(α_{i} X + μ_{i} Z - v)}^{2} + ℓ (v),

for

i = 1, 2

. Finally, denote

λ_{θ} : = θ λ_{1} + \bar{θ} λ_{2}, α_{θ} : = θ α_{1} + \bar{θ} α_{2}

and

μ_{θ} : = θ μ_{1} + \bar{θ} μ_{2}

. With this notation, -4.6cm0cm

\begin{matrix} Ψ (θ v_{1} + \bar{θ} v_{2}) \\ \leq E [\frac{1}{2 λ_{θ}} {(α_{θ} X + μ_{θ} Z - (θ p_{1} (X, Z) + \bar{θ} p_{2} (X, Z)))}^{2} + ℓ (θ p_{1} (X, Z) + θ p_{2} (X, Z))] \\ = E [H (α_{θ} X + μ_{θ} Z, θ p_{1} (X, Z) + θ p_{2} (X, Z), λ_{θ}) + ℓ (θ p_{1} (X, Z) + \bar{θ} p_{2} (X, Z))] \\ \leq E [θ H (α_{1} X + μ_{1} Z, p_{1} (X, Z), λ_{1}) + \bar{θ} H (α_{2} X + μ_{2} Z, p_{2} (X, Z), λ_{2}) \\ + ℓ (θ p_{1} (X, Z) + \bar{θ} p_{2} (X, Z))] . \end{matrix}

(A16)

The first inequality above follows by the definition of the Moreau envelope in (A1). The equality in the second line uses the definition of the function

H : R^{3} \to R

in (A10). Finally, the last inequality follows from convexity of H as proved in Lemma A1.

Continuing from (59), we may use convexity of ℓ to find that

\begin{matrix} Ψ (θ v_{1} + \bar{θ} v_{2}) \\ \leq E [θ H (α_{1} X + μ_{1} Z, λ_{1}, p_{1} (X, Z)) + \bar{θ} H (α_{2} X + μ_{2} Z, λ_{2}, p_{2} (X, Z)) \\ + θ ℓ (p_{1} (X, Z)) + \bar{θ} ℓ (p_{2} (X, Z))] \\ = θ Ψ (v_{1}) + \bar{θ} Ψ (v_{2}) . \end{matrix}

(A17)

This already proves convexity of (A15). In what follows, we argue that the inequality in (A17) is in fact strict under the assumption of the lemma.

Specifically, in Lemma A3, we prove that, under the assumptions of the proposition, for

v_{1} \neq v_{2}

, it holds that

E [ℓ (θ p_{1} (X, Z) + \bar{θ} p_{2} (X, Z))] < θ E [ℓ (p_{1} (X, Z))] + \bar{θ} E [ℓ (p_{2} (X, Z))] .

Using this in (A16) completes the proof of the proposition. The idea behind the proof of Lemma A3 is as follows. First, we use the fact that

v_{1} \neq v_{2}

and

ℓ^{'} (0) \neq 0

to argue that there exists a non-zero measure set of

(x, z) \in R^{2}

such that

p_{1} (x, z) \neq p_{2} (x, z)

. Then, the desired claim follows by strict convexity of ℓ. □

Lemma A3.

Let

ℓ : R \to R

be a proper strictly convex function that is continuously differentiable with

ℓ^{'} (0) \neq 0

. Further, assume independent continuous random variables

X, Z

with strictly positive densities. Fix arbitrary triplets

v_{i} = (α_{i}, μ_{i}, λ_{i}), i = 1, 2

such that

v_{1} \neq v_{2}

. Further, denote

\begin{matrix} p_{i} (X, Z) : = {prox}_{ℓ} (α_{i} X + μ_{i} Z; λ_{i}), i = 1, 2 . \end{matrix}

(A18)

Then, there exists a ball

S \subset R^{2}

of nonzero measure, i.e.,

P ((X, Z) \in S) > 0

, such that

p_{1} (x, z) \neq p_{2} (x, z)

, for all

(x, z) \in S

. Consequently, for any

θ \in (0, 1)

and

\bar{θ} = 1 - θ

, the following strict inequality holds,

\begin{matrix} E [ℓ (θ p_{1} (X, Z) + \bar{θ} p_{2} (X, Z))] < θ E [ℓ (p_{1} (X, Z))] + \bar{θ} E [ℓ (p_{2} (X, Z))] . \end{matrix}

(A19)

Proof.

Note that (A19) holds trivially with “

< "

replaced by “

\leq "

due to the convexity of ℓ. To prove that the inequality is strict, it suffices, by strict convexity of ℓ, that there exists subset

S \subset R^{2}

that satisfies the following two properties:

$p_{1} (x, z) \neq p_{2} (x, z)$ , for all $(x, z) \in S$ .
$P ((X, Z) \in S) > 0$ .

Consider the following function

f : R^{2} \to R

\begin{matrix} f (x, z) : = p_{1} (x, z) - p_{2} (x, z) . \end{matrix}

(A20)

By Lemma A4, there exists

(x_{0}, z_{0})

such that

\begin{matrix} f (x_{0}, z_{0}) \neq 0 . \end{matrix}

(A21)

Moreover, by continuity of the proximal operator (cf. Proposition A1(a)), it follows that f is continuous. From this and (A21), we conclude that for sufficiently small

ζ > 0

there exists a

ζ

-ball

S

centered at

(x_{0}, z_{0})

, such that property 1 holds. Property 2 is also guaranteed to hold for

S

, since both

X, Z

have strictly positive densities and are independent. □

Lemma A4.

Let

ℓ : R \to R

be a proper, convex function. Further, assume that

ℓ : R \to R

is continuously differentiable and

ℓ^{'} (0) \neq 0

. Let

α_{1}, α_{2} > 0

λ_{1}, λ_{2} > 0

. Then, the following statement is true

-4.6cm0cm

\begin{matrix} (α_{1}, μ_{1}, λ_{1}) \neq (α_{2}, μ_{2}, λ_{2}) \to \exists (x, z) \in R^{2} : {prox}_{ℓ} (α_{1} x + μ_{1} z; λ_{1}) \neq {prox}_{ℓ} (α_{2} x + μ_{2} z; λ_{2}) . \end{matrix}

(A22)

Proof.

We prove the claim by contradiction, but first, let us set up some useful notation. Let

v \in R^{3}

denote triplets

(α, μ, λ)

and further define

p_{α, μ, λ} (x, z) : = {prox}_{ℓ} (α x + μ z; λ),

and

L_{α, μ, λ} (x, z) : = ℓ^{'} ({prox}_{ℓ} (α x + μ z; λ)) .

By Proposition A1, the following is true:

\begin{matrix} L_{α, μ, λ} (x, z) = \frac{1}{λ} (α x + μ z - p_{α, μ, λ} (x, z)) . \end{matrix}

(A23)

For the sake of contradiction, assume that the claim of the lemma is false. Then,

\begin{matrix} p_{α_{1}, μ_{1}, λ_{1}} (x, z) = p_{α_{2}, μ_{2}, λ_{2}} (x, z), \forall (x, z) \in R^{2} . \end{matrix}

(A24)

From this, it also holds that

\begin{matrix} L_{α_{1}, μ_{1}, λ_{1}} (x, z) = L_{α_{2}, μ_{2}, λ_{2}} (x, z), \forall (x, z) \in R^{2} . \end{matrix}

(A25)

Recalling (A23) and applying (A24), we derive the following from (A25):

\begin{matrix} (λ_{2} - λ_{1}) p_{α_{1}, μ_{1}, λ_{1}} (x, z) = (λ_{2} α_{1} - λ_{1} α_{2}) x + (λ_{2} μ_{1} - λ_{1} μ_{2}) z, \forall (x, z) \in R^{2} . \end{matrix}

(A26)

We consider the following two cases separately.

Case 1:

λ_{1} = λ_{2}

: Since

v_{1} \neq v_{2}

, it holds that

\begin{matrix} \exists (x, z) \in R^{2} : α_{1} x + μ_{1} z \neq α_{2} x + μ_{2} z . \end{matrix}

(A27)

However, from (A26) we have that

(α_{1} - α_{2}) x + (μ_{1} - μ_{2}) z = 0

for all

(x, z) \in R^{2}

. This contradicts (A27) and completes the proof for this case.

Case 2:

λ_{1} \neq λ_{2}

: Continuing from (A26), we can compute that for all

(x, z) \in R^{2}

\begin{matrix} ℓ^{'} (p_{α_{1}, μ_{1}, λ_{1}} (x, z)) & = \frac{1}{λ_{1}} (α_{1} x + μ_{1} z - p_{α_{1}, μ_{1}, λ_{1}} (x, z)) \\ = \frac{α_{2} - α_{1}}{λ_{2} - λ_{1}} x + \frac{μ_{2} - μ_{1}}{λ_{2} - λ_{1}} z . \end{matrix}

(A28)

By replacing

p_{α_{1}, μ_{1}, λ_{1}} (x, z)

from (A26), we derive that:

\begin{matrix} ℓ^{'} (ε_{1} x + ε_{2} z) = ε_{3} x + ε_{4} z, \forall (x, z) \in R^{2}, \end{matrix}

(A29)

where

\begin{matrix} ε_{1} & = \frac{λ_{2} α_{1} - λ_{1} α_{2}}{λ_{2} - λ_{1}}, ε_{2} = \frac{λ_{2} μ_{1} - λ_{1} μ_{2}}{λ_{2} - λ_{1}}, \\ ε_{3} & = \frac{α_{2} - α_{1}}{λ_{2} - λ_{1}}, ε_{4} = \frac{μ_{2} - μ_{1}}{λ_{2} - λ_{1}} . \end{matrix}

By replacing

x = z = 0

in (A29), we find that

ℓ^{'} (0) = 0

. This contradicts the assumption of the lemma and completes the proof. □

Appendix A.6.3. Strict Concavity

In this section, we study the following variant

Γ : R_{+} \to R

of the expected Moreau envelope:

\begin{matrix} Γ (γ) : = E [M_{ℓ} (X; 1 / γ)], \end{matrix}

(A30)

for a lower semi-continuous, proper, convex function ℓ and continuous random variable X. The expectation above is over the randomness of X. In Appendix B.4, we show that the function

Γ

is concave in

γ

. Here, we prove the following statement regarding strict-concavity of

Γ

“ If ℓ is convex, continuously differentiable and $ℓ^{'} (0) \neq 0$ , then $Γ$ is strictly concave. ”

This is summarized in Proposition A6 below.

Proposition A6.

(Strict concavity). Let

ℓ : R \to R

be a convex, continuously differentiable function for which

ℓ^{'} (0) \neq 0

. Further, let X be a continuous random variable in

R

with strictly positive density in the real line. Then, the function Γ in (A23) is strictly concave in

R_{+}

Proof.

Before everything, we introduce the following convenient notation:

{\tilde{Γ}}_{x} (γ) : = M_{ℓ} (x; 1 / γ) and p_{γ}^{x} : = {prox}_{ℓ} (x; 1 / γ) .

Note from Proposition A1 that

{\tilde{Γ}}_{x}

is differentiable with derivative

\begin{matrix} {\tilde{Γ}}_{x}^{'} (γ) = \frac{1}{2} {(x - {prox}_{ℓ} (x; 1 / γ))}^{2} . \end{matrix}

(A31)

We proceed in two steps as follows. First, for fixed

x \in R

and

γ_{2} > γ_{1}

, we prove in Lemma A5 that

\begin{matrix} {(x - p_{γ_{2}}^{x})}^{2} - {(x - p_{γ_{1}}^{x})}^{2} \leq - \frac{γ_{1}}{γ_{2} - γ_{1}} {(p_{γ_{1}}^{x} - p_{γ_{2}}^{x})}^{2}, \end{matrix}

(A32)

This shows that for all

x \in R

\begin{matrix} {\tilde{Γ}}_{x}^{'} (γ_{2}) - {\tilde{Γ}}_{x}^{'} (γ_{1}) \leq 0 . \end{matrix}

(A33)

Second, we use Lemma A3 to argue that the inequality is in fact strict for all

x \in S

where

S \subset R

and

P (X \in S) > 0

. To be concrete, apply Lemma A3 for

v_{i} = (1, 0, 1 / γ_{i}), i = 1, 2

. Notice that all the assumptions of the lemma are satisfied, hence there exists interval

S \subset R

for which

P (X \in S) > 0

and

p_{γ_{1}}^{x} \neq p_{γ_{2}}^{x} \Rightarrow {(p_{γ_{1}}^{x} - p_{γ_{2}}^{x})}^{2} > 0, \forall x \in S .

Hence, from (A32), it follows that

{(x - p_{γ_{2}}^{x})}^{2} - {(x - p_{γ_{1}}^{x})}^{2} < 0, \forall x \in S .

From this, and (A31) we conclude that

\begin{matrix} {\tilde{Γ}}_{x}^{'} (γ_{2}) - {\tilde{Γ}}_{x}^{'} (γ_{1}) < 0, \forall x \in S . \end{matrix}

(A34)

Thus, from (A33) and (A34), as well as the facts that

Γ (γ) = E [{\tilde{Γ}}_{X} (γ)]

and

P (X \in S) > 0

, we conclude that

Γ

is strictly concave in

R_{+}

. □

Lemma A5.

Let

ℓ : R \to R

be a convex, continuously differentiable function. Fix

x \in R

and denote

p_{γ} : = {prox}_{ℓ} (x; 1 / γ)

. Then, for any

γ, \tilde{γ} > 0

, it holds that

\begin{matrix} (\tilde{γ} - γ) (p_{\tilde{γ}} - p_{γ}) (p_{γ} - x) + \tilde{γ} {(p_{\tilde{γ}} - p_{γ})}^{2} \leq 0 . \end{matrix}

(A35)

Moreover, for

γ_{2} > γ_{1}

, the following statement is true:

\begin{matrix} {(x - p_{γ_{2}})}^{2} - {(x - p_{γ_{1}})}^{2} \leq - \frac{γ_{1}}{γ_{2} - γ_{1}} {(p_{γ_{1}} - p_{γ_{2}})}^{2} . \end{matrix}

(A36)

Proof.

First, we prove (A35). Then, we use it to prove (A36).

Proof of (A35): Consider function

g : R \to R

defined as follows

g (p) = \frac{\tilde{γ}}{2} {(x - p)}^{2} + ℓ (p)

. By assumption, g is differentiable with derivative

g^{'} (p) = \tilde{γ} (p - x) + ℓ^{'} (p)

. Moreover, g is

γ_{2}

-strongly convex. Finally, by optimality of the proximal operator (cf. Proposition A1), it holds that

γ (x - p_{γ}) = ℓ^{'} (p_{γ})

and

\tilde{γ} (x - p_{\tilde{γ}}) = ℓ^{'} (p_{\tilde{γ}})

. Using these, it can be computed that

g^{'} (p_{\tilde{γ}}) = 0

and

g^{'} (p_{γ}) = (\tilde{γ} - γ) (p_{γ} - x)

In the following inequalities, we combine all the aforementioned properties of the function g to find that

\begin{matrix} g (p_{γ}) & \geq g (p_{\tilde{γ}}) + \frac{\tilde{γ}}{2} {(p_{γ} - p_{\tilde{γ}})}^{2} \geq g (p_{γ}) + (\tilde{γ} - γ) (p_{γ} - x) (p_{\tilde{γ}} - p_{γ}) + \tilde{γ} {(p_{γ} - p_{\tilde{γ}})}^{2} . \end{matrix}

This leads to the desired statement and completes the proof of (A35).

Proof of (A36): We fix

γ_{2} > γ_{1}

and apply (A35) two times as follows. First, applying (A35) for

(\tilde{γ}, γ) = (γ_{2}, γ_{1})

and using the fact that

γ_{2} > γ_{1}

, we find that

\begin{matrix} (p_{γ_{2}} - p_{γ_{1}}) (p_{γ_{1}} - x) \leq - \frac{γ_{2}}{γ_{2} - γ_{1}} {(p_{γ_{2}} - p_{γ_{1}})}^{2} . \end{matrix}

(A37)

Second, applying (A35) for

(\tilde{γ}, γ) = (γ_{1}, γ_{2})

and using again the fact that

γ_{2} > γ_{1}

, we find that

\begin{matrix} (γ_{1} - γ_{2}) (p_{γ_{1}} - p_{γ_{2}}) (p_{γ_{2}} - x) + γ_{1} {(p_{γ_{1}} - p_{γ_{2}})}^{2} \leq 0 \\ \Rightarrow & (p_{γ_{2}} - p_{γ_{1}}) (p_{γ_{2}} - x) \leq - \frac{γ_{1}}{γ_{2} - γ_{1}} {(p_{γ_{1}} - p_{γ_{2}})}^{2} . \end{matrix}

(A38)

Adding (A37) and (A38), we show the desired property as follows:

(p_{γ_{2}} - p_{γ_{1}}) (p_{γ_{2}} - x) + (p_{γ_{2}} - p_{γ_{1}}) (p_{γ_{1}} - x) \leq - \frac{γ_{2} + γ_{1}}{γ_{2} - γ_{1}} {(p_{γ_{1}} - p_{γ_{2}})}^{2} .

□

Appendix A.6.4. Summary of Properties of (uid135)

Proposition A7.

Let

ℓ : R \to R

be a lsc, proper, convex function. Let

G, S \overset{iid}{\sim} N (0, 1)

and function

f : R \to {\pm 1}

such that the random variable

Y S = f (S) S

has a continuous strictly positive density on the real line. Then, the following properties are true for the expected Moreau envelope function

\begin{matrix} Ω : (α > 0, μ, τ > 0, γ > 0) \mapsto E [M_{ℓ} (α G + μ S Y; τ / γ)] : \end{matrix}

(A39)

(a): The function Ω is differentiable and its derivatives are given as follows:

$\begin{matrix} \frac{\partial}{\partial α} Ω (α, μ, τ, γ) & = E [G M_{ℓ, 1}^{'} (α G + μ S Y; τ / γ)], \\ \frac{\partial}{\partial μ} Ω (α, μ, τ, γ) & = E [S Y M_{ℓ, 1}^{'} (α G + μ S Y; τ / γ)], \\ \frac{\partial}{\partial τ} Ω (α, μ, τ, γ) & = \frac{1}{γ} E [M_{ℓ, 2}^{'} (α G + μ S Y; τ / γ)], \\ \frac{\partial}{\partial γ} Ω (α, μ, τ, γ) & = - \frac{τ}{γ^{2}} E [M_{ℓ, 2}^{'} (α G + μ S Y; τ / γ)] . \end{matrix}$
(b): The function Ω is jointly convex $(α, μ, τ)$ and concave on γ.
(c): The function Ω is increasing in α.
For the statements below, further assume that ℓ is strictly convex and continuously differentiable with $ℓ^{'} (0) \neq 0$ .
(d): The function Ω is strictly convex in $(α, μ, τ)$ and strictly concave in λ.
(e): The function Ω is strictly increasing in α.

Proof.

Statements (a), (b) and (d) follow directly by Propositions A4–A6. It remains to prove Statements (c) and (e). Let

α_{2} > α_{1}

. Then, there exist independent copies

G^{'}, G^{''}

of G and

\tilde{α} > 0

such that

α_{2} G = α_{1} G^{'} + \tilde{α} G^{''}

. Hence, we have the following chain of inequalities:

\begin{matrix} Ω (α_{2}, μ, τ, γ) & = E [M_{ℓ} (α_{1} G^{'} + \tilde{α} G^{''} + μ S Y; τ / γ)] \geq E [M_{ℓ} (α_{1} G^{'} + \tilde{α} E [G^{''}] + μ S Y; τ / γ)] \\ = E [M_{ℓ} (α_{1} G^{'} + μ S Y; τ / γ)] = Ω (α_{1}, μ, τ, γ), \end{matrix}

where the inequality follows from Jensen and convexity of

Ω

with respect to

α

(see Statement (b) of the Proposition). This proves Statement (c). For Statement (e), note that the inequality is strict provided that

Ω

is strictly convex (see Statement (d) of the Proposition). □

Appendix B. Proof of Theorem 1

In this section, we provide a proof sketch of Theorem 1. The main technical tool that facilitates our analysis is the convex Gaussian min-max theorem (CGMT), which is an extension of Gordon’s Gaussian min-max inequality (GMT). We introduce the necessary background on the CGMT in Appendix B.1.

The CGMT has been mostly applied to linear measurements [9,10,13,15,19]. The simple, yet central idea, which allows for this extension, is a certain projection trick inspired by Plan and Vershynin [40]. Here, we apply a similar trick, but, in our setting, we recognize that it suffices to simply rotate

x_{0}

to align with the first basis vector. The simple rotation decouples the measurements

y_{i}

from the last

n - 1

coordinates of the measurement vectors

a_{i}

(see Appendix B.2). While this is sufficient for LS in [43], to study more general loss functions, we further need to combine this with a duality argument similar to that in [13]. Second, while the steps that bring the ERM minimization to the form of a PO (see (A48)) bear the aforementioned similarities to those in [13,43], the resulting AO is different from the one studied in previous works. Hence, the mathematical derivations in Appendix B.3 and Appendix B.4 are different. This also leads to a different system of equations characterizing the statistical behavior of ERM. Finally, in Appendix B.5, we prove uniqueness of the solution of this system of equations using the properties of the expected Moreau envelope function studied in Appendix A.6.

Appendix B.1. Technical Tool: CGMT

Appendix B.1.1. Gordon’s Min-Max Theorem (GMT)

The Gordon’s Gaussian comparison inequality [59] compares the min-max value of two doubly indexed Gaussian processes based on how their autocorrelation functions compare. The inequality is quite general (see [59]), but for our purposes we only need its application to the following two Gaussian processes:

\begin{matrix} X_{w, u} & : = u^{T} G w + ψ (w, u), \end{matrix}

(A40a)

\begin{matrix} Y_{w, u} & : = {∥ w ∥}_{2} g^{T} u + {∥ u ∥}_{2} T w + ψ (w, u), \end{matrix}

(A40b)

where

G \in R^{m \times n}

g \in R^{m}

R^{n}

, they all have entries iid Gaussian; the sets

S_{w} \subset R^{n}

and

S_{u} \subset R^{m}

are compact; and

ψ : R^{n} \times R^{m} \to R

. For these two processes, define the following (random) min-max optimization programs, which we refer to as the primary optimization (PO) problem and the auxiliary optimization (AO).

\begin{matrix} \tilde{Φ} (G) & = min_{w \in S_{w}} max_{u \in S_{u}} X_{w, u}, \end{matrix}

(A41a)

\begin{matrix} ϕ (g, & = min_{w \in S_{w}} max_{u \in S_{u}} Y_{w, u} . \end{matrix}

(A41b)

According to Gordon’s comparison inequality (To be precise, the formulation in (A42), which is due to [13], is slightly different from the original statement in Gordon’s paper (see [13] for details).), for any

c \in R

, it holds:

P (\tilde{Φ} (G) < c) \leq 2 P (ϕ (g, < c) .

(A42)

In other words, a high-probability lower bound on the AO is a high-probability lower bound on the PO. The premise is that it is often much simpler to lower bound the AO rather than the PO. To be precise, (A42) is a slight reformulation of Gordon’s original result proved in [13].

Appendix B.1.2. Convex Gaussian Min-Max Theorem (CGMT)

The proof of Theorem 1 builds on the CGMT [13]. For ease of reference, we summarize here the essential ideas of the framework following the presentation in [15] (please see [15] (Section 6) for the formal statement of the theorem and further details). The CGMT is an extension of the GMT and it asserts that the AO in (41b) can be used to tightly infer properties of the original (PO) in (41a), including the optimal cost and the optimal solution. According to the CGMT [15] (Theorem 6.1), if the sets

S_{w}

and

S_{u}

are convex and

ψ

is continuous convex-concave on

S_{w} \times S_{u}

, then, for any

ν \in R

and

t > 0

, it holds that

P (| \tilde{Φ} (G) - ν | > t) \leq 2 P (| ϕ (g, - ν | > t) .

(A43)

In words, concentration of the optimal cost of the AO problem around

μ

implies concentration of the optimal cost of the corresponding PO problem around the same value

μ

. Moreover, starting from (A43) and under strict convexity conditions, the CGMT shows that concentration of the optimal solution of the AO problem implies concentration of the optimal solution of the PO to the same value. For example, if minimizers of (A41b) satisfy

∥ w^{*} {(g, ∥}_{2} \to ζ^{*}

for some

ζ^{*} > 0

, then the same holds true for the minimizers of (A41a):

{∥ w^{*} (G) ∥}_{2} \to ζ^{*}

[15] ([Theorem 6.1(iii)). Thus, one can analyze the AO to infer corresponding properties of the PO, the premise being of course that the former is simpler to handle than the latter.

Appendix B.2. Applying the CGMT to ERM for Binary Classification

In this section, we show how to apply the CGMT to (3). For convenience, we drop the subscript ℓ from

{\hat{x}}_{ℓ}

and simply write

\hat{x} = arg min_{x} \frac{1}{m} \sum_{i = 1}^{m} ℓ (y_{i} a_{i}^{T} x),

(A44)

where the measurements

y_{i}, i \in [m]

follow (1). By rotational invariance of the Gaussian distribution of the measurement vectors

a_{i}, i \in [m]

, we assume without loss of generality that

x_{0} = {[1, 0, \dots, 0]}^{T}

. We can rewrite (A44) as a constrained optimization problem by introducing n variables

u_{i}

as follows:

\hat{x} = arg min_{x, u} \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) subject to u_{i} = y_{i} a_{i}^{T} x, i \in [n] .

This problem is now equivalent to the following min-max formulation:

\begin{matrix} min_{u, x} max_{β} \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) + \frac{1}{m} \sum_{i = 1}^{m} β_{i} u_{i} - \frac{1}{m} \sum_{i = 1}^{m} β_{i} y_{i} a_{i}^{T} x . \end{matrix}

(A45)

Now, let us define

a_{i} = [s_{i}; {\tilde{a}}_{i}], i \in [m] and x = [x_{1}; \tilde{x}],

such that

s_{i}

and

x_{1}

are the first entries of

a_{i}

and

x

, respectively. Note that in this new notation (1) becomes:

\begin{matrix} y_{i} = f (s_{i}), \end{matrix}

(A46)

and

\begin{matrix} corr (\hat{x}; x_{0}) = \frac{{\hat{x}}_{1}}{\sqrt{{\hat{x}}_{1}^{2} + {∥ \tilde{\hat{x}} ∥}_{2}^{2}}}, \end{matrix}

(A47)

where we decompose

\hat{x} = [{\hat{x}}_{1}; \tilde{\hat{x}}]

. In addition, (A45) is written as

min_{u, x} max_{β} \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) + \frac{1}{m} \sum_{i = 1}^{m} β_{i} u_{i} + \frac{1}{m} \sum_{i = 1}^{m} β_{i} y_{i} {\tilde{a}}_{i}^{T} \tilde{x} - \frac{1}{m} \sum_{i = 1}^{m} β_{i} y_{i} s_{i} x_{1}

or, in matrix form, as

min_{u, x} max_{β} \frac{1}{m} β^{T} D_{y} \tilde{A} \tilde{x} + \frac{1}{m} x_{1} β^{T} D_{y} s + \frac{1}{m} β^{T} u + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) .

(A48)

where

D_{y} : = diag (y_{1}, y_{2}, \dots, y_{m})

is a diagonal matrix with

y_{1}, y_{2}, \dots y_{m}

on the diagonal,

s = {[s_{1}, \dots, s_{m}]}^{T}

and

\tilde{A}

is an

m \times (n - 1)

matrix with rows

{\tilde{a}}_{i}^{T}, i \in [m]

In (A48), we recognize that the first term has the bilinear form required by the GMT in (A41a). The rest of the terms form the function

ψ

in (A41a): they are independent of

\tilde{A}

and convex-concave as desired by the CGMT. Therefore, we express (A44) in the desired form of a PO and for the rest of the proof we analyze the probabilistically equivalent AO problem. In view of (A41b), this is given as follows,

min_{u, x} max_{β} \frac{1}{m} {∥\tilde{x}∥}_{2} g^{T} D_{y} β + \frac{1}{m} {∥D_{y} β∥}_{2} h^{T} \tilde{x} - \frac{1}{m} x_{1} β^{T} D_{y} s + \frac{1}{m} β^{T} u + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}),

(A49)

where as in (A41b)

g \sim N (0, I_{m})

and

h \sim N (0, I_{n - 1})

Appendix B.3. Analysis of the Auxiliary Optimization

Here, we show how to analyze the AO in (A49). To begin with, note that

y_{i} \in {\pm 1}

, therefore

D_{y} g \sim N (0, I_{m})

and

{∥D_{y} β∥}_{2} = {∥β∥}_{2}

. In addition, let us denote the first entry

x_{1}

x

μ : = x_{1} .

The first step is to optimize over the direction of

\tilde{x}

. For this, we express the AO as: -4.6cm0cm

min_{u, μ, α \geq 0} min_{∥ \tilde{x} ∥_{2} = α} max_{β} \frac{1}{m} {∥\tilde{x}∥}_{2} g^{T} D_{y} β + \frac{1}{m} {∥D_{y} β∥}_{2} h^{T} \tilde{x} - \frac{1}{m} μ β^{T} D_{y} s + \frac{1}{m} β^{T} u + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}),

(A50)

Now, denote

{\tilde{x}}_{☆} {= - α ∥}_{2}

and observe that for every

β

the objective above is minimized (with respect to to

\tilde{x}

) at

{\tilde{x}}_{☆}

. Thus, it follows by [23] (Lem. 8) that (A50) simplifies to

min_{α \geq 0, μ, u} max_{β} \frac{1}{m} α g^{T} β - \frac{α}{m} {∥β∥}_{2} {∥h∥}_{2} - \frac{1}{m} μ s^{T} D_{y} β + \frac{1}{m} β^{T} u + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) .

(A51)

Next, let

γ : = \frac{{∥β∥}_{2}}{\sqrt{m}}

and optimize over the direction of

β

to yield

min_{α \geq 0, u, μ} max_{γ \geq 0} \frac{γ}{\sqrt{m}} {∥α g - μ D_{y} s + u∥}_{2} - \frac{α}{\sqrt{m}} γ {∥h∥}_{2} + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}) .

(A52)

To continue, we utilize the fact that for all

x \in R

{min}_{τ > 0} \frac{τ}{2} + \frac{x^{2}}{2 τ m} = \frac{x}{\sqrt{m}}

. Hence,

\begin{matrix} \frac{γ}{\sqrt{m}} {∥α g - μ D_{y} s + u∥}_{2} = min_{τ > 0} \frac{γ τ}{2} + \frac{γ}{2 τ m} {∥- α g + μ D_{y} s - u∥}_{2}^{2} . \end{matrix}

With this trick, the optimization over

u

becomes separable over its coordinates

u_{i}, i \in [m]

. By inserting this in (A42), we have

min_{α \geq 0, u, μ} max_{γ \geq 0} min_{τ > 0} \frac{γ τ}{2} - \frac{α}{\sqrt{m}} γ {∥h∥}_{2} + \frac{γ}{2 τ m} \sum_{i = 1}^{m} {(- α g_{i} + μ y_{i} s_{i} - u_{i})}^{2} + \frac{1}{m} \sum_{i = 1}^{m} ℓ (u_{i}),

Now, we show that the objective function above is convex-concave. Clearly, the function is linear (thus, concave in

γ

). Moreover, from Lemma A1, the function

\frac{1}{2 τ} {(α g_{i} + μ y_{i} s_{i} - u_{i})}^{2}

is jointly convex in

(α, μ, u_{i}, τ)

. The rest of the terms are clearly convex and this completes the argument. Hence, with a permissible change in the order of min-max, we arrive at the following convenient form (Here, we skip certain technical details in this argument regarding boundedness of the constraint sets in (A49). While they are not trivial, they can be handled with the same techniques used in [15,60].):

min_{μ, α \geq 0, τ > 0} max_{γ \geq 0} \frac{γ τ}{2} - \frac{α}{\sqrt{m}} γ {∥h∥}_{2} + \frac{1}{m} \sum_{i = 1}^{m} M_{ℓ} (- α g_{i} + μ s_{i} y_{i}; \frac{τ}{γ}),

(A53)

where recall the definition of the Moreau envelope in (A1). As to now, we have reduced the AO into a random min-max optimization over only four scalar variables in (A53). For fixed

μ, α, τ, γ

, direct application of the weak law of large numbers shows that the objective function of (A53) converges in probability to the following as

m, n \to \infty

and

\frac{m}{n} = δ

γ \frac{τ}{2} - \frac{α γ}{\sqrt{δ}} + E [M_{ℓ} (α G + μ Y S; \frac{τ}{γ})],

where

G, S \sim N (0, 1)

and

Y \sim f (S)

(in view of (A46)). Based on that, it can be shown (similar arguments are developed in [15,60]) that the random optimizers

α_{n}

and

μ_{n}

of (A53) converge to the deterministic optimizers

α

and

μ

of the following (deterministic) optimization problem (whenever these are bounded as the statement of the theorem requires):

min_{α \geq 0, μ, τ > 0} max_{γ \geq 0} γ \frac{τ}{2} - \frac{α γ}{\sqrt{δ}} + E [M_{ℓ} (α G + μ Y S; \frac{τ}{γ})] .

(A54)

At this point, recall that

α

represents the norm of

\tilde{x}

and

μ

the value of

x_{1}

. Thus, in view of (i) (A47); (ii) the equivalence between the PO and the AO; and (iii) our derivations thus far, we have that with probability approaching 1,

\begin{matrix} lim_{n \to + \infty} corr (\hat{x}; x_{0}) = \frac{μ}{\sqrt{μ^{2} + α^{2}}}, \end{matrix}

where

μ

and

α

are the minimizers in (A54). The three equations in (8) are derived by the first-order optimality conditions of the optimization in (A54). We show this next.

Appendix B.4. Convex-Concavity and First-Order Optimality Conditions

First, we prove that the objective function in (A54) is convex–concave. For convenience define the function

F : R^{4} \to R

as follows

\begin{matrix} F (α, μ, τ, γ) : = \frac{γ τ}{2} - \frac{α γ}{\sqrt{δ}} + E [M_{ℓ} (α G + μ Y S; \frac{τ}{γ})] . \end{matrix}

(A55)

Based on Lemma A2, it immediately follows that, if ℓ is convex, F is jointly convex in

(α, μ, τ)

. To prove concavity of F based on

γ

, it suffices to show that

M_{ℓ} (x; 1 / γ)

is concave in

γ

for all

x \in R

. To show this, we note that

\begin{matrix} M_{ℓ} (x; 1 / γ) = min_{u} \frac{γ}{2} {(x - u)}^{2} + ℓ (u), \end{matrix}

which is the point-wise minimum of linear functions of

γ

. Thus, using Proposition A3(b), we conclude that

M_{ℓ} (x; 1 / γ)

is concave in

γ

. This completes the proof of convex-concavity of the function F in (A55) when ℓ is convex. By direct differentiation and applying Proposition A7(a), the first-order optimality conditions of the min–max optimization in (A54) are as follows:

\begin{matrix} E [S Y \cdot M_{ℓ, 1}^{'} (α G + μ S Y; \frac{τ}{γ})] & = 0, \end{matrix}

(A56a)

\begin{matrix} E [G \cdot M_{ℓ, 1}^{'} (α G + μ S Y; \frac{τ}{γ})] & = \frac{γ}{\sqrt{δ}}, \end{matrix}

(A56b)

\begin{matrix} \frac{γ}{2} + \frac{1}{γ} E [M_{ℓ, 2}^{'} (α G + μ S Y; \frac{τ}{γ})] & = 0, \end{matrix}

(A56c)

\begin{matrix} - \frac{α}{\sqrt{δ}} - \frac{τ}{γ^{2}} E [M_{ℓ, 2}^{'} (α G + μ S Y; \frac{τ}{γ})] + \frac{τ}{2} & = 0 . \end{matrix}

(A56d)

Next, we show how these equations simplify to the following system of equations (same as (8):

\begin{matrix} E [Y S \cdot M_{ℓ, 1}^{'} (α G + μ S Y; λ)] & = 0, \end{matrix}

(A57a)

\begin{matrix} λ^{2} δ E [{(M_{ℓ, 1}^{'} (α G + μ S Y; λ))}^{2}] & = α^{2}, \end{matrix}

(A57b)

\begin{matrix} λ δ E [G \cdot M_{ℓ, 1}^{'} (α G + μ S Y; λ)] & = α . \end{matrix}

(A57c)

Let

λ : = \frac{τ}{γ}

. First, (A57a) is immediate from equation (A56a). Second, substituting

γ

from (A56c) in (A56d) yields

τ = \frac{α}{\sqrt{δ}}

γ = \frac{α}{λ \sqrt{δ}}

, which together with (A56b) leads to (A57c). Finally, (A57b) can be obtained by substituting

γ = \frac{α}{λ \sqrt{δ}}

in (A56c) and using the fact that (see Proposition A1):

M_{ℓ, 2}^{'} (α G + μ S Y; λ) = - \frac{1}{2} {(M_{ℓ, 1}^{'} (α G + μ S Y; λ))}^{2} .

Appendix B.5. On the Uniqueness of Solutions to (A57): Proof of Proposition 1

Here, we prove the claim of Proposition 1 through the following lemmas. As discussed in Remark 4, the main part of the proof is showing strict convex-concavity of F in (11). Lemma A6 proves that this is the case, and Lemmas A7 and A8 show that this is sufficient for the uniqueness of solutions to (A57). When put together, these complete the proof of Proposition 1.

Lemma A6.

(Strict Convex-Concavity of (A55)). Let

ℓ : R \to R

be proper and strictly convex function. Further, assume that ℓ is continuously differentiable with

ℓ^{'} (0) \neq 0

. In addition, assume that

S Y

has positive density in the real line. Then, the function

F : R^{4} \to R

defined in (A55) is strictly convex in

(α, μ, τ)

and strictly concave in γ.

Proof.

The claim follows directly from the strict convexity-concavity properties of the expected Moreau-envelope proved in Propositions A5 and A6. Specifically, we apply Proposition A7. □

Lemma A7.

If the objective function in (A55) is strictly convex in

(α, μ, τ)

and strictly concave in γ, then (A56) has a unique solution

(α, μ, τ, γ)

Proof.

Let

(α_{i}, μ_{i}, τ_{i}, γ_{i}), i = 1, 2,

be two different saddle points of (A55). For convenience, let

x_{i} : = (α_{i}, μ_{i}, τ_{i})

for

i = 1, 2

. By strict-concavity in

γ

, for fixed values of

x : = (α, μ, τ)

, the value of

γ

maximizing

F (x, γ)

is unique. Thus, if

x_{1} = x_{2}

, then it must hold that

γ_{1} = γ_{2}

, which is a contraction to our assumption of

(x_{1}, γ_{1}) \neq (x_{2}, γ_{2})

. Similarly, we can use strict-convexity to derive that

γ_{1} \neq γ_{2}

. Then, based on the definition of the saddle point and strict convexity-concavity, the following two relations hold for

i = 1, 2

\begin{matrix} F (x_{i}, γ) < F (x_{i}, γ_{i}) < F (x, γ_{i}), for all x \neq x_{i}, γ \neq γ_{i} . \end{matrix}

We choose

x = x_{2}, γ = γ_{2}

for

i = 1

and

x = x_{1}, γ = γ_{1}

for

i = 2

to find

\begin{matrix} F (x_{1}, γ_{2}) < F (x_{1}, γ_{1}) < F (x_{2}, γ_{1}), \\ F (x_{2}, γ_{1}) < F (x_{2}, γ_{2}) < F (x_{1}, γ_{2}) . \end{matrix}

From the above, it follows that

F (x_{1}, γ_{1}) < F (x_{2}, γ_{2})

and

F (x_{1}, γ_{1}) > F (x_{2}, γ_{2})

, which is a contradiction. This completes the proof. □

Lemma A8.

If (A56) has a unique solution

(α^{☆}, μ^{☆}, τ^{☆}, γ^{☆})

, then (A57) has a unique solution

(α^{☆}, μ^{☆}, λ^{☆})

Proof.

First, following the same approach of deriving Equations (A57) from (A56) in (A56), it is easy to see that existence of solution

(α_{1}, μ_{1}, τ_{1}, γ_{1})

to (A57) implies existence of solution

(α_{1}, μ_{1}, λ_{1} : = \frac{τ_{1}}{γ_{1}})

to (A57). Now, for the sake of contradiction to the statement of the lemma, assume that there are two different triplets

v_{1} : = (α_{1}, μ_{1}, λ_{1})

and

v_{2} : = (α_{2}, μ_{2}, λ_{2})

with

α_{1}, α_{2}, λ_{1}, λ_{2} > 0

and satisfying (Appendix B.4). Then, we can show that both

w_{i} : = (α_{i}, μ_{i}, τ_{i}, γ_{i}) i = 1, 2,

such that:

\begin{matrix} τ_{i} : = \frac{α_{i}}{\sqrt{δ}}, γ_{i} = \frac{α_{i}}{λ_{i} \sqrt{δ}}, i = 1, 2, \end{matrix}

satisfy the system of equations in (A56). However, since

v_{1} \neq v_{2}

, it must be that

w_{1} \neq w_{2}

. This contradicts the assumption of uniqueness of solutions to (A56) and completes the proof. □

Appendix C. Discussions on the Fundamental Limits for Binary Models

On the Uniqueness of Solutions to Equation $κ (σ) = \frac{1}{δ}$

The existence of a solution to the equation

κ (σ) = \frac{1}{δ}

is proved in the previous section. However, it is not clear if the solution to this equation is unique, i.e., for any

δ > 1

there exists only one

σ_{opt} > 0

such that

κ (σ_{opt}) = \frac{1}{δ} .

If this is the case, then Equation (12) in Theorem 2 can be equivalently written as

σ_{opt} = σ, s . t . κ (σ) = \frac{1}{δ} .

Although we do not prove this claim, our numerical experiments in Figure A1 show that

κ (\cdot)

is a monotonic function for noisy-signed, logistic and Probit measurements, implying the uniqueness of solution to the equation

κ (σ) = \frac{1}{δ}

for all

δ > 1

Appendix C.1. Distribution of SY in Special Cases

We derive the following densities for

S Y

for the special cases (

∥ x_{0} ∥_{2} = 1

Signed: $p_{S Y} (w) = \sqrt{\frac{2}{π}} exp (- w^{2} / 2) 1_{{w \geq 0}} .$
Logistic: $p_{S Y} (w) = \sqrt{\frac{2}{π}} \frac{exp (- w^{2} / 2)}{1 + exp (- w)} .$
Probit: $p_{S Y} (w) = \sqrt{\frac{2}{π}} Φ (w) exp (- w^{2} / 2) .$

In particular, we numerically observe that for logistic and Probit models; the resulting densities are similar to the density of a gaussian distribution derived according to

N (E [S Y], Var [S Y])

. Figure A2 illustrates this similarity for these two models. As discussed in Corollary 1, this similarity results in the tightness of the lower bound achieved for

σ_{opt}

in Equation (23).

Figure A1. The value of

κ (σ)

as in Theorem 2 for various measurement models. Since

κ (σ)

is a monotonic function of

σ

, the solution to

κ (σ) = 1 / δ

determines the minimum possible value of

σ

Figure A1. The value of

κ (σ)

as in Theorem 2 for various measurement models. Since

κ (σ)

is a monotonic function of

σ

, the solution to

κ (σ) = 1 / δ

determines the minimum possible value of

σ

Figure A2. Probability distribution function of

S Y

for the logistic and Probit models (

∥ x_{0} ∥_{2} = 1

) compared with the probability distribution function of the Gaussian random variable (dashed lines) with the same mean and variance i.e.,

N (E [S Y], Var [S Y])

Figure A2. Probability distribution function of

S Y

for the logistic and Probit models (

∥ x_{0} ∥_{2} = 1

) compared with the probability distribution function of the Gaussian random variable (dashed lines) with the same mean and variance i.e.,

N (E [S Y], Var [S Y])

Appendix D. Proofs and Discussions on the Optimal Loss Function

Appendix D.1. Proof of Theorem 3

We show that the triplet

(μ = 1, α = σ_{opt}, λ = 1)

is a solution to Equations (8) for ℓ chosen as in (29). Using Proposition A2 in the Appendix, we rewrite

ℓ_{opt}

using the Fenchel–Legendre conjugate as follows:

\begin{matrix} ℓ_{opt} (w) = {(q + α_{1} q + α_{2} log p_{W_{opt}})}^{☆} (w) - q (w), \end{matrix}

(A58)

where

q (w) = w^{2} / 2

. For a function f, its Fenchel–Legendre conjugate is defined as:

f^{☆} (x) = max_{y} x y - f (y) .

Next, we use the fact that, for any proper, closed and convex function f, it holds that

{(f^{☆})}^{☆} = f

[61] (theorem 12.2). Therefore, noting that

q + α_{1} q + α_{2} log p_{W_{opt}}

is a convex function (see the proof of Lemma A9 in the Appendix), combined with (A58), it yields that

\begin{matrix} {(ℓ_{opt} + q)}^{☆} = q + α_{1} q + α_{2} log p_{W_{opt}} . \end{matrix}

(A59)

Additionally, using Proposition A2, we find that

M_{ℓ_{opt}} (w; 1) = q (w) - {(q + ℓ_{opt})}^{☆} (w),

which by (A49) reduces to:

M_{ℓ_{opt}} (w; 1) = - α_{1} q (w) - α_{2} log p_{W_{opt}} (w) .

Thus, by differentiation, we find that

ℓ_{opt}

satisfies (28) with

c = - 1

, i.e.,

\begin{matrix} M_{ℓ_{opt}, 1}^{'} (w; 1) = - α_{1} w - α_{2} \cdot ξ_{W_{opt}} (w) . \end{matrix}

(A60)

Next, we establish the desired by directly substituting (A60) into the system of equations in (19). First, using the values of

α_{1}

and

α_{2}

in (30), as well as the fact that

κ (σ_{opt}) = 1 / δ

, we have the following chain of equations:

\begin{matrix} E [{(M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1))}^{2}] & = E [{(α_{1} W_{opt} + α_{2} ξ_{W_{opt}} (W_{opt}))}^{2}] \\ = α_{1}^{2} (σ_{opt}^{2} + 1) + α_{2}^{2} I (W_{opt}) + 2 α_{1} α_{2} E [W_{opt} \cdot ξ_{W_{opt}} (W_{opt})] \\ = \frac{1 + σ_{opt}^{2} (σ_{opt}^{2} I (W_{opt}) - 1)}{δ^{2} (σ_{opt}^{2} I (W_{opt}) + I (W_{opt}) - 1)} = \frac{σ_{opt}^{2}}{δ^{2} κ (σ_{opt})} \\ = σ_{opt}^{2} / δ . \end{matrix}

(A61)

This shows (8b). Second, using again the specified values of

α_{1}

and

α_{2}

, a similar calculation yields

\begin{matrix} E [M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1) ξ_{W_{opt}} (W_{opt})] & = - E [(α_{1} W_{opt} + α_{2} ξ_{W_{opt}} (W_{opt})) ξ_{W_{opt}} (W_{opt})] \\ = α_{1} - α_{2} I (W_{opt}) \\ = - 1 / δ . \end{matrix}

(A62)

Recall from (17) that

E [G \cdot M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1)] = - σ_{opt} E [M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1) ξ_{W_{opt}} (W_{opt})] .

This combined with (A62) yields (8c). Finally, we use again (A60) and the specified values of

α_{1}

and

α_{2}

to find that

\begin{matrix} E [W_{opt} \cdot M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1)] & = E [W_{opt} \cdot (- α_{1} W_{opt} - α_{2} ξ_{W_{opt}} (W_{opt}))] \\ = - α_{1} E [W_{opt}^{2}] - α_{2} E [W_{opt} ξ_{W_{opt}} (W_{opt})] \\ = - α_{1} (σ_{opt}^{2} + 1) - α_{2} \int_{- \infty}^{\infty} w p_{W_{opt}}^{'} (w) d w \\ = - α_{1} (σ_{opt}^{2} + 1) + α_{2} \\ = σ_{opt}^{2} / δ . \end{matrix}

(A63)

However, using (17), it holds that

\begin{matrix} E & [W_{opt} \cdot M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1)] = \\ - σ_{opt}^{2} E [M_{ℓ_{opt}, 1}^{'} (W_{opt}; 1) ξ_{W_{opt}} (W_{opt})] + E [Y S \cdot M_{ℓ_{opt}, 1}^{'} (W_{opt}; λ)] . \end{matrix}

This combined with (A63) and (A62) shows that

E [Y S \cdot M_{ℓ_{opt}, 1}^{'} (W_{opt}; λ)] = 0

, as desired to satisfy (8a). This completes the proof of the theorem.

Appendix D.2. On the Convexity of Optimal Loss Function

Here, we provide a sufficient condition for

ℓ_{opt} (w)

to be convex.

Lemma A9.

The optimal loss function as defined in Theorem 3 is convex if

{(log (p_{W_{σ}}))}^{''} (w) \leq - \frac{1}{σ^{2} + 1}, for all w \in R and σ \geq 0 .

Proof.

Using (A9) optimal loss function is written in the following form

\begin{matrix} ℓ_{opt} (w) = {(q + α_{1} q + α_{2} log (p_{W_{opt}}))}^{☆} (w) - q (w) . \end{matrix}

(A64)

Next, we prove that

q + α_{1} q + α_{2} log (p_{W_{opt}})

is a convex function. We first show that both

α_{1}

and

α_{2}

are positive numbers for all values of

σ_{opt}

. We first note that, since G and

S Y

are independent random variables,

σ_{opt}^{2} I (W_{opt}) < σ_{opt}^{2} I (σ_{opt} G) = 1

. Therefore,

\begin{matrix} 1 - σ_{opt}^{2} I (W_{opt}) > 0 . \end{matrix}

(A65)

Additionally, following the Cramer–Rao bound [53] for Fisher information yields that:

\begin{matrix} I (W_{opt}) & > \frac{1}{E [{(W_{opt} - E [W_{opt}])}^{2}]} \\ = \frac{1}{1 + σ_{opt}^{2} - {(E [S Y])}^{2}} . \end{matrix}

Using this inequality for

I (W_{opt})

, we derive that

\begin{matrix} σ_{opt}^{2} I (W_{opt}) + I (W_{opt}) - 1 > 0 . \end{matrix}

(A66)

From (A65) and (A66), it follows that

α_{1}, α_{2} > 0

Based on the definition of the random variable

W_{opt}

\begin{matrix} log p_{W_{opt}} (w) = - w^{2} / (2 σ_{opt}^{2}) + log \int_{- \infty}^{\infty} exp ((2 w z - z^{2}) / 2 σ_{opt}^{2}) p_{S Y} (z) d z + c, \end{matrix}

where c is a constant independent of w. By differentiating twice, we see that

log \int_{- \infty}^{\infty} exp ((2 w z - z^{2}) / 2 σ_{opt}^{2}) p_{S Y} (z) d z

is a convex function of w. Therefore, to prove that

q + α_{1} q + α_{2} log (p_{W_{opt}})

is a convex function, it is sufficient to prove that

(1 + α_{1} - α_{2} / σ_{opt}^{2}) q

is a convex function or equivalently

1 + α_{1} - α_{2} / σ_{opt}^{2} \geq 0

. Replacing values of

α_{1}, α_{2}

and recalling the equation for

σ_{opt}

yields that

1 + α_{1} - α_{2} / σ_{opt}^{2} = 0,

which implies the convexity of

q + α_{1} q + α_{2} log (p_{W_{opt}})

. To obtain the derivative of

ℓ_{o p t}

, we use the result in [61] (Cor. 23.5.1), which states that, for a convex function f,

{(f^{☆})}^{'} = {(f^{'})}^{- 1} .

Therefore, following (A64),

\begin{matrix} ℓ_{opt}^{'} (w) = {(q^{'} + α_{1} q^{'} + α_{2} {(log (p_{W_{opt}}))}^{'})}^{- 1} (w) - w . \end{matrix}

(A67)

Differentiating again and using the properties of inverse function yields that

\begin{matrix} ℓ_{opt}^{''} (w) = \frac{1}{1 + α_{1} + α_{2} {(log (p_{W_{opt}}))}^{''} (g (w))} - 1, \end{matrix}

(A68)

where

g (w) : = {(q^{'} + α_{1} q^{'} + α_{2} {(log (p_{W_{opt}}))}^{'})}^{- 1} (w) .

Note that the denominator of (A68) is nonnegative since it is second derivative of a convex function. Therefore, it is evident from (A68) that a sufficient condition for the convexity of

ℓ_{opt}

is that

α_{1} + α_{2} {(log (p_{W_{opt}}))}^{''} (w) \leq 0, for all w \in R,

1 - σ_{opt}^{2} I (W_{opt}) + {(log (p_{W_{opt}}))}^{''} (w) \leq 0 .

This condition is satisfied if the statement of the lemma holds for

σ = σ_{opt}

\begin{matrix} 1 - σ_{opt}^{2} I (W_{opt}) + {(log (p_{W_{opt}}))}^{''} (w) \leq 1 - σ_{opt}^{2} I (W_{opt}) - \frac{1}{1 + σ_{opt}^{2}} < 0, \end{matrix}

where we use (A66) in the last inequality. This concludes the proof. □

Appendix D.2.1. Provable Convexity of the Optimal Loss Function for Signed Model

In the case of signed model, it can be proved that the conditions of Lemma A9 is satisfied. Since

W_{σ} = σ G + S Y

, we derive the probability density of

W_{σ}

as follows:

\begin{matrix} p_{W_{σ}} (w) = p_{_{σ G}} (w) * p_{S Y} (w) = \frac{\exp (- w^{2} / (2 + 2 σ^{2}))}{\sqrt{2 π (1 + σ^{2})}} \cdot f (w), \end{matrix}

where

f (w) = 2 - 2 Q (w / (σ \sqrt{2 + 2 σ^{2}})) .

Direct calculation shows that f is a log-concave function for all

w \in R

. Therefore,

\begin{matrix} {(log (p_{W_{σ}}))}^{''} (w) & = - \frac{1}{σ^{2} + 1} + {(log (f))}^{''} (w) \\ \leq - \frac{1}{σ^{2} + 1} . \end{matrix}

This proves the convexity of optimal loss function derived according to Theorem 3 when measurements follow the signed model.

Appendix E. Noisy-Signed Measurement Model

Consider a noisy-signed label function as follows:

\begin{matrix} y_{i} = f_{ε} (a_{i}^{T} x_{0}) = \{\begin{matrix} sign (a_{i}^{T} x_{0}) & , w . p . 1 - ε, \\ - sign (a_{i}^{T} x_{0}) & , w . p . ε, \end{matrix} \end{matrix}

where

ε \in [0, 1 / 2]

Figure A3. The value of the threshold

δ_{f_{ε}}^{☆}

in (A69) as a function of probability of error

ε \in [0, 1 / 2]

. For logistic and hinge losses, the set of minimizers in (3) is bounded (as required by Theorem 1) iff

δ > δ_{f_{ε}}^{☆}

Figure A3. The value of the threshold

δ_{f_{ε}}^{☆}

in (A69) as a function of probability of error

ε \in [0, 1 / 2]

. For logistic and hinge losses, the set of minimizers in (3) is bounded (as required by Theorem 1) iff

δ > δ_{f_{ε}}^{☆}

In the case of signed measurements, i.e.,

y_{i} = sign (a_{i}^{T} x_{0})

, it can be observed that for all possible values of

δ

, the condition (34) in Section 4.2 holds for

x_{s} = x_{0}

. This implies the separability of data and therefore the solution to the optimization problem (3) is unbounded for all

δ

. However, in the case of noisy signed label function, boundedness or unboundedness of solutions to (3) depends on

δ

. As discussed in Section 4.2, the minimum value of

δ

for bounded solutions is derived from the following:

δ_{f_{ε}}^{☆} (ε) : = {(min_{c \in R} E [{(G + c S Y)}_{-}^{2}])}^{- 1},

(A69)

where

Y = f_{ε} (S)

. It can be checked analytically that

δ_{f_{ε}}^{☆}

is a decreasing function of

ε

with

δ_{f_{ε}}^{☆} (0^{+}) = + \infty

and

δ_{f_{ε}}^{☆} (1 / 2) = 2

In Figure A3, we numerically evaluate the threshold value

δ_{f_{ε}}^{☆}

as a function of the probability of error

ε

. For

δ < δ_{f_{ε}}^{☆}

, the set of minimizers of the (3) with logistic or hinge loss is unbounded.

The performances of LS, LAD and hinge loss functions for noisy-signed measurement model with

ε = 0.1

and

ε = 0.25

are demonstrated in Figure A4a,b, respectively. Comparing performances of least-squares and hinge loss functions suggest that hinge loss is robust to measurement corruptions, as for moderate to large values of

δ

it outperforms the LS estimator. Theorem 1 opens the way to analytically confirm such conclusions, which is an interesting future direction.

Figure A4. Comparisons between analytical and empirical results for the least-squares (LS), least-absolute deviations and hinge loss functions along with the upper bound on performance and the empirical performance of optimal loss function as in Theorem 3, for noisy-signed measurement model with

ε = 0.1

(a) and

ε = 0.25

(b). The vertical dashed lines are evaluated by (A59) and represent

δ_{f_{ε}}^{☆} \approx 3

and

2.25

for

ε = 0.1

and

0.25

, respectively.

ε = 0.1

(a) and

ε = 0.25

(b). The vertical dashed lines are evaluated by (A59) and represent

δ_{f_{ε}}^{☆} \approx 3

and

2.25

for

ε = 0.1

and

0.25

, respectively.

Appendix F. On LS Performance for Binary Models

Appendix F.1. Proof of Corollary 2

To get the values of

α

and

μ

as in the statement of the corollary, we show how to simplify Equations (8) for

ℓ (t) = {(t - 1)}^{2}

. In this case, the proximal operator admits a simple expression:

{prox}_{ℓ} (x; λ) = (x + 2 λ) / (1 + 2 λ) .

In addition,

ℓ^{'} (t) = 2 (t - 1)

. Substituting these in (14a) gives the formula for

μ

as follows:

\begin{matrix} 0 & = E [Y S (α G + μ S Y - 1)] = μ E [S^{2}] - E [Y S] \\ ⟹ μ = E [Y S], \end{matrix}

where we have also used from (7) that

E [S^{2}] = 1

and G is independent of S. In addition, since

ℓ^{''} (t) = 2

, direct application of (A7) gives

\begin{matrix} 1 = λ δ \frac{2}{1 + 2 λ} ⟹ λ = \frac{1}{2 (δ - 1)} . \end{matrix}

Finally, substituting the value of

λ

into (14b), we obtain the desired value for

α

as follows:

\begin{matrix} α^{2} & = 4 λ^{2} δ E [{({prox}_{ℓ} (α G + μ S Y; λ) - 1)}^{2}] \\ = \frac{4 λ^{2}}{{(1 + 2 λ)}^{2}} δ E [{(α G + μ S Y - 1)}^{2}] \\ = \frac{4 λ^{2} δ}{{(1 + 2 λ)}^{2}} (α^{2} + μ^{2} + 1 - 2 μ E [S Y]) \\ = \frac{1}{δ} (α^{2} + 1 - {(E [S Y])}^{2}) \\ ⟹ α = \sqrt{1 - {(E [S Y])}^{2}} \cdot \sqrt{\frac{1}{δ - 1}} . \end{matrix}

Appendix F.2. Discussion

Linear vs. Binary

On the one hand, Corollary 2 shows that least-squares performance for binary measurements satisfies

lim_{n \to \infty} ∥ \hat{x} - \frac{μ}{∥ x_{0} ∥_{2}} \cdot x_{0} ∥_{2}^{2} = τ^{2} \cdot \frac{1}{δ - 1},

(A70)

where

μ

is as in (32) and

τ^{2} : = 1 - {(E [S Y])}^{2}

. On the other hand, it is well-known (e.g., see references in [15] (Sec. 5.1)) that least-squares for (scaled) linear measurements with additive Gaussian noise (i.e.,

y_{i} = ρ a_{i}^{T} x_{0} + σ z_{i}

z_{i} \sim N (0, 1)

) leads to an estimator that satisfies

\begin{matrix} lim_{n \to \infty} {∥\hat{x} - ρ \cdot x_{0}∥}_{2}^{2} = σ^{2} \cdot \frac{1}{δ - 1} . \end{matrix}

(A71)

Direct comparison of (A70) to (A71) suggests that least-squares with binary measurements performs the same as if measurements were linear with scaling factor

ρ = μ / ∥ x_{0} ∥_{2}

and noise variance

σ^{2} = τ^{2} = α^{2} (δ - 1)

. This worth-mentioning conclusion is not new, as it is proved in [40,43,56,62]. We include a short discussion on the relation to this prior work in the following paragraph. We highlight that all these existing results are limited to a least-squares loss unlike our general analysis.

Prior work. There is a lot of recent work on the use of least-squares-type estimators for recovering signals from nonlinear measurements of the form

y_{i} = h (a_{i}^{T} x_{0})

with Gaussian vectors

a_{i}

. The original work that suggests least-squares as a reasonable estimator in this setting is due to Brillinger [56]. In his 1982 paper, Brillinger studied the problem in the classical statistics regime (namely, n is fixed not scaling with

m \to + \infty

) and he proved for the least-squares solution satisfies

lim_{m \to + \infty} \frac{1}{m} {∥\hat{x} - \frac{μ}{∥ x_{0} ∥_{2}} \cdot x_{0}∥}_{2}^{2} = τ^{2},

where

\begin{matrix} μ & = E [S Y], S \sim N (0, 1), \\ τ^{2} & = E [{(Y - μ S)}^{2}] . \end{matrix}

(A72)

and the expectations are with respect to S and possible randomness of f. Evaluating (A72) for

Y = f_{ε} (S)

leads to the same values for

μ

and

τ^{2}

in (A70). In other works, (A70) for

δ \to + \infty

indeed recovers Brillinger’s result. The extension of Brillinger’s original work to the high-dimensional setting (both

m, n

large) was first studied by Plan and Vershynin [40], who derived (non-sharp) non-asymptotic upper bounds on the performance of constrained least-squares (such as the Lasso). Shortly after, Thrampoulidis et al. [43] extended this result to sharp asymtpotic predictions and to regularized least-squares. In particular, Corollary 2 is a special case of the main theorem in [43]. Several other interesting extensions of the result by Plan and Vershynin have recently appeared in the literature (e.g., [41,62,63,64]). However, the one in [43] is the only one to give results that are sharp in the flavor of this paper. Our work, extends the result of Thrampoulidis et al. [43] to general loss functions beyond least-squares. The techniques of Thrampoulidis et al. [43] that have guided the use of the CGMT in our context were also recently applied by Dhifallah et al. [60] in the context of phase-retrieval.

Appendix G. Fundamental Limits for Gaussian-Mixture Models: Proofs for Section 5

Appendix G.1. Proof of Corollary 3

The proof follows directly by noting that, when

ℓ (t) = {(t - 1)}^{2}

, it holds that

M_{ℓ} (x; λ) = \frac{{(x - 1)}^{2}}{2 λ + 1} .

By inserting this into (38a) and simplifying the equations, we find the value of

μ

μ = \frac{r}{1 + r^{2}} .

Similarly, we derive

λ

using Equation (38c):

λ = \frac{1}{2 (δ - 1)} .

Substituting these values of

μ

and

λ

into (38b) yields that

α^{2} = \frac{1}{δ - 1} \cdot \frac{1}{r^{2} + 1} .

Recalling that

σ_{L S} = α / μ

concludes the proof.

Appendix G.2. Proof of Theorem 5

The high-level steps of the proof follow the proof of Theorem 2. First, we note that by scaling the loss function ℓ the value of

σ_{ℓ}

does not change. In particular, if

\tilde{ℓ} (t) : = C_{1} ℓ (C_{2} t)

for arbitrary constants

C_{1} > 0, C_{2} \neq 0

, it is not hard to see that

{\hat{x}}_{\tilde{ℓ}} = 1 / C_{2} {\hat{x}}_{ℓ}

is the minimizer of (3). Thus, we conclude from (40) that

σ_{\tilde{ℓ}} = σ_{ℓ}

. With this observation, consider the function

\tilde{ℓ} : R \to R

such that

\tilde{ℓ} (t) = \frac{λ}{μ^{2}} ℓ (μ t)

. Then, notice that

M_{ℓ, 1}^{'} (x; λ) = \frac{1}{λ} M_{\tilde{ℓ}, 1}^{'} (x / μ; 1) .

Using this relation in (38) and setting

σ : = σ_{ℓ} = α / μ

, the system of equations in (38) can be equivalently rewritten in the following convenient form, where

Z_{σ} = σ W_{1} + W_{2}

\begin{matrix} E [W_{2} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = 0, \end{matrix}

(A73a)

\begin{matrix} E [{(M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1))}^{2}] = σ^{2} / δ, \end{matrix}

(A73b)

\begin{matrix} E [W_{1} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = σ / δ . \end{matrix}

(A73c)

Next, we show how to use (A73) to derive an equivalent system of equations in terms of only

Z_{σ}

. Starting with (A73c), we have

E [W_{1} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = \frac{1}{σ} \int \int x M_{\tilde{ℓ}, 1}^{'} (x + y; 1) p_{σ W_{1}} (x) p_{W_{2}} (y) d x d y,

(A74)

where recall that

p_{σ W_{1}} (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{x^{2}}{2 σ^{2}}}

. Since it holds that

p_{σ W_{1}} (x) = \frac{- σ^{2}}{x} p_{σ W_{1}}^{'} (x)

, using (A74) yields that

\begin{matrix} E [W_{1} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] & = - σ \int \int M_{\tilde{ℓ}, 1}^{'} (x + y; 1) p_{σ W_{1}}^{'} (x) p_{W_{2}} (y) d x d y \\ = - σ \int \int M_{\tilde{ℓ}, 1}^{'} (z; 1) p_{σ W_{1}}^{'} (x) p_{W_{2}} (z - x) d x d z = - σ \int M_{\tilde{ℓ}, 1}^{'} (z; 1) p_{Z_{σ}}^{'} (z) d z, \end{matrix}

where in the last step we use

p_{Z_{σ}}^{'} (w) = \int p_{σ W_{1}}^{'} (x) p_{W_{2}} (z - x) d x .

Therefore,

\begin{matrix} E [W_{1} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = - σ E [M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1) ξ_{Z_{σ}} (Z_{σ})] . \end{matrix}

This combined with (A73c) gives

E [M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1) ξ_{Z_{σ}} (Z_{σ})] = - 1 / δ .

Second, multiplying (A73c) with

σ^{2}

and adding it to (A73a) yields

\begin{matrix} E [Z_{σ} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] & = σ^{2} / δ . \end{matrix}

(A75)

Putting these together, we conclude with the following system of equations which is equivalent to (A73),

\begin{matrix} E [Z_{σ} \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = σ^{2} / δ, \end{matrix}

(A76a)

\begin{matrix} E [{(M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1))}^{2}] = σ^{2} / δ, \end{matrix}

(A76b)

\begin{matrix} E [M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1) ξ_{Z_{σ}} (Z_{σ})] = - 1 / δ . \end{matrix}

(A76c)

Next, considering (A76a) and (A76c), the following holds for any

c_{1}, c_{2} \in R

E [(c_{1} Z_{σ} + c_{2} ξ_{Z_{σ}} (Z_{σ})) \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)] = c_{1} σ^{2} / δ - c_{2} / δ .

(A77)

Applying Cauchy–Schwarz inequality to the LHS of (A77) gives

\begin{matrix} {(c_{1} σ^{2} / δ - c_{2} / δ)}^{2} = {(E [(c_{1} Z_{σ} + c_{2} ξ_{Z_{σ}} (Z_{σ})) \cdot M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1)])}^{2} \end{matrix}

\begin{matrix} \leq E [{(c_{1} Z_{σ} + c_{2} ξ_{Z_{σ}} (Z_{σ}))}^{2}] E [{(M_{\tilde{ℓ}, 1}^{'} (Z_{σ}; 1))}^{2}] . \end{matrix}

(A78)

By considering (A76b),

E [Z_{σ} ξ_{Z_{σ}} (Z_{σ})] = - 1

(follows from integration by parts) and

E [{(ξ_{Z_{σ}} (Z_{σ}))}^{2}] = I (Z_{σ}) = {(σ^{2} + 1)}^{- 1}

, we simplify (A78) to the following:

{(c_{1} σ^{2} / δ - c_{2} / δ)}^{2} \leq (c_{1}^{2} (σ^{2} + 1 + r^{2}) + c_{2}^{2} / (σ^{2} + 1) - 2 c_{1} c_{2})) σ^{2} / δ .

Choosing

c_{1} = 1

and

c_{2} = (1 + r^{2}) (1 + σ^{2})

and simplifying both sides, we derive the lower bound for

σ^{2}

σ^{2} \geq \frac{1 + r^{2}}{r^{2}} \cdot \frac{1}{(δ - 1)} .

This completes the proof of theorem.

References

Donoho, D.L. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math. Chall. Lect. 2000, 1, 32. [Google Scholar]
Donoho, D.L. Compressed sensing. Inf. Theory IEEE Trans. 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Stojnic, M. Various thresholds for ℓ₁-optimization in compressed sensing. arXiv 2009, arXiv:0907.3666. [Google Scholar]
Chandrasekaran, V.; Recht, B.; Parrilo, P.A.; Willsky, A.S. The convex geometry of linear inverse problems. Found. Comput. Math. 2012, 12, 805–849. [Google Scholar] [CrossRef]
Donoho, D.L.; Maleki, A.; Montanari, A. The noise-sensitivity phase transition in compressed sensing. Inf. Theory IEEE Trans. 2011, 57, 6920–6941. [Google Scholar] [CrossRef] [Green Version]
Tropp, J.A. Convex recovery of a structured signal from independent random linear measurements. arXiv 2014, arXiv:1405.1102. [Google Scholar]
Oymak, S.; Tropp, J.A. Universality laws for randomized dimension reduction, with applications. Inf. Inference J. IMA 2017, 7, 337–446. [Google Scholar] [CrossRef]
Bayati, M.; Montanari, A. The LASSO risk for gaussian matrices. Inf. Theory IEEE Trans. 2012, 58, 1997–2017. [Google Scholar] [CrossRef] [Green Version]
Stojnic, M. A framework to characterize performance of LASSO algorithms. arXiv 2013, arXiv:1303.7291. [Google Scholar]
Oymak, S.; Thrampoulidis, C.; Hassibi, B. The Squared-Error of Generalized LASSO: A Precise Analysis. arXiv 2013, arXiv:1311.0830. [Google Scholar]
Karoui, N.E. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results. arXiv 2013, arXiv:1311.2445. [Google Scholar]
Bean, D.; Bickel, P.J.; El Karoui, N.; Yu, B. Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. USA 2013, 110, 14563–14568. [Google Scholar] [CrossRef] [Green Version]
Thrampoulidis, C.; Oymak, S.; Hassibi, B. Regularized Linear Regression: A Precise Analysis of the Estimation Error. In Proceedings of the 28th Conference on Learning Theory, Paris, France, 3–6 July 2015; pp. 1683–1709. [Google Scholar]
Donoho, D.; Montanari, A. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 2016, 166, 935–969. [Google Scholar] [CrossRef] [Green Version]
Thrampoulidis, C.; Abbasi, E.; Hassibi, B. Precise Error Analysis of Regularized M-Estimators in High Dimensions. IEEE Trans. Inf. Theory 2018, 64, 5592–5628. [Google Scholar] [CrossRef] [Green Version]
Advani, M.; Ganguli, S. Statistical mechanics of optimal convex inference in high dimensions. Phys. Rev. X 2016, 6, 031034. [Google Scholar] [CrossRef] [Green Version]
Weng, H.; Maleki, A.; Zheng, L. Overcoming the limitations of phase transition by higher order analysis of regularization techniques. Ann. Stat. 2018, 46, 3099–3129. [Google Scholar] [CrossRef] [Green Version]
Thrampoulidis, C.; Xu, W.; Hassibi, B. Symbol Error Rate Performance of Box-relaxation Decoders in Massive MIMO. IEEE Trans. Signal Process. 2018, 66, 3377–3392. [Google Scholar] [CrossRef] [Green Version]
Miolane, L.; Montanari, A. The distribution of the Lasso: Uniform control over sparse balls and adaptive parameter tuning. arXiv 2018, arXiv:1811.01212. [Google Scholar]
Bu, Z.; Klusowski, J.; Rush, C.; Su, W. Algorithmic analysis and statistical estimation of slope via approximate message passing. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BD, Canada, 8–14 December 2019; pp. 9361–9371. [Google Scholar]
Xu, J.; Maleki, A.; Rad, K.R.; Hsu, D. Consistent risk estimation in high-dimensional linear regression. arXiv 2019, arXiv:1902.01753. [Google Scholar]
Celentano, M.; Montanari, A. Fundamental Barriers to High-Dimensional Regression with Convex Penalties. arXiv 2019, arXiv:1903.10603. [Google Scholar]
Kammoun, A.; Alouini, M.S. On the precise error analysis of support vector machines. arXiv 2020, arXiv:2003.12972. [Google Scholar]
Amelunxen, D.; Lotz, M.; McCoy, M.B.; Tropp, J.A. Living on the edge: A geometric theory of phase transitions in convex optimization. arXiv 2013, arXiv:1303.6672. [Google Scholar]
Donoho, D.L.; Johnstone, L.; Montanari, A. Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising. IEEE Trans. Inf. Theory 2013, 59, 3396–3433. [Google Scholar] [CrossRef] [Green Version]
Mondelli, M.; Montanari, A. Fundamental limits of weak recovery with applications to phase retrieval. arXiv 2017, arXiv:1708.05932. [Google Scholar] [CrossRef] [Green Version]
Taheri, H.; Pedarsani, R.; Thrampoulidis, C. Fundamental limits of ridge-regularized empirical risk minimization in high dimensions. arXiv 2020, arXiv:2006.08917. [Google Scholar]
Bayati, M.; Lelarge, M.; Montanari, A. Universality in polytope phase transitions and message passing algorithms. Ann. Appl. Probab. 2015, 25, 753–822. [Google Scholar] [CrossRef]
Panahi, A.; Hassibi, B. A universal analysis of large-scale regularized least squares solutions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3381–3390. [Google Scholar]
Abbasi, E.; Salehi, F.; Hassibi, B. Universality in learning from linear measurements. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BD, Canada, 8–14 December 2019; pp. 12372–12382. [Google Scholar]
Goldt, S.; Reeves, G.; Mézard, M.; Krzakala, F.; Zdeborová, L. The Gaussian equivalence of generative models for learning with two-layer neural networks. arXiv 2020, arXiv:2006.14709. [Google Scholar]
Donoho, D.L.; Maleki, A.; Montanari, A. Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 2009, 106, 18914–18919. [Google Scholar] [CrossRef] [Green Version]
Bayati, M.; Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. Inf. Theory IEEE Trans. 2011, 57, 764–785. [Google Scholar] [CrossRef] [Green Version]
Mousavi, A.; Maleki, A.; Baraniuk, R.G. Consistent parameter estimation for LASSO and approximate message passing. Ann. Stat. 2018, 46, 119–148. [Google Scholar] [CrossRef] [Green Version]
El Karoui, N. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 2018, 170, 95–175. [Google Scholar] [CrossRef] [Green Version]
Boufounos, P.T.; Baraniuk, R.G. 1-bit compressive sensing. In Proceedings of the 2008 IEEE 42nd Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, 19–21 March 2008; pp. 16–21. [Google Scholar]
Jacques, L.; Laska, J.N.; Boufounos, P.T.; Baraniuk, R.G. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Trans. Inf. Theory 2013, 59, 2082–2102. [Google Scholar] [CrossRef] [Green Version]
Plan, Y.; Vershynin, R. One-Bit Compressed Sensing by Linear Programming. Commun. Pure Appl. Math. 2013, 66, 1275–1297. [Google Scholar] [CrossRef] [Green Version]
Plan, Y.; Vershynin, R. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Trans. Inf. Theory 2012, 59, 482–494. [Google Scholar] [CrossRef] [Green Version]
Plan, Y.; Vershynin, R. The generalized lasso with non-linear observations. IEEE Trans. Inf. Theory 2016, 62, 1528–1537. [Google Scholar] [CrossRef]
Genzel, M. High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. IEEE Trans. Inf. Theory 2017, 63, 1601–1619. [Google Scholar] [CrossRef] [Green Version]
Xu, C.; Jacques, L. Quantized compressive sensing with rip matrices: The benefit of dithering. arXiv 2018, arXiv:1801.05870. [Google Scholar] [CrossRef]
Thrampoulidis, C.; Abbasi, E.; Hassibi, B. Lasso with non-linear measurements is equivalent to one with linear measurements. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 3420–3428. [Google Scholar]
Candès, E.J.; Sur, P. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv 2018, arXiv:1804.09753. [Google Scholar] [CrossRef] [Green Version]
Sur, P.; Candès, E.J. A modern maximum-likelihood theory for high-dimensional logistic regression. Proc. Natl. Acad. Sci. USA 2019, 201810420. [Google Scholar] [CrossRef] [Green Version]
Mai, X.; Liao, Z.; Couillet, R. A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3357–3361. [Google Scholar]
Salehi, F.; Abbasi, E.; Hassibi, B. The Impact of Regularization on High-dimensional Logistic Regression. arXiv 2019, arXiv:1906.03761. [Google Scholar]
Montanari, A.; Ruan, F.; Sohn, Y.; Yan, J. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv 2019, arXiv:1911.01544. [Google Scholar]
Deng, Z.; Kammoun, A.; Thrampoulidis, C. A Model of Double Descent for High-dimensional Binary Linear Classification. arXiv 2019, arXiv:1911.05822. [Google Scholar]
Mignacco, F.; Krzakala, F.; Lu, Y.M.; Zdeborová, L. The role of regularization in classification of high-dimensional noisy Gaussian mixture. arXiv 2020, arXiv:2002.11544. [Google Scholar]
Aubin, B.; Krzakala, F.; Lu, Y.M.; Zdeborová, L. Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization. arXiv 2020, arXiv:2006.06560. [Google Scholar]
Rockafellar, R.T.; Wets, R.J.B. Variational Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009; Volume 317. [Google Scholar]
Barron, A.R. Monotonic Central Limit Theorem for Densities; Technical Report; Stanford University: Stanford, CA, USA, 1984. [Google Scholar]
Costa, M.H.M. A new entropy power inequality. IEEE Trans. Inf. Theory 1985, 31, 751–760. [Google Scholar] [CrossRef]
Blachman, N. The convolution inequality for entropy powers. IEEE Trans. Inf. Theory 1965, 11, 267–271. [Google Scholar] [CrossRef]
Brillinger, D.R. A Generalized Linear Model with “Gaussian” Regressor Variables. In A Festschrift For Erich L. Lehmann; Springer: New York, NY, USA, 1982; p. 97. [Google Scholar]
Dhifallah, O.; Lu, Y.M. A precise performance analysis of learning with random features. arXiv 2020, arXiv:2008.11904. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Gordon, Y. On Milman’s Inequality and Random Subspaces which Escape through a Mesh in Rⁿ; Springer: Berlin/Heidelberg, Germany, 1988. [Google Scholar]
Dhifallah, O.; Thrampoulidis, C.; Lu, Y.M. Phase retrieval via polytope optimization: Geometry, phase transitions, and new algorithms. arXiv 2018, arXiv:1805.09555. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Genzel, M.; Jung, P. Recovering structured data from superimposed non-linear measurements. arXiv 2017, arXiv:1708.07451. [Google Scholar] [CrossRef]
Goldstein, L.; Minsker, S.; Wei, X. Structured signal recovery from non-linear and heavy-tailed measurements. IEEE Trans. Inf. Theory 2018, 64, 5513–5530. [Google Scholar] [CrossRef]
Thrampoulidis, C.; Rawat, A.S. The generalized lasso for sub-gaussian measurements with dithered quantization. arXiv 2018, arXiv:1807.06976. [Google Scholar] [CrossRef] [Green Version]

Figure 1. (a) Comparison between theoretical (solid lines) and empirical (markers) performance for least-squares (LS) and least-absolute deviations (LAD), as predicted by Theorem 1, and the optimal performance, as predicted by the upper bound of Theorem 2, for the signed model. The squares and circles denote the empirical performance for Gaussian and Rademacher features, respectively. (b) Illustrations of optimal loss functions for the signed model for different values of

δ

according to Theorem 3.

δ

according to Theorem 3.

Figure 2. Theoretical (solid lines) and empirical (markers) results of classification risk in GMM as in Theorem 4 and (39) for LS, LAD and logistic loss functions as a function of

δ

for

r = 1

. The vertical line represents the threshold

δ^{☆} \approx 3.7

as evaluated by (36). Logistic loss gives unbounded solution if and only if

δ < δ^{☆}

Figure 2. Theoretical (solid lines) and empirical (markers) results of classification risk in GMM as in Theorem 4 and (39) for LS, LAD and logistic loss functions as a function of

δ

for

r = 1

. The vertical line represents the threshold

δ^{☆} \approx 3.7

as evaluated by (36). Logistic loss gives unbounded solution if and only if

δ < δ^{☆}

Figure 3. (a) Comparison between analytical and empirical results for the performance of LS, logistic loss, hinge loss and optimal loss function for logistic model. The vertical dashed line represents

δ_{f}^{☆} \approx 2.275

, as evaluated by (35). (b) Illustrations of optimal loss functions for different values of

δ

, derived according to Theorem 3 for logistic model. To signify the similarity of optimal loss function to the LS loss, the optimal loss functions (hardly visible) are scaled such that

ℓ (1) = 0

and

ℓ (2) = 1

δ_{f}^{☆} \approx 2.275

, as evaluated by (35). (b) Illustrations of optimal loss functions for different values of

δ

, derived according to Theorem 3 for logistic model. To signify the similarity of optimal loss function to the LS loss, the optimal loss functions (hardly visible) are scaled such that

ℓ (1) = 0

and

ℓ (2) = 1

Figure 4. (a) Comparison between analytical and empirical results for the performance of LS, hinge loss and optimal loss function for Probit model. The vertical dashed line represents

δ_{f}^{☆} \approx 2.699

, as evaluated by (35). (b) Illustrations of optimal loss functions for different values of

δ

derived according to Theorem 3 for Probit model. To signify the similarity of optimal loss function to the LS loss, the optimal loss functions (hardly visible) are scaled such that

ℓ (1) = 0

and

ℓ (2) = 1

Figure 4. (a) Comparison between analytical and empirical results for the performance of LS, hinge loss and optimal loss function for Probit model. The vertical dashed line represents

δ_{f}^{☆} \approx 2.699

, as evaluated by (35). (b) Illustrations of optimal loss functions for different values of

δ

derived according to Theorem 3 for Probit model. To signify the similarity of optimal loss function to the LS loss, the optimal loss functions (hardly visible) are scaled such that

ℓ (1) = 0

and

ℓ (2) = 1

Table 1. Theoretical predictions and empirical performance of the optimal loss function for signed model. Empirical results are averaged over 20 experiments for

n = 128

Table 1. Theoretical predictions and empirical performance of the optimal loss function for signed model. Empirical results are averaged over 20 experiments for

n = 128

$δ$	2	3	4	5	6	7	8	9
Predicted Performance	$0.8168$	$0.9101$	$0.9457$	$0.9645$	$0.9748$	$0.9813$	$0.9855$	$0.9885$
Empirical (Gaussian)	$0.8213$	$0.9045$	$0.9504$	$0.9669$	$0.9734$	$0.9801$	$0.9834$	$0.9873$
Empirical (Rademacher)	$0.8096$	$0.9158$	$0.9490$	$0.9633$	$0.9644$	$0.9768$	$0.9808$	$0.9829$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Taheri, H.; Pedarsani, R.; Thrampoulidis, C. Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models. Entropy 2021, 23, 178. https://doi.org/10.3390/e23020178

AMA Style

Taheri H, Pedarsani R, Thrampoulidis C. Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models. Entropy. 2021; 23(2):178. https://doi.org/10.3390/e23020178

Chicago/Turabian Style

Taheri, Hossein, Ramtin Pedarsani, and Christos Thrampoulidis. 2021. "Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models" Entropy 23, no. 2: 178. https://doi.org/10.3390/e23020178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models †

Abstract

1. Introduction

1.1. Motivation

1.2. Data Models

1.3. Empirical Risk Minimization

1.4. Contributions and Organization

1.5. Related Works

2. Sharp Performance Guarantees

2.1. Definitions

2.2. A System of Equations

2.3. Asymptotic Prediction

3. On Optimal Performance

3.1. Fundamental Limits

3.2. On the Optimal Loss Function

4. Special Cases

4.1. Least-Squares

4.2. Logistic and Hinge Loss

5. Extensions to Gaussian-Mixture Models

5.1. System of Equations for GMM

5.2. Theoretical Prediction of Error for Convex Loss Functions

5.3. Special Case: Least-Squares

5.4. Optimal Risk for GMM

6. Numerical Experiments

Numerical Experiments for GMM

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Properties of Moreau Envelopes

Appendix A.1. Derivatives

Appendix A.2. Alternative Representations of (8)

Appendix A.3. Examples of Proximal Operators

Appendix A.4. Fenchel–Legendre Conjugate Representation

Appendix A.5. Convexity of the Moreau Envelope

Appendix A.6. The Expected Moreau-Envelope (EME) Function and its Properties

Appendix A.6.1. Derivatives

Appendix A.6.2. Strict Convexity

Appendix A.6.3. Strict Concavity

Appendix A.6.4. Summary of Properties of (uid135)

Appendix B. Proof of Theorem 1

Appendix B.1. Technical Tool: CGMT

Appendix B.1.1. Gordon’s Min-Max Theorem (GMT)

Appendix B.1.2. Convex Gaussian Min-Max Theorem (CGMT)

Appendix B.2. Applying the CGMT to ERM for Binary Classification

Appendix B.3. Analysis of the Auxiliary Optimization

Appendix B.4. Convex-Concavity and First-Order Optimality Conditions

Appendix B.5. On the Uniqueness of Solutions to (A57): Proof of Proposition 1

Appendix C. Discussions on the Fundamental Limits for Binary Models

Appendix C.1. Distribution of SY in Special Cases

Appendix D. Proofs and Discussions on the Optimal Loss Function

Appendix D.1. Proof of Theorem 3

Appendix D.2. On the Convexity of Optimal Loss Function

Appendix D.2.1. Provable Convexity of the Optimal Loss Function for Signed Model

Appendix E. Noisy-Signed Measurement Model

Appendix F. On LS Performance for Binary Models

Appendix F.1. Proof of Corollary 2

Appendix F.2. Discussion

Linear vs. Binary

Appendix G. Fundamental Limits for Gaussian-Mixture Models: Proofs for Section 5

Appendix G.1. Proof of Corollary 3

Appendix G.2. Proof of Theorem 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models^†