Nothing Special   »   [go: up one dir, main page]

Lect 05

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

LECTURE 5

Expectation

5.1 DEFINITION AND PROPERTIES

Let X ∈ X be a discrete random variable with pmf p X (x) and let (x) be a function of x.
The expectation (or expected value or mean) of (X) is defined as

E[(X)] = 󵠈 (x)p X (x).


x∈X

For a continuous random variable X ∼ f X (x), the expected value of (X) is defined as


E[(X)] = 󵐐 (x) f X (x) dx.
−∞

The expectation operation E[⋅] satisfies the following properties:


. E[c] = c for every constant c.
. If (X) ≥ 0 w.p. , then E[(X)] ≥ 0.
. Linearity. For any constant a and functions 1 (x) and 2 (x),

E[a1 (X) + 2 (X)] = a E[1 (X)] + E[2 (X)].

By considering Y = (X) as a random variable on its own, we can compute the same
expectation.

Fundamental theorem of expectation. If X ∼ p X (X) and Y = (X) ∼ pY (y), then

E[Y ] = 󵠈 y pY (y) = 󵠈 (x)p X (x) = E[(X)].


y∈Y x∈X

Similarly, if X ∼ f X (x) and Y = (X) ∼ fY (y), then


∞ ∞
E[Y ] = 󵐐 y fY (y) d y = 󵐐 (x) f X (x) dx = E[(X)].
−∞ −∞
2 Expectation

We prove the theorem for the discrete case. Consider


E[Y ] = 󵠈 y pY (y)
y

=󵠈y 󵠈 p X (x)
y {x: (x)=y}

=󵠈 󵠈 y p X (x)
y {x: (x)=y}

=󵠈 󵠈 (x)p X (x)
y {x: (x)=y}

= 󵠈 (x)p X (x).
x

Thus, E[Y ] = E[(X)] can be found using either p X (x) or pY (y). It is often much easier to
use p X (x) than to first find pY (y) and then find E[Y ].
We already know that a random variable is completely specified, that is, any proba-
bility of a Borel set involving the random variable can be determined, by its cumulative
distribution function (or its pmf and pdf in discrete and continuous cases, respectively).
As a simple summary of the random variable, however, its expectation has several appli-
cations.
. Expectation can be used to bound or estimate probabilities of interesting events, as we
will see in Section ..
. Expectation provides the optimal estimate of a random variable under the mean square
error criterion, as we will see in Section ..
. It is far easier to estimate the expectation of a random variable from data than to esti-
mate its distribution, as we will see in Lecture #.

5.2 MEAN AND VARIANCE

The first moment (or mean) of X ∼ f X (x) is the expectation



E[X] = 󵐐 x f X (x) dx.
−∞
The following trick is useful for computing the expectation of a nonnegative random vari-
able. If X ≥ 0 is continuous, then

E[X] = 󵐐 u f X (u) du
0

= 󵐐 󵐐 dx f X (u) du
u

0 0
∞ ∞
=󵐐 󵐐 f X (u) du dx
0 x

= 󵐐 1 − FX (x) dx.
0
5.2 Mean and Variance 3

The same identity also follows by integration by parts ∫ ϕψ 󳰀 = ϕψ − ∫ ϕ󳰀 ψ with ϕ = x and


ψ = 1 − FX (x). Similarly, if X ≥ 0 is discrete, then

E[X] = 󵠈 (1 − FX (k)).
k=0

Let
x ∈ A,
1A (x) = 󶁇
1
0 otherwise.
Then, the expectation of the indicator variable is

E[1A (X)] = 󵐐 1A (x) f X (x) dx = 󵐐 f X (x) dx = P{X ∈ A}.
−∞ A

The second moment (or mean square or average power) of X is



E[X 2 ] = 󵐐 x 2 f X (x) dx.
−∞

The variance of X is
Var(X) = E[(X − E(X))2 ]
= E[X 2 − 2X E(X) + (E(X))2 ]
= E[X 2 ] − 2(E[X])2 + (E[X])2
= E[X 2 ] − (E[X])2 .

The standard deviation of X is defined as σX = 󵀄Var(X), i.e., Var(X) = σX2 .


Example .. We find E[X], E[X 2 ], and Var(X) for (X, Y ) ∼ f X ,Y (x, y) where

x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) = 󶁇
2
0 otherwise.
Consider
∞ ∞
E[X] = 󵐐 󵐐 x f (x, y) dx d y
−∞ −∞

=󵐐 󵐐
1 1−x
2x d y dx
0 0

= 2 󵐐 (1 − x)x dx
1

= 2󶀣 − 󶀳 =
1 1 1
2 3 3
and
E[X 2 ] = 2 󵐐 (1 − x)x 2 dx = 2󶀣 − 󶀳 = .
1
1 1 1
0 3 4 6
Hence,
Var(X) = E[X 2 ] − (E[X])2 − ( )2 = − = .
1 1 1 1 1
6 3 6 9 18
4 Expectation

The following useful identities are direct consequences of the linearity of expectation:

E(aX + b) = a E(X) + b,
Var(aX + b) = a2 Var(X).

Table . summarizes the mean and variance of famous random variables.

Random Variable Mean Variance

Bern(p) p p(1 − p)
1 1−p
Geom(p)
p p2
Binom(n, p) np np(1 − p)

Poisson(λ) λ λ
a+b (b − a)2
Unif[ a, b ]
2 12
1 1
Exp(λ)
λ λ2
N󶀡μ, σ 2 󶀱 μ σ2

Table .. The mean and variance of common random variables.

Remark 5.1. Expectation can be infinite. For example, consider

1≤x<∞
f X (x) = 󶁇
1/x 2
0 otherwise.

Then

E[X] = 󵐐 dx = ∞.
x
1 x2

Remark 5.2. Expectation may not exist. To find conditions under which expectation ex-
ists, consider
∞ ∞
E[X] = 󵐐 x f X (x) dx = − 󵐐 |x | f X (x) dx + 󵐐 |x | f X (x) dx ,
0

−∞ −∞ 0


so either ∫−∞ |x| f X (x) dx or ∫0 |x| f X (x) dx must be finite.
0
5.3 Inequalities 5

Example .. The standard Cauchy random variable has the pdf

f X (x) =
1
π(1 + x 2 )

Since both ∫−∞ |x| f X (x) dx and ∫0 |x| f X (x) dx are infinite, its mean does not exist! (The
0

second moment of the Cauchy is E[X 2 ] = ∞, so it exists.)

5.3 INEQUALITIES

In many cases we do not know the distribution of a random variable X, but wish to find the
probability of an event such as {X > a} or {|X − E(X)| > a}. The Markov and Chebyshev
inequalities provide upper bounds on the probabilities of such events in terms of the mean
and variance of the random variable.

Markov inequality. Let X ≥ 0 be a random variable with finite mean. Then for any
a > 1,
P{X ≥ a E[X]} ≤ .
1
a

Example .. If the average age in the San Diego is , then at most half of the population
is  or older.

To prove the Markov inequality, let A = {x ≥ a E(X)} and consider the indicator func-
tion 1A (x). As illustrated in Figure .,
PSfrag replacements
x
a E[X]


x
E[X] a E[X]

Figure .. Proof of the Markov inequality.

1A (x) ≤
x
.
a E[X]

Since E(1A (X)) = P{X ≥ a E[X]}, taking the expectations of both sides establishes the in-
equality.
6 Expectation

The Markov inequality can be very loose. For example, if X ∼ Exp(1), then
P{X ≥ 10} = e −10 ≈ 4.54 × 10−5 .
The Markov inequality yields
P{X ≥ 10} ≤ 1
10
,
which is very pessimistic. But it is the tightest possible inequality on P{X ≥ a E[X]} when
we are given only E[X]. To show this, note that the inequality is tight for

X=󶁇
a E[X] w.p. 1/a,
0 w.p. 1 − 1/a.
In Example ., if half of the population is  year old and half other population is  years
old, then the average age is  and the Markov inequality is tight.

Chebyshev inequality. Let X be a random variable with finite mean E[X] and vari-
ance σX2 . Then for any a > 1,

P{| X − E[X]| ≥ aσX } ≤


1
.
a2

Example .. Let X be a device parameter in an integrated circuit (IC) with known mean
and variance. The IC is out-of-spec if X is more than, say, 3σX away from its mean. Then,
by the Chebyshev inequality, the fraction of out-of-spec ICs, namely, P{|X − E(X)| ≥ 3σX }
is no larger than 1/9.

The proof of the Chebyshev inequality uses the Markov inequality (which is a slight
twist from the teacher–student relationship between Prof. Pafnuty Chebyshev and his stu-
dent Andrey Markov at Saint Petersburg University in Russia). Define the random vari-
able Y = (X − E[X])2 ≥ 0. Since E[Y ] = σX2 , the Markov inequality implies that

P{| X − E(X)| ≥ aσX } = P{Y ≥ a2 σX2 } ≤


1
.
a2
The Chebyshev inequality can be very loose. Let X ∼ N(0, 1). Then, by the Chebyshev
inequality,
P{| X | ≥ 3} ≤ 19 ,

which is very pessimistic compared to the actual value 2Q(3) ≈ 2 × 10−3 . But it is the tight-
est upper bound on P{|X − E(X)| ≥ aσX } given knowledge only of the mean and variance
of X. Indeed, the inequality holds with equality for the random variable

󶀂
󶀒E(X) + aσX
󶀒
w.p. 1/2a2 ,
X = 󶀊E(X) − aσX
󶀒
w.p. 1/2a2 ,
󶀒
󶀚E(X) w.p. 1 − 1/a2 .
5.3 Inequalities 7

We now discuss an extremely useful inequality that is named after a Danish mathe-
matician Johan Jensen and is centered around the notion of convexity. A function (x) is

(b) − (a)
said to be convex if
(x) ≤ (x − a) + (a)
b−a
for all x ∈ [a, b] and all a < b, that is, the function curve is below every chord across two
points on the curve. If (x) is twice differentiable, then (x) is convex iff
󳰀󳰀 (x) ≥ 0.
If −(x) is convex, then (x) is called concave.

Example 5.5. The following functions are convex: (a) (x) = ax + b. (b) (x) = x 2 .
(c) (x) = |x| p , p ≥ 1. (d) (x) = x log x, x > 0. (e) (x) = 1/x, x > 0.
Example 5.6. The following functions are concave: (a) (x) = ax + b. (b) (x) = 󵀂x,
x > 0. (c) (x) = log x, x > 0.

Jensen’s inequality. Let X be a random variable with finite mean E[X] and (x) be a
function such that E[(X)] is finite. If (x) is convex, then

E[(X)] ≥ (E[X]).

If (x) is concave, then


E[(X)] ≤ (E[X]).

Jensen’s inequality is a powerful tool to derive many inequalities.

Example 5.7. Since (x) = x 2 is convex,


E[(X)] = E[X 2 ] ≥ (E[X])2 = (E[X]),
which we already know since Var(X) = E[X 2 ] − (E[X])2 ≥ 0.
Example 5.8. Since (x) = 1/x, x > 0, is convex,

E[(X 2 )] = E󶁤 󶁴≥ = (E[X 2 ]).


1 1
X 2 E[X ]
2

Example . (Monotonicity of norms). For 1 ≤ p ≤ q,


(E[| X | p ])(1/p) ≤ (E[| X | q ])(1/q) . (.)
To see this, consider a convex function (x) = |x|q/p and use Jensen’s inequality to obtain

E[| X | q ] = E[(| X | p ) 󰑝 ] ≥ (E[| X | p ]) 󰑝 .


󰑞 󰑞

Taking the q-th root of both sides establishes (.).


8 Expectation

5.4 COVARIANCE AND CORRELATION

Let (X, Y ) ∼ f X ,Y (x, y) and let (x, y) be a function of x and y. The expectation of (X, Y )
is
∞ ∞
E[(X, Y )] = 󵐐 󵐐 (x, y) f X ,Y (x, y) dx d y.
−∞ −∞

As an example, the correlation of X and Y is defined as

E[XY ].

We say that X and Y are orthogonal if E(XY ) = 0.


The covariance of X and Y is defined as

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]


= E[XY − X E[Y ] − Y E[X] + E[X] E[Y ]]
= E[XY ] − E[X] E[Y ].

We say that X and Y are uncorrelated if Cov(X, Y ) = 0. Note that Cov(X, X) = Var(X).
The correlation coefficient of X and Y is defined as

Cov(X, Y )
ρ X ,Y =
󵀄Var(X) Var(Y )
.

For any pair of random variables X and Y ,

| ρ X ,Y | ≤ 1,

which follows by the Cauchy–Schwarz inequality

(E[XY ])2 ≤ E[X 2 ] E[Y 2 ].

Note that ρ X ,Y = ±1 iff
X − E[X] Y − E[Y]
=± ,
σX σY

that is, iff X − E[X] is a linear function of Y − E[Y ]. We shall see in Section . that ρ X ,Y is
a measure of how closely X − E[X] can be approximated or estimated by a linear function
of Y − E[Y ].
Example .. We find the correlation, covariance, and correlation coefficient for (X, Y ) ∼
f X ,Y (x, y) where
x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) = 󶁇
2
0 otherwise.
5.4 Covariance and Correlation 9

Recall from Example . that E[X] = 1/3 and Var(X) = 1/18. By symmetry, E[Y ] = 1/3
and Var(Y ) = 1/18 as well. Consider

E[XY ] = 2 󵐐 󵐐
1 1−x
x y d y dx
0 0

= 󵐐 x(1 − x)2 dx
1

= .
1
12

and
Cov(X, Y ) = E[XY ] − E[X] E[Y ] = − =− .
1 1 1
12 9 36

Finally,
Cov(X, Y ) − 36
1
= = 1 =− .
1
󵀄Var(X)󵀄Var(Y )
ρ X ,Y
2
18

As noted earlier, X and Y are uncorrelated if Cov(X, Y ) = 0. If X and Y are indepen-


dent, then they are uncorrelated, since

∞ ∞
E[XY ] = 󵐐 󵐐 x y f X ,Y (x, y) dx d y
−∞ −∞
∞ ∞
=󵐐 󵐐 x y f X (x) fY (y) dx d y
−∞ −∞
∞ ∞
= 󶀤󵐐 x f X (x) dx󶀴󶀤󵐐 y f (y) d y󶀴
−∞ −∞
= E[X] E[Y ].

However, that X and Y are uncorrelated does not necessarily imply that they are indepen-
dent.
Example .. Consider the pmf p X ,Y (x, y) described by the following table

x
−1 0 1
−1 1
6
0 1
6
1
y 0 0 3
0
1 1
1 6
0 6

Clearly X and Y are not independent. But it can be readily checked that E[X] = E[Y ] =
E[XY ] = 0. Thus Cov(X, Y ) = 0, that is, X and Y are uncorrelated.
10 Expectation

5.5 CONDITIONAL EXPECTATION

Let (X, Y ) ∼ f X ,Y (x, y). Recall that the conditional pdf of X given Y = y is
f X ,Y (x, y)
f X|Y (x | y) =
fY (y)
,

if fY (y) > 0. Since f X|Y (x|y) is a pdf for X (for each y), we can define the expectation of
any function (X, Y ) w.r.t. f X|Y (x|y) as

E[(X, Y ) | Y = y] = 󵐐 (x, y) f X|Y (x | y) dx,
−∞

which is a function of y.

Example 5.12. If (X, Y ) = X, then the conditional expectation of X given Y = y is



E[X | Y = y] = 󵐐 x f X|Y (x | y) dx.
−∞

Example 5.13. If (X, Y ) = Y , then E[Y |Y = y] = y.


Example 5.14. If (X, Y ) = XY , then E[XY |Y = y] = y E[X |Y = y].
Example 5.15. Let

2 if x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) = 󶁇
0 otherwise.
From Lecture #, we already know that X | {Y = y} ∼ Unif[0, 1 − y]. Thus, E[X|Y = y] =
(1 − y)/2.

Let ϕ(y) = E[(X, Y ) |Y = y]. We define the conditional expectation of (X, Y ) given
Y as
E[(X, Y ) | Y ] = ϕ(Y ).
In other words, the random variable E[(X, Y ) |Y ] is a function of Y that takes values
E[(X, Y ) |Y = y] when Y = y.
Law of iterated expectation. The following observation is very useful in computing ex-
pectation:

E[E[(X, Y ) | Y ]] = 󵐐 E[(X, Y ) | Y = y] fY (y) d y
−∞
∞ ∞
=󵐐 󶀤󵐐 (x, y) f X|Y (x | y) dx󶀴 fY (y) d y
−∞ −∞
∞ ∞
=󵐐 󵐐 (x, y) f X ,Y (x, y) dx d y
−∞ −∞
= E[(X, Y )].
5.5 Conditional Expectation 11

Example .. We continue Example .. The conditional expectation of X given Y is

1−Y
the random variable
E[X | Y ] = =: Z.
2
The pdf of Z is
fZ (z) = 8z , 0 < z ≤ 12 ,
which is illustrated in Figure .. Note that
1

E[Z] = 󵐐 8z 2 dz = = E[X],
2 1
0 3
as is expected from the law of iterated expectation. Similarly,
E[XY ] = E[E[XY | Y ]]
Y (1 − Y )
= E󶁤 󶁴
2
1 y(1 − y)
=󵐐 ⋅ 2(1 − y) d y = ,
1
0 2 12
which agrees with the direct integration computed in Example ..

Example 5.17. A coin has random bias P ∈ [0, 1] with pdf fP (p) = 2(1 − p). The coin is
flipped n times. Let N be the number of heads, that is, N | {P = p} ∼ Binom(n, p). Then,
by the law of iterated expectation, we can find
E[N] = E[E[N | P]]
= E[nP]
= n E[P]

= n 󵐐 2(1 − p)p d p = n,
1
1
0 3

fZ (z)

PSfrag replacements

z
1
2

Figure .. The graph of fZ (z).


12 Expectation

which is much simpler than finding the pmf of N and computing the expectation.
Example 5.18. Let E[X |Y ] = Y 2 and Y ∼ Unif[0, 1]. In this case, we cannot find the pdf
of X, since we do not know f X|Y (x|y). But using iterated expectation we can still find

E[X] = E[E[X | Y ]] = E[Y 2 ] = 󵐐 y 2 d y = .


1
1
0 3

We define the conditional variance of X given Y = y as the variance of X w.r.t. f X|Y (x|y),
i.e.,

Var(X | Y = y) = E[(X − E[X |Y = y])2 | Y = y]


= E[X 2 | Y = y] − 2 E[X E[X |Y = y] | Y = y] + E[(E[X |Y = y])2 | Y = y]
= E[X 2 | Y = y] − (E[X | Y = y])2 .

The random variable Var(X |Y ) is a function of Y that takes on the values Var(X |Y = y).
Its expected value is

E[Var(X | Y )] = E[E[X 2 | Y ] − (E[X | Y ])2 ] = E[X 2 ] − E[(E[X | Y ])2 ]. (.)

Since E(X |Y ) is a random variable, it has a variance

Var(E(X | Y )) = E[(E[X | Y ] − E[E[X | Y ]])2 ]


= E[(E[X | Y ] − E[X])2 ] = E[(E[X | Y ])2 ] − (E[X])2 . (.)

By adding (.) and (.), we establish the law of conditional variances:

Var(X) = E[Var(X | Y )] + Var(E[X | Y ]).


PSfrag replacements
5.6 MMSE ESTIMATION

Consider the signal estimation system depicted in Figure ., where the original signal is
X ∼ f X (x) and its noisy observation is

Y | {X = x} ∼ fY|X (y|x).

f X (x) X Noisy Y
Estimator X̂ = (Y )
channel
fY|X (y|x) (y)

Figure .. The signal estimation system.


5.6 MMSE Estimation 13

An estimator is a mapping  : ℝ → ℝ that generates an estimate X̂ = (Y ) of the original


signal. We wish to find an optimal estimator ∗ (y) that minimizes the mean square error
(MSE)
̂ 2 ] = E[(X − (Y ))2 ].
E[(X − X) (.)

The estimator ∗ (y) that attains the smallest value of (.) is referred to as the minimum
mean square error (MMSE) estimator of X given Y , and X̂ = ∗ (Y ) is referred to as the
MMSE estimate.
Suppose that there is no observation and let a∗ be the MMSE estimate of X, that is,

a∗ = arg min E[(X − a)2 ].


a

Then,
a∗ = E[X]. (.)

In other words, the mean is the optimal summary of X under the mean square error
criterion. To prove (.), note that for any estimate a of X,

E[(X − a)2 ] = E[(X − E[X] + E[X] − a)2 ]


= E[(X − E[X])2 ] + (E[X] − a)2 + E[X − E[X]](E[X] − a)
= E[(X − E[X])2 ] + (E[X] − a)2
≥ E[(X − E[X])2 ]

with equality iff a = E[X].


A similar observation continues to hold with the observation Y = y. Let ∗ (y) =
E[X |Y = y]. Then, by (.), for any estimator (y),

E[(X − ∗ (y))2 | Y = y] ≤ E[(X − (y))2 | Y = y].

Consequently,
E[(X − ∗ (Y ))2 ] ≤ E[(X − (Y ))2 ]

and X̂ = ∗ (Y ) = E[X|Y ] is the MMSE estimate of X given Y with the corresponding


MSE
E[Var(X |Y )] = E[(X − E[X | Y ])2 ].

The MMSE estimate X̂ = E[X|Y ] satisfies the following properties.


. It is unbiased, i.e.,
̂ = E[X].
E[ X]

. The estimation error X − X̂ is unbiased for every Y = y, i.e.,

E[X − X̂ | Y = y] = 0.
14 Expectation

. The estimation error and the estimate are orthogonal, i.e.,

̂ X]
E[(X − X) ̂ = E[E[(X − X)̂ X̂ | Y ]]
= E[ X̂ E[X − X̂ | Y ]] = 0.

In fact, the estimation error is orthogonal to any function (Y ), i.e.,

̂
E[(X − X)(Y )] = 0.

. By the law of conditional variance Var(X) = Var( X) ̂ + E[Var(X |Y )], the sum of the
variance of the estimate and its MSE is equal to the variance of the signal.
. If X and Y are independent, then X̂ = E[X], that is, the observation is ignored.

Example .. Again let

x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) = 󶁇
2
0 otherwise.

We find the MMSE estimate of X given Y and its MSE. We already know that the MMSE
1−Y
estimate is
E[X | Y ] =
2

and that the conditional variance is


(1 − Y)2
Var[X | Y ] = .
12

Hence, the MMSE is E[Var(X |Y )] = 1/24, compared to Var(X) = 1/18. The difference is
Var(E[X |Y ]) = 1/72, which is the variance of the estimate.
PSfrag replacements
E[X |Y = y]
y
E[X |Y = y] 1
Var(X |Y = y)
2
PSfrag replacements 1
 1
2 12

y y
 

Figure .. The conditional mean and variance of X given Y = y.

Example . (Additive Gaussian noise channel). Consider a communication channel


with input X ∼ N(μ, P), noise Z ∼ N(0, N), and output Y = X + Z. We assume that X and
5.6 MMSE Estimation 15

Z are independent. We find the MMSE estimate of X given Y and its MSE, i.e., E[X |Y ]
and E[Var(X |Y )]. Recall that Y | {X = x} ∼ N(x, N) and Y ∼ N(μ, P + N), that is,

fY|X (y|x) = fZ (y − x) = e − 2󰑁
1 (󰑦−󰑥)2

󵀂2πN

and
fY (y) = e − 2(󰑃+󰑁 ) .
1 (󰑦−󰜇)2

󵀄2π(P + N)

Hence,
f X (x) fY|X (y|x)
f X|Y (x | y) =
fY (y)
e− e−
(󰑥−󰜇)2 (󰑦−󰑥)2
1 1
=
󵀂2πP 󵀂2πN
2󰑃 2󰑁

e − 2(󰑃+󰑁 )
(󰑦−󰜇)2
1
󵀄2π(P+N)

󶀡x − 󶀡 P+N y+
P N 2
= 󶀶,
1 P+N
μ󶀱󶀱
exp󶀦−
󵀆2π P+N
PN
PN 2 P+N

or equivalently,
X | {Y = y} ∼ N󶀣 y+ 󶀳.
P N PN
P+N P +N P+N
μ,

Thus,
E[X | Y ] = Y+
P N
P +N P +N
μ,

which is a convex combination of the observation Y and the mean μ (MMSE estimate
without observation), and tends to Y as N → 0 and to μ as N → ∞. The corresponding
MSE is
E[Var(X | Y )] = E󶁤 󶁴=
PN PN
P +N P+N
,

which is less than P, the MSE without the observation Y . Note that the conditional vari-
ance Var(X |Y ) is independent of Y .

In the above two examples, the MMSE estimate turned out to be an affine function of
Y (i.e., of the form aY + b). This is not always the case.
Example .. Let
ye −yx x ≥ 0, y > 0,
f (x | y) = 󶁇
0 otherwise.

Then,
E[X | Y ] =
1
.
Y
16 Expectation

Remark .. There can be alternative criteria for measuring goodness of estimators. For
example, instead of the MSE criteria in (.) that was introduced in the th century by
Legendre and Gauss, one may measure the mean absolute error (MAE)

E[| X − (Y )|],

which dates back to Boscovich and Laplace in the preceding century. It can be shown that
the minimum MAE estimate is the conditional median, that is,

P{X ≤ ∗ (y) | Y = y} ≥ 1/2,


P{X ≥ ∗ (y) | Y = y} ≥ 1/2.

5.7 LINEAR MMSE ESTIMATION

To find the MMSE estimate, one needs to know the statistics of the signal and the channel,
namely, f X ,Y (x, y), or at least, f X|Y (x|y), which is rarely the case in practice. We typically
have estimates only of the first and second moments of the signal and the observation,
i.e., the means, variances, and covariance of X and Y . This is not, in general, sufficient
information for computing the MMSE estimate, but as we shall see is enough to compute
the linear MMSE (LMMSE) estimate of the signal X given the observation Y , i.e., the
estimate of the form
X̂ = aY + b

that minimizes the mean square error

̂ 2 ] = E[(X − aY − b)2 ].
E[(X − X)

We show that the LMMSE estimate of X given Y is

X̂ = a∗ Y + b∗
Cov(X, Y )
= (Y − E[Y ]) + E[X]
Var(Y )
Y − E[Y ]
= ρ X ,Y σX 󶀤 󶀴 + E[X] (.)
σY

and its MSE is

(Cov(X, Y ))2
E[(X − a∗ Y − b∗ )2 ] = Var(X) −
Var(Y )
= (1 − ρ2X ,Y )σX2 .

First note that for any a,

E[(X − aY − b)2 ] = E[((X − aY ) − b)2 ]


5.7 Linear MMSE Estimation 17

is minimized by b∗ (a) = E[X − aY ] = E[X] − a E[Y ]. Hence, under this choice, the MSE
can be written as a quadratic function in a as
E[(X − aY − b∗ (a))2 ] = E[((X − E[X]) − a(Y − E[Y ]))2 ]
= Var(X) − 2a Cov(X, Y ) + a2 Var(Y ),

Cov(X, Y )
which is minimized at
a∗ =
Var(Y )
with the minimum
(Cov(X, Y ))2
Var(X) −
Var(Y )
.

Alternatively, we can minimize


J(a, b) = E[(X − aY − b)2 ]
by finding (a∗ , b∗ ) that satisfies
󰜕 󰜕
J(a, b) = J(a, b) = 0
󰜕a 󰜕b
and by showing that the solution attains the global minimum as illustrated in Figure ..

MSE

PSfrag replacements
Linear MMSE
a

b
Figure .. The MSE as a function of a and b.

The linear MMSE estimate X̂ satisfies the following properties:


̂ = E(X), which was also true for the nonlinear MMSE estimate.
. It is unbiased, i.e., E( X)
. The estimation error and the estimate are orthogonal, i.e.,
̂ X]
E[(X − X) ̂ = 0.

In fact, the estimation error is orthogonal to any affine function aY + b, i.e.,


̂
E[(X − X)(aY + b)] = 0.
18 Expectation

. If ρ X ,Y = 0, i.e., X and Y are uncorrelated, then the observation is ignored and

X̂ = E[X].

. If ρ X ,Y = ±1, i.e., (X − E(X)) and (Y − E(Y )) are linearly dependent, then the linear
estimate is perfect and X̂ = X.
The LMMSE estimate is not, in general, as good as the MMSE estimate.
Example .. Let Y ∼ Unif[−1, 1] and X = Y 2 . The MMSE estimate of X given Y is Y 2 ,
which is perfect. To find the LMMSE estimate we compute

E[Y ] = 0,

E[X] = 󵐐 y dy = ,
1
1 2 1
−1 2 3

and

Cov(X, Y ) = E[XY ] − 0 = E[Y 3 ] = 0.

Thus, the LMMSE estimate X̂ = E(X) = 31 , i.e., the observation Y is totally ignored, even
though it completely determines X.

5.8 GEOMETRIC FORMULATION OF ESTIMATION

For both nonlinear and linear MMSE estimation problems we discussed in the previous
two sections, we found that the estimation error is orthogonal to the optimal estimate.
This orthogonality property is a fundamental characteristic of an optimal estimator that
minimizes the MSE among a class of estimators and can be used to find the optimal esti-
mator in a simple geometric argument.
First, we introduce some mathematical background. A vector space V consists of a set
of vectors that are closed under two operations:
∙ Vector addition: if 󰑣, 󰑤 ∈ V then 󰑣 + 󰑤 ∈ V.
∙ Scalar multiplication: if a ∈ ℝ and 󰑣 ∈ V, then a󰑣 ∈ V.
An inner product is a real-valued operation 󰑣 ⋅ 󰑤 satisfying these three conditions:
∙ Commutativity: 󰑣 ⋅ 󰑤 = 󰑤 ⋅ 󰑣.
∙ Linearity: (au + 󰑣) ⋅ 󰑤 = a(u ⋅ 󰑤) + 󰑣 ⋅ 󰑤.
∙ Nonnegativity: 󰑣 ⋅ 󰑤 ≥ 0 and 󰑣 ⋅ 󰑣 = 0 iff 󰑣 = 0.
A vector space with an inner product is referred to as an inner product space. For example,
the Euclidean space

ℝn = {x = (x1 , x2 , . . . , xn ) : x1 , x2 , . . . , xn ∈ ℝ}
5.8 Geometric Formulation of Estimation 19

with vector addition


x + y = (x1 + y1 , x2 + y2 , . . . , xn + yn ),
scalar multiplication
ax = (ax1 , ax2 , . . . , axn ),
and dot product
x ⋅ y = 󵠈 xi y i
n

i=1

is an inner product space.


The inner product induces generalized notions of length, distance, and angle. The
norm of 󰑣 ∈ V is defined as ‖󰑣‖ = 󵀂󰑣 ⋅ 󰑣, and can be viewed as the “length” of 󰑣. A metric
is a function d(󰑣, 󰑤) satisfying the following three conditions:
∙ Commutativity: d(󰑣, 󰑤) = d(󰑤, 󰑣).
∙ Nonnegativity: d(󰑣, 󰑤) ≥ 0 with equality iff 󰑣 = 󰑤.
∙ Triangle inequality: d(u, 󰑤) ≤ d(u, 󰑣) + d(󰑣, 󰑤).
It is easy to verify that ‖󰑣 − 󰑤‖ is a metric, and thus can be interpreted as the “distance”
between 󰑣 and 󰑤. We say that 󰑣 and 󰑤 are orthogonal (written 󰑣 ⊥ 󰑤) if 󰑣 ⋅ 󰑤 = 0. More
generally, the “angle” θ between 󰑣 and 󰑤 satisfies 󰑣 ⋅ 󰑤 = ‖󰑣‖‖󰑤‖ cos θ. For example, the
norm (length) of a vector x in the Euclidean space is
1

‖x‖ = 󶀤󵠈 xi2 󶀴 ,
n 2

i=1

the distance between the two points represented by x and y is


1

‖x − y‖ = 󶀤󵠈(xi − yi )2 󶀴 ,
n 2

i=1

and the angle between the two vectors x and y is


∑ni=1 xi yi
θ = arccos
(∑ni=1 xi2 )1/2 (∑ni=1 yi2 )1/2
.

The Pythagorean theorem holds in an arbitrary inner product space, namely, if 󰑣 ⊥ 󰑤,


then
‖󰑣 + 󰑤‖2 = (󰑣 + 󰑤) ⋅ (󰑣 + 󰑤) = (󰑣 ⋅ 󰑣) + (󰑤 ⋅ 󰑤) + 2(󰑣 ⋅ 󰑤) = ‖󰑣‖2 + ‖󰑤‖2 .
We say that W is a subspace of a vector space V if W ⊆ V is itself a vector space (i.e., closed
under vector addition and scalar multiplication). A subspace of an inner product space
inherits the same inner product and is also an inner product space. We now establish the
following simple observation on the distance between a vector in a vector space V and its
subspae W.
20 Expectation

Orthogonality principle. Let V be an inner product space and W be its subspace. Let
󰑣 ∈ V. Suppose that there exists 󰑤∗ ∈ W such that 󰑣 − 󰑤∗ is orthogonal to every 󰑤 ∈ W.
Then,
󰑤∗ = arg min ‖󰑣 − 󰑤‖.
󰑤∈W

As depicted in Figure ., the orthogonal projection of 󰑣 onto W (if it exists) is the
closest vector of 󰑣 in W. The proof is immediate from the Pythagorean theorem. For any
󰑤 ∈ W, (󰑣 − 󰑤∗ ) ⊥ (󰑤∗ − 󰑤) by the orthogonality condition. Hence, ‖󰑣 − 󰑤∗ ‖2 + ‖󰑤∗ −
󰑤‖2 = ‖󰑣 − 󰑤‖2 and thus ‖󰑣 − 󰑤∗ ‖ ≤ ‖󰑣 − 󰑤‖.

󰑣
PSfrag replacements 󰑣 − 󰑤∗

(Y − EY ) 󰑤∗

Figure .. Among all vectors in W, the orthogonal projection 󰑤 ∗ of 󰑣 is the closest.

We now consider the inner product space V that consists of all random variables (with
finite second moment) on the same probability space, where
∙ the vector addition (sum) V + W of random variables V and W is a random variable,
∙ the scalar (constant) multiplication aV is a random variable, and
∙ the inner product V ⋅ W = 0 of V and W is their correlation E[VW] (which satisfies
the three inner product axioms).
Fortuitously, two random variables V and W are orthogonal, i.e., E[VW] = 0, as defined
in Section . iff V and W are orthogonal as two vectors, i.e., V ⋅ W = 0. Note that the
norm of V is ‖V ‖ = 󵀄E[V 2 ].
The goal of MMSE estimation can be now rephrased as follows: Given the vector space
V of all random variables (or all random variables that are functions of X and Y ) and
a subspace W of estimators, find X̂ that is closest to X, that is, the mean square error
‖ X̂ − X‖2 is the smallest.
Example 5.23 (MMSE estimator). Let W be the space of all functions (Y ) with finite
second moment. It can be easily verified that it is an inner product space. We already know
that the MMSE estimate X̂ = ∗ (Y ) = E[X|Y ] we found in Section . has the property
5.9 Jointly Gaussian Random Variables 21

that the error X̂ − X is orthogonal to every (Y ). Hence, it minimizes the MSE among all
functions of Y .
Example 5.24 (Mean). Let W be the set of all constants a ∈ ℝ. Once again it is a valid
subspace. Since X − E[X] is orthogonal to W, i.e., E[(X − E[X])a] = 0 for every a, X̂ =
a∗ = E[X] minimizes the MSE among all constants.
Example 5.25 (LMMSE estimator). Let W be the subspace that consists all functions of
the form aY + b. Since X − (a∗ Y + b∗ ), where a∗ and b∗ are given in (.), is orthogonal
to any aY + b, X̂ = a∗ Y + b∗ minimizes the MSE among all affine functions of Y .
We shall later apply this orthogonality principle to find MMSE estimators in more
general subspaces such as linear combinations of multiple random variables and linear
filters of random processes.

5.9 JOINTLY GAUSSIAN RANDOM VARIABLES

We say that two random variables are jointly Gaussian if their joint pdf is of the form
(x−μ 󰑋 )2 (y−μ󰑌 )2 2ρ 󰑋,󰑌 (x−μ 󰑋 )(y−μ󰑌 )
− 󶀦 + − 󶀶
1
f X ,Y (x, y) = 2(1−ρ 󰑋,󰑌 )
1 2
σ 󰑋2 σ󰑌2 σ 󰑋 σ󰑌
e .
2πσ X σY 󵀆1−ρ2X,Y

Note that this pdf is a function only of μ X , μY , σX2 , σY2 , and ρ X ,Y . Consistent with our
notation, these parameters are indeed E[X], E[Y ], Var(X), Var(Y ), and the correlation
coefficient of X and Y . In Lecture #, we shall define jointly Gaussian random variables
in a more general way.
Example .. Consider the additive Gaussian noise channel in Example ., where X ∼
N(μ, P) and Z ∼ N(0, N) are independent and Y = X + Z. Then the pair X and Z, the
pair X and Y , and the pair Y and Z are jointly Gaussian.

If X and Y are jointly Gaussian, contours of equal joint pdf are ellipses defined by the
quadratic equation
(x − μ X )2 (y − μY )2 (x − μ X )(y − μY )
+ − 2ρ X ,Y = c ≥ 0.
σX2 σY2 σX σY
The orientation of the major axis of these ellipses is

arctan 󶀥 X,Y 󶀵.
1 2ρ σ X σY
θ=
2 σ X2 − σY2
Figure . shows a few examples of the joint pdf.
Jointly Gaussian random variables X and Y satisfy the following properties.

. They are marginally Gaussian, i.e.,


X ∼ N(μ X , σX2 ) and Y ∼ N(μY , σY2 ).
22 Expectation

σX = 1, σY = 1, ρ X ,Y = 0
f (x, y)
5

0.2

0.1
y

0
PSfrag replacements
0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x

σX = 1, σY = 1, ρ X ,Y = 0.4 : θ = 45∘
f (x, y)
5

0.2

PSfrag replacements θ 0.1


y

0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x

σX = 1, σY = 3, ρ X ,Y = 0.4 : θ = 81.65∘
f (x, y)
5

0.06

0.04
PSfrag replacements θ
y

0
0.02

0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x

Figure .. Joint pdfs of jointly Gaussian random variables.


Problems 23

. The conditional pdf is Gaussian, i.e.,

X | {Y = y} ∼ N󶀤 (y − μY ) + μ X , (1 − ρ2X ,Y )σX2 󶀴 ,
ρ X,Y σ X
σY

which shows that the MMSE estimate is linear.


. If X and Y are jointly Gaussian and uncorrelated, i.e., ρ X ,Y = 0, then they are also
independent.

The converse to the first property is not necessarily true, that is, Gaussian marginals
do not necessarily mean that the random variables are jointly Gaussian.
Example .. Let X ∼ N(0, 1) and

+1 w.p. 1/2,
Z =󶁇
−1 w.p. 1/2

be independent and let Y = XZ. Clearly, Y ∼ N(0, 1). However, X and Y do not have
a joint pdf. Using delta functions, “ f X ,Y (x, y)” has the form shown in Figure .. Note
that X and Y are uncorrelated, but not independent. This does not contradict the third
property since X and Y are not jointly Gaussian.

“ f X ,Y (x, y)”

PSfrag replacements x

Figure .. The “joint pdf ” of X and Y.

PROBLEMS

.. Inequalities. Label each of the following statements with =, ≤, or ≥. Justify each
answer.
(a) 1/E[X 2 ] vs. E(1/X 2 ).
24 Expectation

(b) (E[X])2 vs. E[X 2 ].


(c) Var(X) vs. Var(E[X|Y ]).
(d) E[X 2 ] vs. E[(E[X|Y ])2 ].
.. Cauchy–Schwartz inequality.
(a) Prove the following inequality: (E[XY ])2 ≤ E[X 2 ] E[Y 2 ]. (Hint: Use the fact
that for any real t, E[(X + tY )2 ] ≥ 0.)
(b) Prove that equality holds if and only if X = cY for some constant c. Find c in
terms of the second moments of X and Y .
(c) Use the Cauchy–Schwartz inequality to show the correlation coefficient satis-
fies |ρ X ,Y | ≤ 1.
(d) Prove the triangle inequality: 󵀄E[(X + Y )2 ] ≤ 󵀄E[X 2 ] + 󵀄E[Y 2 ].
.. Two envelopes. An amount A is placed in one envelope and the amount 2A is placed
in another envelope. The amount A is fixed but unknown to you. The envelopes
are shuffled and you are given one of the envelopes at random. Let X denote the
amount you observe in this envelope. Designate by Y the amount in the other
envelope. Thus
(A, 2A),
(X, Y ) = 󶁇
w.p. 1/2,
(2A, A), w.p. 1/2.

You may keep the envelope you are given, or you can switch envelopes and receive
the amount in the other envelope.
(a) Find E[X] and E[Y ].
(b) Find E[X/Y ] and E[Y /X].
(c) Suppose you switch. What is the expected amount you receive?
.. Mean and variance. Let X and Y be random variables with joint pdf

1 if |x| + |y| ≤ 1/󵀂2,


f X ,Y (x, y) = 󶁇
0 otherwise.

Define the random variable Z = |X| + |Y | . Find the mean and variance of Z with-
out first finding the pdf of Z.
.. Tall trees. Suppose that the average height of trees on campus is  feet. Argue that
no more than half of the tree population is taller than  feet.
.. Let X and Y have correlation coefficient ρ X ,Y .
(a) What is the correlation coefficient between X and 3Y ?
(b) What is the correlation coefficient between 2X and −5Y ?
Problems 25

.. Random phase signal. Let Y (t) = sin(ωt + Θ) be a sinusoidal signal with random
phase Θ ∼ Unif[−π, π]. Assume here that ω and t are constants. Find the mean
and variance of Y (t). Do they depend on t?
.. Coin tosses. A coin with bias p is tossed independently until two heads or two tails
come up in a row. Find the expected value of the number of tosses X.
.. Iterated expectation. Let Λ and X be two random variables with

Λ ∼ fΛ (λ) = 󶁆
5 23
3
λ , 0≤λ≤1
0, otherwise,
and X|{Λ = λ} ∼ Exp(λ). Find E(X).
.. Sum of packet arrivals. Consider a network router with two types of incoming
packets, wireline and wireless. Let the random variable N1 (t) denote the number
of wireline packets arriving during time (0, t] and let the random variable N2 (t)
denote the number of wireless packets arriving during time (0, t]. Suppose N1 (t)
and N2 (t) are independent Poisson with pmfs
(λ1 t)n −λ1 t
P{N1 (t) = n} = e for n = 0, 1, 2, . . .
n!
(λ t)k
P{N2 (t) = k} = 2 e −λ2 t for k = 0, 1, 2, . . . .
k!
Let N(t) = N1 (t) + N2 (t) be the total number of packets arriving at the router dur-
ing time (0, t].
(a) Find the mean E(N(t)) and variance Var(N(t)) of the total number of packet
arrivals.
(b) Find the pmf of N(t).
(c) Let the random variable Y be the time to receive the first packet of either type.
Find the pdf of Y .
(d) What is the probability that the first received packet is wireless?
.. Conditioning on an event. Let X be a r.v. with pdf

󶀂2(1 − x) for 0 ≤ x ≤ 1
f X (x) = 󶀊
󶀚0 otherwise

and let the event A = {X ≥ 1/3}. Find f X|A (x), E(X|A), and Var(X|A).
.. Jointly Gaussian random variables. Let X and Y be jointly Gaussian random vari-
ables with pdf

f X ,Y (x, y) = e − 2 (4x /3+16y /3+8x y/3−8x−16y+16).


1 1 2 2

π 󵀄3/4
Find E(X), E(Y ), Var(X), Var(Y ), and Cov(X, Y ).
26 Expectation

.. Neural net. Let Y = X + Z, where the signal X ∼ U[−1, 1] and noise Z ∼ N (0, 1)
are independent.
(a) Find the function (y) that minimizes
MSE = E 󶁡(sgn(X) − (Y ))2 󶁱 ,
where
−1 x≤0
sgn(x) = 󶁇
+1 x > 0.

(b) Plot (y) vs. y.


.. Additive shot noise channel. Consider an additive noise channel Y = X + Z, where
the signal X ∼ N (0, 1), and the noise Z|{X = x} ∼ N (0, x 2 ), i.e., the noise power
of increases linearly with the signal squared.
(a) Find E(Z 2 ).
(b) Find the best linear MSE estimate of X given Y .
.. Additive uniform noise channel. Let the signal
+1, with probability 21
X=󶁆
−1, with probability 12 ,
and the noise Z ∼ Unif[−2, 2] be independent random variables. Their sum Y =
X + Z is observed. Find the minimum MSE estimate of X given Y and its MSE.
.. Estimation vs. detection. Let the signal
+1, with probability 12
X=󶁆
−1, with probability 12 ,
and the noise Z ∼ Unif[−2, 2] be independent random variables. Their sum Y =
X + Z is observed.
(a) Find the best MSE estimate of X given Y and its MSE.
(b) Now suppose we use a decoder to decide whether X = +1 or X = −1 so that the
probability of error is minimized. Find the optimal decoder and its probability
of error. Compare the optimal decoder’s MSE to the minimum MSE.
.. Linear estimator. Consider a channel with the observation Y = XZ, where the
signal X and the noise Z are uncorrelated Gaussian random variables. Let E[X] =
1, E[Z] = 2, σX2 = 5, and σZ2 = 8.
(a) Find the best MSE linear estimate of X given Y .
(b) Suppose your friend from Caltech tells you that he was able to derive an esti-
mator with a lower MSE. Your friend from UCLA disagrees, saying that this is
not possible because the signal and the noise are Gaussian, and hence the best
linear MSE estimator will also be the best MSE estimator. Could your UCLA
friend be wrong?
Problems 27

.. Additive-noise channel with path gain. Consider the additive noise channel shown
in the figure below, where X and Z are zero mean and uncorrelated, and a and b
are constants.

PSfrag replacements
Z

Y = b(aX + Z)
Y
X a b

Find the MMSE linear estimate of X given Y and its MSE in terms only of σX , σZ ,
a, and b.
.. Worst noise distribution. Consider an additive noise channel Y = X + Z, where the
signal X ∼ N (0, P) and the noise Z has zero mean and variance N. Assume X and
Z are independent. Find a distribution of Z that maximizes the minimum MSE of
estimating X given Y , i.e., the distribution of the worst noise Z that has the given
mean and variance. You need to justify your answer.
.. Image processing. A pixel signal X ∼ U[−k, k] is digitized to obtain

1
X̃ = i + , if i < X ≤ i + 1, i = −k, −k + 1, . . . , k − 2, k − 1.
2

To improve the the visual appearance, the digitized value X̃ is dithered by adding
an independent noise Z with mean E(Z) = 0 and variance Var(Z) = N to obtain
Y = X̃ + Z.
(a) Find the correlation of X and Y .
(b) Find the best linear MSE estimate of X given Y . Your answer should be in
terms only of k, N, and Y .
.. Orthogonality. Let X̂ be the minimum MSE estimate of X given Y .
̂
(a) Show that for any function (y), E((X − X)(Y )) = 0, i.e., the error (X − X)
̂
and (Y ) are orthogonal.
(b) Show that
Var(X) = E(Var(X |Y )) + Var( X).
̂

Provide a geometric interpretation for this result.


.. Difference from sum. Let X and Y be two random variables. Let Z = X + Y and let
W = X − Y . Find the best linear estimate of W given Z as a function of E(X), E(Y ), σX , σY , ρ XY
and Z.
28 Expectation

.. Nonlinear and linear estimation. Let X and Y be two random variables with joint
pdf
x + y, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
f (x, y) = 󶁆
0, otherwise.

(a) Find the MMSE estimator of X given Y .


(b) Find the corresponding MSE.
(c) Find the pdf of Z = E(X|Y ).
(d) Find the linear MMSE estimator of X given Y .
(e) Find the corresponding MSE.
.. Additive-noise channel with signal dependent noise. Consider the channel with cor-
related signal X and noise Z and observation Y = 2X + Z, where

μX = 1 , μZ = 0 , σX2 = 4 , σZ2 = 9 , ρ X ,Z = − 38 .

Find the best MSE linear estimate of X given Y .

You might also like