Lect 05
Lect 05
Lect 05
Expectation
Let X ∈ X be a discrete random variable with pmf p X (x) and let (x) be a function of x.
The expectation (or expected value or mean) of (X) is defined as
For a continuous random variable X ∼ f X (x), the expected value of (X) is defined as
∞
E[(X)] = (x) f X (x) dx.
−∞
By considering Y = (X) as a random variable on its own, we can compute the same
expectation.
=y p X (x)
y {x: (x)=y}
= y p X (x)
y {x: (x)=y}
= (x)p X (x)
y {x: (x)=y}
= (x)p X (x).
x
Thus, E[Y ] = E[(X)] can be found using either p X (x) or pY (y). It is often much easier to
use p X (x) than to first find pY (y) and then find E[Y ].
We already know that a random variable is completely specified, that is, any proba-
bility of a Borel set involving the random variable can be determined, by its cumulative
distribution function (or its pmf and pdf in discrete and continuous cases, respectively).
As a simple summary of the random variable, however, its expectation has several appli-
cations.
. Expectation can be used to bound or estimate probabilities of interesting events, as we
will see in Section ..
. Expectation provides the optimal estimate of a random variable under the mean square
error criterion, as we will see in Section ..
. It is far easier to estimate the expectation of a random variable from data than to esti-
mate its distribution, as we will see in Lecture #.
0 0
∞ ∞
= f X (u) du dx
0 x
∞
= 1 − FX (x) dx.
0
5.2 Mean and Variance 3
Let
x ∈ A,
1A (x) =
1
0 otherwise.
Then, the expectation of the indicator variable is
∞
E[1A (X)] = 1A (x) f X (x) dx = f X (x) dx = P{X ∈ A}.
−∞ A
The variance of X is
Var(X) = E[(X − E(X))2 ]
= E[X 2 − 2X E(X) + (E(X))2 ]
= E[X 2 ] − 2(E[X])2 + (E[X])2
= E[X 2 ] − (E[X])2 .
x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) =
2
0 otherwise.
Consider
∞ ∞
E[X] = x f (x, y) dx d y
−∞ −∞
=
1 1−x
2x d y dx
0 0
= 2 (1 − x)x dx
1
= 2 − =
1 1 1
2 3 3
and
E[X 2 ] = 2 (1 − x)x 2 dx = 2 − = .
1
1 1 1
0 3 4 6
Hence,
Var(X) = E[X 2 ] − (E[X])2 − ( )2 = − = .
1 1 1 1 1
6 3 6 9 18
4 Expectation
The following useful identities are direct consequences of the linearity of expectation:
E(aX + b) = a E(X) + b,
Var(aX + b) = a2 Var(X).
Table . summarizes the mean and variance of famous random variables.
Bern(p) p p(1 − p)
1 1−p
Geom(p)
p p2
Binom(n, p) np np(1 − p)
Poisson(λ) λ λ
a+b (b − a)2
Unif[ a, b ]
2 12
1 1
Exp(λ)
λ λ2
Nμ, σ 2 μ σ2
1≤x<∞
f X (x) =
1/x 2
0 otherwise.
Then
∞
E[X] = dx = ∞.
x
1 x2
Remark 5.2. Expectation may not exist. To find conditions under which expectation ex-
ists, consider
∞ ∞
E[X] = x f X (x) dx = − |x | f X (x) dx + |x | f X (x) dx ,
0
−∞ −∞ 0
∞
so either ∫−∞ |x| f X (x) dx or ∫0 |x| f X (x) dx must be finite.
0
5.3 Inequalities 5
Example .. The standard Cauchy random variable has the pdf
f X (x) =
1
π(1 + x 2 )
∞
Since both ∫−∞ |x| f X (x) dx and ∫0 |x| f X (x) dx are infinite, its mean does not exist! (The
0
5.3 INEQUALITIES
In many cases we do not know the distribution of a random variable X, but wish to find the
probability of an event such as {X > a} or {|X − E(X)| > a}. The Markov and Chebyshev
inequalities provide upper bounds on the probabilities of such events in terms of the mean
and variance of the random variable.
Markov inequality. Let X ≥ 0 be a random variable with finite mean. Then for any
a > 1,
P{X ≥ a E[X]} ≤ .
1
a
Example .. If the average age in the San Diego is , then at most half of the population
is or older.
To prove the Markov inequality, let A = {x ≥ a E(X)} and consider the indicator func-
tion 1A (x). As illustrated in Figure .,
PSfrag replacements
x
a E[X]
x
E[X] a E[X]
1A (x) ≤
x
.
a E[X]
Since E(1A (X)) = P{X ≥ a E[X]}, taking the expectations of both sides establishes the in-
equality.
6 Expectation
The Markov inequality can be very loose. For example, if X ∼ Exp(1), then
P{X ≥ 10} = e −10 ≈ 4.54 × 10−5 .
The Markov inequality yields
P{X ≥ 10} ≤ 1
10
,
which is very pessimistic. But it is the tightest possible inequality on P{X ≥ a E[X]} when
we are given only E[X]. To show this, note that the inequality is tight for
X=
a E[X] w.p. 1/a,
0 w.p. 1 − 1/a.
In Example ., if half of the population is year old and half other population is years
old, then the average age is and the Markov inequality is tight.
Chebyshev inequality. Let X be a random variable with finite mean E[X] and vari-
ance σX2 . Then for any a > 1,
Example .. Let X be a device parameter in an integrated circuit (IC) with known mean
and variance. The IC is out-of-spec if X is more than, say, 3σX away from its mean. Then,
by the Chebyshev inequality, the fraction of out-of-spec ICs, namely, P{|X − E(X)| ≥ 3σX }
is no larger than 1/9.
The proof of the Chebyshev inequality uses the Markov inequality (which is a slight
twist from the teacher–student relationship between Prof. Pafnuty Chebyshev and his stu-
dent Andrey Markov at Saint Petersburg University in Russia). Define the random vari-
able Y = (X − E[X])2 ≥ 0. Since E[Y ] = σX2 , the Markov inequality implies that
which is very pessimistic compared to the actual value 2Q(3) ≈ 2 × 10−3 . But it is the tight-
est upper bound on P{|X − E(X)| ≥ aσX } given knowledge only of the mean and variance
of X. Indeed, the inequality holds with equality for the random variable
E(X) + aσX
w.p. 1/2a2 ,
X = E(X) − aσX
w.p. 1/2a2 ,
E(X) w.p. 1 − 1/a2 .
5.3 Inequalities 7
We now discuss an extremely useful inequality that is named after a Danish mathe-
matician Johan Jensen and is centered around the notion of convexity. A function (x) is
(b) − (a)
said to be convex if
(x) ≤ (x − a) + (a)
b−a
for all x ∈ [a, b] and all a < b, that is, the function curve is below every chord across two
points on the curve. If (x) is twice differentiable, then (x) is convex iff
(x) ≥ 0.
If −(x) is convex, then (x) is called concave.
Example 5.5. The following functions are convex: (a) (x) = ax + b. (b) (x) = x 2 .
(c) (x) = |x| p , p ≥ 1. (d) (x) = x log x, x > 0. (e) (x) = 1/x, x > 0.
Example 5.6. The following functions are concave: (a) (x) = ax + b. (b) (x) = x,
x > 0. (c) (x) = log x, x > 0.
Jensen’s inequality. Let X be a random variable with finite mean E[X] and (x) be a
function such that E[(X)] is finite. If (x) is convex, then
E[(X)] ≥ (E[X]).
Let (X, Y ) ∼ f X ,Y (x, y) and let (x, y) be a function of x and y. The expectation of (X, Y )
is
∞ ∞
E[(X, Y )] = (x, y) f X ,Y (x, y) dx d y.
−∞ −∞
E[XY ].
We say that X and Y are uncorrelated if Cov(X, Y ) = 0. Note that Cov(X, X) = Var(X).
The correlation coefficient of X and Y is defined as
Cov(X, Y )
ρ X ,Y =
Var(X) Var(Y )
.
| ρ X ,Y | ≤ 1,
Note that ρ X ,Y = ±1 iff
X − E[X] Y − E[Y]
=± ,
σX σY
that is, iff X − E[X] is a linear function of Y − E[Y ]. We shall see in Section . that ρ X ,Y is
a measure of how closely X − E[X] can be approximated or estimated by a linear function
of Y − E[Y ].
Example .. We find the correlation, covariance, and correlation coefficient for (X, Y ) ∼
f X ,Y (x, y) where
x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) =
2
0 otherwise.
5.4 Covariance and Correlation 9
Recall from Example . that E[X] = 1/3 and Var(X) = 1/18. By symmetry, E[Y ] = 1/3
and Var(Y ) = 1/18 as well. Consider
E[XY ] = 2
1 1−x
x y d y dx
0 0
= x(1 − x)2 dx
1
= .
1
12
and
Cov(X, Y ) = E[XY ] − E[X] E[Y ] = − =− .
1 1 1
12 9 36
Finally,
Cov(X, Y ) − 36
1
= = 1 =− .
1
Var(X)Var(Y )
ρ X ,Y
2
18
∞ ∞
E[XY ] = x y f X ,Y (x, y) dx d y
−∞ −∞
∞ ∞
= x y f X (x) fY (y) dx d y
−∞ −∞
∞ ∞
= x f X (x) dx y f (y) d y
−∞ −∞
= E[X] E[Y ].
However, that X and Y are uncorrelated does not necessarily imply that they are indepen-
dent.
Example .. Consider the pmf p X ,Y (x, y) described by the following table
x
−1 0 1
−1 1
6
0 1
6
1
y 0 0 3
0
1 1
1 6
0 6
Clearly X and Y are not independent. But it can be readily checked that E[X] = E[Y ] =
E[XY ] = 0. Thus Cov(X, Y ) = 0, that is, X and Y are uncorrelated.
10 Expectation
Let (X, Y ) ∼ f X ,Y (x, y). Recall that the conditional pdf of X given Y = y is
f X ,Y (x, y)
f X|Y (x | y) =
fY (y)
,
if fY (y) > 0. Since f X|Y (x|y) is a pdf for X (for each y), we can define the expectation of
any function (X, Y ) w.r.t. f X|Y (x|y) as
∞
E[(X, Y ) | Y = y] = (x, y) f X|Y (x | y) dx,
−∞
which is a function of y.
2 if x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) =
0 otherwise.
From Lecture #, we already know that X | {Y = y} ∼ Unif[0, 1 − y]. Thus, E[X|Y = y] =
(1 − y)/2.
Let ϕ(y) = E[(X, Y ) |Y = y]. We define the conditional expectation of (X, Y ) given
Y as
E[(X, Y ) | Y ] = ϕ(Y ).
In other words, the random variable E[(X, Y ) |Y ] is a function of Y that takes values
E[(X, Y ) |Y = y] when Y = y.
Law of iterated expectation. The following observation is very useful in computing ex-
pectation:
∞
E[E[(X, Y ) | Y ]] = E[(X, Y ) | Y = y] fY (y) d y
−∞
∞ ∞
= (x, y) f X|Y (x | y) dx fY (y) d y
−∞ −∞
∞ ∞
= (x, y) f X ,Y (x, y) dx d y
−∞ −∞
= E[(X, Y )].
5.5 Conditional Expectation 11
1−Y
the random variable
E[X | Y ] = =: Z.
2
The pdf of Z is
fZ (z) = 8z , 0 < z ≤ 12 ,
which is illustrated in Figure .. Note that
1
E[Z] = 8z 2 dz = = E[X],
2 1
0 3
as is expected from the law of iterated expectation. Similarly,
E[XY ] = E[E[XY | Y ]]
Y (1 − Y )
= E
2
1 y(1 − y)
= ⋅ 2(1 − y) d y = ,
1
0 2 12
which agrees with the direct integration computed in Example ..
Example 5.17. A coin has random bias P ∈ [0, 1] with pdf fP (p) = 2(1 − p). The coin is
flipped n times. Let N be the number of heads, that is, N | {P = p} ∼ Binom(n, p). Then,
by the law of iterated expectation, we can find
E[N] = E[E[N | P]]
= E[nP]
= n E[P]
= n 2(1 − p)p d p = n,
1
1
0 3
fZ (z)
PSfrag replacements
z
1
2
which is much simpler than finding the pmf of N and computing the expectation.
Example 5.18. Let E[X |Y ] = Y 2 and Y ∼ Unif[0, 1]. In this case, we cannot find the pdf
of X, since we do not know f X|Y (x|y). But using iterated expectation we can still find
We define the conditional variance of X given Y = y as the variance of X w.r.t. f X|Y (x|y),
i.e.,
The random variable Var(X |Y ) is a function of Y that takes on the values Var(X |Y = y).
Its expected value is
Consider the signal estimation system depicted in Figure ., where the original signal is
X ∼ f X (x) and its noisy observation is
Y | {X = x} ∼ fY|X (y|x).
f X (x) X Noisy Y
Estimator X̂ = (Y )
channel
fY|X (y|x) (y)
The estimator ∗ (y) that attains the smallest value of (.) is referred to as the minimum
mean square error (MMSE) estimator of X given Y , and X̂ = ∗ (Y ) is referred to as the
MMSE estimate.
Suppose that there is no observation and let a∗ be the MMSE estimate of X, that is,
Then,
a∗ = E[X]. (.)
In other words, the mean is the optimal summary of X under the mean square error
criterion. To prove (.), note that for any estimate a of X,
Consequently,
E[(X − ∗ (Y ))2 ] ≤ E[(X − (Y ))2 ]
E[X − X̂ | Y = y] = 0.
14 Expectation
̂ X]
E[(X − X) ̂ = E[E[(X − X)̂ X̂ | Y ]]
= E[ X̂ E[X − X̂ | Y ]] = 0.
̂
E[(X − X)(Y )] = 0.
. By the law of conditional variance Var(X) = Var( X) ̂ + E[Var(X |Y )], the sum of the
variance of the estimate and its MSE is equal to the variance of the signal.
. If X and Y are independent, then X̂ = E[X], that is, the observation is ignored.
x ≥ 0, y ≥ 0, x + y ≤ 1,
f X ,Y (x, y) =
2
0 otherwise.
We find the MMSE estimate of X given Y and its MSE. We already know that the MMSE
1−Y
estimate is
E[X | Y ] =
2
Hence, the MMSE is E[Var(X |Y )] = 1/24, compared to Var(X) = 1/18. The difference is
Var(E[X |Y ]) = 1/72, which is the variance of the estimate.
PSfrag replacements
E[X |Y = y]
y
E[X |Y = y] 1
Var(X |Y = y)
2
PSfrag replacements 1
1
2 12
y y
Z are independent. We find the MMSE estimate of X given Y and its MSE, i.e., E[X |Y ]
and E[Var(X |Y )]. Recall that Y | {X = x} ∼ N(x, N) and Y ∼ N(μ, P + N), that is,
fY|X (y|x) = fZ (y − x) = e − 2
1 (−)2
2πN
and
fY (y) = e − 2(+ ) .
1 (−)2
2π(P + N)
Hence,
f X (x) fY|X (y|x)
f X|Y (x | y) =
fY (y)
e− e−
(−)2 (−)2
1 1
=
2πP 2πN
2 2
e − 2(+ )
(−)2
1
2π(P+N)
x − P+N y+
P N 2
= ,
1 P+N
μ
exp−
2π P+N
PN
PN 2 P+N
or equivalently,
X | {Y = y} ∼ N y+ .
P N PN
P+N P +N P+N
μ,
Thus,
E[X | Y ] = Y+
P N
P +N P +N
μ,
which is a convex combination of the observation Y and the mean μ (MMSE estimate
without observation), and tends to Y as N → 0 and to μ as N → ∞. The corresponding
MSE is
E[Var(X | Y )] = E =
PN PN
P +N P+N
,
which is less than P, the MSE without the observation Y . Note that the conditional vari-
ance Var(X |Y ) is independent of Y .
In the above two examples, the MMSE estimate turned out to be an affine function of
Y (i.e., of the form aY + b). This is not always the case.
Example .. Let
ye −yx x ≥ 0, y > 0,
f (x | y) =
0 otherwise.
Then,
E[X | Y ] =
1
.
Y
16 Expectation
Remark .. There can be alternative criteria for measuring goodness of estimators. For
example, instead of the MSE criteria in (.) that was introduced in the th century by
Legendre and Gauss, one may measure the mean absolute error (MAE)
which dates back to Boscovich and Laplace in the preceding century. It can be shown that
the minimum MAE estimate is the conditional median, that is,
To find the MMSE estimate, one needs to know the statistics of the signal and the channel,
namely, f X ,Y (x, y), or at least, f X|Y (x|y), which is rarely the case in practice. We typically
have estimates only of the first and second moments of the signal and the observation,
i.e., the means, variances, and covariance of X and Y . This is not, in general, sufficient
information for computing the MMSE estimate, but as we shall see is enough to compute
the linear MMSE (LMMSE) estimate of the signal X given the observation Y , i.e., the
estimate of the form
X̂ = aY + b
̂ 2 ] = E[(X − aY − b)2 ].
E[(X − X)
X̂ = a∗ Y + b∗
Cov(X, Y )
= (Y − E[Y ]) + E[X]
Var(Y )
Y − E[Y ]
= ρ X ,Y σX + E[X] (.)
σY
(Cov(X, Y ))2
E[(X − a∗ Y − b∗ )2 ] = Var(X) −
Var(Y )
= (1 − ρ2X ,Y )σX2 .
is minimized by b∗ (a) = E[X − aY ] = E[X] − a E[Y ]. Hence, under this choice, the MSE
can be written as a quadratic function in a as
E[(X − aY − b∗ (a))2 ] = E[((X − E[X]) − a(Y − E[Y ]))2 ]
= Var(X) − 2a Cov(X, Y ) + a2 Var(Y ),
Cov(X, Y )
which is minimized at
a∗ =
Var(Y )
with the minimum
(Cov(X, Y ))2
Var(X) −
Var(Y )
.
MSE
PSfrag replacements
Linear MMSE
a
b
Figure .. The MSE as a function of a and b.
X̂ = E[X].
. If ρ X ,Y = ±1, i.e., (X − E(X)) and (Y − E(Y )) are linearly dependent, then the linear
estimate is perfect and X̂ = X.
The LMMSE estimate is not, in general, as good as the MMSE estimate.
Example .. Let Y ∼ Unif[−1, 1] and X = Y 2 . The MMSE estimate of X given Y is Y 2 ,
which is perfect. To find the LMMSE estimate we compute
E[Y ] = 0,
E[X] = y dy = ,
1
1 2 1
−1 2 3
and
Thus, the LMMSE estimate X̂ = E(X) = 31 , i.e., the observation Y is totally ignored, even
though it completely determines X.
For both nonlinear and linear MMSE estimation problems we discussed in the previous
two sections, we found that the estimation error is orthogonal to the optimal estimate.
This orthogonality property is a fundamental characteristic of an optimal estimator that
minimizes the MSE among a class of estimators and can be used to find the optimal esti-
mator in a simple geometric argument.
First, we introduce some mathematical background. A vector space V consists of a set
of vectors that are closed under two operations:
∙ Vector addition: if , ∈ V then + ∈ V.
∙ Scalar multiplication: if a ∈ ℝ and ∈ V, then a ∈ V.
An inner product is a real-valued operation ⋅ satisfying these three conditions:
∙ Commutativity: ⋅ = ⋅ .
∙ Linearity: (au + ) ⋅ = a(u ⋅ ) + ⋅ .
∙ Nonnegativity: ⋅ ≥ 0 and ⋅ = 0 iff = 0.
A vector space with an inner product is referred to as an inner product space. For example,
the Euclidean space
ℝn = {x = (x1 , x2 , . . . , xn ) : x1 , x2 , . . . , xn ∈ ℝ}
5.8 Geometric Formulation of Estimation 19
i=1
‖x‖ = xi2 ,
n 2
i=1
‖x − y‖ = (xi − yi )2 ,
n 2
i=1
Orthogonality principle. Let V be an inner product space and W be its subspace. Let
∈ V. Suppose that there exists ∗ ∈ W such that − ∗ is orthogonal to every ∈ W.
Then,
∗ = arg min ‖ − ‖.
∈W
As depicted in Figure ., the orthogonal projection of onto W (if it exists) is the
closest vector of in W. The proof is immediate from the Pythagorean theorem. For any
∈ W, ( − ∗ ) ⊥ (∗ − ) by the orthogonality condition. Hence, ‖ − ∗ ‖2 + ‖∗ −
‖2 = ‖ − ‖2 and thus ‖ − ∗ ‖ ≤ ‖ − ‖.
PSfrag replacements − ∗
(Y − EY ) ∗
Figure .. Among all vectors in W, the orthogonal projection ∗ of is the closest.
We now consider the inner product space V that consists of all random variables (with
finite second moment) on the same probability space, where
∙ the vector addition (sum) V + W of random variables V and W is a random variable,
∙ the scalar (constant) multiplication aV is a random variable, and
∙ the inner product V ⋅ W = 0 of V and W is their correlation E[VW] (which satisfies
the three inner product axioms).
Fortuitously, two random variables V and W are orthogonal, i.e., E[VW] = 0, as defined
in Section . iff V and W are orthogonal as two vectors, i.e., V ⋅ W = 0. Note that the
norm of V is ‖V ‖ = E[V 2 ].
The goal of MMSE estimation can be now rephrased as follows: Given the vector space
V of all random variables (or all random variables that are functions of X and Y ) and
a subspace W of estimators, find X̂ that is closest to X, that is, the mean square error
‖ X̂ − X‖2 is the smallest.
Example 5.23 (MMSE estimator). Let W be the space of all functions (Y ) with finite
second moment. It can be easily verified that it is an inner product space. We already know
that the MMSE estimate X̂ = ∗ (Y ) = E[X|Y ] we found in Section . has the property
5.9 Jointly Gaussian Random Variables 21
that the error X̂ − X is orthogonal to every (Y ). Hence, it minimizes the MSE among all
functions of Y .
Example 5.24 (Mean). Let W be the set of all constants a ∈ ℝ. Once again it is a valid
subspace. Since X − E[X] is orthogonal to W, i.e., E[(X − E[X])a] = 0 for every a, X̂ =
a∗ = E[X] minimizes the MSE among all constants.
Example 5.25 (LMMSE estimator). Let W be the subspace that consists all functions of
the form aY + b. Since X − (a∗ Y + b∗ ), where a∗ and b∗ are given in (.), is orthogonal
to any aY + b, X̂ = a∗ Y + b∗ minimizes the MSE among all affine functions of Y .
We shall later apply this orthogonality principle to find MMSE estimators in more
general subspaces such as linear combinations of multiple random variables and linear
filters of random processes.
We say that two random variables are jointly Gaussian if their joint pdf is of the form
(x−μ )2 (y−μ )2 2ρ , (x−μ )(y−μ )
− + −
1
f X ,Y (x, y) = 2(1−ρ , )
1 2
σ 2 σ2 σ σ
e .
2πσ X σY 1−ρ2X,Y
Note that this pdf is a function only of μ X , μY , σX2 , σY2 , and ρ X ,Y . Consistent with our
notation, these parameters are indeed E[X], E[Y ], Var(X), Var(Y ), and the correlation
coefficient of X and Y . In Lecture #, we shall define jointly Gaussian random variables
in a more general way.
Example .. Consider the additive Gaussian noise channel in Example ., where X ∼
N(μ, P) and Z ∼ N(0, N) are independent and Y = X + Z. Then the pair X and Z, the
pair X and Y , and the pair Y and Z are jointly Gaussian.
If X and Y are jointly Gaussian, contours of equal joint pdf are ellipses defined by the
quadratic equation
(x − μ X )2 (y − μY )2 (x − μ X )(y − μY )
+ − 2ρ X ,Y = c ≥ 0.
σX2 σY2 σX σY
The orientation of the major axis of these ellipses is
arctan X,Y .
1 2ρ σ X σY
θ=
2 σ X2 − σY2
Figure . shows a few examples of the joint pdf.
Jointly Gaussian random variables X and Y satisfy the following properties.
σX = 1, σY = 1, ρ X ,Y = 0
f (x, y)
5
0.2
0.1
y
0
PSfrag replacements
0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x
σX = 1, σY = 1, ρ X ,Y = 0.4 : θ = 45∘
f (x, y)
5
0.2
0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x
σX = 1, σY = 3, ρ X ,Y = 0.4 : θ = 81.65∘
f (x, y)
5
0.06
0.04
PSfrag replacements θ
y
0
0.02
0
5
5
0
0
−5
−5 0 5 −5 −5
y x
x
X | {Y = y} ∼ N (y − μY ) + μ X , (1 − ρ2X ,Y )σX2 ,
ρ X,Y σ X
σY
The converse to the first property is not necessarily true, that is, Gaussian marginals
do not necessarily mean that the random variables are jointly Gaussian.
Example .. Let X ∼ N(0, 1) and
+1 w.p. 1/2,
Z =
−1 w.p. 1/2
be independent and let Y = XZ. Clearly, Y ∼ N(0, 1). However, X and Y do not have
a joint pdf. Using delta functions, “ f X ,Y (x, y)” has the form shown in Figure .. Note
that X and Y are uncorrelated, but not independent. This does not contradict the third
property since X and Y are not jointly Gaussian.
“ f X ,Y (x, y)”
PSfrag replacements x
PROBLEMS
.. Inequalities. Label each of the following statements with =, ≤, or ≥. Justify each
answer.
(a) 1/E[X 2 ] vs. E(1/X 2 ).
24 Expectation
You may keep the envelope you are given, or you can switch envelopes and receive
the amount in the other envelope.
(a) Find E[X] and E[Y ].
(b) Find E[X/Y ] and E[Y /X].
(c) Suppose you switch. What is the expected amount you receive?
.. Mean and variance. Let X and Y be random variables with joint pdf
Define the random variable Z = |X| + |Y | . Find the mean and variance of Z with-
out first finding the pdf of Z.
.. Tall trees. Suppose that the average height of trees on campus is feet. Argue that
no more than half of the tree population is taller than feet.
.. Let X and Y have correlation coefficient ρ X ,Y .
(a) What is the correlation coefficient between X and 3Y ?
(b) What is the correlation coefficient between 2X and −5Y ?
Problems 25
.. Random phase signal. Let Y (t) = sin(ωt + Θ) be a sinusoidal signal with random
phase Θ ∼ Unif[−π, π]. Assume here that ω and t are constants. Find the mean
and variance of Y (t). Do they depend on t?
.. Coin tosses. A coin with bias p is tossed independently until two heads or two tails
come up in a row. Find the expected value of the number of tosses X.
.. Iterated expectation. Let Λ and X be two random variables with
Λ ∼ fΛ (λ) =
5 23
3
λ , 0≤λ≤1
0, otherwise,
and X|{Λ = λ} ∼ Exp(λ). Find E(X).
.. Sum of packet arrivals. Consider a network router with two types of incoming
packets, wireline and wireless. Let the random variable N1 (t) denote the number
of wireline packets arriving during time (0, t] and let the random variable N2 (t)
denote the number of wireless packets arriving during time (0, t]. Suppose N1 (t)
and N2 (t) are independent Poisson with pmfs
(λ1 t)n −λ1 t
P{N1 (t) = n} = e for n = 0, 1, 2, . . .
n!
(λ t)k
P{N2 (t) = k} = 2 e −λ2 t for k = 0, 1, 2, . . . .
k!
Let N(t) = N1 (t) + N2 (t) be the total number of packets arriving at the router dur-
ing time (0, t].
(a) Find the mean E(N(t)) and variance Var(N(t)) of the total number of packet
arrivals.
(b) Find the pmf of N(t).
(c) Let the random variable Y be the time to receive the first packet of either type.
Find the pdf of Y .
(d) What is the probability that the first received packet is wireless?
.. Conditioning on an event. Let X be a r.v. with pdf
2(1 − x) for 0 ≤ x ≤ 1
f X (x) =
0 otherwise
and let the event A = {X ≥ 1/3}. Find f X|A (x), E(X|A), and Var(X|A).
.. Jointly Gaussian random variables. Let X and Y be jointly Gaussian random vari-
ables with pdf
π 3/4
Find E(X), E(Y ), Var(X), Var(Y ), and Cov(X, Y ).
26 Expectation
.. Neural net. Let Y = X + Z, where the signal X ∼ U[−1, 1] and noise Z ∼ N (0, 1)
are independent.
(a) Find the function (y) that minimizes
MSE = E (sgn(X) − (Y ))2 ,
where
−1 x≤0
sgn(x) =
+1 x > 0.
.. Additive-noise channel with path gain. Consider the additive noise channel shown
in the figure below, where X and Z are zero mean and uncorrelated, and a and b
are constants.
PSfrag replacements
Z
Y = b(aX + Z)
Y
X a b
Find the MMSE linear estimate of X given Y and its MSE in terms only of σX , σZ ,
a, and b.
.. Worst noise distribution. Consider an additive noise channel Y = X + Z, where the
signal X ∼ N (0, P) and the noise Z has zero mean and variance N. Assume X and
Z are independent. Find a distribution of Z that maximizes the minimum MSE of
estimating X given Y , i.e., the distribution of the worst noise Z that has the given
mean and variance. You need to justify your answer.
.. Image processing. A pixel signal X ∼ U[−k, k] is digitized to obtain
1
X̃ = i + , if i < X ≤ i + 1, i = −k, −k + 1, . . . , k − 2, k − 1.
2
To improve the the visual appearance, the digitized value X̃ is dithered by adding
an independent noise Z with mean E(Z) = 0 and variance Var(Z) = N to obtain
Y = X̃ + Z.
(a) Find the correlation of X and Y .
(b) Find the best linear MSE estimate of X given Y . Your answer should be in
terms only of k, N, and Y .
.. Orthogonality. Let X̂ be the minimum MSE estimate of X given Y .
̂
(a) Show that for any function (y), E((X − X)(Y )) = 0, i.e., the error (X − X)
̂
and (Y ) are orthogonal.
(b) Show that
Var(X) = E(Var(X |Y )) + Var( X).
̂
.. Nonlinear and linear estimation. Let X and Y be two random variables with joint
pdf
x + y, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
f (x, y) =
0, otherwise.
μX = 1 , μZ = 0 , σX2 = 4 , σZ2 = 9 , ρ X ,Z = − 38 .