Kalman Filter
Kalman Filter
Kalman Filter
p(x)
tified. Estimates are modeled as random samples from distri- p1 (x)
butions, and confidence in estimates is quantified in terms
of the variances and covariances of these distributions.1 Sec-
x
tions 3-5 develop the two key ideas behind Kalman filtering. 58 ◦C 60 ◦C 63 ◦C
(1) How should estimates be fused optimally?
Section 3 shows how to fuse scalar estimates such Figure 1: Using pdfs to model devices with systematic
as temperatures optimally. It is also shown that the and random errors. Ground truth is 60 ◦ C. Dashed lines
problem of fusing more than two estimates can be re- are means of pdfs.
duced to the problem of fusing two estimates at a time
without any loss in the quality of the final estimate.
Section 4 extends these results to estimates that are vec-
with pdf pi whose mean and variance are µ i and σi2 respec-
tors, such as state vectors representing the estimated
tively; following convention, we use x i to represent a random
position and velocity of a robot.
sample from this distribution as well.
(2) In some applications, estimates are vectors but only
Means and variances of distributions model different kinds
a part of the vector can be measured directly. For ex-
of inaccuracies in measurements. Device i is said to have a
ample, the state of a spacecraft may be represented
systematic error or bias in its measurements if the mean µ i
by its position and velocity, but only its position may
of its distribution is not equal to the actual temperature x c
be observable. In such situations, how do we obtain a
(in general, to the value being estimated, which is known as
complete estimate from a partial estimate?
ground truth); otherwise, the instrument is unbiased. Figure 1
Section 5 shows how the Best Linear Unbiased Esti-
shows pdfs for two devices that have different amounts of
mator (BLUE) can be used for this. Intuitively, it is
systematic error. The variance σi2 on the other hand is a mea-
assumed that there is a linear relationship between the
sure of the random error in the measurements. The impact
observable and hidden parts of the state vector, and
of random errors can be mitigated by taking many measure-
this relationship is used to compute an estimate for
ments with a given device and averaging their values, but
the hidden part of the state, given an estimate for the
this approach will not help to reduce systematic error.
observable part.
In the usual formulation of Kalman filtering, it is assumed
Section 6 uses these ideas to solve the state estimation that measuring devices do not have systematic errors. How-
problems for linear systems, which is the usual context for ever, we do not have the luxury of taking many measure-
presenting Kalman filters. Section 7 briefly discusses exten- ments of a given state, so we must take into account the
sions of Kalman filtering for nonlinear systems. impact of random error on a single measurement. Therefore,
confidence in a device is modeled formally by the variance
2 FORMALIZATION OF ESTIMATES of the distribution associated with that device; the smaller
This section makes precise the notions of estimates and con- the variance, the higher our confidence in the measurements
fidence in estimates. made by the device. In Figure 1, the fact that we have less
confidence in the first device has been illustrated by making
2.1 Scalar estimates p1 more spread out than p2 , giving it a larger variance.
To model the behavior of devices producing noisy measure- The informal notion that noise should affect the two de-
ments, we associate each device i with a random variable vices in “unrelated ways” is formalized by requiring that
that has a probability density function (pdf) pi (x) such as the the corresponding random variables be uncorrelated. This is
ones shown in Figure 1 (the x-axis in this figure represents a weaker condition than requiring them to be independent,
temperature). Random variables need not be Gaussian.2 Ob- as explained in the Appendix A. Suppose we are given the
taining a measurement from device i corresponds to drawing measurement made by one of the devices (say x 1 ) and we
a random sample from the distribution for that device. We have to guess what the other measurement (x 2 ) might be.
write x i ∼pi (µ i , σi2 ) to denote that x i is a random variable If knowing x 1 does not give us any new information about
what x 2 might be, the random variables are independent.
1 Basic This is expressed formally by the equation p(x 2 |x 1 ) = p(x 2 );
concepts including probability density function, mean, expectation,
variance and covariance are introduced in Appendix A. intuitively, knowing the value of x 1 does not change the
2 The role of Gaussians in Kalman filtering is discussed in Section 6.5. pdf for the possible values of x 2 . If the random variables are
2
only uncorrelated, knowing x 1 might give us new informa- The covariance matrix Σxx of a random variable x is the
tion about x 2 such as restricting its possible values but the matrix E[(x − µ x )(x − µ x )T ], where µ x is the mean of x. In-
mean of x 2 |x 1 will still be µ 2 . Using expectations, this can be tuitively, entry (i,j) of this matrix is the covariance between
written as E[x 2 |x 1 ] = E[x 2 ], which is equivalent to requiring the i and j components of vector x; in particular, entry (i,i)
that E[(x 1 −µ 1 )(x 2 −µ 2 )], the covariance between the two vari- is the variance of the i th component of x. A random variable
ables, be equal to zero. This is obviously a weaker condition x with a pdf p whose mean is µ x and covariance matrix is
than independence. Σxx is written as x∼p(µµx , Σxx ). The inverse of the covariance
Although the discussion in this section has focused on matrix (Σ−1xx ) is called the precision or information matrix.
measurements, the same formalization can be used for esti- Uncorrelated random variables: The cross-covariance ma-
mates produced by an estimator. Lemma 2.1(i) shows how trix Σvw of two random variables v and w is the matrix
the mean and variance of a linear combination of pairwise E[(v−µµ v )(w−µµ w )T ]. Intuitively, element (i,j) of this matrix
uncorrelated random variables can be computed from the is the covariance between elements v(i) and w(j). If the ran-
means and variances of the random variables [19]. The mean dom variables are uncorrelated, all entries in this matrix are
and variance can be used to quantify bias and random errors zero, which is equivalent to saying that every component of
for the estimator as in the case of measurements. v is uncorrelated with every component of w. Lemma 2.2
An unbiased estimator is one whose mean is equal to the generalizes Lemma 2.1.
unknown value being estimated and it is preferable to a bi-
ased estimator with the same variance. Only unbiased estima- Lemma 2.2. Let x1 ∼p1 (µµ 1 , Σ1 ), ..., xn ∼pn (µµ n , Σn ) be a set
tors are considered in this paper. Furthermore, an unbiased of pairwise uncorrelated random variables of length m. Let
y = ni=1 Ai xi .
Í
estimator with a smaller variance is preferable to one with a
larger variance since we would have more confidence in the (i) The mean and covariance matrix of y are the following:
estimates it produces. As a step towards generalizing this n
Õ
discussion to estimators that produce vector estimates, we
µy = Ai µ i (3)
refer to the variance of an unbiased scalar estimator as the i=1
Mean Square Error of that estimator or MSE for short. Õn
Lemma 2.1(ii) asserts that if a random variable is pairwise Σyy = Ai Σi ATi (4)
uncorrelated with a set of random variables, it is uncorrelated i=1
with any linear combination of those variables.
(ii) If a random variable xn+1 is pairwise uncorrelated with
Lemma 2.1. Let x 1 ∼p1 (µ 1 , σ12 ), ..., x n ∼pn (µ n , σn2 ) be a set of x1 , .., xn , it is uncorrelated with y.
pairwise uncorrelated random variables. Let y = ni=1 α i x i be
Í
a random variable that is a linear combination of the x i ’s. The MSE of an unbiased estimator y is E[(y−µµ y )T (y−µµ y )],
which is the sum of the variances of the components of y;
(i) The mean and variance of y are: if y has length 1, this reduces to variance as expected. The
n
Õ MSE is also the sum of the diagonal elements of Σyy (this is
µy = αi µi (1) called the trace of Σyy ).
i=1
n
Õ 3 FUSING SCALAR ESTIMATES
σy2 = α i2σi2 (2)
i=1
Section 3.1 discusses the problem of fusing two scalar esti-
mates. Section 3.2 generalizes this to the problem of fusing
(ii) If a random variable x n+1 is pairwise uncorrelated with n>2 scalar estimates. Section 3.3 shows that fusing n>2 esti-
x 1 , .., x n , it is uncorrelated with y. mates can be done iteratively by fusing two estimates at a
time without any loss of quality in the final estimate.
Substituting the optimal value of α into Equation 6, we The minimal variance is given by the following expression:
get 1
σy2n = n (13)
1
σy2 = 1
Õ
(10)
1
σ12
+ 1
σ22 σ j2
j=1
The expressions for y and σy2 are complicated because they As in Section 3.1, these expressions are more intuitive if
contain the reciprocals of variances. If we let ν 1 and ν 2 denote the variance is replaced with precision: the contribution of x i
4
to the value of yn (x 1 , .., x n ) is proportional to x i ’s confidence. (ν 1 + ν 2 )
y(x 1, x 2 )
(ν 1 + ν 2 + ν 3 ) (ν 1 + · · · + ν n−1 ) (ν 1 + · · · + ν n )
y2 (y2 (x 1, x 2 ), x 3 ) y2 (y2 (. . .), x n−1 ) y2 (y2 (. . .), x n )
n ν1 ν 1 +ν 2 ν 1 +···+ν n−1
Õ νi (ν 1 )
ν 1 +ν 2
+ ν 1 +ν 2 +ν 3
+ ... + ν 1 +···+ν n
+
yn (x 1 , .., x n ) = ∗ xi (14) x1
i=1
ν 1 +...+νn ν2 ν3 νn−1 νn
ν 1 +ν 2 ν 1 +ν 2 +ν 3 ν 1 +···+ν n−1 ν 1 +···+ν n
n
Õ
x2 x3 x n−1 xn
νy n = νi (15) (ν 2 ) (ν 3 ) (ν n−1 ) (ν n )
i=1
Equations 14 and 15 generalize Equations 11 and 12. Figure 2: Dataflow graph for incremental fusion.
5
Theorem 4.1. Let xi ∼pi (µµi , Σi ) for (1≤i≤n) be a set of 5 BEST LINEAR UNBIASED ESTIMATOR
pairwise uncorrelated randomÍ variables. Consider the linear (BLUE)
estimator yA (x1 , .., xn )= ni=1 Ai xi , where ni=1 Ai = I . The
Í
In some applications, the state of the system is represented
value of MSE(yA ) is minimized for
by a vector but only part of the state can be measured directly,
n
Õ so it is necessary to estimate the hidden portion of the state
Ai = ( Σ−1
j ) Σi
−1 −1
(22) corresponding to a measured value of the visible state. This
j=1 section describes an estimator called the Best Linear Unbiased
Therefore the optimal estimator is Estimator (BLUE) [17, 20, 27] for doing this.
n n
Õ Õ y
y(x1 , ..., xn ) = ( Σ−1
j )
−1
Σi−1 xi (23)
i=1 y − µ y ) = Σyx Σ−1
(b xx (x − µ x )
j=1
z1 x1 z2 x2 zt xt
(R 1 ) (R 2 ) (R t )
(b) Dynamical system with uncertainty.
(d) Implementation of the dataflow diagram (b) for systems with partial observability.
• The random variable x0 captures the likelihood of dif- time step t = 1, 2, ... that capture our beliefs about the likeli-
ferent initial states. hood of different states at time t before and after fusion with
• The random variables at successive time steps are re- the measurement respectively. The mean and covariance ma-
lated by the following linear model: trix of a random variable xi |j are denoted by b xi |j and Σi |j
respectively. We assume E[b x0|0 ]=E[x0 ] (no bias).
xt = Ft xt −1 + Bt ut + wt (34)
Prediction essentially uses xt −1|t −1 as a proxy for xt −1 in
Here, ut is the control input, which is assumed to Equation 34 to determine xt |t −1 as shown in Equation 35.
be deterministic, and wt is a zero-mean noise term
that models all the uncertainty in the system. The
covariance matrix of wt is denoted by Q t , and the xt |t −1 = Ft xt −1|t −1 + Bt ut + wt (35)
noise terms in different time steps are assumed to be
uncorrelated to each other (that is, E[wi wj ]=0 if i,j) For state estimation, we need only the mean and covari-
and to x0 . ance matrix of xt |t −1 . The Predictor box in Figure 4 com-
For estimation, we have a random variable x0|0 that cap- putes these values; the covariance matrix is obtained from
tures our belief about the likelihood of different states at Lemma 2.2 under the assumption that wt is uncorrelated
time t=0, and two random variables xt |t −1 and xt |t at each with xt −1 |t −1 , which is justified in Section 6.2.
8
xt|t−1
example, if the state vector has two components and only
the first component is observable, Ht can be [ 1 0 ]. In general,
xt+1|t
x̂ t|t−1
the Ht matrix can specify a linear relationship between the
state and the observation, and it can be time-dependent. The
x̂ t|t
imprecise measurement model introduced in Equation 36
xt|t ⨁
x̂ t+1|t becomes:
xt
zt
zt+1
⨁
x̂ t+1|t+1 zt = Ht xt + vt (37)
xt+1
xt+1|t+1
The hidden portion of the state can be specified using a
matrix Ct , which is an orthogonal complement of Ht . For
example, if Ht = [ 1 0 ], one choice for Ct is [ 0 1 ].
Figure 5: Pictorial representation of Kalman filtering.
Figure 4d shows the computation for this case. The fusion
phase can be understood intuitively in terms of the following
6.2 Fusing complete observations of the steps.
state (i) The observable part of the a priori estimate of the state
If the entire state can be measured at each time step, the Ht b
xt |t −1 is fused with the measurement zt , using the
imprecise measurement at time t is modeled as follows: techniques developed in Sections 3-4. The quantity zt − Ht b xt |t −1
is called the innovation. The result is the a posteriori es-
zt = xt + vt (36) timate of the observable state Ht b xt |t .
where vt is a zero-mean noise term with covariance matrix (ii) The BLUE in Section 5 is used to obtain the a pos-
R t . The noise terms in different time steps are assumed to be teriori estimate of the hidden state Ct b xt |t by adding
uncorrelated with each other (that is, E[vi vj ] is zero if i,j) to the a priori estimate of the hidden state Ct b xt |t −1 a
as well as with x0|0 and all wk . A subtle point here is that value obtained from the product of the covariance be-
xt in this equation is the actual state of the system at time t tween Ht xt |t −1 and Ct xt |t −1 and the difference between
(that is, a particular realization of the random variable xt ), Ht b
xt |t −1 and Ht b
xt |t .
so variability in zt comes only from vt and its covariance (iii) The a posteriori estimates of the observable and hidden
matrix R t . portions of the state are composed to produce the a
Therefore, we have two imprecise estimates for the state posteriori estimate of the entire state b xt |t .
at each time step t = 1, 2, ..., the a priori estimate from the
The actual implementation produces the final result di-
predictor (b xt |t −1 ) and the one from the measurement (zt ). If rectly without going through these steps as shown in Fig-
zt is uncorrelated to xt |t −1 , we can use Equations 30-32 to ure 4d, but these incremental steps are useful for understand-
fuse the estimates as shown in Figure 4c.
ing how all this works, and they are implemented as follows.
The assumptions that (i) xt −1|t −1 is uncorrelated with wt ,
which is used in prediction, and (ii) xt |t −1 is uncorrelated (i) The a priori estimate of the observable part of the state
with zt , which is used in fusion, are easily proved to be is Ht b
xt |t −1 and the covariance is Ht Σt |t −1HtT . The a pos-
correct by induction on t, using Lemma 2.2(ii). Figure 4b teriori estimate is obtained directly from Equation 31:
gives the intuition: xt |t −1 for example is an affine function
Ht b
xt |t = Ht b
xt |t −1
of the random variables x0|0 , w1 , v1 , w2 , v2 , ..., wt , and is
therefore uncorrelated with vt (by assumption about vt and + Ht Σt |t −1 HtT (Ht Σt |t −1 HtT + R t )−1 (zt − Ht b
xt |t −1 )
Lemma 2.2(ii)) and hence with zt .
Figure 5 shows the computation pictorially using confi- Let Kt =Σt |t −1 HtT (Ht Σt |t −1 HtT + R t )−1 . The a posteriori
dence ellipses to illustrate uncertainty. The dotted arrows estimate of the observable state can be written in terms
at the bottom of the figure show the evolution of the state, of Kt as follows:
and the solid arrows show the computation of the a priori Ht b
xt |t = Ht b
xt |t −1 + Ht Kt (zt − Ht b
xt |t −1 ) (38)
estimates and their fusion with measurements.
(ii) The a priori estimate of the hidden state is Ct b xt |t −1 .
6.3 Fusing partial observations of the state The covariance between the hidden portion Ct xt |t −1
In some problems, only a portion of the state can be mea- and the observable portion Ht xt |t −1 is Ct Σt |t −1HtT . The
sured directly. The observable portion of the state is specified difference between the a priori estimate and a posteriori
formally using a full row-rank matrix Ht called the observa- estimate of Ht x is Ht Kt (zt −Hb
xt |t −1 ). Therefore the a
tion matrix: if the state is x, what is observable is Ht x. For posteriori estimate of the hidden portion of the state is
9
obtained directly from Equation 33: system is easily shown to be the following:
vn 0 vn−1
Ct b
xt |t = Ct b
xt |t −1 =
1
+
0 0.25 0
(42)
sn 0.25 1 sn−1 0 0.5 × 0.252 9.8
+ (Ct Σt |t −1HtT )(Ht Σt |t −1 HtT )−1 Ht Kt (zt − Ht b
xt |t −1 )
v
0 80 0
= Ct b
xt |t −1 + Ct Kt (zt − Ht b
xt |t −1 ) (39) where we assume 0 = and Σ0 = .
s0 0 0 10
(iii) Putting the a posteriori estimates (38) and (39) together, The gray lines in Figure 6 show the evolution of velocity
and distance with time according to this model. Because of
Ht Ht H
+ t Kt (zt − Ht b
uncertainty in modeling the system dynamics, the actual
x = x xt |t −1 )
Ct t |t Ct t |t −1 Ct
b b
evolution of the velocity and position will be different in
practice. The red lines in Figure 6 show one trajectory for
H
Since t is invertible, it can be canceled from the left this evolution,
corresponding to a Gaussian noise term with
Ct 2 2.5
and right hand sides, giving the equation covariance in Equation 34 (because this noise
2.5 4
xt |t = b
b xt |t −1 + Kt (zt − Ht b
xt |t −1 ) (40) term is random, there are many trajectories for the evolu-
tion, and we are just showing one of them). The red lines
To compute Σt |t , Equation 40 can be rewritten as correspond to “ground truth” in our example.
xt |t = (I − Kt Ht )b
b xt |t −1 + Kt zt . Since xt |t −1 and zt are un- The green points in Figure 6b show the noisy measure-
correlated, it follows from Lemma 2.2 that ments of velocity at different time steps, assuming the noise
is modeled by a Gaussian with variance 8. The blue lines
Σt |t = (I − Kt Ht )Σt |t −1 (I − Kt Ht )T + Kt R t KtT show the a posteriori estimates of the velocity and position.
It can be seen that the a posteriori estimates track the ground
Substituting the value of Kt and simplifying, we get
truth quite well even when the ideal system model (the gray
Σt |t = (I − Kt Ht )Σt |t −1 (41) lines) is inaccurate and the measurements are noisy. The
cyan bars in the right figure show the variance of the veloc-
Figure 4d puts all this together. In the literature, this ity at different time steps. Although the initial variance is
dataflow is referred to as Kalman filtering. Unlike in Sec- quite large, application of Kalman filtering is able to reduce
tions 3 and 4, the Kalman gain is not a dimensionless value it rapidly in few time steps.
here. If Ht = I , the computations in Figure 4d reduce to those
of Figure 4c as expected. 6.5 Discussion
Equation 40 shows that the a posteriori state estimate is a This section shows that Kalman filtering for state estimation
linear combination of the a priori state estimate (bxt |t −1 ) and in linear systems can be derived from two elementary ideas:
the measurement (zt ). The optimality of this linear unbiased optimal linear estimators for fusing uncorrelated estimates
estimator is shown in the Appendix D. In Section 3.3, it was and best linear unbiased estimators for correlated variables.
shown that incremental fusion of scalar estimates is optimal. This is a different approach to the subject than the standard
The dataflow of Figures 4(c,d) computes the a posteriori state presentations in the literature. One standard approach is to
estimate at time t by incrementally fusing measurements use Bayesian inference. The other approach is to assume that
from the previous time steps, and this incremental fusion the a posteriori state estimator is a linear combination of the
can be shown to be optimal using a similar argument. form At bxt |t −1 +Bt zt , and then find the values of At and Bt
that produce an unbiased estimator with minimum MSE. We
6.4 Example: falling body believe that the advantage of the presentation given here is
To demonstrate the effectiveness of the Kalman filter, we that it exposes the concepts and assumptions that underlie
consider an example in which an object falls from the origin Kalman filtering.
at time t=0 with an initial speed of 0 m/s and an expected Most presentations in the literature also begin by assuming
constant acceleration of 9.8 m/s2 due to gravity. Note that that the noise terms wt in the state evolution equation and vt
acceleration in reality may not be constant due to factors in the measurement are Gaussian. Some presentations [1, 10]
such as wind, air friction, and so on. use properties of Gaussians to derive the results in Sections 3
The state vector of the object contains two components, although as we have seen, these results do not depend on
one for the distance from the origin s(t) and one for the distributions being Gaussians. Gaussians however enter the
velocity v(t). We assume that only the velocity state can be picture in a deeper way if one considers nonlinear estimators.
measured at each time step. If time is discretized in steps of It can be shown that if the noise terms are not Gaussian, there
0.25 seconds, the difference equation for the dynamics of the may be nonlinear estimators whose MSE is lower than that
10
2500 300 150
Model-only Model-only
Ground Truth Ground Truth
Estimated 250 Estimated 125
2000 Measured
Variance
200 100
Velocity (m/s)
1500
Distance (m)
Variance
150 75
1000
100 50
500
50 25
0 0 0
0 5 10 15 20 0 5 10 15 20
Time (s) Time (s)
(a) Evolution of state: Distance (b) Evolution of state: Velocity
of the linear estimator presented in Figure 4d. However if h. This requires computing the following Jacobians4 , which
the noise is Gaussian, this linear estimator is as good as any play the role of Ft and Ht in Figure 4d.
unbiased nonlinear estimator (that is, the linear estimator is a
minimum variance unbiased estimator (MVUE)). This result ∂ f ∂h
Ft = H t =
is proved using the Cramer-Rao lower bound [25]. ∂ x x b t −1|t −1, ut ∂ x xb t |t −1
14