Kalman Filter

An Elementary Introduction to Kalman Filtering
Yan Pei Swarnendu Biswas

University of Texas at Austin Indian Institute of Technology Kanpur
ypei@cs.utexas.edu swarnendu@cse.iitk.ac.in
Donald S. Fussell Keshav Pingali

University of Texas at Austin University of Texas at Austin
fussell@cs.utexas.edu pingali@cs.utexas.edu
arXiv:1710.04055v5 [eess.SY] 27 Jun 2019
ABSTRACT can be applied only if noise is Gaussian [15]. The goal of

Kalman filtering is a classic state estimation technique used this paper is to present the abstract concepts behind Kalman
in application areas such as signal processing and autonomous filtering in a way that is accessible to most computer scien-
control of vehicles. It is now being used to solve problems tists while clarifying the key assumptions, and then show
in computer systems such as controlling the voltage and how the problem of state estimation in linear systems can
frequency of processors. be solved as an application of these general concepts.
Although there are many presentations of Kalman filtering Abstractly, Kalman filtering can be seen as a particular
in the literature, they usually deal with particular systems approach to combining approximations of an unknown value
like autonomous robots or linear systems with Gaussian to produce a better approximation. Suppose we use two de-
noise, which makes it difficult to understand the general vices of different designs to measure the temperature of a
principles behind Kalman filtering. In this paper, we first CPU core. Because devices are usually noisy, the measure-
present the abstract ideas behind Kalman filtering at a level ments are likely to differ from the actual temperature of the
accessible to anyone with a basic knowledge of probability core. Since the devices are of different designs, let us assume
theory and calculus, and then show how these concepts can that noise affects the two devices in unrelated ways (this
be applied to the particular problem of state estimation in is formalized using the notion of correlation in Section 2).
linear systems. This separation of concepts from applications Therefore, the measurements x 1 and x 2 are likely to be dif-
should make it easier to understand Kalman filtering and to ferent from each other and from the actual core temperature
apply it to other problems in computer systems. x c . A natural question is the following: is there a way to
combine the information in the noisy measurements x 1 and
KEYWORDS x 2 to obtain a good approximation of the actual temperature
xc ?
Kalman filtering, data fusion, uncertainty, noise, state esti-
One ad hoc solution is to use the formula 0.5∗x 1 +0.5∗x 2
mation, covariance, BLUE, linear systems
to take the average of the two measurements, giving them
equal weight. Formulas of this sort are called linear estimators
1 INTRODUCTION because they use a weighted sum to fuse values; for our
Kalman filtering is a state estimation technique invented temperature problem, their general form is β∗x 1 +α∗x 2 . In
in 1960 by Rudolf E. Kálmán [16]. Because of its ability to this paper, we use the term estimate to refer to both a noisy
extract useful information from noisy data and its small com- measurement and to a value computed by an estimator, since
putational and memory requirements, it is used in many both are approximations of unknown values of interest.
application areas including spacecraft navigation, motion Suppose we have additional information about the two
planning in robotics, signal processing, and wireless sen- devices; say the second one uses more advanced tempera-
sor networks [12, 21, 28–30]. Recent work has used Kalman ture sensing. Since we would have more confidence in the
filtering in controllers for computer systems [5, 13, 14, 24]. second measurement, it seems reasonable that we should
Although many introductions to Kalman filtering are avail- discard the first one, which is equivalent to using the linear
able in the literature [1–4, 6–11, 18, 22, 26, 30], they are usu- estimator 0.0∗x 1 + 1.0∗x 2 . Kalman filtering tells us that in
ally focused on particular applications like robot motion or general, this intuitively reasonable linear estimator is not
state estimation in linear systems. This can make it difficult “optimal”; paradoxically, there is useful information even in
to see how to apply Kalman filtering to other problems. Other the measurement from the lower-quality device, and the op-
presentations derive Kalman filtering as an application of timal estimator is one in which the weight given to each
Bayesian inference assuming that noise is Gaussian. This measurement is proportional to the confidence we have in
leads to the common misconception that Kalman filtering the device producing that measurement. Only if we have no
confidence whatever in the first device should we discard its p2 (x)
measurement.
Section 2 describes how these intuitive ideas can be quan-
p(x)
tified. Estimates are modeled as random samples from distri- p1 (x)
butions, and confidence in estimates is quantified in terms
of the variances and covariances of these distributions.1 Sec-
x
tions 3-5 develop the two key ideas behind Kalman filtering. 58 ◦C 60 ◦C 63 ◦C
(1) How should estimates be fused optimally?
Section 3 shows how to fuse scalar estimates such Figure 1: Using pdfs to model devices with systematic
as temperatures optimally. It is also shown that the and random errors. Ground truth is 60 ◦ C. Dashed lines
problem of fusing more than two estimates can be re- are means of pdfs.
duced to the problem of fusing two estimates at a time
without any loss in the quality of the final estimate.
Section 4 extends these results to estimates that are vec-
with pdf pi whose mean and variance are µ i and σi2 respec-
tors, such as state vectors representing the estimated
tively; following convention, we use x i to represent a random
position and velocity of a robot.
sample from this distribution as well.
(2) In some applications, estimates are vectors but only
Means and variances of distributions model different kinds
a part of the vector can be measured directly. For ex-
of inaccuracies in measurements. Device i is said to have a
ample, the state of a spacecraft may be represented
systematic error or bias in its measurements if the mean µ i
by its position and velocity, but only its position may
of its distribution is not equal to the actual temperature x c
be observable. In such situations, how do we obtain a
(in general, to the value being estimated, which is known as
complete estimate from a partial estimate?
ground truth); otherwise, the instrument is unbiased. Figure 1
Section 5 shows how the Best Linear Unbiased Esti-
shows pdfs for two devices that have different amounts of
mator (BLUE) can be used for this. Intuitively, it is
systematic error. The variance σi2 on the other hand is a mea-
assumed that there is a linear relationship between the
sure of the random error in the measurements. The impact
observable and hidden parts of the state vector, and
of random errors can be mitigated by taking many measure-
this relationship is used to compute an estimate for
ments with a given device and averaging their values, but
the hidden part of the state, given an estimate for the
this approach will not help to reduce systematic error.
observable part.
In the usual formulation of Kalman filtering, it is assumed
Section 6 uses these ideas to solve the state estimation that measuring devices do not have systematic errors. How-
problems for linear systems, which is the usual context for ever, we do not have the luxury of taking many measure-
presenting Kalman filters. Section 7 briefly discusses exten- ments of a given state, so we must take into account the
sions of Kalman filtering for nonlinear systems. impact of random error on a single measurement. Therefore,
confidence in a device is modeled formally by the variance
2 FORMALIZATION OF ESTIMATES of the distribution associated with that device; the smaller
This section makes precise the notions of estimates and con- the variance, the higher our confidence in the measurements
fidence in estimates. made by the device. In Figure 1, the fact that we have less
confidence in the first device has been illustrated by making
2.1 Scalar estimates p1 more spread out than p2 , giving it a larger variance.
To model the behavior of devices producing noisy measure- The informal notion that noise should affect the two de-
ments, we associate each device i with a random variable vices in “unrelated ways” is formalized by requiring that
that has a probability density function (pdf) pi (x) such as the the corresponding random variables be uncorrelated. This is
ones shown in Figure 1 (the x-axis in this figure represents a weaker condition than requiring them to be independent,
temperature). Random variables need not be Gaussian.2 Ob- as explained in the Appendix A. Suppose we are given the
taining a measurement from device i corresponds to drawing measurement made by one of the devices (say x 1 ) and we
a random sample from the distribution for that device. We have to guess what the other measurement (x 2 ) might be.
write x i ∼pi (µ i , σi2 ) to denote that x i is a random variable If knowing x 1 does not give us any new information about
what x 2 might be, the random variables are independent.
1 Basic This is expressed formally by the equation p(x 2 |x 1 ) = p(x 2 );
concepts including probability density function, mean, expectation,
variance and covariance are introduced in Appendix A. intuitively, knowing the value of x 1 does not change the
2 The role of Gaussians in Kalman filtering is discussed in Section 6.5. pdf for the possible values of x 2 . If the random variables are
2
only uncorrelated, knowing x 1 might give us new informa- The covariance matrix Σxx of a random variable x is the
tion about x 2 such as restricting its possible values but the matrix E[(x − µ x )(x − µ x )T ], where µ x is the mean of x. In-
mean of x 2 |x 1 will still be µ 2 . Using expectations, this can be tuitively, entry (i,j) of this matrix is the covariance between
written as E[x 2 |x 1 ] = E[x 2 ], which is equivalent to requiring the i and j components of vector x; in particular, entry (i,i)
that E[(x 1 −µ 1 )(x 2 −µ 2 )], the covariance between the two vari- is the variance of the i th component of x. A random variable
ables, be equal to zero. This is obviously a weaker condition x with a pdf p whose mean is µ x and covariance matrix is
than independence. Σxx is written as x∼p(µµx , Σxx ). The inverse of the covariance
Although the discussion in this section has focused on matrix (Σ−1xx ) is called the precision or information matrix.
measurements, the same formalization can be used for esti- Uncorrelated random variables: The cross-covariance ma-
mates produced by an estimator. Lemma 2.1(i) shows how trix Σvw of two random variables v and w is the matrix
the mean and variance of a linear combination of pairwise E[(v−µµ v )(w−µµ w )T ]. Intuitively, element (i,j) of this matrix
uncorrelated random variables can be computed from the is the covariance between elements v(i) and w(j). If the ran-
means and variances of the random variables [19]. The mean dom variables are uncorrelated, all entries in this matrix are
and variance can be used to quantify bias and random errors zero, which is equivalent to saying that every component of
for the estimator as in the case of measurements. v is uncorrelated with every component of w. Lemma 2.2
An unbiased estimator is one whose mean is equal to the generalizes Lemma 2.1.
unknown value being estimated and it is preferable to a bi-
ased estimator with the same variance. Only unbiased estima- Lemma 2.2. Let x1 ∼p1 (µµ 1 , Σ1 ), ..., xn ∼pn (µµ n , Σn ) be a set
tors are considered in this paper. Furthermore, an unbiased of pairwise uncorrelated random variables of length m. Let
y = ni=1 Ai xi .
Í
estimator with a smaller variance is preferable to one with a
larger variance since we would have more confidence in the (i) The mean and covariance matrix of y are the following:
estimates it produces. As a step towards generalizing this n
Õ
discussion to estimators that produce vector estimates, we
µy = Ai µ i (3)
refer to the variance of an unbiased scalar estimator as the i=1
Mean Square Error of that estimator or MSE for short. Õn
Lemma 2.1(ii) asserts that if a random variable is pairwise Σyy = Ai Σi ATi (4)
uncorrelated with a set of random variables, it is uncorrelated i=1
with any linear combination of those variables.
(ii) If a random variable xn+1 is pairwise uncorrelated with
Lemma 2.1. Let x 1 ∼p1 (µ 1 , σ12 ), ..., x n ∼pn (µ n , σn2 ) be a set of x1 , .., xn , it is uncorrelated with y.
pairwise uncorrelated random variables. Let y = ni=1 α i x i be
Í
a random variable that is a linear combination of the x i ’s. The MSE of an unbiased estimator y is E[(y−µµ y )T (y−µµ y )],
which is the sum of the variances of the components of y;
(i) The mean and variance of y are: if y has length 1, this reduces to variance as expected. The
n
Õ MSE is also the sum of the diagonal elements of Σyy (this is
µy = αi µi (1) called the trace of Σyy ).
i=1
n
Õ 3 FUSING SCALAR ESTIMATES
σy2 = α i2σi2 (2)
i=1
Section 3.1 discusses the problem of fusing two scalar esti-
mates. Section 3.2 generalizes this to the problem of fusing
(ii) If a random variable x n+1 is pairwise uncorrelated with n>2 scalar estimates. Section 3.3 shows that fusing n>2 esti-
x 1 , .., x n , it is uncorrelated with y. mates can be done iteratively by fusing two estimates at a
time without any loss of quality in the final estimate.
2.2 Vector estimates 3.1 Fusing two scalar estimates

In some applications, estimates are vectors. For example, the We now consider the problem of choosing the optimal values
state of a mobile robot might be represented by a vector of the parameters α and β in the linear estimator β∗x 1 + α∗x 2
containing its position and velocity. Similarly, the vital signs for fusing estimates x 1 and x 2 from uncorrelated random
of a person might be represented by a vector containing his variables.
temperature, pulse rate and blood pressure. In this paper, we The first reasonable requirement is that if the two esti-
denote a vector by a boldfaced lowercase letter, and a matrix mates x 1 and x 2 are equal, fusing them should produce the
by an uppercase letter. same value. This implies that α+β=1. Therefore the linear
3
estimators of interest are of the form the precisions of the two distributions, the expressions for y
and νy can be written more simply as follows:
yα (x 1 , x 2 )=(1−α)∗x 1 + α∗x 2 (5)
ν1 ν2
If x 1 and x 2 in Equation 5 are considered to be unbiased y(x 1 , x 2 ) = ∗x 1 + ∗x 2 (11)
ν1 + ν2 ν1 + ν2
estimators of some quantity of interest, then yα is an unbi- νy = ν 1 + ν 2 (12)
ased estimator for any value of α. How should optimality
of such an estimator be defined? One reasonable definition These results say that the weight we should give to an
is that the optimal value of α minimizes the variance of yα estimate is proportional to the confidence we have in that
since this will produce the highest-confidence fused esti- estimate, and that we have more confidence in the fused es-
mates as discussed in Section 2. The variance (MSE) of yα timate than in the individual estimates, which is intuitively
can be determined from Lemma 2.1: reasonable. To use these results, we need only the variances
of the distributions. In particular, the pdfs pi , which are usu-
σy2 (α) = (1 − α)2 ∗σ12 + α 2 ∗σ22 (6) ally not available in applications, are not needed, and the
proof of Theorem 3.1 does not require these pdf’s to have
Theorem 3.1. Let x 1 ∼p1 (µ 1 , σ12 ) and x 2 ∼p2 (µ 2 , σ22 ) be un-
the same mean.
correlated random variables. Consider the linear estimator
yα (x 1 , x 2 ) = (1−α)∗x 1 + α∗x 2 . The variance of the estimator
σ12 3.2 Fusing multiple scalar estimates
is minimized for α = σ12 +σ22
.
The approach in Section 3.1 can be generalized to optimally
fuse multiple pairwise uncorrelated estimates x 1 , x 2 , ..., x n .
This result can be proved by setting the derivative of σy2 (α)
Let yn,α (x 1 , .., x n ) denote the linear estimator for fusing the
with respect to α to zero and solving equation for α.
n estimates given parameters α 1 , .., α n , which we denote by
Proof. α. The notation yα (x 1 , x 2 ) introduced in the previous section
can be considered to be an abbreviation of y2,α (x 1 , x 2 ).
d 2
σ (α) = −2(1 − α) ∗ σ12 + 2α ∗ σ22
dα y Theorem 3.2. Let x i ∼pi (µ i , σi2 ) for (1≤i≤n) be a set of
= 2α ∗ (σ12 + σ22 ) − 2 ∗ σ12 = 0 pairwise uncorrelated random Í variables. Consider the linear
estimator yn,α (x 1 , .., x n ) = ni=1 α i x i where ni=1 α i = 1. The
Í
σ12
α= (7) variance of the estimator is minimized for
σ12 + σ22 1
σi2
The second order derivative of σy2 (α),
is positive, (σ12 +σ22 ), αi = Ín 1
showing that σy2 (α) reaches a minimum at this point. □ j=1 σ 2
j
Proof. From Lemma 2.1, σy2 (α) = ni=1 α i 2σ i 2 . To find the

Í
In the literature, the optimal value of α is called the Kalman
gain K. Substituting K into the linear fusion model, we get values of α i that minimize the variance σy2 under the con-
the optimal linear estimator y(x 1 , x 2 ): straint that the α i ’s sum to 1, we use the method of Lagrange
multipliers. Define
σ22 σ12
y(x 1 , x 2 ) = ∗x 1 + ∗x 2 (8) n n
σ12 + σ2 σ12 + σ22
2
Õ Õ
f (α 1 , ..., α n ) = α i 2σ i 2 + λ( α i − 1)
i=1 i=1
As a step towards fusion of n>2 estimates, it is useful to
rewrite this as follows: where λ is the Lagrange multiplier. Taking the partial deriva-
1 1 tives of f with respect to each α i and setting these derivatives
σ12 σ22 to zero, we find α 1σ12 = α 2σ22 = ... = α n σn2 = −λ/2. From this,
y(x 1 , x 2 ) = ∗x 1 + ∗x 2 (9)
1
σ12
+ 1
σ22
1
σ12
+ 1
σ22
and the fact that sum of the α i ’s is 1, the result follows. □
Substituting the optimal value of α into Equation 6, we The minimal variance is given by the following expression:
get 1
σy2n = n (13)
1
σy2 = 1
Õ
(10)
1
σ12
+ 1
σ22 σ j2
j=1
The expressions for y and σy2 are complicated because they As in Section 3.1, these expressions are more intuitive if
contain the reciprocals of variances. If we let ν 1 and ν 2 denote the variance is replaced with precision: the contribution of x i
4
to the value of yn (x 1 , .., x n ) is proportional to x i ’s confidence. (ν 1 + ν 2 )
y(x 1, x 2 )
(ν 1 + ν 2 + ν 3 ) (ν 1 + · · · + ν n−1 ) (ν 1 + · · · + ν n )
y2 (y2 (x 1, x 2 ), x 3 ) y2 (y2 (. . .), x n−1 ) y2 (y2 (. . .), x n )
n ν1 ν 1 +ν 2 ν 1 +···+ν n−1
Õ νi (ν 1 )
ν 1 +ν 2
+ ν 1 +ν 2 +ν 3
+ ... + ν 1 +···+ν n
+
yn (x 1 , .., x n ) = ∗ xi (14) x1
i=1
ν 1 +...+νn ν2 ν3 νn−1 νn
ν 1 +ν 2 ν 1 +ν 2 +ν 3 ν 1 +···+ν n−1 ν 1 +···+ν n
n
Õ
x2 x3 x n−1 xn
νy n = νi (15) (ν 2 ) (ν 3 ) (ν n−1 ) (ν n )
i=1
Equations 14 and 15 generalize Equations 11 and 12. Figure 2: Dataflow graph for incremental fusion.
3.3 Incremental fusing is optimal

In many applications, the estimates x 1 , x 2 , ..., x n become inversely proportional to the variance of the random variable
available successively over a period of time. While it is possi- from which that estimate is obtained. Furthermore, when fus-
ble to store all the estimates and use Equations 14 and 15 to ing n>2 estimates, estimates can be fused incrementally with-
fuse all the estimates from scratch whenever a new estimate out any loss in the quality of the final result. These results are
becomes available, it is possible to save both time and storage often expressed formally in terms of the Kalman gain K, as
if one can do this fusion incrementally. We show that just as shown below; the equations can be applied recursively to fuse
a sequence of numbers can be added by keeping a running multiple estimates. Note that if ν 1 ≫ν 2 , K≈0 and y(x 1 ,x 2 )≈x 1 ;
sum and adding the numbers to this running sum one at a conversely if ν 1 ≪ν 2 , K≈1 and y(x 1 ,x 2 )≈x 2 .
time, a sequence of n>2 estimates can be fused by keeping a x 1 ∼p1 (µ 1 , σ12 ), x 2 ∼p2 (µ 2 , σ22 )
“running estimate” and fusing estimates from the sequence
one at a time into this running estimate without any loss in σ12 ν2
K= = (18)
the quality of the final estimate. In short, we want to show σ12 + σ22 ν1 + ν2
that yn (x 1 , x 2 , ..., x n ) = y2 (y2 (y2 (x 1 , x 2 ), x 3 )..., x n ). y(x 1 , x 2 ) = x 1 + K(x 2 − x 1 ) (19)
A little bit of algebra shows that if n>2, Equations 14 and
15 for the optimal linear estimator and its precision can be σy2 = (1 − K)σ12 or νy = ν 1 + ν 2 (20)
expressed as shown in Equations 16 and 17.
4 FUSING VECTOR ESTIMATES
νyn−1 νn
yn (x 1 , .., x n ) = yn−1 (x 1 , ..., x n−1 ) + xn The results in Section 3 for fusing scalar estimates can be
νyn−1 +νn νyn−1 +νn extended to vectors by replacing variances with covariance
(16) matrices.
νyn = νyn−1 + νn (17)
4.1 Fusing multiple vector estimates
This shows that yn (x 1 , .., x n ) = y2 (yn−1 (x 1 , .., x n−1 ), x n ).
For vectors, the linear estimator is
Using this argument recursively gives the required result.3
To make the connection to Kalman filtering, it is useful n
Õ n
Õ
to derive the same result using a pictorial argument. Fig- yn,A (x1 , x2 , .., xn ) = Ai xi where Ai = I (21)
i=1 i=1
ure 2 shows the process of incrementally fusing the n es-
timates. In this picture, time progresses from left to right, Here A stands for the matrix parameters (A1 , ..., An ). All the
the precision of each estimate is shown in parentheses next vectors xi are assumed to be of the same length. To simplify
to it, and the weights on the edges are the weights from notation, we omit the subscript n in yn,A in the discussion
Equation 11. The contribution of each x i to the final value below since it is obvious from the context.
y2 (y2 (y2 (x 1 , x 2 ), x 3 )..., x n ) is given by the product of the weights Optimality: The parameters A1 , ..., An in the linear data
on the path from x i to the final value and this product is ob- fusion model are chosen to minimize MSE(yA ) which is
viously equal to the weight of x i in Equation 14, showing E[(yA −µµ yA )T (yA −µµ yA )], as explained in Section 2.
that incremental fusion is optimal. Theorem 4.1 generalizes Theorem 3.2 to the vector case.
The proof of this theorem uses matrix derivatives [23] (see
3.4 Summary Appendix B) and is given in Appendix C since it is not needed
The results in this section can be summarized informally for understanding the rest of this paper. Comparing Theo-
as follows. When using a linear estimator to fuse uncertain rems 4.1 and 3.2, we see that the expressions are similar, the
scalar estimates, the weight given to each estimate should be main difference being that the role of variance in the scalar
case is played by the covariance matrix in the vector case.
3 We thank Mani Chandy for showing us this approach to proving the result.
5
Theorem 4.1. Let xi ∼pi (µµi , Σi ) for (1≤i≤n) be a set of 5 BEST LINEAR UNBIASED ESTIMATOR
pairwise uncorrelated randomÍ variables. Consider the linear (BLUE)
estimator yA (x1 , .., xn )= ni=1 Ai xi , where ni=1 Ai = I . The
Í
In some applications, the state of the system is represented
value of MSE(yA ) is minimized for
by a vector but only part of the state can be measured directly,
n
Õ so it is necessary to estimate the hidden portion of the state
Ai = ( Σ−1
j ) Σi
−1 −1
(22) corresponding to a measured value of the visible state. This
j=1 section describes an estimator called the Best Linear Unbiased
Therefore the optimal estimator is Estimator (BLUE) [17, 20, 27] for doing this.
n n
Õ Õ y
y(x1 , ..., xn ) = ( Σ−1
j )
−1
Σi−1 xi (23)
i=1 y − µ y ) = Σyx Σ−1
(b xx (x − µ x )
j=1
The covariance matrix of y can be computed by using

Lemma 2.2. yb1
Õ n
Σyy = ( Σ−1
j )
−1
(24) µx

j=1
µy
In the vector case, precision is the inverse of a covariance
matrix, denoted by N . Equations 25–26 use precision to ex-
x
press the optimal estimator and its variance and generalize x1
Equations 14–15 to the vector case.
n
Õ Figure 3: BLUE line corresponding to Equation 33.
y(x1 , ..., xn ) = N y−1 Ni xi (25)
i=1 Consider the general problem of determining a value for
n
Õ vector y given a value for a vector x. If there is a functional
Ny = Nj (26)
relationship between x and y (say y=F (x) and F is given),
j=1
it is easy to compute y given a value for x (say x1 ).
As in the scalar case, fusion of n>2 vector estimates can In our context however, x and y are random variables, so
be done incrementally without loss of precision. The proof such a precise functional relationship will not hold. Figure 3
is similar to the scalar case, and is omitted. shows an example in which x and y are scalar-valued random
There are several equivalent expressions for the Kalman variables. The gray ellipse in this figure, called a confidence
gain for the fusion of two estimates. The following one, ellipse, is a projection of the joint distribution of x and y onto
which is easily derived from Equation 22, is the vector analog the (x, y) plane, that shows where some large proportion of
of Equation 18: the (x, y) values are likely to be. Suppose x takes the value
x 1 . Even within the confidence ellipse, there are many points
K = Σ1 (Σ1 + Σ2 )−1 (27)
(x 1 , y), so we cannot associate a single value of y with x 1 . One
The covariance matrix of the optimal estimator y(x1 , x2 ) possibility is to compute the mean of the y values associated
is the following. with x 1 (that is, the expectation E[y|x=x 1 ]), and return this
as the estimate for y if x=x 1 . This requires knowing the joint
Σyy = Σ1 (Σ1 + Σ2 )−1 Σ2 (28)
distribution of x and y, which may not always be available.
= K Σ2 = Σ1 − K Σ1 (29) In some problems, we can assume that there is an un-
known linear relationship between x and y and that uncer-
4.2 Summary tainty comes from noise. Therefore, we can use a technique
The results in this section can be summarized in terms of the similar to the ordinary least squares (OLS) method to esti-
Kalman gain K as follows. mate this linear relationship, and use it to return the best
estimate of y for any given value of x. In Figure 3, we see that
x1 ∼p1 (µµ 1 , Σ1 ), x2 ∼p2 (µµ 2 , Σ2 ) although there are many points (x 1 , y), the y values are clus-
K = Σ1 (Σ1 + Σ2 )−1 = (N 1 + N 2 )−1 N 2 (30) tered around the line shown in the figure so the value yb1 is a
y(x1 , x2 ) = x1 + K(x2 − x1 ) (31) reasonable estimate for the value of y corresponding to x 1 .
This line, called the best linear unbiased estimator (BLUE), is
Σyy = (I − K)Σ1 or Ny = N1 + N2 (32) the analog of ordinary least squares (OLS) for distributions.
6
Computing BLUE. Consider the estimator b yA,b (x)=Ax+b. at any time step is a function of the state of the system at
We choose A and b so that this is an unbiased estimator with the previous time step and the control inputs applied to the
minimal MSE. The “ b ” over the y is notation that indicates system during that interval. This is usually expressed by
that we are computing an estimate for y. an equation of the form xt = ft (xt −1 , ut ) where ut is the
control input. The function ft is nonlinear in the general
µ Σ Σ

x
Theorem 5.1. Let ∼p( x , xx xy ). The estima- case, and can be different for different steps. If the system
y µy Σyx Σyy
is linear, the relation for state evolution over time can be
yA,b (x)=Ax+b for estimating the value of y for a given
tor b written as xt = Ft xt −1 + Bt ut , where Ft and Bt are time-
value of x is an unbiased estimator with minimal MSE if dependent matrices that can be determined from the physics
b = µ y −A(µµ x ) of the system. Therefore, if the initial state x0 is known
A = Σyx Σ−1
xx
exactly and the system dynamics are modeled perfectly by
the Ft and Bt matrices, the evolution of the state over time
The proof of Theorem 5.1 is straightforward. For an unbi- can be computed precisely as shown in Figure 4a.
ased estimator, E[by]=E[y]. This implies that b=µµ y −A(µµ x ) so In general however, we may not know the initial state ex-
an unbiased estimator is of the form b yA (x) = µ y +A(x−µµ x ). actly, and the system dynamics and control inputs may not
Note that this is equivalent to asserting that the BLUE line be known precisely. These inaccuracies may cause the state
must pass through the point (µµ x , µ y ). Setting the derivative computed by the model to diverge unacceptably from the
of MSE A (b
yA ) with respect to A to zero[23] and solving for actual state over time. To avoid this, we can make measure-
A, we find that the best linear unbiased estimator is ments of the state after each time step. If these measurements
y = µ y + Σyx Σ−1
b xx (x − µ x ) (33) were exact, there would of course be no need to model the
system dynamics. However, in general, the measurements
This equation can be understood intuitively as follows. If themselves are imprecise.
we have no information about x and y, the best we can do is Kalman filtering was invented to solve the problem of state
the estimate (µµ x , µ y ), which lies on the BLUE line. However, estimation in such systems. Figure 4b shows the dataflow
if we know that x has a particular value x1 , we can use the of the computation, and we use it to introduce standard
correlation between y and x to estimate a better value for terminology. An estimate of the initial state, denoted by b x0|0 ,
y from the difference (x1 −µµ x ). Note that if Σyx = 0 (that is, is assumed to be available. At each time step t=1, 2, .., the
x and y are uncorrelated), the best estimate of y is just µ y , system model is used to provide an estimate of the state at
so knowing the value of x does not give us any additional time t using information from time t−1. This step is called
information about y as one would expect. In Figure 3, this prediction and the estimate that it provides is called the a
corresponds to the case when the BLUE line is parallel to priori estimate and denoted by b xt |t −1 . The a priori estimate
the x-axis. At the other extreme, suppose that y and x are is then fused with zt , the state estimate obtained from the
functionally related so y = C x. In that case, it is easy to see measurement at time t, and the result is the a posteriori state
that Σyx = CΣxx , so b y = C x as expected. In Figure 3, this estimate at time t, denoted by bxt |t . This a posteriori estimate
corresponds to the case when the confidence ellipse shrinks is used by the model to produce the a priori estimate for
down to the BLUE line. the next time step and so on. As described below, the a
Equation 33 is a generalization of ordinary least squares priori and a posteriori estimates are the means of certain
in the sense that if we compute the relevant means and random variables; the covariance matrices of these random
variances of a set of discrete data (x i , yi ) and substitute into variables are shown within parentheses above each estimate
Equation 33, we get the same line that is obtained by using in Figure 4b, and these are used to weight estimates when
OLS. fusing them.
Section 6.1 presents the state evolution model and a priori
6 KALMAN FILTERS FOR LINEAR state estimation. Section 6.2 discusses how state estimates
SYSTEMS are fused if an estimate of the entire state can be obtained
In this section, we apply the algorithms developed in Sec- by measurement; Section 6.3 addresses this problem when
tions 3-5 to the particular problem of state estimation in only a portion of the state can be measured directly.
linear systems, which is the classical application of Kalman
filtering.
Figure 4a shows how the evolution of the state of such 6.1 State evolution model and prediction
a system over time can be computed if the initial state x0 The evolution of the state over time is described by a series
and the model of the system dynamics are known precisely. of random variables x0 , x1 , x2 ,...
Time advances in discrete steps. The state of the system
7
x0 x1 x2 xt −1 xt
f1 f2 ..... ft
(a) Discrete-time dynamical system.
(Σ0|0 ) (Σ1|0 ) (Σ1|1 ) (Σ2|1 ) (Σ2|2 ) (Σt −1 |t −1 ) (Σt |t −1 ) (Σt |t )

x0|0 x1|0 x1|1 x2|1 x2|2 xt −1|t −1 xt |t −1 b xt |t
b
f1
b
+ b
f2
b
+ b ..... b ft
b
+
z1 x1 z2 x2 zt xt
(R 1 ) (R 2 ) (R t )
(b) Dynamical system with uncertainty.
Predictor Measurement Fusion +

(Σ0|0 )
xt |t −1 = Ft b
xt −1|t −1 + Bt ut
(Σt |t −1 ) Kt = Σt |t −1 (Σt |t −1 + R t )−1 (Σt |t )
x0|0 b xt |t −1 xt |t
xt |t −1 + Kt (zt - b
xt |t = b xt |t −1 )
b b b
b
Σt |t −1 = Ft Σt −1|t −1 FtT + Q t Σt |t = (I - Kt )Σt |t −1
zt xt
(R t )
(c) Implementation of the dataflow diagram (b).
Predictor Measurement Fusion +

(Σ0|0 )
xt |t −1 = Ft b
(Σt |t −1 )
xt −1|t −1 + Bt ut b Kt = Σt |t −1Ht T (Ht Σt |t −1 Ht T + R t )−1 (Σt |t )
x0|0 b xt |t −1 xt |t
xt |t −1 + Kt (zt - Ht b
xt |t = b xt |t −1 )
b b
b
Σt |t −1 = Ft Σt −1|t −1 FtT + Q t Σt |t = (I - Kt Ht )Σt |t −1
zt Ht xt
(R t )
(d) Implementation of the dataflow diagram (b) for systems with partial observability.
Figure 4: State estimation using Kalman filtering.
• The random variable x0 captures the likelihood of dif- time step t = 1, 2, ... that capture our beliefs about the likeli-
ferent initial states. hood of different states at time t before and after fusion with
• The random variables at successive time steps are re- the measurement respectively. The mean and covariance ma-
lated by the following linear model: trix of a random variable xi |j are denoted by b xi |j and Σi |j
respectively. We assume E[b x0|0 ]=E[x0 ] (no bias).
xt = Ft xt −1 + Bt ut + wt (34)
Prediction essentially uses xt −1|t −1 as a proxy for xt −1 in
Here, ut is the control input, which is assumed to Equation 34 to determine xt |t −1 as shown in Equation 35.
be deterministic, and wt is a zero-mean noise term
that models all the uncertainty in the system. The
covariance matrix of wt is denoted by Q t , and the xt |t −1 = Ft xt −1|t −1 + Bt ut + wt (35)
noise terms in different time steps are assumed to be
uncorrelated to each other (that is, E[wi wj ]=0 if i,j) For state estimation, we need only the mean and covari-
and to x0 . ance matrix of xt |t −1 . The Predictor box in Figure 4 com-
For estimation, we have a random variable x0|0 that cap- putes these values; the covariance matrix is obtained from
tures our belief about the likelihood of different states at Lemma 2.2 under the assumption that wt is uncorrelated
time t=0, and two random variables xt |t −1 and xt |t at each with xt −1 |t −1 , which is justified in Section 6.2.
8
xt|t−1
example, if the state vector has two components and only
the first component is observable, Ht can be [ 1 0 ]. In general,
xt+1|t
x̂ t|t−1
the Ht matrix can specify a linear relationship between the
state and the observation, and it can be time-dependent. The
x̂ t|t
imprecise measurement model introduced in Equation 36
xt|t ⨁
x̂ t+1|t becomes:
xt
zt
zt+1
⨁
x̂ t+1|t+1 zt = Ht xt + vt (37)
xt+1
xt+1|t+1
The hidden portion of the state can be specified using a
matrix Ct , which is an orthogonal complement of Ht . For
example, if Ht = [ 1 0 ], one choice for Ct is [ 0 1 ].
Figure 5: Pictorial representation of Kalman filtering.
Figure 4d shows the computation for this case. The fusion
phase can be understood intuitively in terms of the following
6.2 Fusing complete observations of the steps.
state (i) The observable part of the a priori estimate of the state
If the entire state can be measured at each time step, the Ht b
xt |t −1 is fused with the measurement zt , using the
imprecise measurement at time t is modeled as follows: techniques developed in Sections 3-4. The quantity zt − Ht b xt |t −1
is called the innovation. The result is the a posteriori es-
zt = xt + vt (36) timate of the observable state Ht b xt |t .
where vt is a zero-mean noise term with covariance matrix (ii) The BLUE in Section 5 is used to obtain the a pos-
R t . The noise terms in different time steps are assumed to be teriori estimate of the hidden state Ct b xt |t by adding
uncorrelated with each other (that is, E[vi vj ] is zero if i,j) to the a priori estimate of the hidden state Ct b xt |t −1 a
as well as with x0|0 and all wk . A subtle point here is that value obtained from the product of the covariance be-
xt in this equation is the actual state of the system at time t tween Ht xt |t −1 and Ct xt |t −1 and the difference between
(that is, a particular realization of the random variable xt ), Ht b
xt |t −1 and Ht b
xt |t .
so variability in zt comes only from vt and its covariance (iii) The a posteriori estimates of the observable and hidden
matrix R t . portions of the state are composed to produce the a
Therefore, we have two imprecise estimates for the state posteriori estimate of the entire state b xt |t .
at each time step t = 1, 2, ..., the a priori estimate from the
The actual implementation produces the final result di-
predictor (b xt |t −1 ) and the one from the measurement (zt ). If rectly without going through these steps as shown in Fig-
zt is uncorrelated to xt |t −1 , we can use Equations 30-32 to ure 4d, but these incremental steps are useful for understand-
fuse the estimates as shown in Figure 4c.
ing how all this works, and they are implemented as follows.
The assumptions that (i) xt −1|t −1 is uncorrelated with wt ,
which is used in prediction, and (ii) xt |t −1 is uncorrelated (i) The a priori estimate of the observable part of the state
with zt , which is used in fusion, are easily proved to be is Ht b
xt |t −1 and the covariance is Ht Σt |t −1HtT . The a pos-
correct by induction on t, using Lemma 2.2(ii). Figure 4b teriori estimate is obtained directly from Equation 31:
gives the intuition: xt |t −1 for example is an affine function
Ht b
xt |t = Ht b
xt |t −1
of the random variables x0|0 , w1 , v1 , w2 , v2 , ..., wt , and is
therefore uncorrelated with vt (by assumption about vt and + Ht Σt |t −1 HtT (Ht Σt |t −1 HtT + R t )−1 (zt − Ht b
xt |t −1 )
Lemma 2.2(ii)) and hence with zt .
Figure 5 shows the computation pictorially using confi- Let Kt =Σt |t −1 HtT (Ht Σt |t −1 HtT + R t )−1 . The a posteriori
dence ellipses to illustrate uncertainty. The dotted arrows estimate of the observable state can be written in terms
at the bottom of the figure show the evolution of the state, of Kt as follows:
and the solid arrows show the computation of the a priori Ht b
xt |t = Ht b
xt |t −1 + Ht Kt (zt − Ht b
xt |t −1 ) (38)
estimates and their fusion with measurements.
(ii) The a priori estimate of the hidden state is Ct b xt |t −1 .
6.3 Fusing partial observations of the state The covariance between the hidden portion Ct xt |t −1
In some problems, only a portion of the state can be mea- and the observable portion Ht xt |t −1 is Ct Σt |t −1HtT . The
sured directly. The observable portion of the state is specified difference between the a priori estimate and a posteriori
formally using a full row-rank matrix Ht called the observa- estimate of Ht x is Ht Kt (zt −Hb
xt |t −1 ). Therefore the a
tion matrix: if the state is x, what is observable is Ht x. For posteriori estimate of the hidden portion of the state is
9
obtained directly from Equation 33: system is easily shown to be the following:
vn 0 vn−1

Ct b
xt |t = Ct b
xt |t −1 =
1
+
0 0.25 0
(42)
sn 0.25 1 sn−1 0 0.5 × 0.252 9.8
+ (Ct Σt |t −1HtT )(Ht Σt |t −1 HtT )−1 Ht Kt (zt − Ht b
xt |t −1 )
v

0 80 0
= Ct b
xt |t −1 + Ct Kt (zt − Ht b
xt |t −1 ) (39) where we assume 0 = and Σ0 = .
s0 0 0 10
(iii) Putting the a posteriori estimates (38) and (39) together, The gray lines in Figure 6 show the evolution of velocity
and distance with time according to this model. Because of
Ht Ht H

+ t Kt (zt − Ht b
uncertainty in modeling the system dynamics, the actual
x = x xt |t −1 )
Ct t |t Ct t |t −1 Ct
b b
evolution of the velocity and position will be different in
practice. The red lines in Figure 6 show one trajectory for
H

Since t is invertible, it can be canceled from the left this evolution,
corresponding to a Gaussian noise term with
Ct 2 2.5

and right hand sides, giving the equation covariance in Equation 34 (because this noise
2.5 4
xt |t = b
b xt |t −1 + Kt (zt − Ht b
xt |t −1 ) (40) term is random, there are many trajectories for the evolu-
tion, and we are just showing one of them). The red lines
To compute Σt |t , Equation 40 can be rewritten as correspond to “ground truth” in our example.
xt |t = (I − Kt Ht )b
b xt |t −1 + Kt zt . Since xt |t −1 and zt are un- The green points in Figure 6b show the noisy measure-
correlated, it follows from Lemma 2.2 that ments of velocity at different time steps, assuming the noise
is modeled by a Gaussian with variance 8. The blue lines
Σt |t = (I − Kt Ht )Σt |t −1 (I − Kt Ht )T + Kt R t KtT show the a posteriori estimates of the velocity and position.
It can be seen that the a posteriori estimates track the ground
Substituting the value of Kt and simplifying, we get
truth quite well even when the ideal system model (the gray
Σt |t = (I − Kt Ht )Σt |t −1 (41) lines) is inaccurate and the measurements are noisy. The
cyan bars in the right figure show the variance of the veloc-
Figure 4d puts all this together. In the literature, this ity at different time steps. Although the initial variance is
dataflow is referred to as Kalman filtering. Unlike in Sec- quite large, application of Kalman filtering is able to reduce
tions 3 and 4, the Kalman gain is not a dimensionless value it rapidly in few time steps.
here. If Ht = I , the computations in Figure 4d reduce to those
of Figure 4c as expected. 6.5 Discussion
Equation 40 shows that the a posteriori state estimate is a This section shows that Kalman filtering for state estimation
linear combination of the a priori state estimate (bxt |t −1 ) and in linear systems can be derived from two elementary ideas:
the measurement (zt ). The optimality of this linear unbiased optimal linear estimators for fusing uncorrelated estimates
estimator is shown in the Appendix D. In Section 3.3, it was and best linear unbiased estimators for correlated variables.
shown that incremental fusion of scalar estimates is optimal. This is a different approach to the subject than the standard
The dataflow of Figures 4(c,d) computes the a posteriori state presentations in the literature. One standard approach is to
estimate at time t by incrementally fusing measurements use Bayesian inference. The other approach is to assume that
from the previous time steps, and this incremental fusion the a posteriori state estimator is a linear combination of the
can be shown to be optimal using a similar argument. form At bxt |t −1 +Bt zt , and then find the values of At and Bt
that produce an unbiased estimator with minimum MSE. We
6.4 Example: falling body believe that the advantage of the presentation given here is
To demonstrate the effectiveness of the Kalman filter, we that it exposes the concepts and assumptions that underlie
consider an example in which an object falls from the origin Kalman filtering.
at time t=0 with an initial speed of 0 m/s and an expected Most presentations in the literature also begin by assuming
constant acceleration of 9.8 m/s2 due to gravity. Note that that the noise terms wt in the state evolution equation and vt
acceleration in reality may not be constant due to factors in the measurement are Gaussian. Some presentations [1, 10]
such as wind, air friction, and so on. use properties of Gaussians to derive the results in Sections 3
The state vector of the object contains two components, although as we have seen, these results do not depend on
one for the distance from the origin s(t) and one for the distributions being Gaussians. Gaussians however enter the
velocity v(t). We assume that only the velocity state can be picture in a deeper way if one considers nonlinear estimators.
measured at each time step. If time is discretized in steps of It can be shown that if the noise terms are not Gaussian, there
0.25 seconds, the difference equation for the dynamics of the may be nonlinear estimators whose MSE is lower than that
10
2500 300 150
Model-only Model-only
Ground Truth Ground Truth
Estimated 250 Estimated 125
2000 Measured
Variance
200 100
Velocity (m/s)
1500
Distance (m)
Variance
150 75
1000
100 50
500
50 25
0 0 0
0 5 10 15 20 0 5 10 15 20
Time (s) Time (s)
(a) Evolution of state: Distance (b) Evolution of state: Velocity
Figure 6: Estimates of the object’s state over time.
of the linear estimator presented in Figure 4d. However if h. This requires computing the following Jacobians4 , which
the noise is Gaussian, this linear estimator is as good as any play the role of Ft and Ht in Figure 4d.
unbiased nonlinear estimator (that is, the linear estimator is a
minimum variance unbiased estimator (MVUE)). This result ∂ f ∂h

Ft = H t =
is proved using the Cramer-Rao lower bound [25]. ∂ x x b t −1|t −1, ut ∂ x xb t |t −1
The EKF performs well in some applications such as navi-

7 EXTENSION TO NONLINEAR SYSTEMS
gation systems and GPS [29].
The Extended Kalman Filter (EKF) and Unscented Kalman Fil-
ter (UKF) are heuristic approaches to using Kalman filtering UKF. When the system dynamics and observation models
for nonlinear systems. The state evolution and measurement are highly nonlinear, the Unscented Kalman Filter (UKF) [15]
equations for nonlinear systems with additive noise can be can be an improvement over the EKF. The UKF is based on
written as follows; in these equations, f and h are nonlinear the unscented transformation, which is a method for comput-
functions. ing the statistics of a random variable x that undergoes a
nonlinear transformation (y = д(x)). The random variable x
xt = f (xt −1 , ut ) + wt (43) is sampled using a carefully chosen set of sigma points and
these sample points are propagated through the nonlinear
zt = h(xt ) + vt (44)
function д. The statistics of y are estimated using a weighted
sample mean and covariance of the posterior sigma points.
Intuitively, the EKF constructs linear approximations to
The UKF tends to be more robust and accurate than the EKF
the nonlinear functions f and h and applies the Kalman
but has higher computation overhead due to the sampling
filter equations, while the UKF constructs approximations
process.
to probability distributions and propagates these through
the nonlinear functions to construct approximations to the
8 CONCLUSION
posterior distributions.
In this paper, we have shown that two concepts - optimal
EKF. Examining Figure 4d, we see that the a priori state linear estimators for fusing uncorrelated estimates and best
estimate in the predictor can be computed using the system linear unbiased estimators for correlated variables - provide
model: bxt |t −1 = f (b
xt −1|t −1 , ut ). However, since the system the underpinnings for Kalman filtering. By combining these
dynamics and measurement equations are nonlinear, it is not ideas, standard results on Kalman filtering for linear systems
clear how to compute the covariance matrices for the a priori can be derived in an intuitive and straightforward way that
estimate and the measurement. In the EKF, these matrices is simpler than other presentations of this material in the
are computed by linearizing Equations 43 and 44 using the 4 The Jacobian matrix is the matrix of all first order partial derivatives of a
Taylor series expansions for the nonlinear functions f and vector-valued function.
11
literature. This approach makes clear the assumptions that [16] Rudolph Emil Kalman. 1960. A New Approach to Linear Filtering
underlie the optimality results associated with Kalman filter- and Prediction Problems. Transactions of the ASME–Journal of Basic
ing, and should make it easier to apply Kalman filtering to Engineering 82, Series D (1960), 35–45.
[17] Peter K Kitanidis. 1987. Unbiased Minimum-variance Linear State
problems in computer systems. Estimation. Automatica 23, 6 (Nov. 1987), 775–778. https://doi.org/10.
1016/0005-1098(87)90037-9
ACKNOWLEDGMENTS [18] Anders Lindquist and Giogio Picci. 2017. Linear Stochastic Systems.
Springer-Verlag.
This research was supported by NSF grants 1337281, 1406355, [19] Peter S Maybeck. 1982. Stochastic models, estimation, and control. Vol. 3.
and 1618425, and by DARPA contracts FA8750-16-2-0004 and Academic press.
FA8650-15-C-7563. The authors would like to thank K. Mani [20] Jerry M Mendel. 1995. Lessons in Estimation Theory for Signal Processing,
Chandy (Caltech), Ivo Babuska (UT Austin) and Augusto Communications, and Control. Pearson Education.
Ferrante (Padova) for their feedback on this paper. [21] Kaushik Nagarajan, Nicholas Gans, and Roozbeh Jafari. 2011. Mod-
eling Human Gait Using a Kalman Filter to Measure Walking Dis-
tance. In Proceedings of the 2nd Conference on Wireless Health (WH ’11).
REFERENCES ACM, New York, NY, USA, Article 34, 2 pages. https://doi.org/10.1145/
[1] Tim Babb. 2018. How a Kalman filter works, in pictures | Bzarg. https: 2077546.2077584
//www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/. (2018). [22] Eduardo F. Nakamura, Antonio A. F. Loureiro, and Alejandro C. Frery.
Accessed: 2018-11-30. 2007. Information Fusion for Wireless Sensor Networks: Methods,
[2] A. V. Balakrishnan. 1987. Kalman Filtering Theory. Optimization Models, and Classifications. ACM Comput. Surv. 39, 3, Article 9 (Sept.
Software, Inc., Los Angeles, CA, USA. 2007). https://doi.org/10.1145/1267070.1267073
[3] Allen L. Barker, Donald E. Brown, and Worthy N. Martin. 1994. [23] Kaare Brandt Petersen and Michael Syskind Pedersen. 2012. The Matrix
Bayesian Estimation and the Kalman Filter. Technical Report. Char- Cookbook. http://www2.imm.dtu.dk/pubdb/views/publication_details.
lottesville, VA, USA. php?id=3274. (Nov. 2012). Version 20121115.
[4] Alex Becker. 2018. Kalman Filter Overview. https://www.kalmanfilter. [24] Raghavendra Pradyumna Pothukuchi, Amin Ansari, Petros Voulgaris,
net/default.aspx. (2018). Accessed: 2018-11-08. and Josep Torrellas. 2016. Using Multiple Input, Multiple Output
[5] K. Bergman. 2009. Nanophotonic Interconnection Networks in Multi- Formal Control to Maximize Resource Efficiency in Architectures. In
core Embedded Computing. In 2009 IEEE/LEOS Winter Topicals Meeting Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual Interna-
Series. 6–7. https://doi.org/10.1109/LEOSWT.2009.4771628 tional Symposium on. IEEE, 658–670.
[6] Liyu Cao and Howard M. Schwartz. 2004. Analysis of the Kalman [25] C.R. Rao. 1945. Information and the Accuracy Attainable in the Esti-
Filter Based Estimation Algorithm: An Orthogonal Decomposition mation of Statistical Parameters. Bulletin of the Calcutta Mathematical
Approach. Automatica 40, 1 (Jan. 2004), 5–19. https://doi.org/10.1016/ Society 37 (1945), 81–89.
j.automatica.2003.07.011 [26] Matthew B. Rhudy, Roger A. Salguero, and Keaton Holappa. 2017. A
[7] Charles K. Chui and Guanrong Chen. 2017. Kalman Filtering: With Kalman Filtering Tutorial for Undergraduate Students. International
Real-Time Applications (5th ed.). Springer Publishing Company, Incor- Journal of Computer Science & Engineering Survey 8, 1 (Feb. 2017).
porated. [27] Sailes K. Sengupta. 1995. Fundamentals of Statistical Signal Processing:
[8] R.L. Eubank. 2005. A Kalman Filter Primer (Statistics: Textbooks and Estimation Theory. Technometrics 37, 4 (1995), 465–466. https://doi.
Monographs). Chapman & Hall/CRC. org/10.1080/00401706.1995.10484391
[9] Geir Evensen. 2006. Data Assimilation: The Ensemble Kalman Filter. [28] Éfren L. Souza, Eduardo F. Nakamura, and Richard W. Pazzi. 2016.
Springer-Verlag New York, Inc., Secaucus, NJ, USA. Target Tracking for Sensor Networks: A Survey. ACM Comput. Surv.
[10] Rodney Faragher. 2012. Understanding the Basis of the Kalman Filter 49, 2, Article 30 (June 2016), 31 pages. https://doi.org/10.1145/2938639
Via a Simple and Intuitive Derivation. IEEE Signal Processing Magazine [29] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic
29, 5 (September 2012), 128–132. Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press.
[11] Mohinder S. Grewal and Angus P. Andrews. 2014. Kalman Filtering: [30] Greg Welch and Gary Bishop. 1995. An Introduction to the Kalman
Theory and Practice with MATLAB (4th ed.). Wiley-IEEE Press. Filter. Technical Report. Chapel Hill, NC, USA.
[12] Anne-Kathrin Hess and Anders Rantzer. 2010. Distributed Kalman [31] Yan Pei, Swarnendu Biswas, Donald S. Fussell, and Keshav Pingali.
Filter Algorithms for Self-localization of Mobile Devices. In Proceedings 2017. An Elementary Introduction to Kalman Filtering. ArXiv e-prints
of the 13th ACM International Conference on Hybrid Systems: Compu- (Oct. 2017). arXiv:1710.04055
tation and Control (HSCC ’10). ACM, New York, NY, USA, 191–200.
https://doi.org/10.1145/1755952.1755980
[13] Connor Imes and Henry Hoffmann. 2016. Bard: A Unified Framework
for Managing Soft Timing and Power Constraints. In International
Conference on Embedded Computer Systems: Architectures, Modeling
and Simulation (SAMOS).
[14] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. 2015. POET:
A Portable Approach to Minimizing Energy Under Soft Real-time
Constraints. In 21st IEEE Real-Time and Embedded Technology and
Applications Symposium. 75–86. https://doi.org/10.1109/RTAS.2015.
7108419
[15] Simon J. Julier and Jeffrey K. Uhlmann. 2004. Unscented Filtering
and Nonlinear Estimation. Proc. IEEE 92, 3 (2004), 401–422. https:
//doi.org/10.1109/JPROC.2003.823141
12
Appendices x 2 : [−1, 1] that are the projections of u on the x and y axes
respectively. Given a value for x 2 , there are only two possible
values for x 1 , so x 1 and x 2 are not independent. However, the
A BASIC PROBABILITY THEORY AND
mean of these values is 0, which is the mean of x 1 , so x 1 and
STATISTICS TERMINOLOGY x 2 are not correlated.
Probability density function. For a continuous random vari-
able x, a probability density function (pdf) is a function p(x) B MATRIX DERIVATIVES
whose value provides a relative likelihood that the value of
the random variable will equal x. The integral of the pdf If f (X ) is a scalar function of a matrix X , the matrix deriva-
∂f (X )
within a range of values is the probability that the random tive ∂X is defined as the matrix
variable will take a value within that range. ∂f (X ) ∂f (X )
∂X (1,1) ... ∂X (1,n)
If д(x) is a function of x with pdf p(x), the expected value
... ... ...
© ª
or expectation of д(x) is E[д(x)], defined as the following
®
∂f (X ) ∂f (X )
®
integral: « ∂X (n,1) ... ∂X (n,n) ¬
∫ ∞
E[д(x)] = д(x)p(x)dx Lemma B.1. Let X be a m × n matrix, a be a m × 1 vector,
−∞ b be a n × 1 vector.
By definition, the mean µ x of a random variable x is E[x].
The variance of a random variable x measures the variability
of the distribution. For the set of possible values of x, variance ∂ aT X b
= abT (45)
(denoted by σx2 ) is defined by σx2 = E[(x − µ x )2 ]. The variance ∂X
of a continuous random variable x can be written as the ∂(aT X T X b)
= X (abT + baT ) (46)
following integral: ∂X
∫ ∞ ∂(trace(X BX T ))
σx2 = (x − µ)2p(x)dx = X BT + X B (47)
∂X
−∞
If x is discrete and all outcomes are equally likely, then See Petersen and Pedersen for a proof [23].
Σ(x −µ )2
σx2 = i n x . The standard deviation σx is the square root
of the variance. C PROOF OF THEOREM 4.1
Covariance. The covariance of two random variables is a Theorem 4.1, which is reproduced below for convenience,
measure of their joint variability. The covariance between can be proved using matrix derivatives.
random variables x 1 : p1 ∼(µ 1 , σ12 ) and x 2 : p2 ∼(µ 2 , σ22 ) is the
Theorem. Let pairwise uncorrelated estimates xi (1≤i≤n)
expectation E[(x 1 −µ 1 )∗(x 2 −µ 2 )]. Two random variables are
drawn from distributions piÍ (x)=(µµ i , Σi ) be fused
Í using the lin-
uncorrelated or not correlated if their covariance is zero. This
ear model yA (x1 , .., xn ) = ni=1 Ai xi , where ni=1 Ai = I . The
is not the same concept as independence of random variables.
MSE(yA ) is minimized for
Two random variables are independent if knowing the
value of one of the variables does not give us any information n
Õ
about the possible values of the other one. This is written Ai = ( Σ−1
j ) Σi .
−1 −1
formally as p(x 1 |x 2 ) = p(x 1 ); intuitively, knowing the value j=1
of x 2 does not change the probability that p1 takes a particular
value. Proof. To use the Lagrange multiplier approach, we can
convert the constraint ni=1 Ai = I into a set of m 2 scalar
Í
Independent random variables are uncorrelated but ran-
dom variables can be uncorrelated even if they are not in- equations (for example, the first equation would be A1 (1, 1) +
dependent. It can be shown that if x 1 and x 2 are not corre- A2 (1, 1) + .. + An (1, 1) = 1), and then introduce m 2 Lagrange
lated, E[x 1 |x 2 ]=E[x 1 ]; intuitively, knowing the value of x 2 multipliers, which can denoted by λ(1, 1), ...λ(m, m).
may change the probability that x 1 takes a particular value, This obscures the matrix structure of the problem so it is
but the mean of the resulting distribution remains the same better to implement this idea implicitly. Let Λ be an m×m
as the mean of x 1 . A special case of this that is easy to un- matrix in which each entry is one of the scalar Lagrange
derstand are examples in which knowing x 2 restricts the multipliers we would have introduced in the approach de-
possible values of x 1 without changing the mean. Consider a scribed above. Analogous to the inner product of vectors, we
random variable u : U that is uniformly distributed over the can define the inner product of two matrices as <A, B> =
trace(AT B) (it is easy to see that <A, B> is m
Í Ím
unit circle, and consider random variables x 1 : [−1, 1] and i=1 j=1 A(i, j)B(i, j)).
13
Using this notation, we can formulate the optimization prob- (b) We prove that the estimator in Equation 48 is unbiased if
lem using Lagrange multipliers as follows: At =(I −Bt ) ∗ Ht .
n
Õ
f (A1 , ..., An ) = E (xi − µ i )TAi TAi (xi − µ i )

i=1
E[b
yt |t ] = E[At ∗ b
yt |t −1 + Bt ∗ zt ] (From Equation 48)
n
= At ∗ E[b
yt |t −1 ] + Bt ∗ E[zt ]
Õ
+ Λ, Ai − I

i=1 = At ∗ E[xt ] + Bt ∗ E[Ht xt + vt ] (Equation 37 for zt )
Taking the matrix derivative of f with respect to each Ai = At ∗ E[xt ] + Bt ∗ Ht ∗ E[xt ] (Because vt is zero mean)
and setting each derivative to
zero to find the optimal values
of Ai gives us the equation E 2Ai (xi − µ i )(xi − µ i )T + Λ = 0.
= (At + Bt ∗ Ht ) ∗ E[xt ]
This equation can be written as 2Ai Σi + Λ = 0, which im-
plies The estimator is unbiased if (At + Bt ∗ Ht ) = I , which is
Λ equivalent to requiring that At = (I − Bt ∗ Ht ). Therefore
A1 Σ1 = A2 Σ2 = ... = An Σn = − the general unbiased linear estimator is of the form
2
Using the constraint that the sum of all Ai equals to the
identity matrix I gives us the desired expression for Ai :
Õn yt |t = (I − Bt ∗ Ht ) ∗ b
b yt |t −1 + Bt ∗ zt
Ai = ( Σ−1
j ) Σi
−1 −1
=b
yt |t −1 + Bt ∗ (zt − Ht ∗ b
yt |t −1 ) (49)
j=1
□
D PROOF OF THE OPTIMALITY OF Since Equation 40 is of this form, it is an unbiased linear

EQUATION 40 estimator.
We show that (bxt |t = b
xt |t −1 + Kt (zt − Ht b
xt |t −1 )) (Equation 40)
is an optimal unbiased linear estimator for fusing the a priori Optimality: We now show that using Bt = Kt at each step
state estimate with the measurement at each step. The proof is optimal, assuming that this is done at all time steps before t.
has two steps: we show that this estimator is unbiased, and Since zt and yt |t −1 are uncorrelated, we can use Lemma 2.2 to
then show it is optimal. compute the covariance matrix of b yt |t , denoted by Ξt |t . This
Unbiased condition: We prove a more general result that gives Ξt |t = (I − Bt Ht )Σt |t −1 (I − Bt Ht )T + Bt R t BtT . The MSE
characterizes unbiased linear estimators for this problem, is the trace of this matrix, and we need to find Bt that mini-
assuming that the prediction stage (Figure 4(d)) is unchanged. mizes this trace. Using matrix derivatives (Equation 47), we
The general form of the linear estimator for computing the see that
a posteriori state estimate is
yt |t = At ∗ b
yt |t −1 + Bt ∗ zt (48)
∂(trace(Ξt |t ))
b
It is unbiased if E[b yt |t ]=E[xt ], and we show that this is true = −2(I − Bt Ht )Σt |t −1 HtT + 2 ∗ Bt R t
∂Bt
if At =(I −Bt ) ∗ Ht .
The proof is by induction on t. By assumption, E[b y0|0 ] = E[x0 ].
Assume inductively that E[b yt −1|t −1 ]=E[xt −1 ].
Setting this zero and solving for Bt gives
Bt = Σt |t −1 HtT [Ht Σt |t −1 HtT + R t ]−1 . This is exactly Kt , prov-
(a) We first prove that the predictor is unbiased.
yt |t −1 = Ft ∗b
b yt −1|t −1 + Bt ∗ut (Predictor in Figure 4) ing that Equation 40 is an optimal unbiased linear estimator.
E[b
yt |t −1 ] = Ft ∗ E[b
yt −1|t −1 ] + Bt ∗ut
= Ft ∗ E[xt −1 ] + Bt ∗ut (By inductive assumption) Comment: This proof of optimality provides another way
= E[Ft ∗ xt −1 + Bt ∗ut ] of deriving Equation 40. We believe the constructive ap-
proach described in Section 6 provides more insight into how
= E[Ft ∗ xt −1 + Bt ∗ut + wt ] (wt is zero mean)
and why Kalman filtering works. The ideas in that construc-
= E[xt ] (From state evolution equation 34) tion can be used to provide a different proof of optimality.
14

Kalman Filter

Uploaded by

Copyright:

Available Formats

Kalman Filter

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kalman Filter

Uploaded by

Copyright:

Available Formats

What is the goal of the paper?

What are the two steps in proving that Equation 40 is an optimal unbiased linear estimator?

An Elementary Introduction to Kalman Filtering

Yan Pei Swarnendu Biswas

Donald S. Fussell Keshav Pingali

ABSTRACT can be applied only if noise is Gaussian [15]. The goal of

2.2 Vector estimates 3.1 Fusing two scalar estimates

Proof. From Lemma 2.1, σy2 (α) = ni=1 α i 2σ i 2 . To find the

3.3 Incremental fusing is optimal

The covariance matrix of y can be computed by using

(a) Discrete-time dynamical system.

(Σ0|0 ) (Σ1|0 ) (Σ1|1 ) (Σ2|1 ) (Σ2|2 ) (Σt −1 |t −1 ) (Σt |t −1 ) (Σt |t )

Predictor Measurement Fusion +

(c) Implementation of the dataflow diagram (b).

Predictor Measurement Fusion +

Figure 4: State estimation using Kalman filtering.

Figure 6: Estimates of the object’s state over time.

The EKF performs well in some applications such as navi-

D PROOF OF THE OPTIMALITY OF Since Equation 40 is of this form, it is an unbiased linear

You might also like