Regression Vs Kalman Filter

Lecture 2: From Linear Regression to Kalman
Filter and Beyond
Simo Särkkä
Department of Biomedical Engineering and Computational Science

Helsinki University of Technology
March 24, 2009
Simo Särkkä Lecture 2: From Linear Regression to Kalman Filter and Beyond
Contents
1 Linear Regression and Correlation
2 Multidimensional Models
3 Non-Linear Models
4 Input and Model Selection
5 Stochastic Bayesian Models
6 Dynamic Models
7 Summary
Introduction: Correlation of Time Series
Assume that we have two signals x(t) and y(t):

Two time series
10
x(t)
0
0 20 40 60 80 100
1.5
y(t)
0.5
0
0 10 20 30 40 50 60 70 80 90 100
We can now measure that the squared correlation R 2 of

these is 0.97.
What does this really mean?
Introduction: Correlation of Time Series (cont.)
Plot x and y on separate axes (scatter plot) and fit a line

y = a x + b to it:
Scatter plot of the data Linear fit
1.6 1.6
1.4 1.4
1.2 1.2
2
1 1 R = 0.9731
y = 0.11 x + 0.17
0.8 0.8
y
y
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
x x
The correlation coefficient measures how well the

regression line fits to the data.
Least Squared Solution to Linear Regression [1/3]
We have observed pairs {(x (1) , y (1) ), . . . , (x (n) , y (n) )}:

Scatter plot of the data
1.6
1.4
1.2
0.8
y
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
x
Assume linear relationship
y (i) = a x (i) + b, i = 1, . . . , n.
We want to estimate the parameters a and b.
In the least squares method, we minimize the mean

squared error:
1 X (i)
S(a, b) = (y − ax (i) − b)2
n
i
= E[(y − ax − b)2 ],
where E[·] denotes the expectation of the argument.

In the minimum, derivatives must vanish:
∂S 2 X (i) (i)
= x (y − ax (i) − b) = 0
∂a n
i
∂S 2 X
= (y (i) − ax (i) − b) = 0.
∂b n
i
The final solution is

E[x y] − E[x] E[y]
a=
E[x 2 ] − E[x]2
b = E[y] − a E [x],
where
1 X (i) (i) 1 X (i) (i)
E [xy] = x y , E [x 2 ] = x x
n n
i i
1 X (i) 1 X (i)
E [x] = x , E [y] = y .
n n
i i
Correlation coefficient R
The regression line equation has the form
y − E[y] = a (x − E[x]),
and we could use parameter a as measure of the linear

relationship between x and y.
The problem is that if we scale x or y by constant, the
coefficient a changes also.
Thus instead, we should consider the relationship between
the normalized and centered variables:
y − E[y]
ỹ = p
E[y 2 ] − E[y]2
x − E[x]
x̃ = p .
E[x 2 ] − E[x]2
Correlation coefficient R (cont)
The equation reduces to form
E[x y] − E[x] E [y]

ỹ = p p x̃.
E[x 2 ] −E[x]2 E[y 2 ] − E[y]2
| {z }
R
The proportionality coefficient R is the correlation

coefficient:
E[x y] − E[x] E [y]
R=p p .
E[x 2 ] − E[x]2 E[y 2 ] − E[y]2
Coefficient of determination R 2
The correlation coefficient can be also derived in another

way, which generalizes to non-linear models.
The remaining mean squared error or the residual variance
using the fitted parameters a and b can be written as
1 X (i)
S(a, b) = (y − ax (i) − b)2
n
i
(E[x y] − E[x] E[y])2
= E[y 2 ] − E[y]2 − .
E[x 2 ] − E[x]2
One way of measuring the goodness of the fit is to

compare the residual variance to the original data variance
Var[y].
Coefficient of determination R 2 (cont)
The proportion of variance (MSE) in y that the fit has
explained, that is, the coefficient of determination can be
computed as:
Var[y] − S(a, b) S(a, b)
=1−
Var[y] E[y 2 ] − E[y]2
(E[x y] − E[x] E[y])2
= .
(E[x 2 ] − E[x]2 ) (E[y 2 ] − E[y]2 )
Comparing to the correlation coefficient expression reveals
that
Var[y] − S(a, b)
= R 2.
Var[y]
That is, the coefficient of determination is the square of the
correlation coefficient.
This definition of correlation coefficient also works with
non-linear and multidimensional models.
Cautions on Interpretation of Correlations
Correlation does not imply causality!

False conclusions: Using swimming suit correlates with
drowning accidents ⇒ using swimming suit causes
drowning accidents.
Correlation can be caused by latent (hidden) factor, which
makes both variables change to the same direction.
One must also be careful when making conclusions from
correlations between physical phenomena.
In addition, correlation only measures linear relationship
between two variables.
Variables can correlate much even if the variables are
essentially constant - the scatter plot should always be
checked visually.
Effect of Delay to Correlations [1/2]
Even small delay in inputs or outputs can destroy the

correlation completely:
Two time series with 1 sample delay in y
10
x(t)
0
0 20 40 60 80 100
1.5
y(t)
0.5
0
0 10 20 30 40 50 60 70 80 90 100
Effect of Delay to Correlations [2/2]
With only 1 sample delay, the original correlation 0.97

decreases to 0.30!
2
Delay=1: Linear fit to scatter plot R as function of delay
1.6 1
2 0.9
1.4 R = 0.3016
y = 0.06 x + 0.41 0.8
1.2
2
Coefficient of variation R
0.7
1
0.6
0.8 0.5
y
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0 0
1 2 3 4 5 6 7 8 9 10 0 5 10 15 20
x Delay
Multidimensional linear models [1/3]
Multidimensional generalization of linear regression model

is
y = a1 x1 + a2 x2 + . . . + ad xd + b
(i) (i)
where we measure {(x1 , . . . , xd , y (i) ) : i = 1, . . . , n}.
The mean squared error is now given as
1 X (i) (i) (i) (i)

S(a, b) = (y − a1 x1 − a2 x2 − ... − ad xd − b)2 .
n
i
We could now solve the parameters by setting the

derivatives to zero as in scalar case.
However, more intuitive way is to first rewrite the model in a
suitable matrix form:
We define the following:

 (1) (1) (1)
  
x1 x2 . . . xd 1  (1)  a1
 (2) (2) (2)  y  .. 
x1 x2 . . . xd 1 . 
 ..  , θ =  .  .
,Y =   
H=  .. 
 .  (n)
ad 
(n) (n) (n) y
x1 x2 . . . xd 1 b
The model and the mean squared error can be now written
as:
Y = Hθ
1
S(θ) = (Y − H θ)T (Y − H θ).
n
The gradient of error should vanish at minimum:
1
∇S(θ) = [−2HT Y + 2HT Hθ] = 0.
n
The resulting least squares estimate is
θ = (HT H)−1 HT Y.
Correlation coefficient is not applicable, but the coefficient

of determination R 2 can still be computed (if y is scalar):
Var[y] − S(θ)
R2 = .
Var[y]
If y is a vector, R 2 can be computed for each component

separately.
Example: Dependence of Three Signals [1/3]
We want to find out the relationship between the following

signals:
x1(t)
2
−2
0 500 1000 1500
x2(t)
2
−2
0 500 1000 1500
y(t)
0.5
−0.5
0 500 1000 1500
The 1d-correlations are: R 2 (x1 , x2 ) = 0.00,

R 2 (x1 , y) = 0.36 and R 2 (x2 , y) = 0.57.
With 2-dimensional linear model y = a1 x1 + a2 x2 we get

the following fit:
0.6
R2 = 0.9918
Fit
Data
0.5
y = 0.10 x1 + 0.30 x2
0.4
0.3
0.2
0.1
−0.1
−0.2
200 400 600 800 1000 1200 1400
The scatter plot of (x1 , x2 , y) reveals that the time series lie
on the same plane in 3d:
Scatter plot of (x ,x ,y)
1 2
0.6
0.5
0.4
0.3
y
0.2
0.1
−0.1
1.5
1 1.5
1
0.5 0.5
0
0 −0.5
−1
x −0.5 −1.5
2 x1
Linear-in-Parameters Models
The multidimensional linear model framework also applies

to linear-in-parameters models such as polynomial models
y = a1 x + a2 x 2 + ... + ad x d + b,
or other models of the form
y = a1 f1 (x) + a2 f2 (x) + ... + ad fd (x) + b.
For example, Fourier series can be derived from the least

squares framework.
Although the approximating function is non-linear, these
are still called linear models because the parameters
appear linearly.
Nonlinear Models [1/3]
The advantage of linearity is that it yields to easy
mathematics.
But, linearity can be a restriction in the modeling point of
view.
We can also use general non-linear models of the form
y = f (x; θ),
where θ ∈ Rn is a vector of parameters and x = (x1 , ..., xd ).

The non-linearity can, for example, result from modeling a
non-linearity in the physical system under consideration.
Non-physically based non-linear regression models are
also used, e.g., multi layer perceptron (MLP) neural
network:
y = W tanh(A x + b) + c.
In principle, least squares can be used for fitting

parameters of non-linear models ("training" the model).
The error function needs to be minimized using iterative
methods, because closed form minimization is not
possible.
For example, simple gradient descend method can be
used (called back-propagation training in case of MLP’s).
More advanced optimization methods such as conjugate
gradient, Levenberg-Marquardt, simulated annealing, and
genetic algorithms can be used as well.
In practice, non-linear models easily suffer from over-fitting,
and regularization and Bayesian methods are needed.
Often non-linear problem can be converted into a linear

one by using so called kernel trick.
The idea is to map the input space to a feature space
(reproducing kernel Hilbert space, RKHS), which is linear
with respect to outputs.
This method is often used in machine leaning and leads to
support vector machines (SVM) and other kernel methods.
Gaussian process (GP) regression models, which
nowadays are often used instead of MLP’s are also kinds
of kernel methods.
Example: Approximation of Sine
Polynomial fit
2
2
R = 0.9960
1.5 y = −0.00032x6+0.00072x5+0.04x4−0.22x3−0.046x2+1.1x+0.16
0.5
−0.5
−1
0 1 2 3 4 5 6
Over-fitting and Regularization [1/3]
Consider the following alternative fits to the same data:

Linear fit 9th order polynomial
1.5 1.5
2 2
R = 0.7744 R = 1.0000
1 1
0.5 0.5
0 0
−0.5 −0.5
0 2 4 6 8 10 0 2 4 6 8 10
The polynomial on the right is said to be over-fitted to the

data.
If there are many free parameters compared to effective
number of data points the over-fitting behavior occurs.
With 1d-models over-fitting is easy to “see”, but with
high-dimensional models much harder to detect.
Over-fitting problem is the worst with multi-parameter

nonlinear models such as neural network models.
Even linear models can be ill-conditioned, which indicated
that there are few linearly independent data points.
A common solution to the problem is regularization, where
an additional cost function C(θ) is added to the least
squares error:
Sr (θ) = S(θ) + C(θ).
For example, the L2 norm of the parameters can be
penalized:
1
Sr (θ) = (Y − H θ)T (Y − H θ) + λ |θ|2 .
n
The parameter λ can be used for tuning the effective order
of the polynomial from say 10 to 0 (λ = 0, . . . , ∞).
It is also possible to optimize the parameter λ using
information criteria (AIC, BIC, DIC, ...) or by using
cross-validation.
The polynomial order can be also used as a regularization
parameter as such and estimated by information criteria or
cross-validation.
In the case of MLP’s the number of hidden units (and
parameters) can be similarly used as a regularization
parameter.
A general class of cost terms are the Tikhonov
regularizers: Z
C(θ) = |Lf (x; θ)|2 d x,
where L is a linear operator (e.g. differential or integral

operator).
Example: Regularization of Polynomial Fit
Returning to the previous example, if we add the

regularization term λ|θ|2 to the cost function of 9th order
polynomial we get:
Regularized 9th order polynomial
1.5
R2 = 0.8039
1
0.5
−0.5
0 2 4 6 8 10
In this case this kind of regularization does not help much.
Example: Regularization of Polynomial Fit (cont)
Using cross-validation with respect to polynomial order

would indicate that the following 3rd order polynomial is the
best:
3th order polynomial
1.5
R2 = 0.8077
1
0.5
−0.5
0 2 4 6 8 10
Actually, the data is generated from linear model, but it is

impossible to deduce it from the data.
Cautions on Practical Use of Non-Linear Models
No advanced regression model or model class (e.g. NN)

can be used for every regression task and no fancy
computational method can replace good physics (or
chemistry) based model.
Neural Network models such as Multi Layer Perceptron
(MLP) sound much more advanced than they really are.
Including non-linear terms to regression models without
physical or other a priori knowledge basis is often a bad
idea - almost always linear models can be used instead.
If the physics of the phenomenon are known, one must not
assume that the method (such as neural network) would
"learn" the phenomenon from the data.
Instead, if anything about the mathematical form of the
phenomenon is known a priori e.g. from physics, the
knowledge should be explicitly included into the model.
Input Selection in Multi-Linear Models
In input selection we try to find the subset inputs x that are

most relevant in predicting the target values y.
That is, we try to find out, e.g., which of the following
models is the best:
y = a1 x1 + a2 x2 + b
y = a1 x1 + a3 x3 + b
y = a2 x2 + a3 x3 + b
y = a1 x1 + a2 x2 + a3 x3 + b.
In case of multi-linear models there exists good methods

for input selection and it is even possible to take the noise
in x into account.
However, including more inputs always improves the fit and
a separate cost term needs to be used for penalizing the
number of inputs.
PCA and PLS
If inputs are not independent, it is possible to try to find

lower dimensional latent factors that explain the variations.
Principal component analysis (PCA) finds the latent factors
that explain the most variance in inputs.
Partial least squares (or Projection to latent structures)
(PLS) finds the latent factors that best explain the outputs.
PCA or PLS give no answer to how many latent factors we
should select.
PCA and PLS apply to linear models, because the latent
factors are linear combinations of the inputs.
The methodogy can be extended non-linear models by
using the kernel trick.
Certain other generalizations to non-linear models exist,
but they typically only work in restricted special cases.
Model Selection [1/2]
If we do not restrict model class to linear models, we have

the general problem of model selection of which input
selection is a special case.
As mentioned, one way of measuring goodness of fit of a
linear or non-linear model is the coefficient of
determination:
Var[y] − S(θ)
R2 =
Var[y]
With fixed data set this is equivalent to comparing models
with respect to the error function S(θ).
The problem is that increasing the number of parameters
always reduces the error and increases correlation.
Due to this over-fitting dilemma, goodness of fit does not
imply that the fit would make any sense.
Model Selection [2/2]
To overcome this problem, several regularized goodness fit

measures have been developed:
Information criteria such as AIC, BIC and DIC include
additional cost to S(θ), which penalizes high number of
parameters
Training & validation set based methods the error function
minimization or other estimation is done with respect to
training set and the "generalization ability" is tested using
the validation or test set.
Cross-validation methods, where the validation procedure
is included as part of the training procedure.
None of the methods work in all the cases - the best way is
to use knowledge of the physics behind as much as
possible.
Maximum Likelihood
The linear regression can be equivalently formulated as
stochastic model
y (i) = ax (i) + b + e(i) , e(i) ∼ N(0, σ 2 ).
In probabilistic notation this is:
p(y (i) | a, b) = N(y (i) | ax (i) + b, σ 2 ),
where
1 2 2
N(y | m, σ 2 ) = √ e−(y −m) /(2σ ) .
2πσ 2
In maximum likelihood (ML) method we maximize the

likelihood function
Y
L(a, b) = p(y (i) |a, b).
i
Maximum Likelihood (cont)
In linear regression ML leads to exactly the same result as

least squares.
In principle, non-linear and multidimensional models such
as
p(y (i) | θ) = N(y (i) | f (x(i) ; θ), σ 2 ),
can be handled similarly.
Also in principle, we could treat the noise variance as an
unknown variable:
p(y (i) | θ, σ 2 ) = N(y (i) | f (x(i) ; θ), σ 2 ),
and estimate it by maximum likelihood.

But over-fitting will cause problems and there is no way of
including regularization to the ML-framework - for that we
need the Bayesian data analysis.
Bayesian Data Analysis
In Bayesian analysis, also the parameters are considered
as random variables with a prior distribution:
θ ∼ p(θ).
This prior distribution can be, for example,
multidimensional Gaussian:

1 1 T −1
p(θ) = p exp − (θ − µ) Σ (θ − µ) .
|2πΣ| 2
The measurements are modeled in the same manner as in
ML-estimation, e.g.:
p(y (i) | θ) = N(y (i) | f (x(i) ; θ), σ 2 ).
The joint distribution of all the measurements is now
Y
p(Y | θ) = p(y (i) | θ).
i
where Y = {y (1) , . . . , y (n) }.

Posterior Distribution [1/3]
Instead of maximizing the likelihood L(θ) = p(Y | θ), we
compute the posterior distribution of parameters using the
Bayes’ rule:
p(Y | θ) p(θ)
p(θ | Y) = R .
p(Y | θ) p(θ) d θ
The normalization factor in denominator does not depend
on θ and it is common to just write
p(θ | Y) ∝ p(Y | θ) p(θ).
We may now e.g. compute the expected value (MMSE
estimate) of the parameter θ:
Z
E[θ | Y] = θ p(θ | Y) d θ.
Rn
Other possible estimates are e.g. the maximum a posteriori
(MAP) estimate or the median of the posterior distribution.
The Bayesian version of linear regression model is:
p(y (i) | a, b) = N(y (i) | ax (i) + b, σ 2 )

p(a) = N(a | 0, σa2 )
p(b) = N(b | 0, σb2 ).
The posterior distribution is now of the form

(1) (n) a
p(a, b | y , . . . , y ) = N m, Σ ,
b
where
The posterior mean m is the L2 -regularized least squares
solution.
The posterior covariance Σ is the covariance of the error in
the mean.
The Bayesian solution also automatically produces

estimate of the error in the used estimator (e.g., posterior
mean).
In Bayesian framework, regularization can be implemented
by using suitable prior distributions for the parameters.
The maximum likelihood (ML) method is obtained as a
limiting special case when prior distributions become flat
(i.e., non-informative).
The non-regularized and regularized least squares (LS)
solutions are also special cases of the framework.
However, the true advantage of the framework is in
models, which cannot be formulated in LS- and
ML-frameworks at all.
Examples of such models are hierarchical models.
Hierarchical Models
In Hierarchical models prior distributions (regularizers) are

specified for unknown noise variance parameters.
The linear regression model was of the following form
p(y (i) | a, b) = N(y (i) | ax (i) + b, σ 2 )

p(a) = N(a | 0, σa2 )
p(b) = N(b | 0, σb2 ).
In this model we assume that the variances σ 2 , σa2 and σb2

are exactly known - but if they are not, we need
hierarchical models.
Hierarchical Models (cont)
The parameters σ 2 , σa2 and σb2 can be included into model:
p(y (i) | a, b, σ 2 ) = N(y (i) | ax (i) + b, σ 2 )

p(a) = N(a | 0, σa2 )
p(b) = N(b | 0, σb2 )
p(σ 2 ) = Inv-Gamma(σ 2 | α, β)
p(σa2 ) = Inv-Gamma(σa2 | αa , βa )
p(σb2 ) = Inv-Gamma(σb2 | αb , βb ).
The variance parameters are called hyper-parameters and

the model is now a hierarchical model.
The Inv-Gamma distributions are typically used here,
because then the posterior distribution is analytically
computable.
Marginalization of Hyper-parameters
The Bayesian Data Analysis would proceed to analyzing

the joint distribution of all the parameters:
p(a, b, σ 2 , σa2 , σb2 | Y).
If we are not interested in actual values of the noise

variances, we may integrate them out:
Z
p(a, b | Y) = p(a, b, σ 2 , σa2 , σb2 | Y) d (σ 2 ) d (σa2 ) d (σb2 ).
Watch out for the notation - this looks the same as the
posterior with fixed variances, but it is not the same!
In the linear regression case, this marginal posterior
distribution is a Student’s T-distribution.
Typically Used Distribution Models
Gaussian prior and noise models with linear or non-linear

regression functions.
Mixture Gaussian models for modeling non-Gaussian
phenomena.
Inverse-Gamma, Inv-χ2 , and other such distributions for
noise variances.
Inverse-Wishart distributions for covariance matrices.
Multinomial and Poisson distributions for discrete variables.
Dirichlet and Gamma distributions for parameters of
multinomial and Poisson models.
Gaussian processes and random fields as function space
priors in spatial analysis and signal processing.
Gaussian, Monte Carlo and Other Approximations
The computation of the expected value required evaluation

of multidimensional integrals:
Z
E[θ | Y] = θ p(θ | Y) d θ
Rn
R
n θ p(Y | θ) p(θ) d θ
= RR .
Rn p(Y | θ) p(θ) d θ
Computation of other statistics is even harder than

computation of the expectation.
The closed form solution exists only for special cases,
namely linear Gaussian models.
The integrals are particularly problematic, when θ is very
high dimensional (tens or hundreds of parameters), as
typically is the case.
Gaussian, Monte Carlo and Other Approximations
(cont)
In Gaussian or Laplace approximations, the posterior

distribution is approximated with a multidimensional
Gaussian distribution.
Gaussian mixtures can be used for approximating the
posterior as a weighted sum of Gaussian distributions.
Gaussian mixtures can be fitted by the expectation
maximization (EM) algorithm.
Monte Carlo methods use a finite set of samples for
approximating the posterior distribution.
The most common types of Monte Carlo methods are
Markov chain Monte Carlo (MCMC) methods
(Metropolis-Hastings methods) and importance sampling
(IS) based methods.
Variational and minimum free energy approximations.
What is Different in Estimation of Dynamic Processes?
In conventional offline or “batch” estimation we first get the
measurements, then we estimate.
In online or “recursive” estimation we get measurements
continuously and compute new estimates continuously.
The parameters do not need to be constant, but they can
be modeled as stochastic process θ(t), which changes in
time, e.g.:
Discrete-time random walk:
θk = θk −1 + wk −1 .
Continuous-time random walk (Brownian motion):
dθ/dt = w(t).
General (Hidden) Markov models:
θ k ∼ p(θ k | θk −1 ).
Batch Bayesian Estimation
1 Collect the data:
Y = {(x1 , y1 ), . . . , (xT , yT )}.

2 Specify the likelihood model:
p(yk | θ)
3 Specify the prior distribution:
p(θ).
4 Compute the posterior distribution:
1 Y
p(θ|Y) = p(θ) p(yk | θ).
Z
k
5 Use numerical methods for approximating the distribution

and its statistics.
Recursive Bayesian Estimation
In recursive estimation the likelihood, and prior are

specified in the same way, but the data is obtained
incrementally y1 , y2 , y3 , . . .
The posterior distribution is computed recursively as
follows:
1
p(θ|y1 ) = p(y1 | θ)p(θ)
Z1
1
p(θ|y1:2 ) = p(y2 | θ)p(θ | y1 )
Z2
..
.
1
p(θ|y1:T ) = p(yT | θ)p(θ | y1:T −1 ).
ZT
At each stage, the result is the same as what would be
obtained by computing the corresponding batch solution.
Bayesian Batch Linear Regression [1/4]
Two time series Scatter plot of the data

10 1.6
x(t)
1.4
5
1.2
1
0
0 20 40 60 80 100
0.8
y
1.5
y(t) 0.6
1
0.4
0.5 0.2
0 0
0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
x
Measurement data:
Y = {(x1 , y1 ), . . . , (xT , yT )}.
Likelihood (σ 2 given):
p(yk | θ) = N(yk | axk + b, σ 2 ).
Prior (known m0 and P0 ):
p(θ) = N(θ | m0 , P0 ).
Because p(θ) and p(θ | Y) are Gaussian, we get
p(θ | Y) = N(θ | mT , PT ).
Posterior mean and covariance:

1 T −1 1 T
mT = P−10 + H H H y + P −1
0 m 0
σ2 σ2

1 T −1
PT = P0 + 2 H H
−1
.
σ
Here H and y are
   
1 x1 y1
1 x2   y2 
   
H = . .  y =  . .
 .. ..   .. 
1 xT yT
Linear fit
1.6
1.4
1.2
1 R2 = 0.9731
y = 0.11 x + 0.17
0.8
y
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
x
Bayesian Recursive Linear Regression [1/4]
Distribution given measurements 1, . . . , t − 1:
p(θ | y1:k −1 ) = N(θ | mk −1 , Pk −1 ).
Given another measurement yk the posterior distribution is
p(θ | y1:k ) ∝ p(yk | θ) p(θ | y1:k −1 ).
Posterior distribution is Gaussian
p(θ | y1:k ) = N(θ | mk , Pk ).
Gaussian distribution parameters are

−1
1 T 1 T
mk = P−1 + H H k H yk + P −1
m
k −1 k −1
k −1 σ2 k σ2 k
−1
1 T
Pk = Pk −1 + 2 Hk Hk
−1
.
σ
where Hk = (xk 1).

Solution can be simplified by Matrix inversion lemma
h i−1
Pk = Pk −1 − Pk −1 HTk Hk Pk −1 HTk + σ 2 Hk Pk −1 .
Estimated parameters
0.35
Recursive a
Recursive b
0.3 Batch a
Batch b
0.25
Paramater value
0.2
0.15
0.1
0.05
0
0 20 40 60 80 100
Step number
Variances of parameters in recursive estimation

2
10
Variance of a
Variance of b
1
10
Paramater variance
0
10
−1
10
−2
10
−3
10
0 20 40 60 80 100
Step number
Kalman Filtering [1/7]
If the phenomenon is not constant in time, we may assume

that parameters θ are not constant, but stochastic
processes.
Assume additional random walk model for θ k :
θ k = θ k −1 + wk
where wk is 2d Gaussian random variable.

The whole model is now of the form
p(yk | θ k ) = N(yk | ak xk + bk , σ 2 )
p(θk | θ k −1 ) = N(θ k | θ k −1 , Q)
p(θ0 ) = N(θ 0 | m0 , P0 ).
Q is the covariance of random walk.
Assume that we know distribution
p(θk −1 | y1:k −1 ) = N(θ k −1 | mk −1 , Pk −1 ).
The posterior distribution of θ k is Gaussian
p(θ k | y1:k ) = N(θ k | mk , Pk ).
The posterior computation can be now composed into two

steps:
1 On prediction step, we predict the mean and covariance
from step k − 1 to k .
2 On update step, we condition the mean and covariance to
the measurement yk .
Mean and covariance are given by

Prediction:
k = mk −1
m−
k = Pk −1 + Q.
P−
Update:
T 2
Sk = Hk P−
k Hk + σ
T −1
Kk = P−
k Hk Sk
mk = m−
k + Kk [yk − Hk mk ]
−
T
Pk = P−
k − Kk Sk Kk .
Two time series Linear fit to scatter plot

20 2
x(t)
2
15 R = 0.0129
1.5
y = −0.01 x + 0.30
10
1
5
0.5
0
0 50 100 150 200
0
y
2
y(t) −0.5
1
−1
0
−1 −1.5
−2 −2
0 50 100 150 200 0 5 10 15 20
x
Estimated parameters
0.6
Kalman a
Kalman b
0.5
Batch a
Batch b
0.4
Paramater value
0.3
0.2
0.1
−0.1
−0.2
0 20 40 60 80 100 120 140 160 180
Step number
Batch model prediction

2
Measurement
Prediction
1
−1
−2
0 20 40 60 80 100 120 140 160 180 200
Kalman filter prediction

2
Measurement
Prediction
1
−1
−2
0 20 40 60 80 100 120 140 160 180 200
In Kalman filtering literature the stochastic process θ(t) (or

θ k ) is often denoted with x(t) (or xk ).
Kalman filter gives the solution to generic linear state
space models of the form
xk = A xk −1 + qk , qk ∼ N(0, Q)
yk = H xk + rk , rk ∼ N(0, R)
x0 ∼ N(m0 , P0 ).
Vector xk is the state and vector yk is the measurement.

In probabilistic notation the model is:
p(yk | xk ) = N(yk | H xk , R)
p(xk | xk −1 ) = N(xk | A xk −1 , Q).
Probabilistic Non-Linear Filtering [1/2]
Generic discrete-time state space models
xk = f(xk −1 , qk )
yk = h(xk , rk ).
Generic Markov models

yk ∼ p(yk | xk )
xk ∼ p(xk | xk −1 ).
Approximation methods: Extended Kalman filters (EKF),

Unscented Kalman filters (UKF), sequential Monte Carlo
(SMC) filters a’ka particle filters.
Probabilistic Non-Linear Filtering [2/2]
In continuous-discrete filtering models, dynamics are

modeled in continuous time, measurements at discrete
time steps.
The continuous time versions of Markov models are called
as stochastic differential equations:
dx
= f(x, t) + w(t)
dt
where w(t) is a continuous time Gaussian white noise
process.
Approximation methods: Extended Kalman filters,
Unscented Kalman filters, sequential Monte Carlo, particle
filters.
Summary
Correlation coefficient R measures goodness of

one-dimensional linear regression fit to scatter plot.
Multidimensional models and non-linear
linear-in-parameter models are direct generalizations of
the one-dimensional linear regression.
The coefficient of determination R 2 , which is the square of
correlation coefficient can be generalized to
multidimensional and non-linear models.
With non-linear models over-fitting is a huge problem.
Regularization diminishes the over-fitting problem by
including additional cost function to the error function.
Bayesian data analysis provides unified framework of
statistical regression and regularization.
Statistical estimation in dynamic context leads to Kalman
filters and sequential Monte Carlo filters.

Regression Vs Kalman Filter

Uploaded by

Copyright:

Available Formats

Regression Vs Kalman Filter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Vs Kalman Filter

Uploaded by

Copyright:

Available Formats

Lecture 2: From Linear Regression to Kalman

Filter and Beyond

Department of Biomedical Engineering and Computational Science

March 24, 2009

1 Linear Regression and Correlation

4 Input and Model Selection

5 Stochastic Bayesian Models

Assume that we have two signals x(t) and y(t):

We can now measure that the squared correlation R 2 of

Plot x and y on separate axes (scatter plot) and fit a line

The correlation coefficient measures how well the

We have observed pairs {(x (1) , y (1) ), . . . , (x (n) , y (n) )}:

Assume linear relationship

We want to estimate the parameters a and b.

In the least squares method, we minimize the mean

where E[·] denotes the expectation of the argument.

The final solution is

The regression line equation has the form

and we could use parameter a as measure of the linear

The equation reduces to form

E[x y] − E[x] E [y]

The proportionality coefficient R is the correlation

The correlation coefficient can be also derived in another

One way of measuring the goodness of the fit is to

Correlation does not imply causality!

Even small delay in inputs or outputs can destroy the

With only 1 sample delay, the original correlation 0.97

Multidimensional generalization of linear regression model

1 X (i) (i) (i) (i)

We could now solve the parameters by setting the

We define the following:

The gradient of error should vanish at minimum:

Correlation coefficient is not applicable, but the coefficient

If y is a vector, R 2 can be computed for each component

We want to find out the relationship between the following

The 1d-correlations are: R 2 (x1 , x2 ) = 0.00,

With 2-dimensional linear model y = a1 x1 + a2 x2 we get

The multidimensional linear model framework also applies

or other models of the form

y = a1 f1 (x) + a2 f2 (x) + ... + ad fd (x) + b.

For example, Fourier series can be derived from the least

where θ ∈ Rn is a vector of parameters and x = (x1 , ..., xd ).

In principle, least squares can be used for fitting

Often non-linear problem can be converted into a linear

Consider the following alternative fits to the same data:

The polynomial on the right is said to be over-fitted to the

Over-fitting problem is the worst with multi-parameter

where L is a linear operator (e.g. differential or integral

Returning to the previous example, if we add the

In this case this kind of regularization does not help much.

Using cross-validation with respect to polynomial order

Actually, the data is generated from linear model, but it is

No advanced regression model or model class (e.g. NN)

In input selection we try to find the subset inputs x that are

In case of multi-linear models there exists good methods

If inputs are not independent, it is possible to try to find

If we do not restrict model class to linear models, we have

To overcome this problem, several regularized goodness fit

y (i) = ax (i) + b + e(i) , e(i) ∼ N(0, σ 2 ).