Nonlinear System Identification

Nonlinear system identification

System identification is a method of identifying or measuring the mathematical model of a system from
measurements of the system inputs and outputs. The applications of system identification include any
system where the inputs and outputs can be measured and include industrial processes, control systems,
economic data, biology and the life sciences, medicine, social systems and many more.

A nonlinear system is defined as any system that is not linear, that is any system that does not satisfy the
superposition principle. This negative definition tends to obscure that there are very many different types of
nonlinear systems. Historically, system identification for nonlinear systems[1][2] has developed by focusing
on specific classes of system and can be broadly categorized into five basic approaches, each defined by a
model class:

1. Volterra series models,

2. Block-structured models,
3. Neural network models,
4. NARMAX models, and
5. State-space models.

There are four steps to be followed for system identification: data gathering, model postulate, parameter
identification, and model validation. Data gathering is considered as the first and essential part in
identification terminology, used as the input for the model which is prepared later. It consists of selecting an
appropriate data set, pre-processing and processing. It involves the implementation of the known algorithms
together with the transcription of flight tapes, data storage and data management, calibration, processing,
analysis, and presentation. Moreover, model validation is necessary to gain confidence in, or reject, a
particular model. In particular, the parameter estimation and the model validation are integral parts of the
system identification. Validation refers to the process of confirming the conceptual model and
demonstrating an adequate correspondence between the computational results of the model and the actual

Volterra series methods

The early work was dominated by methods based on the Volterra series, which in the discrete time case can
be expressed as

where u(k), y(k); k = 1, 2, 3, ... are the measured input and output respectively and is the
lth-order Volterra kernel, or lth-order nonlinear impulse response. The Volterra series is an extension of the
linear convolution integral. Most of the earlier identification algorithms assumed that just the first two, linear
and quadratic, Volterra kernels are present and used special inputs such as Gaussian white noise and
correlation methods to identify the two Volterra kernels. In most of these methods the input has to be
Gaussian and white which is a severe restriction for many real processes. These results were later extended
to include the first three Volterra kernels, to allow different inputs, and other related developments including
the Wiener series. A very important body of work was developed by Wiener, Lee, Bose and colleagues at
MIT from the 1940s to the 1960s including the famous Lee and Schetzen method.[4][5] While these
methods are still actively studied today there are several basic restrictions. These include the necessity of
knowing the number of Volterra series terms a priori, the use of special inputs, and the large number of
estimates that have to be identified. For example, for a system where the first order Volterra kernel is
described by say 30 samples, 30x30 points will be required for the second order kernel, 30x30x30 for the
third order and so on and hence the amount of data required to provide good estimates becomes excessively
large.[6] These numbers can be reduced by exploiting certain symmetries but the requirements are still
excessive irrespective of what algorithm is used for the identification.

Block-structured systems
Because of the problems of identifying Volterra models other model forms were investigated as a basis for
system identification for nonlinear systems. Various forms of block structured nonlinear models have been
introduced or re-introduced.[6][7] The Hammerstein model consists of a static single valued nonlinear
element followed by a linear dynamic element.[8] The Wiener model is the reverse of this combination so
that the linear element occurs before the static nonlinear characteristic.[9] The Wiener-Hammerstein model
consists of a static nonlinear element sandwiched between two dynamic linear elements, and several other
model forms are available. The Hammerstein-Wiener model consists of a linear dynamic block sandwiched
between two static nonlinear blocks.[10] The Urysohn model [11][12] is different from other block models, it
does not consists of sequence linear and nonlinear blocks, but describes both dynamic and static
nonlinearities in the expression of the kernel of an operator.[13] All these models can be represented by a
Volterra series but in this case the Volterra kernels take on a special form in each case. Identification
consists of correlation based and parameter estimation methods. The correlation methods exploit certain
properties of these systems, which means that if specific inputs are used, often white Gaussian noise, the
individual elements can be identified one at a time. This results in manageable data requirements and the
individual blocks can sometimes be related to components in the system under study.

More recent results are based on parameter estimation and neural network based solutions. Many results
have been introduced and these systems continue to be studied in depth. One problem is that these methods
are only applicable to a very special form of model in each case and usually this model form has to be
known prior to identification.

Neural networks
Artificial neural networks try loosely to imitate the network of neurons in the brain where computation
takes place through a large number of simple processing elements. A typical neural network consists of a
number of simple processing units interconnected to form a complex network. Layers of such units are
arranged so that data is entered at the input layer and passes through either one or several intermediate
layers before reaching the output layer. In supervised learning the network is trained by operating on the
difference between the actual output and the desired output of the network, the prediction error, to change
the connection strengths between the nodes. By iterating the weights are modified until the output error
reaches an acceptable level. This process is called machine learning because the network adjusts the
weights so that the output pattern is reproduced. Neural networks have been extensively studied and there
are many excellent textbooks devoted to this topic in general,[1][14] and more focused textbooks which
emphasise control and systems applications,.[1][15] There are two main problem types that can be studied
using neural networks: static problems, and dynamic problems. Static problems include pattern recognition,
classification, and approximation. Dynamic problems involve lagged variables and are more appropriate for
system identification and related applications. Depending on the architecture of the network the training
problem can be either nonlinear-in-the-parameters which involves optimisation or linear-in-the-parameters
which can be solved using classical approaches. The training algorithms can be categorised into supervised,
unsupervised, or reinforcement learning. Neural networks have excellent approximation properties but
these are usually based on standard function approximation results using for example the Weierstrass
Theorem that applies equally well to polynomials, rational functions, and other well-known models. Neural
networks have been applied extensively to system identification problems which involve nonlinear and
dynamic relationships. However, classical neural networks are purely gross static approximating machines.
There is no dynamics within the network. Hence when fitting dynamic models all the dynamics arise by
allocating lagged inputs and outputs to the input layer of the network. The training procedure then produces
the best static approximation that relates the lagged variables assigned to the input nodes to the output.
There are more complex network architectures, including recurrent networks,[1] that produce dynamics by
introducing increasing orders of lagged variables to the input nodes. But in these cases it is very easy to
over specify the lags and this can lead to over fitting and poor generalisation properties. Neural networks
have several advantages; they are conceptually simple, easy to train and to use, have excellent
approximation properties, the concept of local and parallel processing is important and this provides
integrity and fault tolerant behaviour. The biggest criticism of the classical neural network models is that the
models produced are completely opaque and usually cannot be written down or analysed. It is therefore
very difficult to know what is causing what, to analyse the model, or to compute dynamic characteristics
from the model. Some of these points will not be relevant to all applications but they are for dynamic

NARMAX methods
The nonlinear autoregressive moving average model with exogenous inputs (NARMAX model) can
represent a wide class of nonlinear systems,[2] and is defined as

where y(k), u(k) and e(k) are the system output, input, and noise sequences respectively; , , and are
the maximum lags for the system output, input and noise; F[•] is some nonlinear function, d is a time delay
typically set to d = 1.The model is essentially an expansion of past inputs, outputs and noise terms. Because
the noise is modelled explicitly, unbiased estimates of the system model can be obtained in the presence of
unobserved highly correlated and nonlinear noise. The Volterra, the block structured models and many
neural network architectures can all be considered as subsets of the NARMAX model. Since NARMAX
was introduced, by proving what class of nonlinear systems can be represented by this model, many results
and algorithms have been derived based around this description. Most of the early work was based on
polynomial expansions of the NARMAX model. These are still the most popular methods today but other
more complex forms based on wavelets and other expansions have been introduced to represent severely
nonlinear and highly complex nonlinear systems. A significant proportion of nonlinear systems can be
represented by a NARMAX model including systems with exotic behaviours such as chaos, bifurcations,
and subharmonics. While NARMAX started as the name of a model it has now developed into a
philosophy of nonlinear system identification,.[2] The NARMAX approach consists of several steps:

Structure detection: which terms are in the model

Parameter estimation: determine the model coefficients
Model validation: is the model unbiased and correct
Prediction: what is the output at some future time
Analysis: what are the dynamical properties of the system
Structure detection forms the most fundamental part of NARMAX. For example, a NARMAX model
which consists of one lagged input and one lagged output term, three lagged noise terms, expanded as a
cubic polynomial would consist of eighty two possible candidate terms. This number of candidate terms
arises because the expansion by definition includes all possible combinations within the cubic expansion.
Naively proceeding to estimate a model which includes all these terms and then pruning will cause
numerical and computational problems and should always be avoided. However, only a few terms are often
important in the model. Structure detection, which aims to select terms one at a time, is therefore critically
important. These objectives can easily be achieved by using the Orthogonal Least Squares [2] algorithm and
its derivatives to select the NARMAX model terms one at a time. These ideas can also be adapted for
pattern recognition and feature selection and provide an alternative to principal component analysis but with
the advantage that the features are revealed as basis functions that are easily related back to the original

NARMAX methods are designed to do more than find the best approximating model. System identification
can be divided into two aims. The first involves approximation where the key aim is to develop a model
that approximates the data set such that good predictions can be made. There are many applications where
this approach is appropriate, for example in time series prediction of the weather, stock prices, speech,
target tracking, pattern classification etc. In such applications the form of the model is not that important.
The objective is to find an approximation scheme which produces the minimum prediction errors. A second
objective of system identification, which includes the first objective as a subset, involves much more than
just finding a model to achieve the best mean squared errors. This second aim is why the NARMAX
philosophy was developed and is linked to the idea of finding the simplest model structure. The aim here is
to develop models that reproduce the dynamic characteristics of the underlying system, to find the simplest
possible model, and if possible to relate this to components and behaviours of the system under study. The
core aim of this second approach to identification is therefore to identify and reveal the rule that represents
the system. These objectives are relevant to model simulation and control systems design, but increasingly
to applications in medicine, neuro science, and the life sciences. Here the aim is to identify models, often
nonlinear, that can be used to understand the basic mechanisms of how these systems operate and behave
so that we can manipulate and utilise these. NARMAX methods have also been developed in the frequency
and spatio-temporal domains.

Stochastic nonlinear models

In a general situation, it might be the case that some exogenous uncertain disturbance passes through the
nonlinear dynamics and influence the outputs. A model class that is general enough to capture this situation
is the class of stochastic nonlinear state-space models. A state-space model is usually obtained using first
principle laws,[16] such as mechanical, electrical, or thermodynamic physical laws, and the parameters to be
identified usually have some physical meaning or significance.

A discrete-time state-space model may be defined by the difference equations:

in which is a positive integer referring to time. The functions and are general nonlinear functions. The
first equation is known as the state equation and the second is known as the output equation. All the signals
are modeled using stochastic processes. The process is known as the state process, and are usually
assumed independent and mutually independent such that . The parameter is
usually a finite-dimensional (real) parameter to be estimated (using experimental data). Observe that the
state process does not have to be a physical signal, and it is normally unobserved (not measured). The data
set is given as a set of input-output pairs for for some finite positive integer value

Unfortunately, due to the nonlinear transformation of unobserved random variables, the likelihood function
of the outputs is analytically intractable; it is given in terms of a multidimensional marginalization integral.
Consequently, commonly used parameter estimation methods such as the Maximum Likelihood Method or
the Prediction Error Method based on the optimal one-step ahead predictor[16] are analytically intractable.
Recently, algorithms based on sequential Monte Carlo methods have been used to approximate the
conditional mean of the outputs or, in conjunction with the Expectation-Maximization algorithm, to
approximate the maximum likelihood estimator.[17] These methods, albeit asymptotically optimal, are
computationally demanding and their use is limited to specific cases where the fundamental limitations of
the employed particle filters can be avoided. An alternative solution is to apply the prediction error method
using a sub-optimal predictor.[18][19][20] The resulting estimator can be shown to be strongly consistent and
asymptotically normal and can be evaluated using relatively simple algorithms.[21][20]

