EP1101359A1

EP1101359A1 - A video coding system

Info

Publication number: EP1101359A1
Application number: EP99938387A
Authority: EP
Inventors: Marta Karczewicz; Levent Oktem
Original assignee: Nokia Mobile Phones Ltd; Nokia Inc
Current assignee: Nokia Oyj
Priority date: 1998-07-29
Filing date: 1999-07-27
Publication date: 2001-05-23
Also published as: GB9816540D0; AU5290299A; WO2000007375A1; GB2340327A

Abstract

A motion estimation method and system for a video coder are disclosed. The system comprises an input for a video image (6) to be coded. It also comprises a series of motion estimators of varying complexity, for estimating a motion vector field between the received image (6) and a reference image (5). The subsequent motion estimator in the series is selected by a control means (50) if a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold. Smoothing means (10) and sub-sampling means (20) are provided. Preferably the two are inter-dependent.

Description

A VIDEO CODING SYSTEM

The present invention relates to a video coding system. In particular, it relates to a system for the compression of video sequences using motion compensated prediction.

The schematic diagram of a system using motion compensated prediction is shown in Figure 1 and Figure 2 of the accompanying drawings.

Figure 1 illustrates an encoder having a motion estimation block and figure 2 illustrates a corresponding decoder. Motion compensated prediction in such a system is outlined below.

In typical video sequences the change of the content of successive frames is to a great extent the result of the motion in the scene. This motion may be due to camera motion or due to motion of the objects depicted in the scene. Therefore typical video sequences are characterized by significant temporal correlation, which is highest along the trajectory of the motion. Efficient compression of video sequences requires exploitation of this property of video sequences.

Motion Compensated (MC) prediction is a widely recognized technique for compression of video. It utilizes the fact that in a typical video sequence, image intensity value in a particular frame can be predicted using image intensities of some other already coded and transmitted frame, given the motion trajectory between these two frames.

The operating principle of motion compensated video coders is to minimize the prediction error E_n(x,y) , i.e., the difference between the frame being coded I„(χ,y) called the current frame and the prediction frame P_n(χ,y) (Figure 1 ):

E_n(x,y) = I_n(x,y) - P_n(x,y) (1 )

The prediction error E_n(χ,y) is compressed and the compression process typically introduces some loss of information. The compressed prediction error denoted by E_n(x,y) is sent to the decoder. The prediction frame P„(χ,y) is constructed by the motion compensated prediction block in Figure 1 and Figure 2. The prediction frame is built using pixel values of the reference frame denoted by R„(^χ,y) and the motion vectors of pixels between the current frame and the reference frame using the formula

P_n (x, y) = R_n[x + Ax(x, y), y + Ay(x, y)]. (2)

The reference frame is one of the previously coded and transmitted frames (e.g. a frame preceding the one being coded) which at a given instant is available in the Frame Memory of the encoder and of the decoder. The pair of numbers is called the motion vector of the pixel in location (x,y) in the current frame. Δx(χ,y) and Ay(χ,y) are the values of horizontal and vertical displacements of this pixel, respectively. Motion vectors are calculated by the motion estimation block in the encoder shown in Figure 1. The set of motion vectors of all pixels of the current frame [Δx(-),ΔyQ] is called the motion vector field and is transmitted to the decoder.

In the decoder, pixels of the coded current frame I„(χ,y) are reconstructed by finding the prediction pixels in the reference frame R_n(-) using the received motion vectors and by adding the received prediction error E_n(x,y) , i.e.,

(χ,y) = R„[χ + x(χ,y),y + Ay(χ,y)] + E_n(x,y) (3)

For example, if the transmission channel available for the compressed video bit stream is very narrow, it is possible to reject the effect of prediction errors. Then it is not necessary to compress and transmit the prediction error, and the spare bits from the transmission channel and spare calculation power can be used for other purposes, e.g., to improve the frame rate of the video signal. The rejection of prediction errors leads to defective pixel elements in the visible video picture, but depending on the demands of the application in use it may be acceptable.

Due to the very large number of pixels in the frame it is not efficient to transmit a separate motion vector for each pixel. Instead, in most video coding schemes the current frame is divided into larger image segments so that all motion vectors of the segment can be described by fewer coefficients. Depending on the way the current frame is divided into the segments two types of motion compensated coders can be distinguished:

1. Block based coders where the current frame is divided into fixed and a priori known blocks, e.g., 16x16 pixels blocks in international standard

ISO/IEC MPEG-1 or ITU-T H.261 codecs (Figure 3a) .

2 . Segmentation based (region based) coders where the current frame is divided into arbitrarily shaped segments, e.g., obtained by a segmentation algorithm (Figure 3b). (For examples refer to Centre de Morphologie Mathematique (CMM), "Segmentation algorithm by multicriteria region merging," Document SIM(95)19, COST 211ter Project Meeting, May 1995 and P. Cicconi and H. Nicolas, "Efficient region-based motion estimation and symmetry oriented segmentation for image sequence coding," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 4, No. 3, June 1994, pp. 357-364)

In practice segments include at least a few tens of pixels. In order to represent the motion vectors of these pixels compactly it is desirable that their values are described by a function of a few coefficients. Such a function is called a motion vector field model. Almost all of the motion vector field models commonly used are additive motion models. Motion compensated video coding schemes may define the motion vectors of image segments by the following general formula:

W-l x(x,y) = ∑α_if,(x,y) (4) ι=0 M-\

Ay(x,y) = ∑b_lg_l(x,y) (5)

/=0 where coefficients a_l and b_t are called motion coefficients and are transmitted to the decoder. Functions f_t and g, are called motion field basis functions and have to be known both to the encoder and decoder. Polynomial motion models are a widely used family of models. (See, for example H. Nguyen and E. Dubois, "Representation of motion information for image coding," in Proc. Picture Coding Symposium '90, Cambridge, Massachusetts, March 26-18, 1990, pp. 841-845 and Centre de Morphologie Mathematique (CMM), "Segmentation algorithm by multicriteria region merging," Document SIM(95)19, COST 211ter Project Meeting, May 1995). The values of motion vectors are described by functions which are linear combinations of 2D polynomial functions. The translational motion model is the simplest model and requires only two coefficients to describe motion vectors of each segment. The values of motion vectors are given by the formulae:

Ax(x,y) = a_n y(^χ,y) = b₀ This model is used in international standards (ISO MPEG-1 , ITU-T Recommendation H.261 ) to describe motion of fixed 16x16 blocks. Two other widely used models are the affine motion model given by the equation:

Ax(x, y) = a_n + a,x + a₇y (7)

Ay(x,y) = b₀ + b_lx + b₂y and the quadratic motion model given by the equation:

Ax(x,y) = a₀ + a_lx + a₂y + a₃xy + a₄x + a₅y

Ay(x,y) = b₀ + b_lx + b₂y + b₃xy + b₄x 2 + b₅y 2 W

The affine motion model presents a very convenient trade-off between the number of motion coefficients and prediction performance. It is capable of representing some of the common real-life motion types such as translation, rotation, zoom and shear by only a few coefficients.

The quadratic motion model has a good prediction performance, but it is less popular in coding than the affine model, since it uses more motion coefficients, while the prediction performance is not much better. Furthermore, it is computationally more costly to estimate the quadratic motion than to estimate affine motion.

The Motion Estimation block calculates motion vectors [Ax(x,y),Ay(x,y)] of the pixels of a given segment S_k which minimize some measure of prediction error in this segment. A meaningful additive measure of prediction error has the form:

∑Ahfa&y) - R„(x + Ax(x,y),y + Ay(x,y))\) (9)

where p^s are scalar constants, |.| denotes absolute value, and h is a non- decreasing function. A very popular measure is the square prediction error, in which case p. = 1 , and h(.) = (.)² :

∑(i_n(^χ,y) - R_n(^χ + ^χ(^χ,y),y+ y(^χ,y))f (10)

(»,Λ⁾

(9) is a highly non-linear function of many variables and there is thus no practical technique which is capable of always finding the absolute minimum of (9) in finite time. Accordingly, motion estimation techniques differ depending on the algorithm for minimization of the chosen measure of prediction error.

Previously known techniques for motion estimation are discussed below.

One technique is the full search. In this technique the value of the cost function is calculated for all the possible combinations of allowed values of the motion coefficients (restricted by the range and precision with which motion coefficients can be represented). The values of motion coefficients for which the cost function is minimized are chosen to represent the motion vector field. The full search technique is usually used only to estimate motion coefficients of the translational motion model and cannot be straightforwardly generalized for other motion models, due to computational burden. In a straight-forward generalization, the computational complexity of the algorithm is exponentially increased by the number of motion coefficients used to represent the motion vector field.

Motion estimation using Gauss-newton iterations (or differential optimization schemes) is an alternative. These are outlined in H. Sanson, "Region based motion estimation and compensation for digital TV sequence coding," in Proc. Picture Coding Symposium '93, Lausanne, Switzerland, March 17-19, 1993 and C. A. Papadoupoulos, and T. G. Clarkson, "Motion Compensation Using Second-Order Geometric Transformations", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 4, August 1995, pp. 319-331. Such techniques, use the well-known Gauss- Newton function minimization algorithm, to minimize the cost function (9), i.e. the chosen measure of prediction error, as a function of motion coefficients.

This algorithm assumes that the function to be minimized can be locally approximated by a quadratic function of the arguments. Then, the n^th iteration step consists of: 1. computing the approximate quadratic function using first and second derivatives of the actual function using the motion coefficient resulting from (n-1 step, 2. computing the coefficient vector minimizing the approximate function, and assigning the result to the motion coefficient of n^th step. The main problem associated with the Gauss-Newton algorithm is that it converges only towards local minima, unless the initial motion coefficients lie in the attraction domain of the global minimum. Thus it is necessary to provide the Gauss-Newton algorithm with a sufficiently good initial guess of the actual optimum. Two different techniques are usually used to improve the convergence of the Gauss-Newton algorithm: 1. motion estimation using multiresolution image pyramids,

2. motion estimation using hierarchically increasing levels of the motion model.

The technique of motion estimation using multiresolution image pyramids is based on the assumption that low-pass filtering the current frame and the reference frame will erase the local minima and help the algorithm to converge to the global minimum. Motion estimation is performed first on the low-pass filtered (smoothed) versions of the reference and current frames, and the result is fed as input to the motion estimation stages using less smoothed images. The final estimation is performed on non-smoothed images. Some variants of this class additionally down-sample the low-pass filtered images, to reduce the number of computations. (For examples of this technique, see H. Sanson, "Region based motion estimation and compensation for digital TV sequence coding," in Proc Picture Coding Symposium '93, Lausanne, Switzerland, March 17-19, 1993 and P. J. Burt, "The Pyramid as a Structure for Efficient Computation", in: Multiresolution Image Processing and Analysis, ed. Rosenfeld, Springer Verlag, 1984, pp. 6- 35).

However, low-pass filtering of the images does not necessarily erase local minima. Furthermore, this may shift the location of global minimum.

Down-sampling can be applied only when the low-pass filtering is sufficient to prevent aliasing. Moreover, convergence becomes more difficult due to the reduction of number of pixels in the region.

By contrast, the technique of motion estimation using hierarchically increasing levels of the motion model makes use of the following assumptions:

1. A complex motion model can be approximated by a lower order motion model.

2. This approximation is a good initial guess for the iterative search for more complex motion model coefficients. The most common hierarchy starts with the translational model (2 coefficients), then continues with a simplified linear model (corresponding to the physical motion of translation, rotation and zoom, having 4 coefficients), and ends with the complete linear model (6 coefficients), etc. (Such hierarchy can be seen in P. Cicconi and H. Nicolas, "Efficient region-based motion estimation and symmetry oriented segmentation for image sequence coding," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 4, No. 3, June 1994, pp. 357-364 and H. Nicolas and C. Labit, "Region-based motion estimation using deterministic relaxation schemes for image sequence coding," in Proc 1994 International Conference on Acoustics, Speech and Signal Processing, pp. III265-268.)

These assumptions can work very well under certain conditions. However, convergence to a local minimum is often a problem, especially in the case when the approximation turns out to be a poor one. Present systems, such as those outlined above, suffer from disadvantages resulting from the relationship between computational complexity and video compression performance. That is, on the one hand, an encoder will require a motion estimation block having high computational complexity in order to determine motion coefficients which minimize the chosen measure of prediction error (9) and thus achieve the appropriate video compression performance. Usually, in this case such a motion estimation block poses as a bottleneck for computational complexity of the overall encoder, due to the huge number of computations required to achieve the solution. On the other hand, if computational complexity is reduced, it will result in a reduction in prediction performance, and thus video compression performance. Since the prediction performance of motion estimation heavily affects the overall compression performance of the encoder, it is crucial for a motion estimation algorithm to have high prediction performance (i.e. low prediction error) with relatively low complexity. To keep the complexity low, motion estimation algorithms have to make assumptions about the image data, motion, and prediction error. The more these assumptions hold statistically, and the stronger the assumptions are, the better the algorithm is. Different sets of assumptions usually result in different minimization techniques.

The system according to the present invention achieves statistically low prediction error with relatively little complexity by dynamically switching between statistically valid assumptions varying in strength.

According to one aspect of the present invention, there is provided a motion estimation system for a video coder comprising:means for receiving a video frame to be coded; means for smoothing the received frame; means for subsampling the smoothed frame and for forwarding the subsampled frame to a series of motion estimators; a series of motion estimators of varying complexity, for estimating a motion vector field between the said received frame and a reference frame; and control means for selecting the subsequent motion estimator in the series only if a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold.

Preferably, the control means is arranged to activate the smoothing means in dependence on the state of the sub-sampling means. Advantageously the control means is arranged to activate the smoothing means only if the sub-sampling means is also activated.

Preferably, the smoothing means comprises a single low-pass filter and the sub-sampling means comprises a single sub-sampler. If the control means causes no smoothing to occur, the control means instructs the sub- sampling means not to sub-sample the image. If the control means sets the smoothing means to smooth the image (i.e. to pass it through the low-pass filter), the control means also instructs the sub-sampling means to sub-sample the image. The smoothing means may take the form of a series of low pass filters. Preferably, the control means selects the level of smoothing during minimisation depending on the change in prediction error. For example, if the change is below a threshold for a certain level of smoothing, then at least the next level of smoothing may be jumped. Alternatively, if the change is at, or above, the threshold, then the next level of smoothing may be selected.

The series of motion estimators may comprise a series of motion models, a series of minimisers, or a combination of both. For example, it may comprise a hierarchy of motion models in order of complexity (e.g. zero motion model, translational motion model, affine motion model, and quadratic motion model). Alternatively, it may comprise a series of minimisers, for example for one particular model (e.g. for a linear model, the series of minimisers could be Quasi-Newton and Gauss-Newton. Another option is a combination of motion models and minimisers. The predetermined threshold may differ, at different stages of minimisation and/or for different models.

The invention also relates to a video coder comprising a motion estimation system according to the present invention.

According to another aspect of the present invention, there is provided a motion estimation method for coding a video frame, comprising:receiving a video frame to be coded; smoothing the received frame; subsampling the smoothed frame and for forwarding the subsampled frame to a series of motion estimators; estimating a motion vector field between the said received frame and a reference frame, using a motion estimator from a series of motion estimators of varying complexity; determining whether a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold and, only if so, selecting the subsequent motion estimator in the series.

Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, of which: Figure 1 illustrates an encoder for the motion compensated coding of video;

Figure 2 illustrates a decoder for the motion compensated coding of video; Figure 3(a) illustrates the division of the current frame for block based motion compensation;

Figure 3(b) illustrates the division of the current frame for segmentation based motion compensation;

Figure 4 illustrates a motion estimation system according to an embodiment of the invention;

Figure 5 illustrates the low pass filtering block shown in Figure 4; Figure 6 illustrates the sub-sampling block shown in Figure 4; Figure 7 is a block diagram of the motion model selector shown in Figure 4; Figure 8 is a block diagram of the cost function minimiser shown in

Figure 4;

Figure 9 is a flow diagram of a preferred implementation of the result verification block shown in Figure 4;

Figure 10 is a flow diagram of a Gauss-Newton minimisation stage according to an embodiment of the present invention; and

Figure 11 is a flow diagram of a Quasi-Newton minimisation stage for use in an embodiment of the present invention.

A motion estimation system of a preferred embodiment of the present invention is shown in Figure 4. It consists of five main building blocks, namely a low-pass filtering block, a sub-sampling block 20, a motion model selector block 30, a cost function minimiser block 40 and a result verification block 50. Each of these blocks is described below.

The system works iteratively. At each iteration the motion model, the minimization method, and the amount of required low-pass filtering is determined. This decision is made in the result verification block 50. Figure 5 is a block diagram of a preferred low-pass filtering block 10. The inputs to this block are the reference frame 5, the current frame 6, and the information about the required level of filtering (smoothness switch update

7). The low-pass filtering block 10 consists of a bank 12 of low-pass filters and a multiplexer 14. The filters in the bank must be designed in such a way that the cut-off frequencies of low pass filter 1 , low pass filter 2, ..., low pass filter n form a decreasing sequence.

The multiplexer is in the form of a smoothness selector 14. The inputs to the smoothness selector 14 are reference frame 5 and current frame 6, and their low-pass filtered versions at various levels of smoothness. The smoothness switch update 7 chooses which image pair will be the output.

Figure 6 shows the block diagram of the Subsampling block 20. The inputs to this block are reference frame 5 (possibly smoothed), the current frame 6 (possibly smoothed), segmentation information, and the information about the required level of subsampling (subsampling switch update 26).

The Subsampling block consists of a bank of subsamplers 22 and a multiplexer 24. Each subsampler in the bank (denoted 'Subsample by mxm') subsamples the input images by taking every other m'th pel in both horizontal and vertical directions, where m is an integer power of 2.

The multiplexer 24 is a Subsampling Selector. The inputs to the Subsampling Selector are the reference and current frames (5,6) (possibly smoothed) and segmentation information; and their subsampled versions at various levels. The Subsampling Switch update 26 chooses which image pair will be the output. Aliasing might occur if Subsampling Switch 26 is not consistent with Smoothness Switch 7. It is the responsibility of Result Verification block 50 to generate consistent switch signals 26 and 7.

Figure 7 shows a preferred motion model selector block 30. The motion model selector block comprises a motion model multiplexer 32. The input to this block is motion model switch signal 9. Model selection block 32 is a multiplexer, which makes a selection among a bank 34 of motion models, via motion model switch signal 9. The motion models in the bank 34 vary in order of complexity. The order of complexity refers to the number of basis functions used in representing the motion. There is a close relation between the order of complexity of the motion model and computational complexity: as the order of complexity of the motion model increases, the computational complexity of estimation of model coefficients increases. As a special case, there is a model 'no motion' 34a for which all the motion coefficients are set to zero. The motion models 34 in the bank have a hierarchy: 'no motion' 34a can be represented in the translational model and in the affine model by setting all the motion coefficients to zero. Similarly, 'translational motion' 34b can be fully represented in the affine model by setting α, = a₂ = b_x = b₂ = 0 in (7).

Figure 8 shows the block diagram of a preferred cost function minimizer 40. The cost function minimizer block 40 is the place where minimization of the cost function is actually performed.

The inputs to this block are the smoothed reference image 5 and current image 6, segmentation information from the subsampling block 20, selected motion model from the motion model selector 30, and current motion coefficients, current cost function value and minimization method switch signal 17 from the result verification block 50.

This cost function minimiser block 40 performs minimization by one of the methods in its bank 43. There are three methods in the bank, and the selection of the method is performed by minimization method switch signal 17. These three methods are segment matching 44, Gauss-Newton minimisation 46 and Quasi-Newton minimisation 48, and are further described below.

Segment matching can be selected only for the translational motion model. The value of the cost function is calculated for the selected set of values of motion coefficients a₀ and b₀. The values of a₀ and b₀ which yield the smallest value of the prediction error in (6) are chosen as the solution. The search is performed on a quantized grid, i.e. only the scalar quantized values of a₀ and b₀ are used. The search is done by evaluating some or all of the candidate pairs on the grid, and choosing the one yielding the least cost function value. The Gauss-Newton step can be used for any motion model except 'No

Motion'. It uses the well-known Gauss-Newton iteration step to reduce the cost function. (R. Fletcher/Practical Methods of Optimization", Second

Edition, John Wiley & Sons, 1987, Chapter 3 and Chapter 6 gives an overview of Gauss-Newton minimization). The Gauss-Newton method is a specific form of Newton's method, commonly used when the cost function to be minimized is a sum of squares. Newton's method is summarized below. Let e(a) be the cost function as a function of parameter vector a . Let a^f be the current parameter vector, being input to the minimization iteration. Then, this cost function can be quadratically approximated around a^f as follows :

e(a) e(a^c) + g^T(a - a^c) + a - a^c)^TG(a - a^c) (1 1 )

where g denotes the gradient of e with respect to a, at the point a = a^c ; and G is the approximation of second derivative matrix of e with respect to a at the point a = a^c .

Then, the output parameter vector a⁰ of the minimization step of Newton's method is calculated by the formula

a⁰ = -G-* g + a^f (12) Note that this is the point at which the quadratic approximation to the cost function around a^f becomes a minimum. Let e(a) have the following form: m e(a) = ∑(d, (a))² = d^τd (12) ι=l Then, the Gauss-Newton method approximates the second derivative matrix G by:

G « 2(Vd)(Vd)^r (13) where Vd denotes the Jacobian matrix, the columns of which are the first derivative vectors Vd, of the components of d . Considering that the first derivative vector g is given by g = 2(Vd)d (14) the iteration step of the Gauss-Newton method takes the form a" = -[(Vd)(Vd)^r]^~' ((Vd)d) + a^r (15)

In the context of minimization of prediction error (10), the entries d, of the vector d are given by d, = i (^χ, >y,) - (^χ _l + ^χ„y,),y, + A_y(x_[ ,y_l)) (16)

This block is not computationally very costly, since it performs only one iteration step, hence does not require multiple evaluations of the cost function as in the Quasi-Newton method described next. Quasi-Newton minimization is a well known iterative procedure which continues iterations until a convergence to a minimum is achieved. (Again, see R. Fletcher, "Practical Methods of Optimization", Second Edition, John Wiley & Sons, 1987, Chapter 3 and Chapter 6).

It uses a similar quadratic approximation of the cost function as in Newton's method. The basic difference from Newton's method is that instead of calculating the second derivative matrix at each step, it uses a positive definite approximation of its inverse, and updates this approximation by simple computations at each step. It is a very robust and high performance technique. However, it is computationally costly for high order motion models. The cost function minimiser 40 outputs a minimised cost function value 45 and a motion coefficient vector 46. Turning now to the result verification block 50 shown in Figure 4, this controls the other blocks in the system by generating switch signals. These switch signals are smoothness switch update 7 (determining the smoothness level to be used in the current iteration), motion model switch 9 (determining the motion model to be used in the current iteration), and minimization method switch 17 (determining the minimization method to be used in the current iteration). Any combination of switch signals generated by this block has an underlying set of assumptions. The purpose of this block 50 is to find the strongest set of assumptions that is valid for the current state. Once the set of assumptions is determined, a new set of switch signals is generated, and new motion coefficients and a new cost function value result. By comparing these to the previous ones, the result verification block determines:

^• Whether the new coefficients will be accepted (verified)

^• Whether the cost function reached an acceptably low value(termination decision) * If iterations will continue, what the next switch signals will be. These switch signals control other blocks in the system. So, determination of the switch signals practically means the decisions about the motion model, minimization method, and low pass filtering level to be used in the next iterations. The result verification block 50 keeps the 'current motion coefficients', which are the motion coefficients yielding the smallest value of the cost function so far, and a 'current cost function value' which is the smallest value of the cost function so far. If the iterations are continued, current motion coefficients and current cost function value are fed to cost function minimizer 40. The generation of the new switching signals depending on the comparison of results can be done in numerous ways. A preferred way will be described below.

Upon convergence, a convergence flag signal 51 is set, current motion coefficients are output as 'final motion coefficients' 52, and current cost function value is output as 'final cost function value' 53.

A major advantage of this invention over previously known solutions is its ability to estimate the motion of a region that minimizes the prediction error, while keeping the computational complexity low. The result verification block 50 is a powerful tool for keeping the computational complexity of the system at a minimum. By context-dependent switching between techniques, this block allows high performance motion estimation, involving computationally complex techniques only when necessary. The system can find motion coefficients of a segment for:

• Any arbitrarily shaped segment

^• Any desired model of the motion field in the segment

• Any meaningfully chosen cost function of the prediction error (mean square error, mean absolute error, rate-distortion Lagrangian, etc.). The system can be implemented in a variety of ways. For example, the following issues may vary:

1. different basis functions can be used in (4) and (5),

2. different cost functions can be selected as a measure of prediction error, 3. different types of low-pass filters can be used,

4. different methods can be used for segment matching,

5. different variants of Quasi-Newton minimization can be used,

6. different strategies can be used for result verification and method switching. 7. different methods can be used for subpixel evaluation and gradient calculation of the images.

A preferred embodiment of the invention will now be described. The filter bank 12 includes only one low-pass filter 12a. This is a separable filter, with 10 taps in each direction. The filter coefficients are as follows:

0.038089 0.003496 -0.139455 0.044002 055387 055387 0.04400 -0.139455 0.003496 0.038089

(18)

This filter is designed with subsampling in mind: thus, a low-pass filter which minimizes (in a least squares sense) the loss incorporated by the cascade operation of "subsampling by 2x2 and then upsampling by cubic convolution interpolation" is implemented. The filter obtained this way is then truncated to provide 10 taps. The purpose of this truncation is to reduce the computational complexity of the filtering operation. Cubic convolution interpolation is preferred for upsampling since it is the chosen subpixel evaluation strategy. Further details can be found in R. G. Keys, "Cubic convolution interpolation for digital image processing," IEEE Trans. Acoust, Speech, Signal Processing, Vol.29, No.6, pp.1153-1160, 1981 Two image pairs 16a, 16b (non-filtered & filtered) are input to the multiplexer 14. Hereinafter, selection of the filtered pair by smoothness switch signal 7 will be denoted 'smooth', whereas selection of the non-filtered pair will be denoted as 'none'.

Only one subsampler 22 is provided in the bank of the sub-sampling block 20. This sub-sampler causes the signal to be subsampled by 2x2. The subsampling switch signal 26 either chooses this subsampler, or no subsampling.

As will be described below, in a preferred mode of implementation, the status of smoothness switch 7 and subsampling switch 26 are dependent on eah other: smoothness switch 7 is set to 'None' by the result verification block 50 if and only if subsampling switch 26 is set to 'None'; and smoothness switch 7 is set to 'Smooth' if and only if subsampling switch 26 is set to 'Subsample by 2x2'. This allows an efficient joint implementation of low-pass filtering and subsampling blocks: when low-pass filtering is performed, only the values of pixels which will survive subsampling are computed.

Note that low-pass filtering and subsampling of the Reference Frame 5 and the Current Frame 6 needs to be performed only once per frame, no matter how many segments there are in the current frame. Hence, the low- pass filtering and subsampling operations constitute a small percentage of the total computational complexity of the scheme.

In this preferred embodiment of the invention there are three motion models in the motion model selector's bank 34: "no motion" model 34a, translational model 34b, affine model 34c.

In the cost function minimiser 40 of this preferred embodiment, the segment matching performed is 'full-pel segment matching with full search'. This operation requires testing all integer-valued motion coefficient pairs in a certain range to find the pair minimizing the cost function.

With regard to Gauss-Newton minimisation 46, equation (15) can be rewritten as follows: G = g (19) where δ = a° - a^c , G = (Vd)(Vd)^r, g = -(Vd)d . Equation (19) is solved by Cholesky decomposition (W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, "Numerical Recipes in C", Second Edition, Cambridge University Press, 1992, pp. 96-98) and the output motion coefficient vector a" is computed by a° = a^f + . In the preferred mode of implementation, the Gauss-Newton step is employed only when the motion model is affine. For this case, G is a 6x6 positive semi-definite symmetric matrix with the following entries: G(m,n) = ∑c_m(i)c_π(i) (20)

where c(ι) = [dx, y, dx, x, dx, dy, y, dy, x, dy, ] ; dx, denoting the horizontal image derivative at i'th pixel, and dy, denoting the vertical image derivative at i'th pixel. Similarly, g is a vector of size 6 with the entries as follows:

g(n) = ∑d, c_n (i) (21 )

where d, is as defined in (16).

Cubic convolution interpolation is used for subpixel evaluation. Image derivatives are also computed using cubic convolution: the derivatives of the continuous function obtained by cubic convolution interpolation are computed, and they are interpreted as the image derivatives.

The Quasi-Newton minimisation 48 used in this preferred embodiment is known as 'Quasi-Newton minimization with BFGS update formula and inexact line minimization' (See R. Fletcher, "Practical Methods of

Optimization", Second Edition, John Wiley & Sons, 1987, Chapter 3 and

Chapter 6).

Figure 9 shows the flow diagram of the operation of the result verification block 50 of this preferred embodiment. A first assumption (90) of "no motion" is set by switching the smoothness level to "None", the sub-sampling level to

0 and model to 'No motion'. If the resulting cost function is below a specified threshold (91 ) (named TH1' in the diagram), the search is terminated (99) with zero motion coefficients.

If the value of the cost function is larger or equal to TH1 , the next assumptions (92) are 'translational model can approximate the actual motion field' and 'smoothing + subsampling does not shift the global minimum very much'. Following these assumptions, the smoothness switch 7 is set to

'Smooth', the subsampling switch 26 to 'Subsampling by 2x2', the model is switched to Translational', and the method to 'Segment Matching' (92) (subsampling is performed to reduce the complexity of operation, and smoothing is performed to prevent aliasing that would occur otherwise in subsampling). If the cost function resulting from this step is below the cost function corresponding to zero-motion, the best-so-far coefficients are updated (93) by setting the values of a₀ and b₀ in (7) to the values computed by segment matching, and setting the other coefficients to zero.

If the threshold is exceeded, the next assumptions are 'Affine motion model satisfactorily approximates the motion field', 'Smoothing the image will erase the local minima', 'Subsampling by 2x2 does not distort the results when images are smoothed', and 'the Gauss-Newton step reduces the cost function'. Hence, the motion model is switched to 'Affine', the smoothness switch 7 to 'Smooth', the subsampling switch 26 to 'Subsample by 2x2', and the minimization method to 'Gauss-Newton step' (100). With these settings, iterations (101-104) are performed (Figure 10) until one of the following occurs:

^• the drop in cost function in the last iteration is below a specified threshold TH2 (102),

^• a certain number of iteration counts, MAX_ITER_1 , is reached (104). After each iteration, the cost function associated with the outcome of the iteration is compared (101 ) to the best-so-far cost function. If the resulting cost function is less than the best-so-far cost function, the motion coefficients and best-so-far cost function value are updated by the outcomes of the iteration. The next set of assumptions are 'Affine motion model satisfactorily approximates the motion field', and 'Gauss-Newton step reduces the cost function'. Hence, the motion model is switched (105) to 'Affine', the smoothness switch 7 to 'None', the subsampling switch 26 to 'None', and the minimization method to 'Gauss-Newton step'. With these settings, iterations (106-109) are performed until one of the following occurs: • the drop in cost function in the last iteration is below a specified threshold TH3 (107),

^• a certain number of iteration counts, MAX_ITER_2, is reached (109). Again, after each iteration, the cost function associated with the outcome of the iteration is compared (106) to best-so-far cost function. If the resulting cost function is less than the best-so-far cost function, the motion coefficients and best-so-far cost function value are updated (106) by the outcomes of the iteration.

If the best-so-far cost function value is less than a specified threshold TH4 (95) (see Figure 9), the search is terminated (99) at this point. Otherwise, Quasi-Newton minimization stages (96) are entered. TH4 must be chosen large enough so that Quasi-Newton minimization stages are seldom entered.

In this phase (96) (see Figure 11 ), motion coefficients are initialized (110) with the coefficients obtained from segment matching and three Quasi- Newton minimization stages (112, 114, 116) are performed:

• Translational model, Smoothness Switch set to 'Smooth', Subsampling Switch set to 'Subsample by 2x2' (112)

^• Affine model, Smoothness Switch set to 'Smooth', Subsampling Switch set to 'Subsample by 2x2' (114)

• Affine model, Smoothness Switch set to 'None', Subsampling Switch set to 'None' (116)

After each minimization stage, the value of the cost function is compared to the smallest value so far (113). If a decrease in cost function value is obtained, the motion coefficients and the value of the cost function are saved. When these stages (112-116) are completed, the Result Verification block 50 terminates the search (99), assigning the smallest cost function value obtained as the final cost function value 53 and the corresponding motion coefficients as the final motion coefficients 52. As outlined above, the system for motion estimation according to the present invention achieves motion estimation with significantly less prediction error than that of previous solutions, while keeping the computational complexity at a statistically low level. This is achieved by the arrangement enabling the dynamic switching between statistically valid assumptions varying in strength, via assumption verification at each iteration.

Further, the present invention specifically provides an improved system for motion estimation of an image segment, using a polynomial motion vector field model. Motion estimation is an element of video coding using motion compensated prediction. The system finds motion coefficients which yield lower prediction error than the prior art solutions, while keeping the computational complexity low. This low prediction error results in better performance in terms of video compression. The low prediction error is achieved by incorporating the general characteristics of image data and motion into a combination of well known function minimization techniques.

From the foregoing description it will be evident to a person skilled in the art that various modifications can be made within the scope of the claims.

Claims

1. A motion estimation system for a video coder comprising: means for receiving a video frame to be coded; means for smoothing the received frame; means for subsampling the smoothed frame and for forwarding the subsampled frame to a series of motion estimators; a series of motion estimators of varying complexity, for estimating a motion vector field between the said received frame and a reference frame; and control means for selecting the subsequent motion estimator in the series only if a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold.

2. A motion estimation system as claimed in claim 1 , wherein the control means is operative to activate the sub-sampling means in dependence on the state of the smoothing means.

3. A motion estimation system as claimed in claim 1 wherein the control means is arranged to activate the smoothing means only if the sub-sampling means is also activated.

4. A motion estimation system as claimed in claim 1 , 2 or 3 wherein the subsampling means comprises a series of subsamplers varying in subsampling factors.

5. A motion estimation system as claimed in any of claims 1 to 4, wherein the control means selects the amount of subsampling during minimisation depending on the change in prediction error.

6. A motion estimation system as claimed in any preceding claim, wherein the smoothing means comprises a single low pass filter.

7. A motion estimation system as claimed in any preceding claim, wherein the control means selects the level of smoothing during minimisation depending on the change in prediction error.

8. A motion estimation system as claimed in any preceding claim, wherein the series of motion models comprises a hierarchy including a zero motion model, translational motion model, and an affine motion model.

9. A motion estimation system as claimed in any preceding claim, wherein the series of motion estimators comprises a series of minimisers, including a segment matching minimiser, a Gauss-Newton minimiser and a Quasi- Newton minimiser.

10. A video coder comprising a motion estimation system as claimed in any preceding claim.

11. A motion estimation method for coding a video frame, comprising: receiving a video frame to be coded; smoothing the received frame; means for subsampling the smoothed frame and for forwarding the subsampled frame to a series of motion estimators; estimating a motion vector field between the said received frame and a reference frame, using a motion estimator from a series of motion estimators of varying complexity; determining whether a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold and, only if so, selecting the subsequent motion estimator in the series.