Visual Tracking Using Multi-layer CNN Features Based Discriminant Correlation Filters with Foreground Mask

Tao Yang¹⁷,
Cindy Cappelle¹⁷,
Yassine Ruichek¹⁷ &
…
Mohammed El Bagdouri¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10884))

Included in the following conference series:

International Conference on Image and Signal Processing

2223 Accesses

Abstract

This work deals with visual object tracking. The well known discriminant correlation filter (DCF) based approach is improved by multi-layer CNN features, spatial reliability (through a foreground mask) and conditionally model updating strategy. In the training stage, by calculating a foreground mask using the color histograms, for each chosen CNN layer, a correlation filter is trained under the foreground constraint to construct a weak tracker. In next frame, the tracking position is from the weighting of weak trackers, for which the weights are computed by Hedge method. The response peak and oscillation are both considered to estimate the confidence criteria. The model and weight of each weak tracker are updated only when the tracking is high-confident. We analyze and evaluate our system on OTB-13 dataset, and show that our approach performs superiorly against several state-of-the-art methods.

You have full access to this open access chapter, Download conference paper PDF

Efficient object tracking using hierarchical convolutional features model and correlation filters

Article 18 April 2020

Efficient Multi-level Correlating for Visual Tracking

Collaborative Learning based on Convolutional Features and Correlation Filter for Visual Tracking

Article 01 February 2018

Keywords

1 Introduction

Visual tracking is an important research topic in computer vision. It has wide applications and remains a challenging problem. The discriminative methods which train a classifier to distinguish the target from the background have shown good performance in recent years. There are two main challenges [14]: (1) Developing an efficient learning algorithm to train a classifier; (2) Finding an effective feature representation of the object. Besides, fusion of different trackers also influences a lot, especially when the trackers have high diversity.

Thanks to the dense sampling allowed by the circulant matrix, trackers based on the discriminant correlation filter (DCF) have shown state-of-the-art performance. And it allows to implement learning and tracking efficiently by discrete Fourier transform (DFT). It was introduced into visual tracking successfully with the minimum output sum of squared error (MOSSE) tracker [2] which used a gray-scale template. Henrique et al. employed kernel method [6] and multi-channel HOG features [7]. Several approaches addressed the windowing problem caused by wrapped-around circularly shifting of the target. Galoogahi et al. [5] proposed zero-padding filter and Danelljan et al. [4] ignored the pixels of the boundary of all the shifted samples. Lukezic et al. [10] used the color histogram based foreground/background model and the Bayesian method to segment the target mask, and used the spatial reliability constraint to train the correlation filter to reduce the boundary effect, which permitted to achieve competitive results with hand-craft features. For the multi-scale problem, Li and Zhu [9] used multi-scale search. Danelljan et al. [3] proposed the HOG features based discriminative scale space tracking (DSST).

On the other hand, inspired by the success of deep learning in computer vision, more and more CNN methods have been developed in visual tracking. The use of CNN features generally outperforms the hand-craft features. Further, taking into account the CNN features in the former layers express more spatial information and the features in the latter layers contain more semantic information. Ma et al. [11] first used multi-layer CNN features to construct weak trackers, and combined the results hierarchically from last layers to first layers. However, the precise positioning is based on the coarse tracking on the last layers, which leads to tracking drift once the coarse positioning fails. Qi et al. [12] used the Hedge method to adaptively weight several CNN features based weak trackers into a single stronger one.

In this work, we consider the natural idea to introduce the spatial reliability into the multi-layer CNN feature based approach. Section 2 describes in details our approach. In Sect. 4, we analyze and evaluate our system on OTB-13 dataset [15], which shows that our approach performs superiorly against several state-of-the-art methods. Section 4 summarizes the paper and prospects future improvements.

The contributions of this paper are summarized below:

We introduce the spatial reliability constrained correlation filter into multi-layer CNN features based tracker.
We improve the object model update strategy using the weighted sums of the response peaks and oscillation indicators of weak trackers.

2 Proposed Method

In this section, we present our method in details. The main framework is illustrated in Fig. 1. In the training stage, a foreground mask of the object is calculated using the color histogram based model. For each chosen CNN layer, a correlation filter is trained under the foreground constraint to construct a weak tracker. In the next frame tracking stage, the response maps are obtained by the different layer feature maps of the interested region and the corresponding correlation filters. The final tracking position is from the weighting of the response maps. The scale changing is then estimated by the DSST method [3]. The weights are calculated by Hedge method [12]. The response peak and oscillation are both considered to estimate the tracking confidence. The model and weight of each weak tracker are updated only when the tracking is high-confident.

2.1 Masked Correlation Filter

A general correlation tracker trains a linear classifier from samples generated by circulant matrix and constructs a response map during the tracking. In this work, we use the CNN to extract feature maps. Each layer feature map trains an individual classifier which is called a weak tracker. $\mathbf x ^k \in \mathbb {R}^{M \times N \times D}$ is the k-th layer feature map, where M, N, D are the width, height and the number of channels, respectively. We use $\mathbf x $ to denote concisely $\mathbf x ^k$, the other variables use the similar representation. Through circulant matrix, $\mathbf X _d = C(\mathbf x _d)$ is generated by the d-th channel shifting along the M and N dimensions. Given the Gaussian shaped label $\mathbf y $, a correlation filter $\mathbf{h } = \{\mathbf{h }_d\}_{d = 1:D}$ can be obtained by solving the minimization problem:

$$\begin{aligned} \mathbf h = \arg \min _\mathbf{h } \sum \limits _{d=1}^{D} (||\mathbf x _d \circledast \mathbf h _d - \mathbf y ||^{2} - \lambda ||\mathbf h _d ||^{2}) \end{aligned}$$

(1)

where $\mathbf x _d \circledast \mathbf h _d$ is the convolution of d-th channel feature map $\mathbf x _d$ with the filter $\mathbf h _d$, which is equal to the inner product between each shifted matrix in $\mathbf X _d$ and $\mathbf h _d$. $\lambda $ is the regularization parameter.

Diagonalizing the circulant matrix with the DFT, the closed form solution can be efficiently computed in the Fourier domain [2]. When we introduce the foreground mask $\mathbf m \in \mathbb {R}^{M \times N}$ as a constraint $\mathbf h = \mathbf h \odot \mathbf m $, the closed-form solution will be prohibited. We express the DFT as $\hat{\mathbf{a }} = \mathscr {F}(\mathbf a )$. Through the ADMM method [10], we can get the approximate solution of the filter $\hat{\mathbf{h }}$.

Given the testing data $\mathbf z $ from the output of k-th layer, we have $\hat{\mathbf{z }}_d = \mathscr {F}(\mathbf z _d)$, the response map can be calculated by:

$$\begin{aligned} \mathbf f = \mathscr {F^{-1}} (\sum \limits _{d=1}^{D} \hat{\mathbf{h }}_d^{*} \odot \hat{\mathbf{z }}_d) \end{aligned}$$

(2)

where $\mathscr {F^{-1}} (\hat{\mathbf{a }})$ is the Fourier inverse transform. $\hat{\mathbf{a }} \odot \hat{\mathbf{b }}$ is the element-wise product, and $\mathbf a ^{*}$ denotes the Hermitian transpose. $\mathbf f $ denotes concisely the response map $\mathbf f ^k$ of k-th weak tracker. The object position estimated by this layer is the highest response. The final target position (x, y) is the weighted sum of each layer’s output:

$$\begin{aligned} (x, y) = \sum \limits _{k=1}^{K} (w^k x^k, w^k y^k) \end{aligned}$$

(3)

where $(x^k, y^k)$ and $w^k$ are the estimated position and the weight of the k-th weak tracker, respectively.

2.2 Foreground Mask Generation

The foreground mask is the binary segmentation of the object. For the pixel $\mathbf p $ in the training patch, which has the appearance $\mathbf c $, the posterior probability of the mask element $m \in \{0, 1\}$ on this pixel is calculated by [10]:

$$\begin{aligned} p(m = 1|\mathbf p , \mathbf c ) \propto p(\mathbf c |\mathbf p , m=1) p(\mathbf p |m=1) p(m=1) \end{aligned}$$

(4)

The likelihood of foreground mask is expressed as $p(\mathbf c |\mathbf p , m=1) p(\mathbf p |m=1)$. $p(\mathbf c |\mathbf p , m=1)$ can be calculated using back projection of the foreground color histogram. $p(\mathbf p |m=1)$ is defined by the Epanechnikov distribution which is a quadratic function about the distance from the pixel to the training patch center. The prior $p(m=1)$ is calculated by the sizes of the target and padding patch considering the likelihood of foreground and background.

The scale estimation is necessary for segmentation. Because after disabling the background pixels, direct use of multi-scale search will fail, we use DSST method [3] with the whole training patch to estimate the scale after position prediction in each frame.

2.3 Updating Strategy

Most trackers update the appearance model of the target in each frame, which will lead to model contamination when blur or occlusion occurs. In our method, the models and weights are updated only when the tracking is high-confident. We use the response peak $f_{max}$ and response oscillation indicator APCE [13] to determine the tracking confidence. When the tracking fails or occlusion occurs these criteria will be significantly reduced. We calculate $f_{max}^k$ and $APCE^k$ for each weak tracker and then weight them by $w^k$:

$$\begin{aligned} f_{max} = \sum \limits _{k=1}^{K} w^k f_{max}^k \qquad \qquad \quad \end{aligned}$$

(5)

$$\begin{aligned} APCE = \sum \limits _{k=1}^{K} w^k \frac{(f_{max}^k - f_{min}^k)^2}{mean\sum \limits _{m, n} (f_{m,n}^k - f_{min}^k)^2} \end{aligned}$$

(6)

where $f_{min}^k$ is the minimum response in k-th response map, and $f_{m,n}^k$ is the response at (m, n). K is the number of the weak trackers.

At the end we use the ratios between $f_{max}$, APCE and their means during all the frames as the final criteria $\beta _{fmax}$ and $\beta _{apce}$. When they are both larger than certain thresholds, we consider this frame to be high-confident.

The weights are then updated only when high-confident tracking by Hedge method [12]. The loss of the k-th weak tracker is computed by $l^k = f^k_{max} - f^k(x,y)$, where $f^k(x,y)$ is the response of the final tracking position on the k-th response map. The weights at $t+1$ are updated by minimizing the regret:

$$\begin{aligned} \mathbf w _{t+1} = \arg \min _\mathbf{w } \sum \limits _{t=1}^{T} (\sum \limits _{k=1}^{K} w^k l^k_t - l^k_t) \end{aligned}$$

(7)

3 Experimental Results

In this section, we present the experimental results of our proposed method. We first discuss the implementation details. Then we analyze our approach on OTB-13 dataset. Finally, we compare the performance to several advanced trackers.

3.1 Implementation Details

For feature extraction, we adopt the VGG-Net-19 pre-trained CNN and remove the fully-connected layers to allow to accept any input size. Six convolutional layers (10,11,12,14,15,16) are selected to output the feature maps with the initial weights of [1, 0.2, 0.2, 0.02, 0.03, 0.01] as in HDT tracker [12]. We use a padding patch with 2.2 times of the object bounding box, which constitutes the input of the VGG-Net. All the feature maps are linear interpolated to the same size as the padding patch. The thresholds of tracking confidence criteria $\beta _{fmax}$ and $\beta _{apce}$ are set to 0.7, 0.6, respectively. The object model updating rate is set to 0.01.

We analyze and evaluate our approach on OTB-13 dataset [15] which contains 50 image sequences covering the variety of challenging factors. The success plot shows the percentage of frames with overlap between the tracking box and ground truth being greater than a threshold. The precision plot is similar but on the center pixel error. Both the success and precision scores are measured by the area under curve (AUC). We implement our tracker in Matlab and MatConvNet toolbox. The test runs at 2 average fps on a laptop with an Intel i7-7700HQ CPU and a NVIDIA GTX 1060 GPU which is only used to extract CNN features.

3.2 Analysis of Our Approach

To analyze the effect of the modules composing our tracker, we disable each component from our full model (SRHDT). SRHDT1 denotes the model without updating strategy. SRHDT2 denotes that we further disable the foreground segmentation. SRHDT3 is similar to SRHDT2 which is without the segmentation and updating strategy modules, but we change to use the multi-scale search for the scale estimation instead of DSST method. We use one-pass evaluation (OPE) for the analysis.

We can see in Fig. 2, the full model shows the best accuracy in both average success rate and precision. The updating strategy has led an improvement, mainly because of its ability to better handle occlusions. When we disable the segmentation, regardless of which scale search method is used, there are significant reductions in both plots. This is because while scale prediction can be a better fit for scale changes but will deteriorate the performance under the situations of occlusion, illumination variation, deformation, etc.

3.3 Comparison with Other Trackers

We compare our tracker with several state-of-the-art trackers. TLD [8] is the traditional baseline method without the correlation filter. CSK [6] and KCF [7] are the correlation filter based methods with hand-craft features. MEEM [16] fuses multiple trackers with regression. SiamFC [1] is the recent CNN based method without the correlation filter. HDT [12] is our baseline method which ensembles several DCF based trackers using different CNN layers.

Figure 3 illustrates the OPE testing results on OTB-13 dataset. Our approach outperforms all these advanced trackers in average success rate and precision. The results of these trackers are taken from the authors’ released data or code. Note that for the HDT tracker, we used the first frame to initialize instead of the second frame, which leads a gap from the reported results in the original paper. And our method also uses the first frame to initialize the tracking.

Figure 4 gives the average overlap success rates of each tracker under different challenging situations. It can be seen that our method performs better under most conditions. For scale changes, our tracker has a slight decrease from SiamFC. This depends on the scale predictor used. For deformation, our method does not perform as well as HDT because the correlation filter does not handle deformation well and adding the scale estimation will worsen this shortcoming.

4 Conclusions

In this paper, we introduced the spatial reliability into the multi-layer CNN features based tracker. In the train stage, a foreground mask is calculated using the color histograms. For each chosen CNN layer, a correlation filter is trained under the foreground constraint to construct a weak tracker. In next frame, the final tracking position is from the weighting of the weak trackers, and the scale changing is then estimated by DSST method. The weights are updated by Hedge method. The response peak and oscillation are both considered to estimate the tracking confidence. The model and weight of each weak tracker are updated only when the tracking is high-confident. The evaluation on OTB-13 dataset shows that our approach performs superiorly against several state-of-the-art methods. In the future, we envision extending our approach to multi-object tracking.

References

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2544–2550. IEEE (2010)
Google Scholar
Danelljan, M., Häger, G., Khan, F., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: British Machine Vision Conference, Nottingham, 1–5 September 2014. BMVA Press (2014)
Google Scholar
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4310–4318 (2015)
Google Scholar
Galoogahi, H.K., Sim, T., Lucey, S.: Correlation filters with limited boundaries. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4630–4638. IEEE (2015)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
Article Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012)
Article Google Scholar
Li, Y., Zhu, J.: A scale adaptive kernel correlation filter tracker with feature integration. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 254–265. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_18
Chapter Google Scholar
Lukezic, A., Vojir, T., Cehovin, L., Matas, J., Kristan, M.: Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2 (2017)
Google Scholar
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3074–3082 (2015)
Google Scholar
Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., Yang, M.H.: Hedged deep tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4303–4311 (2016)
Google Scholar
Wang, M., Liu, Y., Huang, Z.: Large margin object tracking with circulant feature maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 21–26 (2017)
Google Scholar
Wang, N., Shi, J., Yeung, D.Y., Jia, J.: Understanding and diagnosing visual tracking systems. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3101–3109. IEEE (2015)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer vision and pattern recognition (CVPR), pp. 2411–2418. IEEE (2013)
Google Scholar
Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 188–203. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_13
Chapter Google Scholar

Download references

Acknowledgments

The authors gratefully acknowledge financial support from China Scholarship Council.

Author information

Authors and Affiliations

Le2i EA7508, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté, UTBM, 90010, Belfort, France
Tao Yang, Cindy Cappelle, Yassine Ruichek & Mohammed El Bagdouri

Authors

Tao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Cindy Cappelle
View author publications
You can also search for this author in PubMed Google Scholar
Yassine Ruichek
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed El Bagdouri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tao Yang , Cindy Cappelle or Yassine Ruichek .

Editor information

Editors and Affiliations

Université de Bourgogne, Dijon, France
Alamin Mansouri
Université de Caen Normandie, Caen, France
Abderrahim El Moataz
Université du Québec à Trois-Rivières, Trois-Rivieres, Québec, Canada
Fathallah Nouboud
Université Ibn Zohr, Agadir, Morocco
Driss Mammass

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, T., Cappelle, C., Ruichek, Y., El Bagdouri, M. (2018). Visual Tracking Using Multi-layer CNN Features Based Discriminant Correlation Filters with Foreground Mask. In: Mansouri, A., El Moataz, A., Nouboud, F., Mammass, D. (eds) Image and Signal Processing. ICISP 2018. Lecture Notes in Computer Science(), vol 10884. Springer, Cham. https://doi.org/10.1007/978-3-319-94211-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-94211-7_37
Published: 30 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94210-0
Online ISBN: 978-3-319-94211-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)