Keywords

1 Introduction

Visual tracking is an important research topic in computer vision. It has wide applications and remains a challenging problem. The discriminative methods which train a classifier to distinguish the target from the background have shown good performance in recent years. There are two main challenges [14]: (1) Developing an efficient learning algorithm to train a classifier; (2) Finding an effective feature representation of the object. Besides, fusion of different trackers also influences a lot, especially when the trackers have high diversity.

Thanks to the dense sampling allowed by the circulant matrix, trackers based on the discriminant correlation filter (DCF) have shown state-of-the-art performance. And it allows to implement learning and tracking efficiently by discrete Fourier transform (DFT). It was introduced into visual tracking successfully with the minimum output sum of squared error (MOSSE) tracker [2] which used a gray-scale template. Henrique et al. employed kernel method [6] and multi-channel HOG features [7]. Several approaches addressed the windowing problem caused by wrapped-around circularly shifting of the target. Galoogahi et al. [5] proposed zero-padding filter and Danelljan et al. [4] ignored the pixels of the boundary of all the shifted samples. Lukezic et al. [10] used the color histogram based foreground/background model and the Bayesian method to segment the target mask, and used the spatial reliability constraint to train the correlation filter to reduce the boundary effect, which permitted to achieve competitive results with hand-craft features. For the multi-scale problem, Li and Zhu [9] used multi-scale search. Danelljan et al. [3] proposed the HOG features based discriminative scale space tracking (DSST).

On the other hand, inspired by the success of deep learning in computer vision, more and more CNN methods have been developed in visual tracking. The use of CNN features generally outperforms the hand-craft features. Further, taking into account the CNN features in the former layers express more spatial information and the features in the latter layers contain more semantic information. Ma et al. [11] first used multi-layer CNN features to construct weak trackers, and combined the results hierarchically from last layers to first layers. However, the precise positioning is based on the coarse tracking on the last layers, which leads to tracking drift once the coarse positioning fails. Qi et al. [12] used the Hedge method to adaptively weight several CNN features based weak trackers into a single stronger one.

In this work, we consider the natural idea to introduce the spatial reliability into the multi-layer CNN feature based approach. Section 2 describes in details our approach. In Sect. 4, we analyze and evaluate our system on OTB-13 dataset [15], which shows that our approach performs superiorly against several state-of-the-art methods. Section 4 summarizes the paper and prospects future improvements.

The contributions of this paper are summarized below:

  • We introduce the spatial reliability constrained correlation filter into multi-layer CNN features based tracker.

  • We improve the object model update strategy using the weighted sums of the response peaks and oscillation indicators of weak trackers.

2 Proposed Method

In this section, we present our method in details. The main framework is illustrated in Fig. 1. In the training stage, a foreground mask of the object is calculated using the color histogram based model. For each chosen CNN layer, a correlation filter is trained under the foreground constraint to construct a weak tracker. In the next frame tracking stage, the response maps are obtained by the different layer feature maps of the interested region and the corresponding correlation filters. The final tracking position is from the weighting of the response maps. The scale changing is then estimated by the DSST method [3]. The weights are calculated by Hedge method [12]. The response peak and oscillation are both considered to estimate the tracking confidence. The model and weight of each weak tracker are updated only when the tracking is high-confident.

Fig. 1.
figure 1

The main flowchart of our approach.

2.1 Masked Correlation Filter

A general correlation tracker trains a linear classifier from samples generated by circulant matrix and constructs a response map during the tracking. In this work, we use the CNN to extract feature maps. Each layer feature map trains an individual classifier which is called a weak tracker. \(\mathbf x ^k \in \mathbb {R}^{M \times N \times D}\) is the k-th layer feature map, where MND are the width, height and the number of channels, respectively. We use \(\mathbf x \) to denote concisely \(\mathbf x ^k\), the other variables use the similar representation. Through circulant matrix, \(\mathbf X _d = C(\mathbf x _d)\) is generated by the d-th channel shifting along the M and N dimensions. Given the Gaussian shaped label \(\mathbf y \), a correlation filter \(\mathbf{h } = \{\mathbf{h }_d\}_{d = 1:D}\) can be obtained by solving the minimization problem:

$$\begin{aligned} \mathbf h = \arg \min _\mathbf{h } \sum \limits _{d=1}^{D} (||\mathbf x _d \circledast \mathbf h _d - \mathbf y ||^{2} - \lambda ||\mathbf h _d ||^{2}) \end{aligned}$$
(1)

where \(\mathbf x _d \circledast \mathbf h _d\) is the convolution of d-th channel feature map \(\mathbf x _d\) with the filter \(\mathbf h _d\), which is equal to the inner product between each shifted matrix in \(\mathbf X _d\) and \(\mathbf h _d\). \(\lambda \) is the regularization parameter.

Diagonalizing the circulant matrix with the DFT, the closed form solution can be efficiently computed in the Fourier domain [2]. When we introduce the foreground mask \(\mathbf m \in \mathbb {R}^{M \times N}\) as a constraint \(\mathbf h = \mathbf h \odot \mathbf m \), the closed-form solution will be prohibited. We express the DFT as \(\hat{\mathbf{a }} = \mathscr {F}(\mathbf a )\). Through the ADMM method [10], we can get the approximate solution of the filter \(\hat{\mathbf{h }}\).

Given the testing data \(\mathbf z \) from the output of k-th layer, we have \(\hat{\mathbf{z }}_d = \mathscr {F}(\mathbf z _d)\), the response map can be calculated by:

$$\begin{aligned} \mathbf f = \mathscr {F^{-1}} (\sum \limits _{d=1}^{D} \hat{\mathbf{h }}_d^{*} \odot \hat{\mathbf{z }}_d) \end{aligned}$$
(2)

where \(\mathscr {F^{-1}} (\hat{\mathbf{a }})\) is the Fourier inverse transform. \(\hat{\mathbf{a }} \odot \hat{\mathbf{b }}\) is the element-wise product, and \(\mathbf a ^{*}\) denotes the Hermitian transpose. \(\mathbf f \) denotes concisely the response map \(\mathbf f ^k\) of k-th weak tracker. The object position estimated by this layer is the highest response. The final target position (xy) is the weighted sum of each layer’s output:

$$\begin{aligned} (x, y) = \sum \limits _{k=1}^{K} (w^k x^k, w^k y^k) \end{aligned}$$
(3)

where \((x^k, y^k)\) and \(w^k\) are the estimated position and the weight of the k-th weak tracker, respectively.

2.2 Foreground Mask Generation

The foreground mask is the binary segmentation of the object. For the pixel \(\mathbf p \) in the training patch, which has the appearance \(\mathbf c \), the posterior probability of the mask element \(m \in \{0, 1\}\) on this pixel is calculated by [10]:

$$\begin{aligned} p(m = 1|\mathbf p , \mathbf c ) \propto p(\mathbf c |\mathbf p , m=1) p(\mathbf p |m=1) p(m=1) \end{aligned}$$
(4)

The likelihood of foreground mask is expressed as \(p(\mathbf c |\mathbf p , m=1) p(\mathbf p |m=1)\). \(p(\mathbf c |\mathbf p , m=1)\) can be calculated using back projection of the foreground color histogram. \(p(\mathbf p |m=1)\) is defined by the Epanechnikov distribution which is a quadratic function about the distance from the pixel to the training patch center. The prior \(p(m=1)\) is calculated by the sizes of the target and padding patch considering the likelihood of foreground and background.

The scale estimation is necessary for segmentation. Because after disabling the background pixels, direct use of multi-scale search will fail, we use DSST method [3] with the whole training patch to estimate the scale after position prediction in each frame.

2.3 Updating Strategy

Most trackers update the appearance model of the target in each frame, which will lead to model contamination when blur or occlusion occurs. In our method, the models and weights are updated only when the tracking is high-confident. We use the response peak \(f_{max}\) and response oscillation indicator APCE [13] to determine the tracking confidence. When the tracking fails or occlusion occurs these criteria will be significantly reduced. We calculate \(f_{max}^k\) and \(APCE^k\) for each weak tracker and then weight them by \(w^k\):

$$\begin{aligned} f_{max} = \sum \limits _{k=1}^{K} w^k f_{max}^k \qquad \qquad \quad \end{aligned}$$
(5)
$$\begin{aligned} APCE = \sum \limits _{k=1}^{K} w^k \frac{(f_{max}^k - f_{min}^k)^2}{mean\sum \limits _{m, n} (f_{m,n}^k - f_{min}^k)^2} \end{aligned}$$
(6)

where \(f_{min}^k\) is the minimum response in k-th response map, and \(f_{m,n}^k\) is the response at (mn). K is the number of the weak trackers.

At the end we use the ratios between \(f_{max}\), APCE and their means during all the frames as the final criteria \(\beta _{fmax}\) and \(\beta _{apce}\). When they are both larger than certain thresholds, we consider this frame to be high-confident.

The weights are then updated only when high-confident tracking by Hedge method [12]. The loss of the k-th weak tracker is computed by \(l^k = f^k_{max} - f^k(x,y)\), where \(f^k(x,y)\) is the response of the final tracking position on the k-th response map. The weights at \(t+1\) are updated by minimizing the regret:

$$\begin{aligned} \mathbf w _{t+1} = \arg \min _\mathbf{w } \sum \limits _{t=1}^{T} (\sum \limits _{k=1}^{K} w^k l^k_t - l^k_t) \end{aligned}$$
(7)

3 Experimental Results

In this section, we present the experimental results of our proposed method. We first discuss the implementation details. Then we analyze our approach on OTB-13 dataset. Finally, we compare the performance to several advanced trackers.

3.1 Implementation Details

For feature extraction, we adopt the VGG-Net-19 pre-trained CNN and remove the fully-connected layers to allow to accept any input size. Six convolutional layers (10,11,12,14,15,16) are selected to output the feature maps with the initial weights of [1, 0.2, 0.2, 0.02, 0.03, 0.01] as in HDT tracker [12]. We use a padding patch with 2.2 times of the object bounding box, which constitutes the input of the VGG-Net. All the feature maps are linear interpolated to the same size as the padding patch. The thresholds of tracking confidence criteria \(\beta _{fmax}\) and \(\beta _{apce}\) are set to 0.7, 0.6, respectively. The object model updating rate is set to 0.01.

We analyze and evaluate our approach on OTB-13 dataset [15] which contains 50 image sequences covering the variety of challenging factors. The success plot shows the percentage of frames with overlap between the tracking box and ground truth being greater than a threshold. The precision plot is similar but on the center pixel error. Both the success and precision scores are measured by the area under curve (AUC). We implement our tracker in Matlab and MatConvNet toolbox. The test runs at 2 average fps on a laptop with an Intel i7-7700HQ CPU and a NVIDIA GTX 1060 GPU which is only used to extract CNN features.

3.2 Analysis of Our Approach

To analyze the effect of the modules composing our tracker, we disable each component from our full model (SRHDT). SRHDT1 denotes the model without updating strategy. SRHDT2 denotes that we further disable the foreground segmentation. SRHDT3 is similar to SRHDT2 which is without the segmentation and updating strategy modules, but we change to use the multi-scale search for the scale estimation instead of DSST method. We use one-pass evaluation (OPE) for the analysis.

Fig. 2.
figure 2

Success and precision plots of our full model and the models disabling different components.

We can see in Fig. 2, the full model shows the best accuracy in both average success rate and precision. The updating strategy has led an improvement, mainly because of its ability to better handle occlusions. When we disable the segmentation, regardless of which scale search method is used, there are significant reductions in both plots. This is because while scale prediction can be a better fit for scale changes but will deteriorate the performance under the situations of occlusion, illumination variation, deformation, etc.

3.3 Comparison with Other Trackers

We compare our tracker with several state-of-the-art trackers. TLD [8] is the traditional baseline method without the correlation filter. CSK [6] and KCF [7] are the correlation filter based methods with hand-craft features. MEEM [16] fuses multiple trackers with regression. SiamFC [1] is the recent CNN based method without the correlation filter. HDT [12] is our baseline method which ensembles several DCF based trackers using different CNN layers.

Figure 3 illustrates the OPE testing results on OTB-13 dataset. Our approach outperforms all these advanced trackers in average success rate and precision. The results of these trackers are taken from the authors’ released data or code. Note that for the HDT tracker, we used the first frame to initialize instead of the second frame, which leads a gap from the reported results in the original paper. And our method also uses the first frame to initialize the tracking.

Fig. 3.
figure 3

Comparisons with the state-of-the-art trackers.

Fig. 4.
figure 4

Comparison under different challenging situations.

Figure 4 gives the average overlap success rates of each tracker under different challenging situations. It can be seen that our method performs better under most conditions. For scale changes, our tracker has a slight decrease from SiamFC. This depends on the scale predictor used. For deformation, our method does not perform as well as HDT because the correlation filter does not handle deformation well and adding the scale estimation will worsen this shortcoming.

4 Conclusions

In this paper, we introduced the spatial reliability into the multi-layer CNN features based tracker. In the train stage, a foreground mask is calculated using the color histograms. For each chosen CNN layer, a correlation filter is trained under the foreground constraint to construct a weak tracker. In next frame, the final tracking position is from the weighting of the weak trackers, and the scale changing is then estimated by DSST method. The weights are updated by Hedge method. The response peak and oscillation are both considered to estimate the tracking confidence. The model and weight of each weak tracker are updated only when the tracking is high-confident. The evaluation on OTB-13 dataset shows that our approach performs superiorly against several state-of-the-art methods. In the future, we envision extending our approach to multi-object tracking.