Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Effect of contextual information on object tracking

2017, 2017 International Conference on Image and Vision Computing New Zealand (IVCNZ)

Effect of Contextual Information on Object Tracking Mohammad Hedayati Michael J. Cree Jonathan B. Scott School of Engineering University of Waikato Hamilton, New Zealand mh267@students.waikato.ac.nz School of Engineering University of Waikato Hamilton, New Zealand cree@waikato.ac.nz School of Engineering University of Waikato Hamilton, New Zealand scottj@waikato.ac.nz Abstract—Local object information, such as the appearance and motion features of the object, are useful for object tracking in videos provided the object is not occluded by other elements in the scene. During occlusion, however, the local object information in the video frame does not properly represent the true properties of the object, which leads to tracking failure. We propose a framework that combines multiple cues including the local object information, the background characteristics and group motion dynamics to improve object tracking in challenging cluttered environments. The performance of the proposed tracking model is compared with the kernelised correlation filter (KCF) tracker. In the tested video sequences the proposed tracking model correctly tracked objects even when the KCF tracker failed because of occlusion and background noise. Index Terms—video analysis, object tracking, occlusion I. I NTRODUCTION Over the last few decades an enormous amount of study has been dedicated to object tracking [1]. Object tracking remains a challenging topic in computer vision due to problems caused by changes in size or pose of the object, noise produced by the image acquisition, variation of light, occlusion and background clutter [2, 3]. Moreover, the complexity of the tracking is increased if multiple moving objects are tracked. This is because locating targets and maintaining their identities through a video sequence is a highly challenging problem in crowded environments. Wu et al. [4] performed experiments to evaluate the performance of recent online tracking algorithms, and identified three important components that improve tracking performance. First, background information is necessary, mainly to separate background clutter from the object of interest. Second, local models are particularly useful when the appearance of the target has partially changed and third, the motion model is crucial for object tracking, especially when the motion of the target is abrupt. In addition to above components when objects are in a groups, they tend to move relative to each other following similar motion pattern. This group dynamic often gives an important cue to approximate the location of object, especially, when local information is poor or abrupt. This research proposes a framework that combines the group dynamic with local object information to improve object tracking in challenging cluttered environments. 978-1-5386-4276-4/17/$31.00 c 2017 IEEE II. L ITERATURE R EVIEW Assuming the foreground pixels in the video represent the foreground object, primitive tracking systems used background subtraction approaches to separate the foreground from the background and tracking was then performed by enforcing spatial continuity using Kalman filtering [5, 6]. Colour-based models such as means shift [7] and particle filtering [8] also have achieved considerable success in many tracking applications. Particle tracking is a process of propagating the posteriori distribution of the reference target, according to a system dynamic model. Pérez et al. [9] and Nummiaro et al. [10] proposed two independent solutions that couple the colour information of objects with the dynamic model of the system. Mean shift is a non-parametric technique for finding the mode of a probability density function by using gradient descent/ascent [7] to find the local minima/maxima of a distribution by iteratively descending/climbing the density gradients until the point of convergence has been found. In computer vision applications, the mean shift was originally employed by Comaniciu and Meer [11] for segmentation purposes and later Bradski [12] utilised the mean shift framework for tracking applications. The mean shift tracker model calculates the centroid of the colour probability distribution within its 2D tracking window, then moves the window centre to the centroid of distribution. Although the mean shift tracker gives reasonable accuracy in a wide range of environments, it is prone to failure, when 1) the object and background have similar features causing the gradient decent search to get stuck in local minima, and when 2) the object is completely or partially occluded, the object likelihood is reduced leading to convergence to the wrong point. Recently the tracking-by-detection algorithm has become popular for object tracking [4]. The methodology behind these models are similar to the discriminative object detection. Given an initial object location, the goal of tracking-by-detection is to train on-line a classifier to distinguish the tracked object from the background. During tracking the initial sample space is continually updated and the classifier is retrained as result at the time instant, the sampling space can be written as + + − − − + − {x+ 0 , x1 , . . . , xt , x0 , x1 , . . . , xt }, where the xt and xt are the positive and negative samples at time t. There are various classifiers already integrated into the tracking-by-detection framework. Support vector tracking [13] used the Support Vector Machine (SVM) classifier to distinguished foreground motion from the background. Kalal et al. [14] proposed the long-term tracking task based on boosting classifier. The classifier is updated using all extracted appearances up to currant frame that passed the variance filter. Hare et al. [15] employs the structured Support Vector Machine (SVM) to directly link the target’s location space with the training samples to reduced the training time. Kernelised Correlation Filters (KCF) tracker proposed by Henriques et al. [16] achieves the fastest and highest performance among the recent topperforming tracking-by-detection algorithms [17]. The key of KCF tracker is that the augmentation of negative samples are employed to enhance the discriminative ability of the track-bydetector scheme while exploring the structure of the circulant matrix [18] for high efficiency. The reviewed tracking algorithms mostly focused on local object information to track interest objects. During occlusion, however, the local object information does not properly represent the true properties of the object, which leads to tracking failure. In contrast to these methods, we proposed a framework that combines contextual information with local object properties to improve the tracking in clutter environment. III. P ROPOSED MODEL The proposed tracking model contains two main modules, namely point level processing and localisation (see Figure 1). The point processing block is based on the assumption that in reality the sample points are rarely independent and they are parts of bigger units, namely the objects that are being tracked [19], and therefore the sample points should have the same motion and similar colour distribution. Consequently, the location of the object can be estimated by tracking the points that sample from the same objects. The point processing block aims to find the best points from a noisy sample space by a series of filtering stages. In object localisation, two different strategies are used, namely object based and group based localisation. Object based localisation is applied when the sample points correctly represent the local object motion and appearance. The group based localisation is applied when the local information does not properly represent the object, mainly due to occlusion and background clutter. A. Feature Extraction To overcome appearance ambiguities and to handle the occlusion, the object features are extracted from three sampling levels; point level, object level and group level. The definition of these features are as follows: 1) Object template (w) refers to the rectangular window around the object; it is also referred to as the tracking window. 2) Point level motion cues (Up ) are the flow of the sample points extracted from the object template where Up = (up,x , up,y ) are the motion cues for a given point p. Particularly, given the point p = (px , py ) on the selected template at frame I, we estimate its corresponding Fig. 1. The block diagram of proposed tracking model. The point processing block aims to find the best points that represent the object property. For object localisation, two strategies is used, Object based localisation is applied when the sample points properly represent the object local properties and when the local information dos not properly represent the object property the localisation is switched to the group based model. location p′ = (px + up,x , py + up,y ) in the frame I + 1 using the iterative pyramids Lucas-Kanade method [20]. The sample points are extracted using Shi and Tomasi corner detection [21]. 3) Point level colour cues (Hp′ ) refers to the colour distribution of 15 × 15 rectangular patches around sample points. Point level colour cues are calculated from histogram of hue and saturation channels in HSV colour space. 4) Object motion model, Uo = (u(o,x) , u(o,y) ), refers to the tracking window displacement. The object motion model is estimated by taking the average of all motion vector (J) at point level for the given object o by, Uo = PJ i Up,i J (1) 5) Object colour model (Ho ) refers to the colour distribution of the tracking window. The object colour model is calculated from the histogram of hue and saturation channels in HSV colour space. 6) Group motion model, Ug = (u(g,x) , u(g,y) ), is estimated by taking the average motion models of all m objects using, Ug = Pm i Uo,i m (2) 7) Object relative speed, Uv = (u(v,x) , u(v,y) ), refers to the relative speed of the individual object with respect to the group motion model, viz Uv = Uo . Ug (3) 8) Background motion, Ub = (u(b,x) , u(b,y) ), refers to the dominate motion in the frame. The background motion is estimated using the algorithm describe by Hedayati et al. [22]. B. Point Processing The point processing block is based on the assumption that the points are rarely independent and they are parts of bigger units called object, and therefore they should have similar motion and colour distribution as the object. Thus the location of the object can be estimated by tracking the sample points that are distributed over the object surface. The point processing block aims to find the sample points that well represent the object using three stage filtering, namely the cross validation filter, the motion filter and the ambiguity filter. 1) Cross Validation Filter: In cross validation filter the forward-backwards error described by Kalal et al. [19] is used to estimate the stability of motion cues at point level. With having the sample point p at frame I and its corresponding location p′ in the frame I + 1, the backwards flow of point p′ to the frame I is computed. The forward-backwards error εFB of a point p is defined as the Euclidean distance between the original point and the forward-backward prediction. In the filtering stage the points are removed if their forward– backwards error is larger than some threshold (α), that is ( 0 εFB ≥ α ′ p = (4) 1 elsewhere. 2) Motion Filter: Knowing the background motion (Ub ), object motion (Uo ) and the motion of sample points (Up ), the probability of the sample point is estimated, and sampling points with more probability that are background are filtered out by, ( Background d(Up , Ub ) < d(Up , Uo ) ′ p = (5) Foreground elsewhere Here d is the Euclidean distance function. 3) Ambiguity Filter: When the tracked object is occluded by another, some sample points that belong to one object might move to the other which eventually causes tracking drift. For the occlusion problem of k objects, this task can be formulated as maximising a posterior by, ∗ k = argmax S. k (6) The vector S indicates how likely sample points p′ are generated from object O. To measure the similarity (s) the histogram intersection, proposed by Swain and Ballard [23], is used. It is especially suited to comparing histograms for recognition in our case, because it does not require the accurate separation of the object from its background or occluding objects in the foreground. Having the object colour distribution (HO ) and point level colour distribution (Hp′ ) the similarity score is found by intersection using, X (7) min(Hp′ (i), HO (i)), s= i where i is the bin number of the histogram. C. Localisation of Objects We use two different strategies to locate the objects in the frame namely object based localisation and group based localisation. Object based localisation predicts the new location of the tracking window by finding the mass centre of the weighted sample point. However when the object is obscured by some other element in the video or has unpredictable motion the sample points no longer represent the object template. In this case the group motion flow is used to estimate the location of objects. Which strategy is used is determined by estimating the quality of the object template inside the tracking window. The number of sample points before and after point processing block are compared and if more than 60% of the samples points are filtered out in the filtering stages, the object template is not valid and the group based localisation is triggered. 1) Object based localisation: In object based localisation, the quality of the points (after point processing stages) are estimated by finding the colour similarity between each sample point and the object template using Equation 7 and 50% of the points with lowest similarity matches are removed. From the remaining points the centre of mass for the new tracking window is calculated by, Pn Pn ′ ′ i si py,i i si px,i P , Cy = P (8) Cx = n n i si i si where n is number of remaining sample points and s is the colour distribution similarity. 2) Group based localisation: To find the approximate location of an object using group information, three values are estimated, being last valid object motion, group motion flow, and object relative speed to the group. Last valid object motion is the last estimated motion vector of the object that does not suffer from occlusion or unpredicted motion (see Equation 1). Group motion model is estimated by taking the average motion models of all valid objects using Equation 2 and the object relative speed is estimated by mean of Equation 3. Having above value the new location of the object is approximated by moving the tracking window by the relative speed of objects to the group by, Cx′ = Cx + u(v,x) u(g,x) , Cy′ = Cy + u(v,y) u(g,y) where C ′ is the new centre of tracking window. (9) IV. E VALUATION METHOD The main purpose of this evaluation is to show how group property and background motion information improve the tracking performance under occlusion and background clutter. The robustness of proposed tracking model is compared with a state of art tracking algorithm, namely the kernel correlation filter (KCF), which achieves the highest performance among the recent top-performing trackers [16]. To do this evaluation three entities are defined: the tracker output (T ), the correct result or the ground truth (GT ) and distance function (d) which is a measure of the similarity between tracker output and the ground truth [24]. The tracker output and the ground truth are deliminated by bounding boxes. The relative overlap of the ground truth and the tracker output determines the tracking accuracy according to, T ∩ GT . (10) d(T, GT ) = T ∪ GT When d = 0 there is no overlap between ground truth and tracking output bounding boxes, whereas d = 1 occurs when the two bounding boxes are identical. An object is considered correctly tracked if the tracking output is within a distance threshold of ground truth where the most common threshold to consider correct tracking is 0.5 [24]. Two challenging videos are used for this evaluation. The videos are of a group of five people walking together passing obstacles such as trees and other persons in the scene. The five people are the objects to track. The ground truth was built manually by extracting the bounding box of each object in the group for every frame of video. It should be noted that manual selection was used to initialise the trackers with the location of objects in the first frame of each video as it is tracking of an initially located object that is under investigation here. The tracking performance for individual objects per video are shown in Figures 2 and 3. As shown in Figure 2 the performance of both models are identical until just before the objects walk behind the tree. It is clear from Figure 4, the KCF tracker failed to track four of the five objects when they are occluded by the tree. This poor performance is due to two main reasons: first the KCF algorithm did not encode the background motion information, therefore it does not distinguish between the background element (tree) and tracked object. This leads to the second and bigger problem which exists in almost all tracking-by-detection algorithms. It was highlighted above (Section II) the goal of the trackingby-detection algorithm is to train the online classifier to distinguish the tracked object from the background, but each training update can introduce error. To be specific at the point of occlusion, the tree is considered as a tracked object and the classifier is trained with the wrong features which leads to the tracking drift. A trace of this drawback also is seen in Figure 3 when the tracked object is occluded by other objects in the scene. V. C ONCLUSION This paper combined background motion information and group motion dynamic with local object information to im- Fig. 2. Relative overlap of individual objects for the Waikato-1. The red lines are the result of our proposed model, the black lines show the result of KCF tracker and the blue lines indicate the distance threshold when set to 0.5. prove tracking under occlusion and background clutter. The performance of proposed tracking model is compared with KCF tracker. The comparison result indicates the KCF tracker performance is poor in comparison with the proposed model, particularly in the presence of occlusion and background noise. R EFERENCES [1] L. Yang, Hanxuan Shao, F. Zheng, L. Wang, and Z. Song, “Recent advances and trends in visual tracking: A review,” Neurocomputing, vol. 74, no. 18, pp. 3823–3831, 2011. [2] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm computing surveys (CSUR), vol. 38, no. 4, p. 13, 2006. [3] D. E. Maggio and D. A. Cavallaro, Video Tracking: Theory and Practice, 1st ed. Wiley Publishing, 2011. [4] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015. Fig. 3. Relative overlap of individual objects for the Waikato-2. The red lines are the result of our proposed model, the black lines show the result of KCF and the blue lines indicate the distance threshold when set to 0.5. [5] Y. Seo, S. Choi, H. Kim, and K.-S. Hong, “Where are the ball and players? soccer game analysis with colorbased tracking and image mosaick,” in International Conference on Image Analysis and Processing. Springer, 1997, pp. 196–203. [6] J. Han, D. Farin, W. Lao et al., “Automatic tracking method for sports video analysis,” in Proc. Symposium on information theory in the Benelux, Brussels, Belgium, 2005. [7] D. Comaniciu and P. Meer, “Mean shift analysis and applications,” in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2. IEEE, 1999, pp. 1197–1203. [8] A. Blake and M. Isard, “The condensation algorithmconditional density propagation and applications to visual tracking,” in Advances in Neural Information Processing Systems, 1997, pp. 361–367. [9] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Colorbased probabilistic tracking,” Computer vision–ECCV, pp. 661–675, 2002. [10] K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptive color-based particle filter,” Image and vision computing, vol. 21, no. 1, pp. 99–110, 2003. [11] D. Comaniciu and P. Meer, “Robust analysis of feature Fig. 4. Black rectangular box shows the result of KCF and the red box illustrate the proposed tracking output just before (top image) and after (bottom image) occlusion in Waikato–1. [12] [13] [14] [15] [16] spaces: color image segmentation,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1997, pp. 750–755. G. R. Bradski, “Computer vision face tracking for use in a perceptual user interface,” in Intel Technology Journal, 1998, pp. 214–219. S. Avidan, “Support vector tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 8, pp. 1064–1072, 2004. Z. Kalal, K. Mikolajczyk, and J. Matas, “Trackinglearning-detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409– 1422, 2012. S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2096–2109, 2016. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine ground truth evaluation of multi-target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 735–742. Fig. 5. Black rectangular box shows the result of KCF tracking and the red box illustrate the proposed tracking output just before (top image) and after (bottom image) occlusion in Waikato–2 Intelligence, vol. 37, no. 3, pp. 583–596, 2015. [17] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration.” in ECCV Workshops (2), 2014, pp. 254–265. [18] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-bydetection with kernels,” in European conference on computer vision. Springer, 2012, pp. 702–715. [19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Forwardbackward error: Automatic detection of tracking failures,” in 20th international conference on Pattern recognition (ICPR). IEEE, 2010, pp. 2756–2759. [20] J.-Y. Bouguet, “Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm,” Intel Corporation, vol. 5, no. 1-10, p. 4, 2001. [21] J. Shi and C. Tomasi, “Good features to track,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1994, pp. 593–600. [22] M. Hedayati, M. J. Cree, and J. Scott, “Scene structure analysis for sprint sports,” in International Conference on Image and Vision Computing New Zealand (IVCNZ), 2016, pp. 1–5. [23] M. J. Swain and D. H. Ballard, “Color indexing,” International journal of computer vision, vol. 7, no. 1, pp. 11–32, 1991. [24] A. Milan, K. Schindler, and S. Roth, “Challenges of