Abstract
The Vision Meets Drone (VisDrone2020) Multiple Object Tracking (MOT) is the third annual UAV MOT tracking evaluation activity organized by the VisDrone team, in conjunction with European Conference on Computer Vision (ECCV 2020). The VisDrone-MOT2020 consists of 79 challenging video sequences, including 56 videos (\(\sim \)24K frames) for training, 7 videos (\(\sim \)3K frames) for validation and 17 videos (\(\sim \)6K frames) for evaluation. All frames in these sequences are manually annotated with high-quality bounding boxes. Results of 12 participating MOT algorithms are presented and analyzed in detail. The challenging results, video sequences as well as the evaluation toolkit are made available at http://aiskyeye.com/. By holding VisDrone-MOT2020 challenge, we hope to facilitate future research and applications of MOT algorithms on drone videos.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Shakarji, N.M., Bunyak, F., Seetharaman, G., Palaniappan, K.: Multi-object tracking cascade with multi-step data association and occlusion handling. In: AVSS (2018)
Al-Shakarji, N.M., Seetharaman, G., Bunyak, F., Palaniappan, K.: Robust multi-object tracking with semantic color correlation. In: AVSS (2017)
Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)
Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS (2017)
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: CVPR (2020)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)
Chang, Z., et al.: Weighted bilinear coding over salient body parts for person re-identification. Neurocomputing 407, 454–464 (2020)
Chen, B., Deng, W., Hu, J.: Mixed high-order attention network for person re-identification. In: ICCV (2019)
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR (2019)
Chu, P., Fan, H., Tan, C.C., Ling, H.: Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In: WACV (2019)
Chu, P., Ling, H.: FAMNet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV (2019)
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. arXiv (2020)
Dendorfer, P., et al.: MOT20: a benchmark for multi object tracking in crowded scenes. arXiv (2020)
Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 375–391. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_23
Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. PAMI 30(10), 1858–1865 (2008)
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR (2019)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Girshick, R.: Fast R-CNN. In: ICCV (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hsieh, M.R., Lin, Y.L., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: ICCV (2017)
Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation & multiple object tracking by correlation co-clustering. PAMI 42(1), 140–153 (2018)
Kim, C., Li, F., Rehg, J.M.: Multi-object tracking with neural gating using bilinear LSTM. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 208–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_13
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: ICCV (2019)
Li, J., Zhang, S., Huang, T.: Multi-scale 3D convolution network for video based person re-identification. In: AAAI (2019)
Li, S., Yu, H., Hu, H.: Appearance and motion enhancement for video-based person re-identification. In: AAAI (2020)
Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: CVPR (2014)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Luo, H., Gu, Y., Liao, X., Lai, S., Jiang, W.: Bag of tricks and a strong baseline for deep person re-identification. In: CVPRW (2019)
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv (2016)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Pan, S., Tong, Z., Zhao, Y., Zhao, Z., Su, F., Zhuang, B.: Multi-object tracking hierarchically in visual data taken from drones. In: ICCVW (2019)
Park, E., Liu, W., Russakovsky, O., Deng, J., Li, F.F., Berg, A.: Large Scale Visual Recognition Challenge 2017. http://image-net.org/challenges/LSVRC/2017
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: human trajectory understanding in crowded scenes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_33
Wang, G., Wang, Y., Zhang, H., Gu, R., Hwang, J.: Exploit the connectivity: multi-object tracking with trackletnet. In: ACM MM, pp. 482–490 (2019)
Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: ACM MM (2018)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. PAMI (2020)
Wen, L., et al.: UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 193, 102907 (2020)
Wen, L., Du, D., Li, S., Bian, X., Lyu, S.: Learning non-uniform hypergraph for multi-object tracking. In: AAAI, pp. 8981–8988 (2019)
Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multiple target tracking based on undirected hierarchical relation hypergraph. In: CVPR (2014)
Wen, L., Zhang, Y., Bo, L., Shi, H., Zhu, R., et al.: VisDrone-MOT2019: the vision meets drone multiple object tracking challenge results. In: ICCVW, pp. 189–198 (2019)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR (2013)
Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Yang, Y., Wen, L., Lyu, S., Li, S.Z.: Unsupervised learning of multi-level descriptors for person re-identification. In: AAAI (2017)
Zhan, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: A simple baseline for multi-object tracking. arXiv (2020)
Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: ICCV (2017)
Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: CVPR (2013)
Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: ICCV (2019)
Zhou, Q., et al.: Graph correspondence transfer for person re-identification. In: AAAI (2018)
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. arXiv (2020)
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.-H.: Online multi-object tracking with dual matching attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 379–396. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_23
Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision meets drones: past, present and future. CoRR abs/2001.06303 (2020)
Zhu, P., et al.: VisDrone-VDT2018: the vision meets drone video detection and tracking challenge results. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 496–518. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_29
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61876127 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Descriptions of Submitted Trackers
A Descriptions of Submitted Trackers
In the appendix, we summarize 12 trackers submitted in the VisDrone-MOT2020 Challenge, which are ordered according to the submissions of their final results.
1.1 A.1 Coarse-to-Fine Multi-Class Multi-Object Tracking (COFE)
Yuhang He, Wentao Yu, Jie Han, Xiaopeng Hong, Xing Wei and Yihong Gong
{hyh1379478,yu1034397129,hanjie1997}@stu.xjtu.edu.cn,
{hongxiaopeng,weixing,ygong}@mail.xjtu.edu.cn
COFE is proposed to track multiple targets in different categories under different scenarios. As shown in Fig. 1, the proposed method contains three major modules: 1) Multi-class object detection, 2) Coarse-category multi-object tracking, and 3) Fine-grained trajectory finetuning. Firstly, we use a Deep Convolutional Neural Network (DCNN) based object detector [6] to detect interested targets in the image plane, where each detection is denoted by a bounding box with a class label and a confidence score. Secondly, we track multiple targets in coarse categories, where fine-grained classes (such as van, bus, car) are summarized into coarse categories (e.g., vehicle). For each coarse category, we perform multi-object tracking by exploiting the appearance and motion information of targets, where the appearance feature is extracted using a DCNN feature extractor [54] and the motion pattern of each target is modeled by a Kalman Filter. Finally, for each obtained trajectory, we finetune its fine-grained class label by a simple voting and refine the tracking results by post processing (i.e., bounding box smoothing).
1.2 A.2 Simple Online Multi-Object Tracker (SOMOT)
Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Bin Dong and Wang Sai
{luozp,yaoyh,xuzy,dongb,wangs}@deepblueai.com
Following separate detection and embedding model, we build a strong detector based on Cascade R-CNN [6] and a embedding model based on Multiple Granularity Network (MGN). For association step, we build simple online multi-object tracker, which is inspired by DeepSORT [46] and FairMOT [51]. For detector, Cascade R-CNN [6] pretrained on COCO [29] is applied. For embedding model, bag of tricks are used to improve the performance of MGN [40]. For association step, we initialize a number of tracklets based on the estimated boxes in the first frame. In the subsequent frames, we associate the boxes to the existing tracklets (all activated tracklets) according to their distances measured by embedding features. We update the appearance features of the trackers in each time step to handle appearance variations. Then, unmatched activated tracklets and estimated boxes are associated by their distance of Intersection over Union (IoU). Also, inactivated tracklets and estimated boxes are associated by their distance of IoU.
1.3 A.3 Position-, Appearance- and Size-aware Tracker (PAS tracker)
Daniel Stadler, Lars Wilko Sommer and Arne Schumann
daniel.stadler@kit.edu,{lars.sommer,arne.schumann}@iosb.fraunhofer.de
The PAS algorithm follows the tracking-by-detection paradigm. As detectors, we train two Cascade R-CNN [6] with FPN [28] on the VisDrone2020 MOT train and val set applying as backbone ResNeXt-101 [49] and HRNetV2p-W32 [41], respectively. Training is performed on randomly sampled image crops (\(608 \times 608\) pixels) and the SSD [30] data augmentation pipeline is used. To improve the quality of the detections, we utilize test-time strategies like horizontal flipping and multi-scale testing. Additionally, we generate category-specific expert models using weights from different epochs and from the two detectors with different backbones. For associating detections, we build a similarity measure that integrates position, appearance and size information of objects. A constant velocity model is assumed for the motion prediction of objects and a camera motion compensation model based on the Enhanced Correlation Coefficient Maximization [15] is also applied. The appearance of an object is represented by a feature vector computed with a re-identification model from [31] based on a ResNet-50 [19]. The association of tracks and new detections is solved by the Hungarian method [23]. Additionally, to remove false positive detections in crowded scenarios, a simple filtering approach considering the overlap of existing tracks and new detections is proposed. Finally, we remove short tracks with less than 10 frames and small tracks with a mean size of less than 100 pixels as most of them are false positives.
1.4 A.4 Simple Online and Realtime Tracking with a Deep Association (Deepsort)
Zhaoze Zhao
hanjie@smail.swufe.edu.cn
Simple Online and Realtime Tracking (SORT) [46] is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbour queries in visual appearance space.
1.5 A.5 YOLOv5 based V-IOU tracker (YOLO-TRAC)
Zhizhao Duan, Xi Wu, Duo Xu and Zhen Xie
{Duanai,21725018}@zju.edu.cn,wuxi9410@gmail.com,zjutxz@hotmail.com
Trac is a track by detection framework. We use YOLO-V5Footnote 1 as our detection network, and V-IOU Tracker [4] is used for tracking.
1.6 A.6 An improved multi-object tracking method for the VisDrone videos based on CenterTrack (VDCT)
Shengwen Li and Yanyun Zhao
{2019140337,zyy}@bupt.edu.cn
VDCT is improved from CenterTrack, which is a point-based framework that combines detection and tracking [56]. Its inputs include the current frame, the previous frame, and the tracked objects in the previous frame; and it outputs the displacements of tracked objects. Our improvements include: (1) The tracked objects which do not match within 20 frames are allowed to associate with objects detected in current frame by properly extending the survival time of the tracked objects. (2) The motion direction of adjacent frame objects usually does not change abruptly due to the continuity of object motion, so we calculate the dot product of the displacements of adjacent frame objects and decide whether to associate the objects. (3) We use the NIOU method [34] to perform non-maximum suppression on vehicle objects. (4) We adopt the hierarchical matching strategy in DeepSORT [46] to solve the long occlusion problem. (5) OSNet [54] is used to extract each trajectory’s appearance feature, measure their distance from others and we simply merge two trajectories if their distance is close enough. The experimental results show the effectiveness of our improved method.
1.7 A.7 Cascade RCNN based IOU tracker (Cascade RCNN+IOU)
Ting Sun and Xingjie Zhao
sunting9999@stu.xjtu.edu.cn,1243273854@qq.com
We use Cascade R-CNN [6] as the detector with three improvements: (1) We use Group normalization [48] instead of Batch normalization; (2) We use online hard example mining to select positive and negative samples; (3) We use multiple scales to test our data; (4) We use two stronger backbones to train models and integrate them. Then, we perform detection association using the IOU tracker [4].
1.8 A.8 Hybrid task cascade based IOU tracker (HTC+IOU)
Ting Sun, Xingjie Zhao and Guizhong Liu
sunting9999@stu.xjtu.edu.cn,1243273854@qq.com
We use hybrid task cascade for instance segmentation [9] as the detector with three improvements: (1) We use Group normalization [48] instead of Batch normalization; (2) We use online hard example mining to select positive and negative samples; (3) We use multiple scales to test our data; (4) We use two stronger backbones to train models and integrate them. Then, we perform detection association using the IOU tracker [4].
1.9 A.9 Multi-object Tracking based on HRNet (HR-GNN)
Zheng Yang and Kaojin Zhu
151776257@qq.com,1320531351@qq.com
HR-GNN is built based on the detector using HRNet [41] as backbone. Then the tracking results are generated by using GNN to analyze the detection results.
1.10 A.10 Multi-object tracking with TrackletNet (TNT)
Haritha V, Melvin Kuriakose, Hrishikesh PS and Linu Shine
vakkatharitha@gmail.com
TNT is based on the work of [39] by merging temporal and appearance information together as a unified framework. We learn appearance similarity among tracklets by a graph model, where we use CNN features and intersection-over-union (IOU) with epipolar constraints to compensate camera movement between adjacent frames. Finally, the tracklets can be clustered into groups, resulting in trajectories with individual object IDs.
1.11 A.11 A simple baseline for one-shot multi-object tracking (anchor-free_mot)
Min Yao and Libo Zhang
libo@iscas.ac.cn
The anchor-free_mot method is based on FairMOT [51]. Specifically, we use the encoder-decoder network to extract feature maps. Then, two simple parallel heads are used to predict the bounding box and re-ID features of the targets, respectively. Notably, the targets are represented by points from the anchor-free object detection method.
1.12 A.12 Semantic Color Correlation Tracker (SCTrack)
Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman and Kannappan Palaniappan
{nmahyd,bunyak,palaniappank}@mail.missouri.edu,
gunasekaran.seetharaman@rl.af.mil
SCTrack is a time-efficient detection-based multi-object tracking method. Specifically, we use a three-step cascaded data association scheme to combine a fast spatial distance only short-term data association, a robust tracklet linking step using discriminative object appearance models, and an explicit occlusion handling unit relying not only on tracked objects’ motion patterns but also on environmental constraints such as presence of potential occluders in the scene. The details can be referred to [1, 2].
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, H. et al. (2020). VisDrone-MOT2020: The Vision Meets Drone Multiple Object Tracking Challenge Results. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12538. Springer, Cham. https://doi.org/10.1007/978-3-030-66823-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-66823-5_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66822-8
Online ISBN: 978-3-030-66823-5
eBook Packages: Computer ScienceComputer Science (R0)