Abstract
Multiple Object Tracking (MOT) in the wild has a wide range of applications in surveillance retrieval and autonomous driving. Tracking-by-Detection has become a mainstream solution in MOT, which is composed of feature extraction and data association. Most of the existing methods focus on extracting targets’ individual features and optimizing the association by hand-crafted algorithms. In this paper, we specially consider the interrelation cue between targets and we propose Human-Interaction Model (HIM) to extract interaction features between the tracked target and its surrounding. The interaction model has more discriminative features to distinguish objects, especially in crowded (dense) scene. Meanwhile we propose an efficient end-to-end model, Deep Association Network (DAN), to optimize the association with graph-based learning mechanism. Both HIM and DAN are constructed by three kinds of deep networks, which include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Graph Neural Network (GNN). The CNNs extract appearance features from bounding box images, the RNNs encoder motion features from historical positions of trajectory. And then the GNNs aim to extract interaction features and optimize graph structure to associate the objects in different frames. In addition, we present a novel end-to-end training strategy for Deep Association Network and Human-Interaction Model. Our experimental results demonstrate performance of our method reaches the state-of-the-art on MOT15, MOT16 and DukeMTMCT datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Leal-Taix, L., Milan, A., Reid, I., Roth, S., & Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942.
Milan, A., Leal-Taix, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCV workshop on Benchmarking Multi-Target Tracking. (2016)
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixe, L. (2019). Cvpr19 tracking and detection challenge: How crowded can it get? arXiv preprint arXiv:1906.04567.
Martín-Martín, R., Rezatofighi, H., Shenoi, A., Patel, M., Gwak, J., Dass, N., Federman, A., Goebel, P., & Savarese, S. (2019). Jrdb: A dataset and benchmark for visual perception for navigation in human environments. arXiv preprint arXiv:1910.11792.
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154–6162.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 91–99.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Sahbani, B., & Adiprawita, W. (2017). Kalman filter and iterative-hungarian algorithm implementation for low complexity point tracking as part of fast multiple object tracking system. In: ICSET. 109–115.
Schulter, S., Vernaza, P., Choi, W., & Chandraker, M. (2017). Deep network flow for multi-object tracking. In: CVPR. 6951–6960.
Milan, A., Taix, L.L., Reid, I.D., Roth, S., & Schindler, K. (2016) MOT16: A benchmark for multi-object tracking. CoRR abs/1603.00831.
Henschel, R., Leal-Taix, L., Cremers, D., & Rosenhahn, B. (2018). Fusion of head and full-body detectors for multi-object tracking. In: Computer Vision and Pattern Recognition Workshops (CVPRW).
Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person reidentification. In: CVPR. 3539–3548.
Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In: ICCV. 4705–4713.
Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In: ICCV. 3029–3037.
Kim, C., Li, F., Ciptadi, A., & Rehg, J.M. (2015). Multiple hypothesis tracking revisited. In ICCV. 4696–4704.
Chen, J., Sheng, H., Zhang, Y., & Xiong, Z. (2017). Enhancing detection model for multiple hypothesis tracking. In: CVPR Workshops. 18–27.
Bergmann, P., Meinhardt, T., & Leal-Taixe, L. (2019). Tracking without bells and whistles. ICCV .
Keuper, M., Tang, S., Andres, B., Brox, T., & Schiele, B. (2018). Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence, 42(1), 140–53.
Chen, L., Ai, H., Chen, R., & Zhuang, Z. (2019). Aggregate tracklet appearance features for multi-object tracking. IEEE Signal Processing Letters.
Levinkov, E., Uhrig, J., Tang, S., Omran, M., Insafutdinov, E., Kirillov, A., Rother, C., Brox, T., Schiele, B., Andres, B.: Joint graph decomposition and node labeling: Problem, algorithms, applications. CVPR (2017)
Maksai, A., Wang, X., Fleuret, F., & Fua, P. (2017). Globally consistent multi-people tracking using motion patterns. ICCV .
Ma, C., Li, Y., Yang, F., Zhang, Z., Zhuang, Y., Jia, H., & Xie, X. (2019). Deep association: End-to-end graph-based learning for multiple object tracking with conv-graph neural network. In: ICMR, ACM ,253–261.
Shen, H., Huang, L., Huang, C., & Xu, W. (2018). Tracklet association tracker: An end-to-end learning-based association approach for multi-object tracking. arXiv preprint arXiv:1808.01562 .
Sadeghian, A., Alahi, A., & Savarese, S. (2017). Tracking the untrackable: Learning to track multiple cues with long-term dependencies. ICCV .
Yang, F., Yan, K., Lu, S., Jia, H., Xie, X., & Gao, W. (2019). Attention driven person re-identification. Pattern Recognition, 86, 143–155.
Yang, F., Yan, K., Lu, S., Jia, H., Xie, D., Yu, Z., et al. (2020). Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia, 1–1.
Yang, F., Yan, K., Lu, S., Jia, H., Xie, X., & Gao, W. (2019). Attention driven person re-identification. Pattern Recognition, 86, 143–155.
Yang, F., Yan, K., Lu, S., Jia, H., Xie, D., Yu, Z., et al. (2020). Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia.
Son, J., Baek, M., Cho, M., & Han, B. (2017). Multi-object tracking with quadruplet convolutional neural networks. In: CVPR. 5620–5629.
Chu, Q., Ouyang, W., Li, H., Wang, X., Liu, B., & Yu, N. (2017). Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In: CVPR. 4836–4845
Ma, C., Yang, C., Yang, F., Zhuang, Y., Zhang, Z., Jia, H., & Xie, X. (2018). Trajectory factory: Tracklet cleaving and re-connection by deep siamese bi-gru for multiple object tracking. ICME .
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., & Yang, M.H. Online multi-object tracking with dual matching attention networks. In: ECCV. (September 2018)
Gao, X., & Jiang, T. (2018) . Osmo: Online specific models for occlusion in multiple object tracking under surveillance scene. In: 2018 ACM Multimedia Conference on Multimedia Conference. 201–210.
Wang, G., Wang, Y., Zhang, H., Gu, R., & Hwang, J.N. (2019). Exploit the connectivity: Multi-object tracking with trackletnet. In: Proceedings of the 27th ACM International Conference on Multimedia, ACM .482–490.
Dicle, C., Camps, O.I., & Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In: ICCV. 2304–2311.
Hong Yoon, J., Lee, C.R., Yang, M.H., & Yoon, K.J. (2016). Online multi-object tracking via structural constraint event aggregation. In: CVPR. 1392–1400.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, & L., Savarese, S. (2016). Social lstm: Human trajectory prediction in crowded spaces. In: CVPR. 961–971.
Chen, X., Treiber, M., Kanagaraj, V., & Li, H. (2018). Social force models for pedestrian traffic-state of the art. Transport reviews, 38(5), 625–653.
Yang, D., Redmill, K., & Ozguner, U. (2020). A multi-state social force based framework for vehicle-pedestrian interaction in uncontrolled pedestrian crossing scenarios. arXiv preprint arXiv:2005.07769 .
Zhang, M., Li, T., Yu, Y., Li, Y., Hui, P., & Zheng, Y. (2020). Urban anomaly analytics: Description, detection and prediction. IEEE Transactions on Big Data .
Cai, L., Chen, Z., Luo, C., Gui, J., Ni, J., Li, D., & Chen, H. (2020). Structural temporal graph neural networks for anomaly detection in dynamic graphs. arXiv preprint arXiv:2005.07427.
Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., & Savarese, S. (2019). Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1349–1358.
Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H.,&Savarese, S. (2019). Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In: Advances in Neural Information Processing Systems. 137–146
Lan, L., Wang, X., Zhang, S., Tao, D., Gao, W., & Huang, T. S. (2018). Interacting tracklets for multi-object tracking. IEEE Transactions on Image Processing, 27(9), 4585–4597.
Wang, X., Türetken, E., Fleuret, F., & Fua, P. (2015). Tracking interacting objects using intertwined flows. IEEE transactions on pattern analysis and machine intelligence, 38(11), 2312–2326.
Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., & Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 .
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2016). Gated graph sequence neural networks. ICLR .
Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. ICLR .
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. Graph attention networks. ICLR (2018) accepted as poster.
Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., & Adams, R.P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems. 2224–2232.
Kipf, T., Fetaya, E., Wang, K.C., Welling, M., & Zemel, R. (2018). Neural relational inference for interacting systems. ICML .
Garcia, V., & Bruna, J. (2018). Few-shot learning with graph neural networks. ICLR.
Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++. In: CVPR. 859–868.
Yan, S., Xiong, & Y., Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI .
Shen, Y., Li, H., Yi, S., Chen, D., & Wang, X. (2018). Person re-identification with deep similarity-guided graph neural network. In: ECCV, Springer .508–526.
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In: Proceedings of the IEEE International Conference on Computer Vision. 1116–1124
Zheng, Z., Zheng, L., & Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.077173.
Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE .3652–3661.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9), 1627–1645.
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., & Sheikh, Y. (2018). Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 .
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR .
Wang, B., Wang, L., Shuai, B., Zuo, Z., Liu, T., Luk Chan, K., & Wang, G. (2016) . Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association. In: CVPR Workshops. 1–8
Long, C., Haizhou, A., Zijie, & Z., Chong, S. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. ICME
Henschel, R., Leal-Taix, L., Cremers, & D., Rosenhahn, B. (2017). A novel multi-detector fusion framework for multi-object tracking. CoRR .
Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial-temporal relation networks for multi-object tracking. arXiv preprint arXiv:1904.11489 .
Sheng, H., Chen, J., Zhang, Y., Ke, W., Xiong, Z., & Yu, J. (2018). Iterative multiple hypothesis tracking with tracklet-level association. IEEE Transactions on Circuits and Systems for Video Technology.
Chu, P., Fan, H., Tan, C.C., & Ling, H. (2019). Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE . 161–170
Maksai, A., Wang, X., Fleuret, F., & Fua, P. (2017). Non-markovian globally consistent multi-object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE , 2563–2573.
Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision, Springer . 17–35.
Zhang, Z., Wu, J., Zhang, X., & Zhang, C. (2017). Multi-target, multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531 .
Tesfaye, Y.T., Zemene, E., Prati, A., Pelillo, M., & Shah, M. (2017). Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196 .
Yoon, K., Song, Y.m., & Jeon, M. (2018). Multiple hypothesis tracking algorithm for multi-target multi-camera tracking with disjoint views. IET Image Processing .
Sun, S., Akhtar, N., Song, H., Mian, A. S., & Shah, M. (2019). Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence.
Chen, L., Ai, H., Shang, C., Zhuang, Z., & Bai, B. (2017). Online multi-object tracking with convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE , 645–649.
Chu, P., & Ling, H. (2019). Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: Proceedings of the IEEE International Conference on Computer Vision. 6172–6181
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008(1), 246309.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Mei Chen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, C., Yang, F., Li, Y. et al. Deep Human-Interaction and Association by Graph-Based Learning for Multiple Object Tracking in the Wild. Int J Comput Vis 129, 1993–2010 (2021). https://doi.org/10.1007/s11263-021-01460-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01460-0