research-article

Self-supervised Multi-view Multi-Human Association and Tracking

Authors:

Song WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 282 - 290

https://doi.org/10.1145/3474085.3475177

Published: 17 October 2021 Publication History

Abstract

Multi-view Multi-human association and tracking (MvMHAT) aims to track a group of people over time in each view, as well as to identify the same person across different views at the same time. This is a relatively new problem but is very important for multi-person scene video surveillance. Different from previous multiple object tracking (MOT) and multi-target multi-camera tracking (MTMCT) tasks, which only consider the over-time human association, MvMHAT requires to jointly achieve both cross-view and over-time data association. In this paper, we model this problem with a self-supervised learning framework and leverage an end-to-end network to tackle it. Specifically, we propose a spatial-temporal association network with two designed self-supervised learning losses, including a symmetric-similarity loss and a transitive-similarity loss, at each time to associate the multiple humans over time and across views. Besides, to promote the research on MvMHAT, we build a new large-scale benchmark for the training and testing of different algorithms. Extensive experiments on the proposed benchmark verify the effectiveness of our method. We have released the benchmark and code to the public.

Supplementary Material

MP4 File (MM21-fp99.mp4)

Presentation video.

Download
116.41 MB

References

[1]

Mustafa Ayazoglu, Binlong Li, Caglayan Dicle, Mario Sznaier, and Octavia I Camps. 2011. Dynamic subspace-based coordinated multicamera tracking. In ICCV.

Digital Library

[2]

Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019 a. Tracking without bells and whistles. In ICCV.

[3]

Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixé. 2019 b. Tracking Without Bells and Whistles. In ICCV.

[4]

Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance. EURASIP Journal on Image and Video Processing, Vol. 2008 (2008), 1--10.

Digital Library

[5]

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In ICIP.

[6]

Yinghao Cai and Gerard Medioni. 2014. Exploring context information for inter-camera multiple target tracking. In WACV.

[7]

Xiaotang Chen, Kaiqi Huang, and Tieniu Tan. 2014. Object tracking across non-overlapping views by learning inter-camera transfer models. Pattern Recognition, Vol. 47, 3 (2014), 1126--1137.

Digital Library

[8]

Peng Chu and Haibin Ling. 2019. Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In ICCV.

[9]

Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, and Francisco Herrera. 2020. Deep learning in video multi-object tracking: A survey. Neurocomputing, Vol. 381 (2020), 61--88.

Digital Library

[10]

Afshin Dehghan, Shayan Modiri Assari, and Mubarak Shah. 2015. GMMCP Tracker: Globally Optimal Generalized Maximum Multi Clique Problem for Multiple Object Tracking. In CVPR.

[11]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.

[12]

Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV.

Digital Library

[13]

Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2019. Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views. In CVPR.

[14]

Ran Eshel and Yael Moses. 2010. Tracking in a dense crowd using multiple cameras. IJCV, Vol. 88, 1 (2010), 129--143.

Digital Library

[15]

Francois Fleuret, Jerome Berclaz, Richard Lengagne, and Pascal Fua. 2008. Multicamera people tracking with a probabilistic occupancy map. IEEE TPAMI, Vol. 30, 2 (2008), 267.

Digital Library

[16]

Xu Gao and Tingting Jiang. 2018. OSMO: Online Specific Models for Occlusion in Multiple Object Tracking under Surveillance Scene. In ACM MM.

Digital Library

[17]

Andrew Gilbert and Richard Bowden. 2006. Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In ECCV.

Digital Library

[18]

Ruize Han, Wei Feng, Yujun Zhang, Jiewen Zhao, and Song Wang. 2021. Multiple Human Association and Tracking from Egocentric and Complementary Top Views. IEEE TPAMI (2021).

[19]

Ruize Han, Wei Feng, Jiewen Zhao, Zicheng Niu, Yujun Zhang, Liang Wan, and Song Wang. 2020 a. Complementary-View Multiple Human Tracking. In AAAI.

[20]

Ruize Han, Yujun Zhang, Wei Feng, Chenxing Gong, Xiaoyu Zhang, Jiewen Zhao, Liang Wan, and Song Wang. 2019. Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions. In arXiv.

[21]

Ruize Han, Jiewen Zhao, Wei Feng, Yiyang Gan, Liang Wan, and Song Wang. 2020 b. Complementary-View Co-Interest Person Detection. In ACM MM.

Digital Library

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.

[23]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In arXiv.

[24]

Kalun Ho, Janis Keuper, and Margret Keuper. 2020. Unsupervised multiple person tracking using autoencoder-based lifted multicuts. In arXiv.

[25]

Martin Hofmann, Daniel Wolf, and Gerhard Rigoll. 2013. Hypergraphs for joint multi-view reconstruction and multi-object tracking. In CVPR.

Digital Library

[26]

Yunzhong Hou, Liang Zheng, Zhongdao Wang, and Shengjin Wang. 2019. Locality Aware Appearance Metric for Multi-Target Multi-Camera Tracking. In arXiv.

[27]

Shyamgopal Karthik, Ameya Prabhu, and Vineet Gandhi. 2020. Simple unsupervised multi-object tracking. In arXiv.

[28]

Saad M Khan and Mubarak Shah. 2006. A multiview approach to tracking people in crowded scenes using a planar homography constraint. In ECCV.

Digital Library

[29]

Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, Vol. 2, 1 (1955), 83--97.

[30]

Zihang Lai and Weidi Xie. 2019. Self-supervised learning for video correspondence flow. In BMVC.

[31]

Laura Leal-Taixe, Gerard Pons-Moll, and Bodo Rosenhahn. 2012. Branch-and-price global optimization for multi-view multi-target tracking. In CVPR.

[32]

Laura Lealtaixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. 2015. MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. In arXiv.

[33]

Minxian Li, Xiatian Zhu, and Shaogang Gong. 2018. Unsupervised person re-identification by deep learning tracklet association. In ECCV.

[34]

Minxian Li, Xiatian Zhu, and Shaogang Gong. 2019. Unsupervised tracklet person re-identification. IEEE TPAMI, Vol. 42, 7 (2019), 1770--1782.

[35]

Xiaobai Liu, Yuanlu Xu, Lei Zhu, and Yadong Mu. 2017. A stochastic attribute grammar for robust cross-view human tracking. IEEE TCSVT, Vol. 28, 10 (2017), 2884--2895.

[36]

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. 2020. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV, Vol. 129, 2 (2020), 1--31.

[37]

Andrii Maksai, Xinchao Wang, Francois Fleuret, and Pascal Fua. 2017. Non-markovian globally consistent multi-object tracking. In ICCV.

[38]

Jinlong Peng, Yueyang Gu, Yabiao Wang, Chengjie Wang, Jilin Li, and Feiyue Huang. 2020. Dense Scene Multiple Object Tracking with Box-Plane Matching. In ACM MM.

Digital Library

[39]

Bryan James Prosser, Shaogang Gong, and Tao Xiang. 2008. Multi-camera Matching using Bi-Directional Cumulative Brightness Transfer Functions. In BMVC.

[40]

Ergys Ristani, Francesco Solera, Roger S Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In CVPR.

[41]

Ergys Ristani and Carlo Tomasi. 2018. Features for Multi-Target Multi-Camera Tracking and Re-Identification. In CVPR.

[42]

Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. 2013. Visual tracking: An experimental survey. IEEE TPAMI, Vol. 36, 7 (2013), 1442--1468.

Digital Library

[43]

Yonatan Tariku Tesfaye, Eyasu Zemene, Andrea Prati, Marcello Pelillo, and Mubarak Shah. 2019. Multi-target tracking in multiple non-overlapping cameras using fast-constrained dominant sets. IJCV, Vol. 127, 9 (2019), 1303--1320.

Digital Library

[44]

Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu, and Jenq-Neng Hwang. 2019. Exploit the Connectivity: Multi-Object Tracking with TrackletNet. In ACM MM.

Digital Library

[45]

Sibo Wang, Ruize Han, Wei Feng, and Song Wang. 2021. Multiple Human Tracking in Non-Specific Coverage with Wearable Cameras. In ICASSP.

[46]

Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019. Learning correspondence from the cycle-consistency of time. In CVPR.

[47]

Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, and Shengjin Wang. 2020. CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions. In ECCV.

[48]

Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. 2021. Track to Detect and Segment: An Online Multi-Object Tracker. In CVPR.

[49]

Jinlin Wu, Yang Yang, Hao Liu, Shengcai Liao, Zhen Lei, and Stan Z Li. 2019 b. Unsupervised graph association for person re-identification. In ICCV.

[50]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019 a. Detectron2. https://github.com/facebookresearch/detectron2.

[51]

Yu Xiang, Alexandre Alahi, and Silvio Savarese. 2015. Learning to track: Online multi-object tracking by decision making. In ICCV.

Digital Library

[52]

Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. 2019. Spatial-temporal relation networks for multi-object tracking. In ICCV.

[53]

Yuanlu Xu, Xiaobai Liu, Yang Liu, and Songchun Zhu. 2016. Multi-View People Tracking via Hierarchical Trajectory Composition. In CVPR.

[54]

Yuanlu Xu, Xiaobai Liu, Lei Qin, and Song-Chun Zhu. 2017. Cross-view people tracking by scene-centered spatio-temporal parsing. In AAAI.

Digital Library

[55]

Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixé, and Xavier Alameda-Pineda. 2020. How to train your deep multi-object tracker. In CVPR.

[56]

Bo Yang and Ram Nevatia. 2012a. Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In CVPR.

[57]

Bo Yang and Ram Nevatia. 2012b. An online learned CRF model for multi-target tracking. In CVPR.

[58]

Amir Roshan Zamir, Afshin Dehghan, and Mubarak Shah. 2012. GMCP-Tracker: Global Multi-Object Tracking Using Generalized Minimum Clique Graphs. In ECCV.

[59]

Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV.

[60]

Jiewen Zhao, Ruize Han, Yiyang Gan, Liang Wan, Wei Feng, and Song Wang. 2020. Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras. In ACM MM.

Digital Library

[61]

Kang Zheng, Xiaochuan Fan, Yuewei Lin, Hao Guo, and Song Wang. 2017. Learning View-Invariant Features for Person Identification in Temporally Synchronized Videos Taken by Wearable Cameras. In ICCV.

[62]

Xingyi Zhou, Vladlen Koltun, and Philipp Krahenbühl. 2020. Tracking objects as points. In ECCV.

Cited By

Wu PLi YLi ZYang XXue D(2025)Multi-View, Multi-Target Tracking in Low-Altitude Scenes with UAV InvolvementDrones10.3390/drones90201389:2(138)Online publication date: 13-Feb-2025
https://doi.org/10.3390/drones9020138
Feng WWang FHan RGan YQian ZHou JWang S(2025)Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346396647:1(351-368)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3463966
Qiao YFan HWang QZhao TTang Y(2024)STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone AssociationRemote Sensing10.3390/rs1620386116:20(3861)Online publication date: 17-Oct-2024
https://doi.org/10.3390/rs16203861
Show More Cited By

Index Terms

Self-supervised Multi-view Multi-Human Association and Tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Matching
        Object identification
        Tracking
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Self-supervised 3D human pose estimation from video
Abstract
To accurately estimate 3D human pose from monocular camera images, a large amount of 3D annotated data is required. However, obtaining 3D annotated data outside the laboratory is not easy. In the absence of such data, weakly-supervised methods ...
Self-supervised Multi-object Tracking with Cycle-Consistency
MultiMedia Modeling
Abstract
Multi-object tracking is a challenging video task that requires both locating the objects in the frames and associating the objects among the frames, which usually utilizes the tracking-by-detection paradigm. Supervised multi-object tracking ...
Multi-Camera Multi-Target Tracking with Space-Time-View Hyper-graph

Incorporating multiple cameras is an effective solution to improve the performance and robustness of multi-target tracking to occlusion and appearance ambiguities. In this paper, we propose a new multi-camera multi-target tracking method based on a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
585
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu PLi YLi ZYang XXue D(2025)Multi-View, Multi-Target Tracking in Low-Altitude Scenes with UAV InvolvementDrones10.3390/drones90201389:2(138)Online publication date: 13-Feb-2025
https://doi.org/10.3390/drones9020138
Feng WWang FHan RGan YQian ZHou JWang S(2025)Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346396647:1(351-368)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3463966
Qiao YFan HWang QZhao TTang Y(2024)STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone AssociationRemote Sensing10.3390/rs1620386116:20(3861)Online publication date: 17-Oct-2024
https://doi.org/10.3390/rs16203861
Aung SPark HJung HCho J(2024)Enhancing Multi-view Pedestrian Detection Through Generalized 3D Feature Pulling2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00123(1185-1194)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00123
S⊘rensen SKjærgaard M(2024)Quantifying the Accuracy of Collaborative IoT and Robot Sensing in Indoor Settings of Rigid Objects2024 21st International Conference on Ubiquitous Robots (UR)10.1109/UR61395.2024.10597533(550-557)Online publication date: 24-Jun-2024
https://doi.org/10.1109/UR61395.2024.10597533
Wang SSheng HZhang YYang DShen JChen R(2024)Blockchain-Empowered Distributed Multicamera Multitarget Tracking in Edge ComputingIEEE Transactions on Industrial Informatics10.1109/TII.2023.326189020:1(369-379)Online publication date: Jan-2024
https://doi.org/10.1109/TII.2023.3261890
Huang BJu JShu YWang Y(2024)Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras With Human SemanticsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332837134:6(4229-4242)Online publication date: Jun-2024
https://doi.org/10.1109/TCSVT.2023.3328371
Bilakeri SKotegar K(2024)Learning to Track With Dynamic Message Passing Neural Network for Multi-Camera Multi-Object TrackingIEEE Access10.1109/ACCESS.2024.338313812(63317-63333)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3383138
Li CBian NZhao ZWang HSchuller B(2024)Multi-view domain-adaptive representation learning for EEG-based emotion recognitionInformation Fusion10.1016/j.inffus.2023.102156104:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102156
Yang YXu MRalph JLing YPan X(2024)An end-to-end tracking framework via multi-view and temporal feature aggregationComputer Vision and Image Understanding10.1016/j.cviu.2024.104203249(104203)Online publication date: Dec-2024
https://doi.org/10.1016/j.cviu.2024.104203
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten