Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

Published: 29 October 2024 Publication History

Abstract

Due to the limitations of visible (RGB) sensors in challenging scenarios, such as nighttime and foggy environments, the thermal infrared (TIR) modality draws increasing attention as an auxiliary source for robust tracking systems. Currently, the existing methods extract both the RGB and TIR (RGBT) clues in a similar approach, i.e., utilising RGB-pretrained models with or without finetuning, and then aggregate the multi-modal information through a fusion block embedded in a single level. However, the different imaging principles of RGB and TIR data raise questions about the suitability of RGB-pretrained models for thermal data. In this article, it is argued that the modality gap is overlooked, and an alternative training paradigm is proposed for TIR data to ensure consistency between the training and test data, which is achieved by optimising the TIR feature extractor with only TIR data involved. Furthermore, with the goal of making better use of the enhanced thermal representations, a multi-level fusion strategy is inspired by the observation that various fusion strategies at different levels can contribute to a better performance. Specifically, fusion modules at both the feature and decision levels are derived for a comprehensive fusion procedure while the pixel-level fusion strategy is not considered due to the misalignment of multi-modal image pairs. The effectiveness of our method is demonstrated by extensive qualitative and quantitative experiments conducted on several challenging benchmarks. Code will be released at https://github.com/Zhangyong-Tang/MELT.

References

[1]
Luca Bertinetto, Jack Valmadre, Joao F. Henriques, Andrea Vedaldi, and Philip H S Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision Workshops, 850–865.
[2]
Bing Cao, Junliang Guo, Pengfei Zhu, and Qinghua Hu. 2024. Bi-Directional Adapter for Multimodal Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 927–935.
[3]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8126–8135.
[4]
Zhen Chen, Ming Yang, and Shiliang Zhang. 2023. Complementary Coarse-to-Fine Matching for Video Object Segmentation. ACM Transactions on Multimedia Computing, Communications and Applications 19, 6 (2023), 1–21.
[5]
Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. 2020. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’20), 6667–6676.
[6]
N. Cvejic, S. G. Nikolov, H. D. Knowles, A. Loza, A. Achim, D. R. Bull, and C. N. Canagarajah. 2007. The Effect of Pixel-Level Fusion on Object Tracking in Multi-Sensor Surveillance Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’07), 1–7.
[7]
Mingzheng Feng, Kechen Song, Yanyan Wang, Jie Liu, and Yunhui Yan. 2020. Learning Discriminative Update Adaptive Spatial-Temporal Regularized Correlation Filter for RGB-T Tracking. Journal of Visual Communication and Image Representation 72 (2020), Article 102881.
[8]
Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. 2022. AiATrack: Attention in Attention for Transformer Visual Tracking. In Proceedings of the European Conference on Computer Vision. Springer, 146–164.
[9]
Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, and F. Wang. 2019. Deep Adaptive Fusion Network for High Performance RGBT Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW ’19), 91–99.
[10]
Jianting Guo, Peijia Zheng, and Jiwu Huang. 2017. An Efficient Motion Detection and Tracking Scheme for Encrypted Surveillance Videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13, 4 (2017), 1–23.
[11]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’16), 770–778.
[12]
Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. 2023. Bridging Search Region Interaction With Template for RGB-T Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13630–13639.
[13]
Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. 2018. Real-Time Mdnet. In Proceedings of the European Conference on Computer Vision (ECCV), 83–98.
[14]
M. Kristan, J. Matas, A. Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Cehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg, Abdelrahman Eldesokey, Jani Kapyla, Gustavo Fernandez, Abel Gonzalez-Garcia, Alireza Memarmoghadam, Andong Lu, Anfeng He, Anton Varfolomieiev, Antoni Chan, Ardhendu Shekhar Tripathi, Arnold Smeulders, Bala Suraj Pedasingu, Bao Xin Chen, Baopeng Zhang, Baoyuan Wu, Bi Li, Bin He, Bin Yan, Bing Bai, Bing Li, Bo Li, Byeong Hak Kim, and Byeong Hak Ki. 2019. The Seventh Visual Object Tracking VOT2019 Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW ’19), 2206–2241.
[15]
Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. 2016. Learning Collaborative Sparse Representation for Grayscale-Thermal Tracking. IEEE Transactions on Image Processing 25, 12 (2016), 5743–5756.
[16]
Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T Object Tracking: Benchmark and Baseline. Pattern Recognition 96 (2019), Article 106977.
[17]
Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. 2020. Challenge-Aware RGBT Tracking. In Proceedings of the European Conference on Computer Vision (ECCV). Springer International Publishing, 222–237.
[18]
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2022. LasHeR: A Large-Scale High-Diversity Benchmark for RGBT Tracking. IEEE Transactions on Image Processing 31 (2022), 392–404.
[19]
Chenglong Li, Nan Zhao, Yijuan Lu, Chengli Zhu, and Jin Tang. 2017. Weighted Sparse Representation Regularized Graph Learning for RGB-T Object Tracking. In Proceedings of the 25th ACM International Conference on Multimedia, 1856–1864.
[20]
C. L. Li, A. Lu, A. H. Zheng, Z. Tu, and J. Tang. 2019. Multi-Adapter RGBT Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW ’19), 2262–2270.
[21]
Rui Li, Baopeng Zhang, Wei Liu, Zhu Teng, and Jianping Fan. 2023. PANet: An End-To-End Network Based on Relative Motion for Online Multi-Object Tracking. ACM Transactions on Multimedia Computing, Communications and Applications 19, 6 (2023), 1–21.
[22]
Qiao Liu, Xin Li, Zhenyu He, Nana Fan, Di Yuan, Wei Liu, and Yonsheng Liang. 2020. Multi-Task Driven Feature Models for Thermal Infrared Tracking. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 11604–11611.
[23]
Qiao Liu, Xin Li, Di Yuan, Chao Yang, Xiaojun Chang, and Zhenyu He. 2023. LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Single Object Tracking Benchmark. IEEE Transactions on Neural Networks and Learning Systems 35, 7 (2023), 9844–9857.
[24]
Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. 2021. RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss. IEEE Transactions on Image Processing 30 (2021), 5613–5625.
[25]
Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. 2022. Duality-Gated Mutual Condition Network for RGBT Tracking. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–14.
[26]
Alan Lukezic, Tomas Vojir, Luka ˇCehovin Zajc, Jiri Matas, and Matej Kristan. 2017. Discriminative Correlation Filter with Channel and Spatial Reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6309–6318.
[27]
C. Luo, B. Sun, K. Yang, T. Lu, and W. C. Yeh. 2019. Thermal Infrared and Visible Sequences Fusion Tracking Based on a Hybrid Tracking Framework with Adaptive Weighting Scheme. Infrared Physics & Technology 99 (2019), 265–276.
[28]
L. Mihaylova, A. Loza, S. G. Nikolov, J. J. Lewis, and D. R. Bull. 2006. The Influence of Multi-Sensor Video Fusion on Object Tracking Using a Particle Filter. In Proceedings of the Informatik 2006 - Informatik für Menschen, Band 1, Beiträge der 36. Jahrestagung der Gesellschaft für Informatik e.V. (GI) 2006, 354–358.
[29]
Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In Proceedings of the European Conference on Computer Vision (ECCV), 300–317.
[30]
Hyeonseob Nam and Bohyung Han. 2016. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’16), 4293–4302.
[31]
Zhang Pengyu, Jie Zhao, Dong Wang, Huchuan Lu, and Xiang Ruan. 2022. Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8886–8895.
[32]
Huchuan Lu Pengyu Zhang, Dong Wang and Xiaoyun Yang. 2021. Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. International Journal of Computer Vision 129 (2021), 2714–2729.
[33]
Shi Pu, Yibing Song, Chao Ma, Honggang Zhang, and Ming-Hsuan Yang. 2018. Deep Attentive Tracking via Reciprocative Learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 31.
[34]
Dongyu Rao, Tianyang Xu, and Xiao-Jun Wu. 2023. TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network. IEEE Transactions on Image Processing (2023). DOI:
[35]
Shizhong Tan, Jixiang Yang, and Han Ding. 2023. A Prediction and Compensation Method of Robot Tracking Error Considering Pose-Dependent Load Decomposition. Robotics and Computer-Integrated Manufacturing 80 (2023), Article 102476.
[36]
Zhangyong Tang, Tianyang Xu, Hui Li, Xiao-Jun Wu, XueFeng Zhu, and Josef Kittler. 2023. Exploring Fusion Strategies for Accurate RGBT Visual Object Tracking. Information Fusion 99 (2023), Article 101881. DOI:
[37]
Zhangyong Tang, Tianyang Xu, Xiaojun Wu, Xue-Feng Zhu, and Josef Kittler. 2024. Generative-Based Fusion Mechanism for Multi-Modal Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 5189–5197.
[38]
Zhangyong Tang, Tianyang Xu, and Xiao-Jun Wu. 2022. Temporal Aggregation for Adaptive RGBT Tracking. arXiv:2201.08949. Retrieved from https://doi.org/10.48550/arxiv.2201.08949
[39]
C. Wang, C. Xu, Z. Cui, L. Zhou, and J. Yang. 2020. Cross-Modal Pattern-Propagation for RGB-T Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’20).
[40]
Tianyi Wang, Harry Cheng, Kam Pui Chow, and Liqiang Nie. 2023. Deep Convolutional Pooling Transformer for Deepfake Detection. ACM Transactions on Multimedia Computing, Communications and Applications 19, 6 (2023), 1–20.
[41]
Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. 2022. Attribute-Based Progressive Fusion Network for RGBT Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2831–2838.
[42]
Qin Xu, Yiming Mei, Jinpei Liu, and Chenglong Li. 2021. Multimodal Cross-Layer Bilinear Pooling for RGBT Tracking. IEEE Transactions on Multimedia 24 (2021), 567–580.
[43]
T. Xu, Z. Feng, X. Wu, and J. Kittler. 2019. Joint Group Feature Selection and Discriminative Filter Learning for Robust Visual Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19), 7949–7959.
[44]
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2022. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Part XXII. Springer, 341–357.
[45]
Di Yuan, Xiaojun Chang, Zhihui Li, and Zhenyu He. 2022. Learning Adaptive Spatial-Temporal Context-Aware Correlation Filters for UAV Tracking. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–18.
[46]
H. Zhang, L. Zhang, L. Zhuo, and J. Zhang. 2020c. Object Tracking in RGB-T Videos Using Modal-Aware Attention Network and Competitive Learning. Sensors 20, 2 (2020), 393.
[47]
L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. Shahbaz Khan. 2019a. Multi-Modal Fusion for End-to-End RGB-T Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW ’19), 2252–2261.
[48]
L. Zhang, A. Gonzalez-Garcia, J. van de Weijer, M. Danelljan, and F. S. Khan. 2019b. Synthetic Data Generation for End-to-End Thermal Infrared Tracking. IEEE Transactions on Image Processing 28, 4 (2019), 1837–1850.
[49]
Pengyu Zhang, Dong Wang, and Huchuan Lu. 2024. Multi-Modal Visual Tracking: Review and Experimental Comparison. Computational Visual Media 10, 2 (2024), 193–214.
[50]
Pengyu Zhang, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking. IEEE Transactions on Image Processing 30 (2021), 3335–3347.
[51]
Tianlu Zhang, Xueru Liu, Qiang Zhang, and Jungong Han. 2022. SiamCDA: Complementarity- and Distractor-Aware RGB-T Tracking Based on Siamese Network. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2022), 1403–1417.
[52]
Xingchen Zhang, Ping Ye, Shengyun Peng, Jun Liu, Ke Gong, and Gang Xiao. 2019. SiamFT: An RGB-Infrared Fusion Tracking Method via Fully Convolutional Siamese Networks. IEEE Access 7 (2019), 122122–122133.
[53]
Xingchen Zhang, Ping Ye, Shengyun Peng, Jun Liu, and Gang Xiao. 2020b. DSiamMFT: An RGB-T fusion tracking method via dynamic Siamese networks using multi-layer feature fusion. Signal Processing: Image Communication 84 (2020), Article 115756.
[54]
Zhipeng Zhang and Houwen Peng. 2019. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19), 4586–4595.
[55]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. 2023a. Leveraging Local and Global Cues for Visual Tracking via Parallel Interaction Network. IEEE Transactions on Circuits and Systems for Video Technology 33, 4 (2023), 1671–1683.
[56]
Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhenjun Tang, Rongrong Ji, and Xianxian Li. 2023b. Leveraging Local and Global Cues for Visual Tracking via Parallel Interaction Network. IEEE Transactions on Circuits and Systems for Video Technology 33, 4 (2023), 1671–1683.
[57]
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023a. Visual Prompt Multi-Modal Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9516–9526.
[58]
Xuefeng Zhu, Tianyang Xu, Zongtao Liu, Zhangyong Tang, Xiao-Jun Wu, and Josef Kittler. 2024. UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-Modal Learning. International Journal of Computer Vision 132 (2024), 2845–2860.
[59]
Xue-Feng Zhu, Tianyang Xu, Zhangyong Tang, Zucheng Wu, Haodong Liu, Xiao Yang, Xiao-Jun Wu, and Josef Kittler. 2023b. RGBD1K: A Large-scale Dataset and Benchmark for RGB-D Object Tracking. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, 3870–3878.
[60]
Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. 2019. Dense Feature Aggregation and Pruning for RGBT Tracking. In Proceedings of the 27th ACM International Conference on Multimedia, 465–472.
[61]
Yabin Zhu, Chenglong Li, Jin Tang, and Bin Luo. 2021a. Quality-Aware Feature Aggregation Network for Robust RGBT Tracking. IEEE Transactions on Intelligent Vehicles 6, 1 (2021), 121–130.
[62]
Yabin Zhu, Chenglong Li, Jin Tang, Bin Luo, and Liang Wang. 2021b. RGBT Tracking by Trident Fusion Network. IEEE Transactions on Circuits and Systems for Video Technology 32, 2 (2021), 579–592.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 10
October 2024
729 pages
EISSN:1551-6865
DOI:10.1145/3613707
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2024
Online AM: 15 July 2024
Accepted: 08 July 2024
Revised: 03 May 2024
Received: 29 November 2023
Published in TOMM Volume 20, Issue 10

Check for updates

Author Tags

  1. Visual object tracking
  2. RGBT tracking
  3. thermal enhancement
  4. multi-modal multi-level fusion

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Natural Science Foundation of China
  • 111 Project of Ministry of Education of China
  • UK EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 206
    Total Downloads
  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)69
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media