research-article

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition

Authors:

Zheng-Jun ZhaAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 193 - 202

https://doi.org/10.1145/3581783.3611799

Published: 27 October 2023 Publication History

Abstract

Recognizing activities with Unmanned Aerial Vehicles (UAVs) is essential for many applications, while existing video recognition methods are mainly designed for ground cameras and do not account for UAV changing attitudes and fast motion. This creates spatial misalignment of small objects between frames, leading to inaccurate visual movement in drone videos. Additionally, camera motion relative to objects in the video causes relative movements that visually affect object motion and can result in misunderstandings of video content. To address these issues, we present a novel framework named Attentional Spatial and Adaptive Temporal Relations Modeling. First, to mitigate the spatial misalignment of small objects between frames, we design an Attentional Patch-level Spatial Enrichment (APSE) module that models dependencies among patches and enhances patch-level features. Then, we propose a Multi-scale Temporal and Spatial Mixer (MTSM) module that is capable of adapting to disturbances caused by the UAV flight and modeling various temporal clues. By integrating APSE and MTSM into a single model, our network can effectively and accurately capture spatiotemporal relations for UAV videos. Extensive experiments on several benchmarks demonstrate the superiority of our method over state-of-the-art approaches. For instance, our network achieves a classification accuracy of 68.1% with an absolute gain of 1.3% compared to FuTH-Net on the ERA dataset.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.

[2]

Mohammadamin Barekatain, Miquel Martí, Hsueh-Fu Shih, Samuel Murray, Kotaro Nakayama, Yutaka Matsuo, and Helmut Prendinger. 2017. Okutama-action: An aerial view video dataset for concurrent human action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 28--35.

[3]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.

[4]

Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, and Changhong Fu. 2022. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14798--14808.

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[6]

Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE international conference on computer vision. 3218--3226.

Digital Library

[7]

Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and Jia-Bin Huang. 2020. Unsupervised and semi-supervised domain adaptation for action recognition from drones. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1717--1726.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im-ageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248--255. https://doi.org/10.1109/CVPR. 2009.5206848

[9]

Meng Ding, Ning Li, Ziang Song, Ruixing Zhang, Xiaxia Zhang, and Huiyu Zhou. 2020. A Lightweight Action Recognition Method for Unmanned-Aerial-Vehicle Video. In 2020 IEEE 3rd International Conference on Electronics and Communication Engineering (ICECE). IEEE, 181--185.

[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[11]

Milan Erdelj, Enrico Natalizio, Kaushik R Chowdhury, and Ian F Akyildiz. 2017. Help from the sky: Leveraging UAVs for disaster management. IEEE Pervasive Computing 16, 1 (2017), 24--32.

Digital Library

[12]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824--6835.

[13]

Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 203--213.

[14]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-fast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.

[15]

Guoqiang Gong, Liangfeng Zheng, Wenhao Jiang, and Yadong Mu. 2021. Self-Supervised Video Action Localization with Adversarial Temporal Transforms. In IJCAI. 693--699.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[17]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed-ings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[18]

Yuko Iinuma and Shin'ichi Satoh. 2021. Video Action Retrieval Using Action Recognition Model. In Proceedings of the 2021 International Conference on Multi-media Retrieval. 603--606.

Digital Library

[19]

Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2000--2009.

[20]

Pu Jin, Lichao Mou, Yuansheng Hua, Gui-Song Xia, and Xiao Xiang Zhu. 2022. FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--13.

[21]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.

Digital Library

[22]

Divya Kothandaraman, Tianrui Guan, Xijun Wang, Shuowen Hu, Ming Lin, and Dinesh Manocha. 2022. FAR: Fourier Aerial Video Recognition. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVII. Springer, 657--676.

[23]

Christos Kyrkou and Theocharis Theocharides. 2019. Deep-Learning-Based Aerial Image Classification for Emergency Response Applications Using Unmanned Aerial Vehicles. In CVPR workshops. 517--525.

[24]

Christos Kyrkou and Theocharis Theocharides. 2020. EmergencyNet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 1687--1699.

[25]

Bing Li, Jiaxin Chen, Dongming Zhang, Xiuguo Bao, and Di Huang. 2022. Repre- sentation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement. arXiv preprint arXiv:2205.03569 (2022).

[26]

Dong Li, Jiaying Zhu, Menglu Wang, Jiawei Liu, Xueyang Fu, and Zheng-Jun Zha. 2023. Edge-Aware Regional Message Passing Controller for Image Forgery Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8222--8232.

[27]

Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. 159--166.

Digital Library

[28]

Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang, and Zhiheng Li. 2021. UAV-Human: A Large Benchmark for Human Behavior Understanding With Unmanned Aerial Vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16266--16275.

[29]

Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang, and Zhiheng Li. 2021. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16266--16275.

[30]

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal Excitation and Aggregation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[32]

Jiawei Liu, Zheng-Jun Zha, Wei Wu, Kecheng Zheng, and Qibin Sun. 2021. Spatial-temporal correlation and topology learning for person re-identification in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4370--4379.

[33]

Shuai Liu, Xin Li, Huchuan Lu, and You He. 2022. Multi-object tracking meets moving UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8876--8885.

[34]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[35]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.

[36]

Wenyang Luo, Yufan Liu, Bing Li, Weiming Hu, Yanan Miao, and Yangxi Li. [n. d.]. Long-Short Term Cross-Transformer in Compressed Domain for Few-Shot Video Classification. ([n. d.]).

[37]

Murari Mandal, Lav Kush Kumar, and Santosh Kumar Vipparthi. 2020. Mor-uav: A benchmark dataset and baselines for moving object recognition in uav videos. In Proceedings of the 28th ACM international conference on multimedia. 2626--2635.

Digital Library

[38]

Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang. 2020. Multi-objective matrix normalization for fine-grained visual recognition. IEEE Transactions on Image Processing 29 (2020), 4996--5009.

Digital Library

[39]

Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, and Aude Oliva. 2020. Moments in Time Dataset: One Million Videos for Event Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2020), 502--508. https://doi.org/10.1109/TPAMI.2019.2901464

[40]

Lichao Mou, Yuansheng Hua, Pu Jin, and Xiao Xiang Zhu. 2020. Era: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets]. IEEE Geoscience and Remote Sensing Magazine 8, 4 (2020), 125--133.

[41]

Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. 2020. Single image super-resolution via a holistic attention network. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XII 16. Springer, 191--207.

Digital Library

[42]

Asanka G Perera, Yee Wei Law, and Javaan Chahl. 2019. Drone-action: An outdoor recorded drone video dataset for action recognition. Drones 3, 4 (2019), 82.

[43]

Asanka G Perera, Yee Wei Law, Titilayo T Ogunwa, and Javaan Chahl. 2020. A multiviewpoint outdoor dataset for human action recognition. IEEE Transactions on Human-Machine Systems 50, 5 (2020), 405--413.

[44]

Asanka G Perera, Yee Wei Law, and Javaan Chahl. 2018. UAV-GESTURE: A dataset for UAV control and gesture recognition. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0-0.

[45]

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You Only Look Once: Unified, Real-Time Object Detection. CoRR abs/1506.02640 (2015). arXiv:1506.02640 http://arxiv.org/abs/1506.02640

[46]

Wenqi Ren, Sifei Liu, Lin Ma, Qianqian Xu, Xiangyu Xu, Xiaochun Cao, Junping Du, and Ming-Hsuan Yang. 2019. Low-light image enhancement via a deep hybrid network. IEEE Transactions on Image Processing 28, 9 (2019), 4364--4375.

[47]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.

[48]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.

[49]

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.

[50]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).

[51]

Waqas Sultani and Mubarak Shah. 2021. Human action recognition in drone videos using a few aerial training examples. Computer Vision and Image Understanding 206 (2021), 103186.

[52]

Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization.

[53]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.

Digital Library

[54]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.

[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[56]

Kunyu Wang, Xueyang Fu, Yukun Huang, Chengzhi Cao, Gege Shi, and Zheng-Jun Zha. 2023. Generalized UAV Object Detection via Frequency Domain Disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1064--1073.

[57]

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1895--1904.

[58]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.

[59]

H. Wu, J. Liu, Z. J. Zha, Z. Chen, and X. Sun. 2019. Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition. In Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19.

[60]

Haoze Wu, Jiawei Liu, Xierong Zhu, Meng Wang, and Zheng-Jun Zha. 2021. Multi-scale spatial-temporal integration convolutional tube for human action recognition. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 753--759.

Digital Library

[61]

Shuo Yang and Xinxiao Wu. [n. d.]. Entity-aware and Motion-aware Transformers for Language-driven Action Localization. ([n. d.]).

[62]

Junjie Ye, Changhong Fu, Guangze Zheng, Danda Pani Paudel, and Guang Chen. 2022. Unsupervised domain adaptation for nighttime aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8896--8905.

[63]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4694--4702.

[64]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV). 803--818.

Digital Library

[65]

Xuelin Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, and Bo Liu. 2022. Two-Stream Transformer for Multi-Label Image Classification. In Proceedings of the 30th ACM International Conference on Multimedia. 3598--3607.

Digital Library

Cited By

Tu NAikyn N(2025)Improving Vision-Language Models With Attention Mechanisms for Aerial Video ClassificationIEEE Geoscience and Remote Sensing Letters10.1109/LGRS.2025.353298722(1-5)Online publication date: 2025
https://doi.org/10.1109/LGRS.2025.3532987
Islam MRahim NAnwar SSaqib MBakshi SMuhammad KCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)HazeSpace2M: A Dataset for Haze Aware Single Image DehazingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681382(9155-9164)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681382

Index Terms

Alleviating Spatial Misalignment and Motion Interference for UAV-based Video Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Unknown terrain imaging with adaptive spatial resolution using UAV
Abstract
Recently, Unmanned Aerial Vehicles (UAVs), or drones equipped with advanced cameras, have introduced a major revolution in surveying wide landscapes to collect ground data very fast. Capturing images of an unknown area with the desired ...
Real-time stabilization of an eight-rotor UAV usingoptical flow

An original configuration of a small aerial vehicle having eight rotors is presented. Four rotors are devoted to the stabilization of the orientation of the helicopter, and the other four are used to drive the lateral displacements. A precompensation on ...
Moving Objects Detection and Tracking Framework for UAV-based Surveillance
PSIVT '10: Proceedings of the 2010 Fourth Pacific-Rim Symposium on Image and Video Technology

Automated motion detection and tracking of ground moving objects using aerial platforms is challenging due to the small object size in comparison with objects such as buildings, as well as the fact that flying cameras can undergo rapid translations and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)144
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tu NAikyn N(2025)Improving Vision-Language Models With Attention Mechanisms for Aerial Video ClassificationIEEE Geoscience and Remote Sensing Letters10.1109/LGRS.2025.353298722(1-5)Online publication date: 2025
https://doi.org/10.1109/LGRS.2025.3532987
Islam MRahim NAnwar SSaqib MBakshi SMuhammad KCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)HazeSpace2M: A Dataset for Haze Aware Single Image DehazingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681382(9155-9164)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681382

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten