Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3512527.3531392acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Source-free Temporal Attentive Domain Adaptation for Video Action Recognition

Published: 27 June 2022 Publication History

Abstract

With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.

Supplementary Material

MP4 File (icmr22-modfp187.mp4)
Despite the inspiring progress made in the unsupervised video domain adaptation, the assumption that source domain data is available for training the target model may not be always valid in practice. Therefore?we proposed Source-free Temporal Attentive Domain Adaptation for Video Action Recognition. It's first work to solve the interesting but challenging source-free unsupervised domain adaptation task. An innovative Temporal Attentively aGgregation (TAG) module is proposed to generate video-level representation from frame-level features. In the TAG module, attention weight is determined by class centroids. When label information is not available, the attentive aggregation function is replaced by the approximated attention network for feature aggregation. Frame-level and video-level loss functions are designed based on class centroids, attention approximation, supervised classification in the source domain and information maximization in the target domain. By minimizing these loss functions, both the temporal and spatial domain shifts can be mitigated. Extensive experiments show that our proposed method outperforms state-of-the-art.

References

[1]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision. 132--149.
[2]
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv preprint arXiv:1808.01340.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. 2019. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6321--6330.
[5]
Jinwoo Choi, Gaurav Sharma, Samuel Schulter, and Jia-Bin Huang. 2020. Shuffle and attend: Video domain adaptation. In Proceedings of the European Conference on Computer Vision. Springer, 678--695.
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248--255.
[7]
Yuqian Fu, Yanwei Fu, and Yu-Gang Jiang. 2021. Can Action Be Imitated? Learn to Reconstruct and Transfer Human Dynamics from Videos. In Proceedings of the ACM International Conference on Multimedia Retrieval. 101--109.
[8]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of International Conference on Machine Learning. 1180--1189.
[9]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research 17, 1, 2096--2030.
[10]
Ryan Gomes, Andreas Krause, and Pietro Perona. 2010. Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.
[12]
Yunzhong Hou and Liang Zheng. 2020. Source free domain adaptation with image translation. arXiv preprint arXiv:2008.07514.
[13]
Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In Proceedings of International Conference on Machine Learning. 1558--1567.
[14]
Arshad Jamal, Vinay P Namboodiri, Dipti Deodhare, and KS Venkatesh. 2018. Deep Domain Adaptation in Action Space. In Proceedings of the British Machine Vision Conference, Vol. 2. 4.
[15]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
[16]
Youngeun Kim, Sungeun Hong, Donghyeon Cho, Hyoungseob Park, and Priyadarshini Panda. 2020. Domain adaptation without source data. arXiv e-prints, arXiv--2007.
[17]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In Proceedings of IEEE International Conference on Computer Vision. 2556--2563.
[18]
Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. 2020. Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4544--4553.
[19]
Vinod K Kurmi, Venkatesh K Subramanian, and Vinay P Namboodiri. 2021. Domain Impression: A Source Data Free Domain Adaptation Method. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 615--625.
[20]
Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9641--9650.
[21]
Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, 109--117.
[22]
Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In Proceedings of International Conference on Machine Learning. 6028--6039.
[23]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of International Conference on Machine Learning. 97--105.
[24]
Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. 2017. Conditional adversarial domain adaptation. Advances in Neural Information Processing Systems.
[25]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In Proceedings of International Conference on Machine Learning. 2208--2217.
[26]
Yadan Luo, Zi Huang, Zijian Wang, Zheng Zhang, and Mahsa Baktashmotlagh. 2020. Adversarial Bipartite Graph Learning for Video Domain Adaptation. In Proceedings of the 28th ACM International Conference on Multimedia. 19--27.
[27]
Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision. Springer, 392--405.
[28]
Boxiao Pan, Zhangjie Cao, Ehsan Adeli, and Juan Carlos Niebles. 2020. Adversarial cross-domain action recognition with co-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11815--11822.
[29]
Pau Panareda Busto and Juergen Gall. 2017. Open set domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision. 754--763.
[30]
Pedro O Pinheiro. 2018. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8004--8013.
[31]
Yifan Ren, Xing Xu, Fumin Shen, Zheng Wang, Yang Yang, and Heng Tao Shen. 2021. Multi-Scale Dynamic Network for Temporal Action Detection. In Proceedings of the ACM International Conference on Multimedia Retrieval. 267--275.
[32]
Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. 2021. Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing. Advances in Neural Information Processing Systems 34.
[33]
Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3723--3732.
[34]
Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Open set domain adaptation by backpropagation. In Proceedings of the European Conference on Computer Vision. 153--168.
[35]
Rui Shao, Pramuditha Perera, Pong C Yuen, and Vishal M Patel. 2020. Open-set adversarial defense. In Proceedings of the European Conference on Computer Vision. 682--698.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems.
[37]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[38]
Waqas Sultani and Imran Saleemi. 2014. Human action recognition across datasets by foreground-weighted histogram decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 764--771.
[39]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.
[40]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20--36.
[41]
David S Wishart, Dan Tzur, Craig Knox, Roman Eisner, An Chi Guo, Nelson Young, Dean Cheng, Kevin Jewell, David Arndt, Summit Sawhney, et al. 2007. HMDB: the human metabolome database. In Nucleic Acids Research, Vol. 35. D521--D526.
[42]
Tiantian Xu, Fan Zhu, Edward K Wong, and Yi Fang. 2016. Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Computing 55, 127--137.
[43]
Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. 2020. Unsupervised domain adaptation without source data by casting a bait. arXiv e-prints (2020), arXiv--2010.
[44]
Hao-Wei Yeh, Baoyao Yang, Pong C Yuen, and Tatsuya Harada. 2021. SoFA: Source-data-free Feature Alignment for Unsupervised Domain Adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 474--483.
[45]
Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central Moment Discrepancy (CMD) for domain-invariant representation learning. Proceedings of The International Conference on Learning Representations.
[46]
Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. 2019. On learning invariant representations for domain adaptation. In Proceedings of International Conference on Machine Learning. 7523--7532.
[47]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision. 803--818.

Cited By

View all
  • (2024)Source-free unsupervised domain adaptationNeural Networks10.1016/j.neunet.2024.106230174:COnline publication date: 1-Jun-2024
  • (2023)Multi-Source Video Domain Adaptation With Temporal Attentive Moment Alignment NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.323430733:8(3860-3871)Online publication date: 1-Aug-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. source-free domain adaptation
  3. temporal attentive aggregation

Qualifiers

  • Research-article

Funding Sources

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Source-free unsupervised domain adaptationNeural Networks10.1016/j.neunet.2024.106230174:COnline publication date: 1-Jun-2024
  • (2023)Multi-Source Video Domain Adaptation With Temporal Attentive Moment Alignment NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.323430733:8(3860-3871)Online publication date: 1-Aug-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media