Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3343031.3350896acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis

Published: 15 October 2019 Publication History

Abstract

In recent years, more and more videos are captured from the first-person viewpoint by wearable cameras. Such first-person video provides additional information besides the traditional third-person video, and thus has a wide range of applications. However, techniques for analyzing the first-person video can be fundamentally different from those for the third-person video, and it is even more difficult to explore the shared information from both viewpoints. In this paper, we propose a novel method for first- and third-person video co-analysis. At the core of our method is the notion of "joint attention'', indicating the learnable representation that corresponds to the shared attention regions in different viewpoints and thus links the two viewpoints. To this end, we develop a multi-branch deep network with a triplet loss to extract the joint attention from the first- and third-person videos via self-supervised learning. We evaluate our method on the public dataset with cross-viewpoint video matching tasks. Our method outperforms the state-of-the-art both qualitatively and quantitatively. We also demonstrate how the learned joint attention can benefit various applications through a set of additional experiments.

References

[1]
Radhakrishna Achanta, Appu Shaji, Kevin Smith, Auré lien Lucchi, Pascal Fua, and Sabine Sü sstrunk. 2012. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. (2012), 2274--2282.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[3]
Shervin Ardeshir and Ali Borji. 2016. Ego2top: Matching viewers in egocentric and top-view videos. In European Conference on Computer Vision. Springer, 253--268.
[4]
Zhi-Qi Cheng Hao Liu Zequn Jie Jiashi Feng Bo Zhao, Xiao Wu. 2018. Multi-View Image Generation from a Single-View. In ACM international conference on Multimedia . 383--391.
[5]
M. Cai, F. Lu, and Y. Gao. 2018. Desktop Action Recognition From First-Person Point-of-View. IEEE Transactions on Cybernetics, Vol. 49, 5 (2018), 1616--1628.
[6]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[7]
Hong Chen, Yifei Huang, and Hideki Nakayama. 2018a. Semantic Aware Attention Based Deep Object Co-segmentation. ACCV (2018).
[8]
Shi Chen and Qi Zhao. 2018. Boosted Attention: Leveraging Human Attention for Image Captioning. In The European Conference on Computer Vision (ECCV) .
[9]
Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, and Jiebo Luo. 2018b. “Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention. In The European Conference on Computer Vision (ECCV) .
[10]
Yihua Cheng, Feng Lu, and Xucong Zhang. 2018. Appearance-Based Gaze Estimation via Evaluation-Guided Asymmetric Regression. In The European Conference on Computer Vision (ECCV) .
[11]
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-Context Attention for Human Pose Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[12]
Yang Du, Chunfeng Yuan, Bing Li, Lili Zhao, Yangxi Li, and Weiming Hu. 2018. Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification. In The European Conference on Computer Vision (ECCV) .
[13]
Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. 2017. Identifying first-person camera wearers in third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5125--5133.
[14]
Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. 2019. Dual Attention Network for Scene Segmentation. CVPR, Vol. abs/1809.02983 (2019).
[15]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating Summaries from User Videos. In ECCV .
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. CVPR (2016).
[17]
Jie Hu, Li Shen, and Gang Sun. 2018b. Squeeze-and-Excitation Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[18]
Sixing Hu, Mengdan Feng, Rang M. H. Nguyen, and Gim Hee Lee. 2018a. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[19]
Y. Huang, M. Cai, H. Kera, R. Yonetani, K. Higuchi, and Y. Sato. 2017. Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 2313--2321.
[20]
Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. 2018. Predicting Gaze in Egocentric Video by Learning Task-Dependent Attention Transition. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part IV. 789--804.
[21]
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence 11 (1998), 1254--1259.
[22]
Haojian Jin, Yale Song, and Koji Yatani. 2017. ElasticPlay: Interactive Video Summarization with Dynamic Time Budgets. In Proceedings of the 25th ACM International Conference on Multimedia (MM '17).
[23]
Aske R. Lejbolle, Benjamin Krogh, Kamal Nasrollahi, and Thomas B. Moeslund. 2018. Attention in Multimodal Neural Networks for Person Re-Identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .
[24]
Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. 2018a. Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-Identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[25]
Wei Li, Xiatian Zhu, and Shaogang Gong. 2018b. Harmonious Attention Network for Person Re-Identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[26]
Y. Li and Y. Wang. 2018. A Multi-label Image Classification Algorithm Based on Attention Model. In 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS) . 728--731.
[27]
Wentao Liu, Jie Chen, Cheng Li, Chen Qian, Xiao Chu, and Xiaolin Hu. 2018. A Cascaded Inception of Inception Network With Attention Modulated Feature Fusion for Human Pose Estimation.
[28]
F. Lu, X. Chen, and Y. Sato. 2017. Appearance-Based Gaze Estimation via Uncalibrated Gaze Pattern Recovery. IEEE Transactions on Image Processing, Vol. 26, 4 (2017), 1543--1553.
[29]
Prerana Mukherjee, Brejesh Lall, and Snehith Lattupally. 2018. Object cosegmentation using deep Siamese network. (2018).
[30]
Emilio Parisotto, Devendra Singh Chaplot, Jian Zhang, and Ruslan Salakhutdinov. 2018. Global Pose Estimation With an Attention-Based Recurrent Network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .
[31]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W .
[32]
Y. Peng, X. He, and J. Zhao. 2018. Object-Part Attention Model for Fine-Grained Image Classification. IEEE Transactions on Image Processing (March 2018).
[33]
Rong Quan, Junwei Han, Dingwen Zhang, and Feiping Nie. 2016. Object Co-segmentation via Graph Optimized-Flexible Manifold Ranking. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 687--695.
[34]
Adria Recasens, Carl Vondrick, Aditya Khosla, and Antonio Torralba. 2017. Following gaze in video. In Proceedings of the IEEE International Conference on Computer Vision. 1435--1443.
[35]
Krishna Regmi and Ali Borji. 2018. Cross-View Image Synthesis Using Conditional GANs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[36]
Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. 2018. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7396--7404.
[37]
Gunnar A. Sigurdsson, Gü l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV, Vol. abs/1604.01753 (2016).
[38]
Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2018. Mask-Guided Contrastive Attention Model for Person Re-Identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[39]
Arun Balajee Vasudevan, Michael Gygli, Anna Volokitin, and Luc Van Gool. 2017. Query-adaptive Video Summarization via Quality-aware Relevance Estimation. In Proceedings of the 25th ACM International Conference on Multimedia . 582--590.
[40]
James Vo, Nam N.and Hays. 2016. Localizing and Orienting Street Views Using Overhead Imagery. In Computer Vision -- ECCV 2016. Springer International Publishing, Cham, 494--509.
[41]
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual Attention Network for Image Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[42]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[43]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In The European Conference on Computer Vision (ECCV) .
[44]
Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. 2018. Joint Person Segmentation and Identification in Synchronized First-and Third-person Videos. In Proceedings of the European Conference on Computer Vision (ECCV). 637--652.
[45]
Ryo Yonetani, Kris M Kitani, and Yoichi Sato. 2015. Ego-surfing first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5445--5454.
[46]
Jing Zhang, Tong Zhang, Yuchao Dai, Mehrtash Harandi, and Richard Hartley. 2018. Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[47]
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video Summarization with Long Short-Term Memory. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VII. 766--782.
[48]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical Recurrent Neural Network for Video Summarization. In Proceedings of the 25th ACM International Conference on Multimedia (MM '17). 863--871.
[49]
Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. AAAI (2018).

Cited By

View all
  • (2024)Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01834(19383-19400)Online publication date: 16-Jun-2024
  • (2024)Synchronization Is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video PairsComputer Vision – ECCV 202410.1007/978-3-031-73220-1_15(253-270)Online publication date: 3-Nov-2024
  • (2024)Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72920-1_23(407-425)Online publication date: 1-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. co-analysis
  2. cross-view
  3. deep learning
  4. first-person video
  5. joint attention
  6. shared representation
  7. third-person video

Qualifiers

  • Research-article

Funding Sources

  • the Fundamental Research Funds for the Central Universities
  • the National Natural Science Foundation of China (NSFC)

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 995 of 4,171 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01834(19383-19400)Online publication date: 16-Jun-2024
  • (2024)Synchronization Is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video PairsComputer Vision – ECCV 202410.1007/978-3-031-73220-1_15(253-270)Online publication date: 3-Nov-2024
  • (2024)Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric VideosComputer Vision – ECCV 202410.1007/978-3-031-72920-1_23(407-425)Online publication date: 1-Oct-2024
  • (2024)EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action UnderstandingComputer Vision – ECCV 202410.1007/978-3-031-72661-3_21(363-382)Online publication date: 27-Nov-2024
  • (2023)Multi-Dataset, Multitask Learning of Egocentric Vision TasksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.306147945:6(6618-6630)Online publication date: 1-Jun-2023
  • (2023)First- And Third-Person Video Co-Analysis By Learning Spatial-Temporal Joint AttentionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.303004845:6(6631-6646)Online publication date: 1-Jun-2023
  • (2023)EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00486(5250-5261)Online publication date: 1-Oct-2023
  • (2023)Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00306(3284-3294)Online publication date: 1-Oct-2023
  • (2022)Holographic Feature Learning of Egocentric-Exocentric Videos for Multi-Domain Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2021.307888224(2273-2286)Online publication date: 2022
  • (2021)Co-Saliency Detection With Co-Attention Fully Convolutional NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.299205431:3(877-889)Online publication date: Mar-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media