research-article

Open access

Video Visual Relation Detection via Iterative Inference

Authors:

Tat-Seng ChuaAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3654 - 3663

https://doi.org/10.1145/3474085.3475263

Published: 17 October 2021 Publication History

Abstract

The core problem of video visual relation detection (VidVRD) lies in accurately classifying the relation triplets, which comprise of the classes of subject and object entities, and the predicate classes of various relationships between them. Existing VidVRD approaches classify these three relation components in either independent or cascaded manner, thus fail to fully exploit the inter-dependency among them. In order to utilize this inter-dependency in tackling the challenges of visual relation recognition in videos, we propose a novel iterative relation inference approach for VidVRD. We derive our model from the viewpoint of joint relation classification which is light-weight yet effective, and propose a training approach to better learn the dependency knowledge from the likely correct triplet combinations. As such, the proposed inference approach is able to gradually refine each component based on its learnt dependency and the other two's predictions. Our ablation studies show that this iterative relation inference can empirically converge in a few steps and consistently boost the performance over baselines. Further, we incorporate it into a newly designed VidVRD architecture, named VidVRD-II (Iterative Inference), which generalizes well across different datasets. Experiments show that VidVRD-II achieves the start-of-the-art performance on both of ImageNet-VidVRD and VidOR benchmark datasets.

References

[1]

Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia.

Digital Library

[2]

Qianwen Cao, Heyan Huang, Xindi Shang, Boran Wang, and Tat-Seng Chua. 2021. 3-D Relation Network for visual relation recognition in videos. Neurocomputing, Vol. 432 (2021), 91--100.

[3]

Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019 c. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4613--4623.

[4]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019 b. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.

[5]

Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, and Li Fei-Fei. 2019 a. Scene graph prediction with limited labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2580--2590.

[6]

Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10337--10346.

[7]

Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE conference on computer vision and Pattern recognition. 3076--3086.

[8]

Donglin Di, Xindi Shang, Weinan Zhang, Xun Yang, and Tat-Seng Chua. 2019. Multiple Hypothesis Video Relation Detection. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). IEEE, 287--291.

[9]

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047--6056.

[10]

Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1969--1978.

[11]

Mao Gu, Zhou Zhao, Weike Jin, Richang Hong, and Fei Wu. 2021. Graph-Based Multi-Interaction Network for Video Question Answering. IEEE Transactions on Image Processing, Vol. 30 (2021), 2758--2770.

Digital Library

[12]

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).

[13]

Seong Jae Hwang, Sathya Ravi, Zirui Tao, Hyunwoo J. Kim, Maxwell D. Collins, and Vikas Singh. 2018. Tensorize, Factorize and Regularize: Robust Visual Relationship Learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 1014--1023.

[14]

Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 727--735.

[15]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. 2020. The open images dataset v4. International Journal of Computer Vision (2020), 1--26.

[16]

Dong Li, Ting Yao, Zhaofan Qiu, Houqiang Li, and Tao Mei. 2019. Long Short-Term Relation Networks for Video Action Detection. In Proceedings of the 27th ACM International Conference on Multimedia. 629--637.

Digital Library

[17]

Yikang Li, Wanli Ouyang, Zhou Bolei, Shi Jianping, Zhang Chao, and Xiaogang Wang. 2018. Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation. In Proceedings of the European Conference on Computer Vision (ECCV).

[18]

Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 848--857.

[19]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[20]

Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10840--10849.

[21]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[22]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852--869.

[23]

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.

[24]

Julia Peyre, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2019. Detecting Unseen Visual Relations Using Analogies. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 1981--1990.

[25]

Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia. 84--93.

Digital Library

[26]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252.

Digital Library

[27]

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019 a. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279--287.

Digital Library

[28]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017a. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300--1308.

Digital Library

[29]

Xindi Shang, Tongwei Ren, Hanwang Zhang, Gangshan Wu, and Tat-Seng Chua. 2017b. Object trajectory proposal. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 331--336.

[30]

Xindi Shang, Junbin Xiao, Donglin Di, and Tat-Seng Chua. 2019 b. Relation Understanding in Videos: A Grand Challenge Overview. In Proceedings of the 27th ACM International Conference on Multimedia. 2652--2656.

Digital Library

[31]

Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video Relation Detection via Multiple Hypothesis Association. In Proceedings of the 28th ACM International Conference on Multimedia. 3127--3135.

Digital Library

[32]

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV). 318--334.

Digital Library

[33]

Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video visual relation detection via multi-modal feature fusion. In Proceedings of the 27th ACM International Conference on Multimedia. 2657--2661.

Digital Library

[34]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

Digital Library

[35]

Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020 b. Asynchronous interaction aggregation for action detection. In European Conference on Computer Vision. Springer, 71--87.

Digital Library

[36]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020 a. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716--3725.

[37]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The New Data in Multimedia Research. Commun. ACM, Vol. 59, 2 (2016), 64--73.

Digital Library

[38]

Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10424--10433.

[39]

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In European Conference on Computer Vision. Springer, 447--464.

[40]

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9777--9786.

[41]

Wentao Xie, Guanghui Ren, and Si Liu. 2020. Video Relation Detection with Trajectory-aware Multi-modal Features. In Proceedings of the 28th ACM International Conference on Multimedia. 4590--4594.

Digital Library

[42]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[43]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670--685.

Digital Library

[44]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020 a. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339--1348.

Digital Library

[45]

Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded Video Moment Retrieval with Causal Intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[46]

Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020 b. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM International Conference on Multimedia. 1939--1947.

Digital Library

[47]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[48]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3107--3115.

[49]

Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical Contrastive Losses for Scene Graph Parsing. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 11527--11535.

[50]

Sipeng Zheng, Xiangyu Chen, Shizhe Chen, and Qin Jin. 2019. Relation understanding in videos. In Proceedings of the 27th ACM International Conference on Multimedia. 2662--2666.

Digital Library

[51]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408--417.

Cited By

Ji WFei HWei YZheng ZLi JChen LLiao LZhuang YZimmermann RJi WFei HZheng ZFei HWei YZheng Z(2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689091.3690093
Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Jiang XZheng CXu XLiu BZheng WZhang HHe SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680833
Show More Cited By

Index Terms

Video Visual Relation Detection via Iterative Inference
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Video Visual Relation Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

As a bridge to connect vision and language, visual relations between objects in the form of relation triplet $łangle subject,predicate,object\rangle$, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content ...
Video Visual Relation Detection via Multi-modal Feature Fusion
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we ...
Video Relation Detection via Multiple Hypothesis Association
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Sea-NExT Joint Lab Singapore

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,041
Total Downloads

Downloads (Last 12 months)345
Downloads (Last 6 weeks)65

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ji WFei HWei YZheng ZLi JChen LLiao LZhuang YZimmermann RJi WFei HZheng ZFei HWei YZheng Z(2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689091.3690093
Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Jiang XZheng CXu XLiu BZheng WZhang HHe SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680833
Wei MChen LJi WYue XZimmermann R(2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3379935
Qian RFu ZLiu XZhang KLv ZLan X(2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650663
Xu YWang ZZhang X(2024)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047169:C(378-387)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.neunet.2023.10.047
Xu XQiu JYu BWang Z(2024)Visual Relationship TransformationComputer Vision – ECCV 202410.1007/978-3-031-73650-6_15(251-272)Online publication date: 21-Nov-2024
https://doi.org/10.1007/978-3-031-73650-6_15
Li YYang XZhang AFeng CWang XChua TEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Redundancy-aware Transformer for Video Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612577(3172-3180)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612577
Ji WLiang RLiao LFei HFeng FEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Partial Annotation-based Video Moment Retrieval via Iterative LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612088(4330-4339)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612088
Wang WGao KLuo YJiang TGao FShao JSun JXiao JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612024(5153-5163)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612024
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents