Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475263acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Video Visual Relation Detection via Iterative Inference

Published: 17 October 2021 Publication History

Abstract

The core problem of video visual relation detection (VidVRD) lies in accurately classifying the relation triplets, which comprise of the classes of subject and object entities, and the predicate classes of various relationships between them. Existing VidVRD approaches classify these three relation components in either independent or cascaded manner, thus fail to fully exploit the inter-dependency among them. In order to utilize this inter-dependency in tackling the challenges of visual relation recognition in videos, we propose a novel iterative relation inference approach for VidVRD. We derive our model from the viewpoint of joint relation classification which is light-weight yet effective, and propose a training approach to better learn the dependency knowledge from the likely correct triplet combinations. As such, the proposed inference approach is able to gradually refine each component based on its learnt dependency and the other two's predictions. Our ablation studies show that this iterative relation inference can empirically converge in a few steps and consistently boost the performance over baselines. Further, we incorporate it into a newly designed VidVRD architecture, named VidVRD-II (Iterative Inference), which generalizes well across different datasets. Experiments show that VidVRD-II achieves the start-of-the-art performance on both of ImageNet-VidVRD and VidOR benchmark datasets.

References

[1]
Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia.
[2]
Qianwen Cao, Heyan Huang, Xindi Shang, Boran Wang, and Tat-Seng Chua. 2021. 3-D Relation Network for visual relation recognition in videos. Neurocomputing, Vol. 432 (2021), 91--100.
[3]
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019 c. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4613--4623.
[4]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019 b. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.
[5]
Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, and Li Fei-Fei. 2019 a. Scene graph prediction with limited labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2580--2590.
[6]
Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10337--10346.
[7]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE conference on computer vision and Pattern recognition. 3076--3086.
[8]
Donglin Di, Xindi Shang, Weinan Zhang, Xun Yang, and Tat-Seng Chua. 2019. Multiple Hypothesis Video Relation Detection. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). IEEE, 287--291.
[9]
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047--6056.
[10]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1969--1978.
[11]
Mao Gu, Zhou Zhao, Weike Jin, Richang Hong, and Fei Wu. 2021. Graph-Based Multi-Interaction Network for Video Question Answering. IEEE Transactions on Image Processing, Vol. 30 (2021), 2758--2770.
[12]
Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).
[13]
Seong Jae Hwang, Sathya Ravi, Zirui Tao, Hyunwoo J. Kim, Maxwell D. Collins, and Vikas Singh. 2018. Tensorize, Factorize and Regularize: Robust Visual Relationship Learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 1014--1023.
[14]
Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 727--735.
[15]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. 2020. The open images dataset v4. International Journal of Computer Vision (2020), 1--26.
[16]
Dong Li, Ting Yao, Zhaofan Qiu, Houqiang Li, and Tao Mei. 2019. Long Short-Term Relation Networks for Video Action Detection. In Proceedings of the 27th ACM International Conference on Multimedia. 629--637.
[17]
Yikang Li, Wanli Ouyang, Zhou Bolei, Shi Jianping, Zhang Chao, and Xiaogang Wang. 2018. Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation. In Proceedings of the European Conference on Computer Vision (ECCV).
[18]
Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 848--857.
[19]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[20]
Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10840--10849.
[21]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.
[22]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852--869.
[23]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.
[24]
Julia Peyre, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2019. Detecting Unseen Visual Relations Using Analogies. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 1981--1990.
[25]
Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia. 84--93.
[26]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252.
[27]
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019 a. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279--287.
[28]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017a. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300--1308.
[29]
Xindi Shang, Tongwei Ren, Hanwang Zhang, Gangshan Wu, and Tat-Seng Chua. 2017b. Object trajectory proposal. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 331--336.
[30]
Xindi Shang, Junbin Xiao, Donglin Di, and Tat-Seng Chua. 2019 b. Relation Understanding in Videos: A Grand Challenge Overview. In Proceedings of the 27th ACM International Conference on Multimedia. 2652--2656.
[31]
Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video Relation Detection via Multiple Hypothesis Association. In Proceedings of the 28th ACM International Conference on Multimedia. 3127--3135.
[32]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV). 318--334.
[33]
Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video visual relation detection via multi-modal feature fusion. In Proceedings of the 27th ACM International Conference on Multimedia. 2657--2661.
[34]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[35]
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020 b. Asynchronous interaction aggregation for action detection. In European Conference on Computer Vision. Springer, 71--87.
[36]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020 a. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716--3725.
[37]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The New Data in Multimedia Research. Commun. ACM, Vol. 59, 2 (2016), 64--73.
[38]
Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10424--10433.
[39]
Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In European Conference on Computer Vision. Springer, 447--464.
[40]
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9777--9786.
[41]
Wentao Xie, Guanghui Ren, and Si Liu. 2020. Video Relation Detection with Trajectory-aware Multi-modal Features. In Proceedings of the 28th ACM International Conference on Multimedia. 4590--4594.
[42]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.
[43]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670--685.
[44]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020 a. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339--1348.
[45]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded Video Moment Retrieval with Causal Intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[46]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020 b. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM International Conference on Multimedia. 1939--1947.
[47]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[48]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3107--3115.
[49]
Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical Contrastive Losses for Scene Graph Parsing. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 11527--11535.
[50]
Sipeng Zheng, Xiangyu Chen, Shizhe Chen, and Qin Jin. 2019. Relation understanding in videos. In Proceedings of the 27th ACM International Conference on Multimedia. 2662--2666.
[51]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408--417.

Cited By

View all
  • (2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
  • (2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
  • (2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Video Visual Relation Detection via Iterative Inference

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. knowledge
    2. relation inference
    3. video understanding
    4. visual relation

    Qualifiers

    • Research-article

    Funding Sources

    • Sea-NExT Joint Lab Singapore

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)345
    • Downloads (Last 6 weeks)65
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
    • (2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
    • (2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
    • (2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
    • (2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
    • (2024)Leveraging spatial residual attention and temporal Markov networks for video action understandingNeural Networks10.1016/j.neunet.2023.10.047169:C(378-387)Online publication date: 4-Mar-2024
    • (2024)Visual Relationship TransformationComputer Vision – ECCV 202410.1007/978-3-031-73650-6_15(251-272)Online publication date: 21-Nov-2024
    • (2023)Redundancy-aware Transformer for Video Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612577(3172-3180)Online publication date: 26-Oct-2023
    • (2023)Partial Annotation-based Video Moment Retrieval via Iterative LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612088(4330-4339)Online publication date: 26-Oct-2023
    • (2023)Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612024(5153-5163)Online publication date: 27-Oct-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media