Abstract
Human-object interaction detection can be mainly categorized into two core problems, namely human-object association detection and interaction understanding. Firstly, for association detection, previous methods tend to directly detect obvious human-object interaction pairs, while ignoring some interaction pairs that may have potential interaction relationships, which is contrary to the actual situation. Secondly, for the interaction understanding problem, traditional methods face the challenges of long-tailed distribution and zero-shot detection, which cannot flexibly deal with complex and changing real-world scenarios. To this end, adaptive Query Learning for HOI Detection via vision-language knowledge Transfer(QLDT) is proposed. Specifically, a two-stage dynamic matching scoring algorithm based on dynamically changing thresholds and scores is designed to explore obscure H-O pairs and labeling to enlarge the sample size. Secondly, a visual-language pre-trained model GLIP (Grounded Language-Image Pre-training), is introduced to enhance the model’s interactive comprehension ability, extract the visual and linguistic features of the images through GLIP, minimize the gap with the predicted values using cross-entropy loss, and take the maximum value of the score with the obscure H-O pairs as the final prediction, which ensures the model’s positivity. The proposed method shows excellent performance on both HICO-DET and V-COCO datasets, for HICO-DET, QLDT achieved 35.37% mAP on the full category, 30.15% mAP on the rare category, and also improved on all five zero-shot metrics. For V-COCO, 62.74% mAP and 67.71% mAP were achieved under Scenario 1 and 2, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Code Availability
The code of this paper is available at https://github.com/KKKarlW/QLDT.git
References
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778
Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2961–2969
Luo W, Zhang H, Li J, Wei X-S (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Zhang H, Qian F, Shang F, Du W, Qian J, Yang J (2020) Global convergence guarantees of (a) gist for a family of nonconvex sparse learning problems. IEEE Trans Cybernet 52(5):3276–3288
Wu G, Ning X, Hou L, He F, Zhang H, Shankar A (2023) Three-dimensional softmax mechanism guided bidirectional gru networks for hyperspectral remote sensing image classification. Signal Process 212:109151
Zhang H, Qian F, Zhang B, Du W, Qian J, Yang J (2022) Incorporating linear regression problems into an adaptive framework with feasible optimizations. IEEE Trans Multimed
Li LH, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N et al (2022) Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10965–10975
Xie C, Zeng F, Hu Y, Liang S, Wei Y (2023) Category query learning for human-object interaction classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15275–15284
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Affordance transfer learning for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 495–504
Zhong X, Qu X, Ding C, Tao D (2021) Glance and gaze: inferring action-aware points for one-stage human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13234–13243
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning, pp 8748–8763
Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–23400
Wu M, Gu J, Shen Y, Lin M, Chen C, Sun X (2023) End-to-end zero-shot hoi detection via vision and language knowledge distillation. Proceedings of the AAAI conference on artificial intelligence 37:2839–2846
Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20123–20132
Ning S, Qiu L, Liu Y, He X (2023) Hoiclip: efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23507–23517
Chao Y-W, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 381–389
Gupta S, Malik J (2015) Visual semantic role labeling. arXiv preprint arXiv:1505.04474
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the european conference on computer vision, pp 213–229
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Kim B, Lee J, Kang J, Kim E-S, Kim HJ (2021) Hotr: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 74–83
Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10410–10419
Zhang A, Liao Y, Liu S, Lu M, Wang Y, Gao C, Li X (2021) Mining the benefits of two-stage and one-stage hoi detection. Adv Neural Inf Process Syst 34:17209–17220
Zhou D, Liu Z, Wang J, Wang L, Hu T, Ding E, Wang J (2022) Human-object interaction detection via disentangled transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19568–19577
Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834
Chan S, Wang W, Shao Z, Bai C (2023) Sgpt: the secondary path guides the primary path in transformers for hoi detection. In: Proceedings of the IEEE international conference on robotics and automation, pp 7583–7590
Lei T, Caba F, Chen Q, Jin H, Peng Y, Liu Y (2023) Efficient adaptive human-object interaction detection with concept-guided memory. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6480–6490
Cao Y, Tang Q, Yang F, Su X, You S, Lu X, Xu C (2023) Re-mine, learn and reason: exploring the cross-modal semantic correlations for language-guided hoi detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 23492–23503
Chen M, Liao Y, Liu S, Chen Z, Wang F, Qian C (2021) Reformulating hoi detection as adaptive set prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9004–9013
Dong L, Li Z, Xu K, Zhang Z, Yan L, Zhong S, Zou X (2022) Category-aware transformer network for better human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19538–19547
Qu X, Ding C, Li X, Zhong X, Tao D (2022) Distillation using oracle queries for transformer-based human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19558–19567
Zhong X, Ding C, Li Z, Huang S (2022) Towards hard-positive query mining for detr-based human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 444–460
Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the international conference on machine learning, pp 4904–4916
Zhou P, Chi M (2019) Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 843–851
Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N (2021) Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1780–1790
Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D (2021) Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2085–2094
Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2022) Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304
Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Proceedings of the european conference on computer vision, pp 696–712
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Shen L, Yeung S, Hoffman J, Mori G, Fei-Fei L (2018) Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the 2018 IEEE winter conference on applications of computer vision, pp 1568–1576
Bansal A, Rambhatla SS, Shrivastava A, Chellappa R (2020) Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI conference on artificial intelligence, pp 10460–10469
Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9677–9685
Hou Z, Peng X, Qiao Y, Tao D (2020) Visual compositional learning for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 584–600
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14646–14655
Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 4235–4243
Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1981–1990
Xia L, Li R (2020) Multi-stream neural network fused with local information and global information for hoi detection. Appl Intell 50(12):4495–4505
He H, Yuan Y, Yue X, Hu H (2022) Rankseg: adaptive pixel classification with image category ranking for segmentation. In: Proceedings of the european conference on computer vision, pp 682–700
Gupta A, Narayan S, Joseph K, Khan S, Khan FS, Shah M (2022) Ow-detr: open-world detection transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9235–9244
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97
Liu X, Li Y-L, Wu X, Tai Y-W, Lu C, Tang C-K (2022) Interactiveness field in human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20113–20122
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8359–8367
Qi S, Wang W, Jia B, Shen J, Zhu S-C (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the european conference on computer vision, pp 401–417
Li Y-L, Zhou S, Huang X, Xu L, Ma Z, Fang H-S, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3585–3594
Gao C, Xu J, Zou Y, Huang J-B (2020) Drg: dual relation graph for human-object interaction detection. In: Proceedings of the european conference on computer vision, pp 696–712
Ulutan O, Iftekhar A, Manjunath BS (2020) Vsgnet: spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13617–13626
Li Y-L, Liu X, Wu X, Li Y, Lu C (2020) Hoi analysis: integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022
Zhang FZ, Campbell D, Gould S (2021) Spatially conditioned graphs for detecting human-object interactions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13319–13327
Zhang FZ, Campbell D, Gould S (2022) Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20104–20112
Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2019) Multiple kernel \( k \) k-means with incomplete kernels. IEEE Trans Pattern Anal Mach Intell 42(5):1191–1204
Zhou Z, Zhang B, Yu X (2022) Immune coordination deep network for hand heat trace extraction. Infrared Phys Tech 127:104400
Yu X, Ye X, Zhang S (2022) Floating pollutant image target extraction algorithm based on immune extremum region. Digital Signal Process 123:103442
Yu X, Zhou Z, Gao Q, Li D, Ríha K (2018) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Tech 88:184–193
Author information
Authors and Affiliations
Contributions
Xincheng Wang performed the methodology and conceptualization; Yongbin Gao performed the review and supervision; Wenjun Yu performed the data curation; Chenmou Wu performed the validation; Mingxuan Chen performed the formal analysis; Honglei Ma performed the investigation; Zhichao Chen performed the data checks.
Corresponding author
Ethics declarations
Ethical and Informed Consent for Data Used
Informed consent was obtained from the Shanghai University of Engineering Science and COMAC Shanghai Aircraft Manufacturing Co., LTD for the publication of this article from all authors.
Competing Interests
The corresponding author of this paper is the associate editor of Applied Intelligence.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, X., Gao, Y., Yu, W. et al. QLDT: adaptive Query Learning for HOI Detection via vision-language knowledge Transfer. Appl Intell 54, 9008–9027 (2024). https://doi.org/10.1007/s10489-024-05653-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05653-1