research-article

High-Order Interaction Learning for Image Captioning

Authors:

Yanhui Wang,

Ning Xu,

An-An Liu,

Wenhui Li,

Yongdong ZhangAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 32, Issue 7

Pages 4417 - 4430

https://doi.org/10.1109/TCSVT.2021.3121062

Published: 01 July 2022 Publication History

Abstract

Image captioning aims at understanding various semantic concepts (e.g., objects and relationships) from an image and integrating them in a sentence-level description. Hence, it is necessary to learn the interaction among these concepts. If we define the context of the interaction to be involved in the <italic>subject-predicate-object</italic> triplet, most current methods only focus on the single triplet for the first-order interaction to generate sentences. Intuitively, we humans are able to perceive the high-order interaction among concepts from two or more triplets to describe an image. For example, when we see the triplets <italic>man-cutting-sandwich</italic> and <italic>man-with-knife</italic>, it is natural to integrate and predict the sentence <italic>man cutting sandwich with knife</italic>. This depends on the high-order interaction between <italic>cutting</italic> and <italic>knife</italic> in different triplets. Therefore, exploiting high-order interaction is expected to benefit image captioning and focus on reasoning. In this paper, we introduce the novel high-order interaction learning method over detected objects and relationships for image captioning under the umbrella of the encoder-decoder framework. We first extract a set of object and relationship features in an image. During the encoding stage, the interactive refining network is proposed to learn high-order representations by modeling intra- and inter-object feature interaction in the self-attention fashion. During the decoding stage, the interactive fusion network is proposed to integrate object and relationship information by strengthening their high-order interaction based on language context for sentence generation. In this way, we learn the object-relationship dependencies in different stages, which can provide abundant cues for both visual understanding and caption generation. Extensive experiments show that the proposed method can achieve competitive performances against the state-of-the-art methods on MSCOCO dataset. Additional ablation studies further validate its effectiveness.

Cited By

View all

Zhang XJia AJi JQu LYe Q(2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TIP.2025.3528216
Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TCSVT.2024.3452437
Mao YXiao JZhang DCao MShao JZhuang YChen L(2024)Improving Reference-Based Distinctive Image Captioning with Contrastive RewardsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369468320:12(1-24)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1145/3694683
Show More Cited By

Index Terms

High-Order Interaction Learning for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Image captioning based on deep reinforcement learning
ICIMCS '18: Proceedings of the 10th International Conference on Internet Multimedia Computing and Service

Recently it has shown that the policy-gradient methods for reinforcement learning have been utilized to train deep end-to-end systems on natural language processing tasks. What's more, with the complexity of understanding image content and diverse ways ...
Local-global visual interaction attention for image captioning
Abstract
Image captioning is a typical cross-modal task, which aims to automatically describe the main content of an image with a complete and natural sentence. Existing attention based approaches treat local feature and global feature in the ...
Transformer based Multitask Learning for Image Captioning and Object Detection
Advances in Knowledge Discovery and Data Mining
Abstract
In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 32, Issue 7

July 2022

803 pages

ISSN:1051-8215

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 July 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhang XJia AJi JQu LYe Q(2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TIP.2025.3528216
Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TCSVT.2024.3452437
Mao YXiao JZhang DCao MShao JZhuang YChen L(2024)Improving Reference-Based Distinctive Image Captioning with Contrastive RewardsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369468320:12(1-24)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1145/3694683
Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1145/3648370
An-An LZimu LNing XMin LChenggang YBolun ZBo LYulong DZhuang SXuanya L(2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3616399
Lu ZJin LChen ZTian CSun XLi XZhang YLi QXu G(2024)Relation-Aware Multi-Pass Comparison Deconfounded Network for Change CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.344533734:12_Part_2(13349-13363)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3445337
Li JZhang LZhang KHu BXie HMao Z(2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TCSVT.2023.3343520
Liu CMu Y(2024)Multi-Granularity Interaction for Multi-Person 3D Motion PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329875534:3(1546-1558)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TCSVT.2023.3298755
Zhang YChen JMa XWang GBhatti UHuang M(2024)Interactive medical image annotation using improved Attention U-net with compound geodesic distanceExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121282237:PAOnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121282
Sharma DDhiman CKumar D(2024)XGL-T transformer model for intelligent image captioningMultimedia Tools and Applications10.1007/s11042-023-15291-383:2(4219-4240)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15291-3
Show More Cited By

Abstract

Cited By

Index Terms

Recommendations

Image captioning based on deep reinforcement learning

Local-global visual interaction attention for image captioning

Transformer based Multitask Learning for Image Captioning and Object Detection

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations