Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

High-Order Interaction Learning for Image Captioning

Published: 01 July 2022 Publication History

Abstract

Image captioning aims at understanding various semantic concepts (e.g., objects and relationships) from an image and integrating them in a sentence-level description. Hence, it is necessary to learn the interaction among these concepts. If we define the context of the interaction to be involved in the <italic>subject-predicate-object</italic> triplet, most current methods only focus on the single triplet for the first-order interaction to generate sentences. Intuitively, we humans are able to perceive the high-order interaction among concepts from two or more triplets to describe an image. For example, when we see the triplets <italic>man-cutting-sandwich</italic> and <italic>man-with-knife</italic>, it is natural to integrate and predict the sentence <italic>man cutting sandwich with knife</italic>. This depends on the high-order interaction between <italic>cutting</italic> and <italic>knife</italic> in different triplets. Therefore, exploiting high-order interaction is expected to benefit image captioning and focus on reasoning. In this paper, we introduce the novel high-order interaction learning method over detected objects and relationships for image captioning under the umbrella of the encoder-decoder framework. We first extract a set of object and relationship features in an image. During the encoding stage, the interactive refining network is proposed to learn high-order representations by modeling intra- and inter-object feature interaction in the self-attention fashion. During the decoding stage, the interactive fusion network is proposed to integrate object and relationship information by strengthening their high-order interaction based on language context for sentence generation. In this way, we learn the object-relationship dependencies in different stages, which can provide abundant cues for both visual understanding and caption generation. Extensive experiments show that the proposed method can achieve competitive performances against the state-of-the-art methods on MSCOCO dataset. Additional ablation studies further validate its effectiveness.

Cited By

View all
  • (2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 1-Jan-2025
  • (2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
  • (2024)Improving Reference-Based Distinctive Image Captioning with Contrastive RewardsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369468320:12(1-24)Online publication date: 24-Sep-2024
  • Show More Cited By

Index Terms

  1. High-Order Interaction Learning for Image Captioning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Circuits and Systems for Video Technology
        IEEE Transactions on Circuits and Systems for Video Technology  Volume 32, Issue 7
        July 2022
        803 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 July 2022

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 05 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 1-Jan-2025
        • (2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
        • (2024)Improving Reference-Based Distinctive Image Captioning with Contrastive RewardsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369468320:12(1-24)Online publication date: 24-Sep-2024
        • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
        • (2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
        • (2024)Relation-Aware Multi-Pass Comparison Deconfounded Network for Change CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.344533734:12_Part_2(13349-13363)Online publication date: 1-Dec-2024
        • (2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: 1-Jul-2024
        • (2024)Multi-Granularity Interaction for Multi-Person 3D Motion PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329875534:3(1546-1558)Online publication date: 1-Mar-2024
        • (2024)Interactive medical image annotation using improved Attention U-net with compound geodesic distanceExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121282237:PAOnline publication date: 27-Feb-2024
        • (2024)XGL-T transformer model for intelligent image captioningMultimedia Tools and Applications10.1007/s11042-023-15291-383:2(4219-4240)Online publication date: 1-Jan-2024
        • Show More Cited By

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media