Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

High-Order Interaction Learning for Image Captioning

Published: 01 July 2022 Publication History

Abstract

Image captioning aims at understanding various semantic concepts (e.g., objects and relationships) from an image and integrating them in a sentence-level description. Hence, it is necessary to learn the interaction among these concepts. If we define the context of the interaction to be involved in the <italic>subject-predicate-object</italic> triplet, most current methods only focus on the single triplet for the first-order interaction to generate sentences. Intuitively, we humans are able to perceive the high-order interaction among concepts from two or more triplets to describe an image. For example, when we see the triplets <italic>man-cutting-sandwich</italic> and <italic>man-with-knife</italic>, it is natural to integrate and predict the sentence <italic>man cutting sandwich with knife</italic>. This depends on the high-order interaction between <italic>cutting</italic> and <italic>knife</italic> in different triplets. Therefore, exploiting high-order interaction is expected to benefit image captioning and focus on reasoning. In this paper, we introduce the novel high-order interaction learning method over detected objects and relationships for image captioning under the umbrella of the encoder-decoder framework. We first extract a set of object and relationship features in an image. During the encoding stage, the interactive refining network is proposed to learn high-order representations by modeling intra- and inter-object feature interaction in the self-attention fashion. During the decoding stage, the interactive fusion network is proposed to integrate object and relationship information by strengthening their high-order interaction based on language context for sentence generation. In this way, we learn the object-relationship dependencies in different stages, which can provide abundant cues for both visual understanding and caption generation. Extensive experiments show that the proposed method can achieve competitive performances against the state-of-the-art methods on MSCOCO dataset. Additional ablation studies further validate its effectiveness.

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
  • (2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
  • (2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: 1-Jul-2024
  • Show More Cited By

Index Terms

  1. High-Order Interaction Learning for Image Captioning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Circuits and Systems for Video Technology
        IEEE Transactions on Circuits and Systems for Video Technology  Volume 32, Issue 7
        July 2022
        803 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 July 2022

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
        • (2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
        • (2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: 1-Jul-2024
        • (2024)Multi-Granularity Interaction for Multi-Person 3D Motion PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329875534:3(1546-1558)Online publication date: 1-Mar-2024
        • (2024)Interactive medical image annotation using improved Attention U-net with compound geodesic distanceExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121282237:PAOnline publication date: 27-Feb-2024
        • (2024)XGL-T transformer model for intelligent image captioningMultimedia Tools and Applications10.1007/s11042-023-15291-383:2(4219-4240)Online publication date: 1-Jan-2024
        • (2024)Relational reasoning and adaptive fusion for visual question answeringApplied Intelligence10.1007/s10489-024-05437-754:6(5062-5080)Online publication date: 1-Mar-2024
        • (2024)Evaluating the performance of athletes in various sports using data mining and big data analyticsSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09620-928:4(2875-2890)Online publication date: 1-Feb-2024
        • (2024)A novel approach for automatic detection and identification of inappropriate postures and movements of table tennis playersSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-023-09587-728:3(2245-2269)Online publication date: 1-Feb-2024
        • (2023)Exploring cross-cultural and gender differences in facial expressions: a skin tone analysis using RGB ValuesJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00550-312:1Online publication date: 3-Oct-2023
        • Show More Cited By

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media