Nothing Special   »   [go: up one dir, main page]

End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance.

    Reference
    Related
    Cited by
Get Citation

Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao. End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration. International Journal of Software and Informatics, 2023,13(2):221~241

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 18,2022
  • Revised:
  • Adopted:August 24,2022
  • Online: June 29,2023
  • Published: