research-article

Direction Relation Transformer for Image Captioning

Authors:

Li GuoAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5056 - 5064

https://doi.org/10.1145/3474085.3475607

Published: 17 October 2021 Publication History

Abstract

Image captioning is a challenging task that combines computer vision and natural language processing for generating a textual description of the content within an image. Recently, Transformer-based encoder-decoder architectures have shown great success in image captioning, where multi-head attention mechanism is utilized to capture the contextual interactions between object regions. However, such methods regard region features as a bag of tokens without considering the directional relationships between them, making it hard to understand the relative position between objects in the image and generate correct captions effectively. In this paper, we propose a novel Direction Relation Transformer to improve the orientation perception between visual features by incorporating the relative direction embedding into multi-head attention, termed DRT. We first generate the relative direction matrix according to the positional information of the object regions, and then explore three forms of direction-aware multi-head attention to integrate the direction embedding into Transformer architecture. We conduct experiments on challenging Microsoft COCO image captioning benchmark. The quantitative and qualitative results demonstrate that, by integrating the relative directional relation, our proposed approach achieves significant improvements over all evaluation metrics compared with baseline model, e.g., DRT improves task-specific metric CIDEr score from 129.7% to 133.2% on the offline ''Karpathy'' test split.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision. Springer, 382--398.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.

[5]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6298--6306.

[6]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578--10587.

[7]

Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019. Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia (2019).

[8]

Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and Geometry-Aware Self-Attention Network for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327--10336.

[9]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems. 11137--11147.

Digital Library

[10]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.

[11]

Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving Image Captioning by Leveraging Intra-and Inter-layer Global Representation in Transformer Network. In Proceedings of the AAAI Conference on Artificial Intelligence.

[12]

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10267--10276.

[13]

Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.

[14]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[15]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV). Springer, 740--755.

[16]

Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Kai Lei, and Xu Sun. 2020 a. Exploring and Distilling Cross-Modal Information for Image Captioning. In International Joint Conferences on Artificial Intelligence Organization.

[17]

Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020 b. Prophet Attention: Predicting Attention with Future Attention. Advances in Neural Information Processing Systems, Vol. 33.

[18]

Fenglin Liu, Xian Wu, Shen Ge, Xiaoyu Zhang, Wei Fan, and Yuexian Zou. 2020 c. Bridging the Gap between Vision and Language Domains for Improved Image Captioning. In Proceedings of the 28th ACM International Conference on Multimedia. 4153--4161.

Digital Library

[19]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 375--383.

[20]

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-Level Collaborative Transformer for Image Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence.

[21]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.

[22]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. 311--318.

Digital Library

[23]

Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look Back and Predict Forward in Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8367--8375.

[24]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.

[25]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).

[26]

Zeliang Song and Xiaofei Zhou. 2021. Exploring Explicit and Implicit Visual Relationships for Image Captioning. In IEEE International Conference on Multimedia and Expo (ICME).

[27]

Zeliang Song, Xiaofei Zhou, Zhendong Mao, and Jianlong Tan. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010.

Digital Library

[29]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.

[30]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.

[31]

Yong Wang, WenKai Zhang, Qing Liu, Zhengyuan Zhang, Xin Gao, and Xian Sun. 2020. Improving Intra-and Inter-Modality Visual Relation for Image Captioning. In Proceedings of the 28th ACM International Conference on Multimedia. 4190--4198.

Digital Library

[32]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[33]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.

[34]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.

Digital Library

[35]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

Cited By

Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Li JMao ZLi HChen WZhang Y(2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638558
Li JZhang LZhang KHu BXie HMao Z(2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3343520
Show More Cited By

Index Terms

Direction Relation Transformer for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
    2. Natural language processing
      1. Natural language generation

Recommendations

ReFormer: The Relational Transformer for Image Captioning
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation ...
Relation constraint self-attention for image captioning
Abstract
Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, ...
Dual-Spatial Normalized Transformer for image captioning
Abstract
Self-attention modules have shown dominance in image captioning. However, current self-attention modules insufficiently consider spatial correlations between objects in an image and easily suffer from distribution shifts. In this work, we aim to ...
Highlights
- A Spatial-Enhanced Attention (SEA) module is proposed to enhance spatial relevance.
- A Gated-Normalized Attention (GNA) module is proposed to fix distributions.
- SEA and GNA module are applied to a Transformer architecture for image ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program
National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
574
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Li JMao ZLi HChen WZhang Y(2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638558
Li JZhang LZhang KHu BXie HMao Z(2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3343520
Chen LLi K(2024)Multi-Modal Graph Aggregation Transformer for Image CaptioningNeural Networks10.1016/j.neunet.2024.106813(106813)Online publication date: Oct-2024
https://doi.org/10.1016/j.neunet.2024.106813
Alsayed AArif MQadah TAlotaibi S(2023)A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic LanguagesApplied Sciences10.3390/app13191089413:19(10894)Online publication date: 30-Sep-2023
https://doi.org/10.3390/app131910894
Yongqiang ZZhi JFeng ZHaiyan ZZhengwei TChengfeng DXinhai XDonghong L(2023)Deep-learning-based image captioning:analysis and prospectsJournal of Image and Graphics10.11834/jig.22066028:9(2788-2816)Online publication date: 2023
https://doi.org/10.11834/jig.220660
Zhou YTan GLi MGou CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612581(4368-4377)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612581
Zhang JXie YLiu XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Improving Image Captioning through Visual and Semantic Mutual PromotionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612480(4716-4724)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612480
Wang BZhang ZZhao SZhang HHong RWang MEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)CropCap: Embedding Visual Cross-Partition Dependency for Image CaptioningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612245(1750-1758)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612245
Sunil DSafar SDas AM M AJoshy D(2023)A Comparative Analysis of Image Captioning Techniques2023 4th International Conference for Emerging Technology (INCET)10.1109/INCET57972.2023.10170043(1-8)Online publication date: 26-May-2023
https://doi.org/10.1109/INCET57972.2023.10170043
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents