Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475607acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Direction Relation Transformer for Image Captioning

Published: 17 October 2021 Publication History

Abstract

Image captioning is a challenging task that combines computer vision and natural language processing for generating a textual description of the content within an image. Recently, Transformer-based encoder-decoder architectures have shown great success in image captioning, where multi-head attention mechanism is utilized to capture the contextual interactions between object regions. However, such methods regard region features as a bag of tokens without considering the directional relationships between them, making it hard to understand the relative position between objects in the image and generate correct captions effectively. In this paper, we propose a novel Direction Relation Transformer to improve the orientation perception between visual features by incorporating the relative direction embedding into multi-head attention, termed DRT. We first generate the relative direction matrix according to the positional information of the object regions, and then explore three forms of direction-aware multi-head attention to integrate the direction embedding into Transformer architecture. We conduct experiments on challenging Microsoft COCO image captioning benchmark. The quantitative and qualitative results demonstrate that, by integrating the relative directional relation, our proposed approach achieves significant improvements over all evaluation metrics compared with baseline model, e.g., DRT improves task-specific metric CIDEr score from 129.7% to 133.2% on the offline ''Karpathy'' test split.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision. Springer, 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.
[5]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6298--6306.
[6]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578--10587.
[7]
Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019. Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia (2019).
[8]
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and Geometry-Aware Self-Attention Network for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327--10336.
[9]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems. 11137--11147.
[10]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.
[11]
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving Image Captioning by Leveraging Intra-and Inter-layer Global Representation in Transformer Network. In Proceedings of the AAAI Conference on Artificial Intelligence.
[12]
Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10267--10276.
[13]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.
[14]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[15]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV). Springer, 740--755.
[16]
Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Kai Lei, and Xu Sun. 2020 a. Exploring and Distilling Cross-Modal Information for Image Captioning. In International Joint Conferences on Artificial Intelligence Organization.
[17]
Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020 b. Prophet Attention: Predicting Attention with Future Attention. Advances in Neural Information Processing Systems, Vol. 33.
[18]
Fenglin Liu, Xian Wu, Shen Ge, Xiaoyu Zhang, Wei Fan, and Yuexian Zou. 2020 c. Bridging the Gap between Vision and Language Domains for Improved Image Captioning. In Proceedings of the 28th ACM International Conference on Multimedia. 4153--4161.
[19]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 375--383.
[20]
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-Level Collaborative Transformer for Image Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence.
[21]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.
[22]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. 311--318.
[23]
Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look Back and Predict Forward in Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8367--8375.
[24]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[25]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
[26]
Zeliang Song and Xiaofei Zhou. 2021. Exploring Explicit and Implicit Visual Relationships for Image Captioning. In IEEE International Conference on Multimedia and Expo (ICME).
[27]
Zeliang Song, Xiaofei Zhou, Zhendong Mao, and Jianlong Tan. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010.
[29]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[30]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[31]
Yong Wang, WenKai Zhang, Qing Liu, Zhengyuan Zhang, Xin Gao, and Xian Sun. 2020. Improving Intra-and Inter-Modality Visual Relation for Image Captioning. In Proceedings of the 28th ACM International Conference on Multimedia. 4190--4198.
[32]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.
[33]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.
[34]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.
[35]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
  • (2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. direction embedding
  2. direction relation transformer
  3. image captioning
  4. multi-head attention

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program
  • National Natural Science Foundation of China

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)4
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
  • (2024)Cascade Semantic Prompt Alignment Network for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334352034:7(5266-5281)Online publication date: Jul-2024
  • (2024)Multi-Modal Graph Aggregation Transformer for Image CaptioningNeural Networks10.1016/j.neunet.2024.106813(106813)Online publication date: Oct-2024
  • (2023)A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic LanguagesApplied Sciences10.3390/app13191089413:19(10894)Online publication date: 30-Sep-2023
  • (2023)Deep-learning-based image captioning:analysis and prospectsJournal of Image and Graphics10.11834/jig.22066028:9(2788-2816)Online publication date: 2023
  • (2023)Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612581(4368-4377)Online publication date: 26-Oct-2023
  • (2023)Improving Image Captioning through Visual and Semantic Mutual PromotionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612480(4716-4724)Online publication date: 27-Oct-2023
  • (2023)CropCap: Embedding Visual Cross-Partition Dependency for Image CaptioningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612245(1750-1758)Online publication date: 26-Oct-2023
  • (2023)A Comparative Analysis of Image Captioning Techniques2023 4th International Conference for Emerging Technology (INCET)10.1109/INCET57972.2023.10170043(1-8)Online publication date: 26-May-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media