research-article

CITE: Compact Interactive TransformEr for Multilingual Image Captioning

Authors:

Richang HongAuthors Info & Claims

ICIGP '23: Proceedings of the 2023 6th International Conference on Image and Graphics Processing

Pages 175 - 181

https://doi.org/10.1145/3582649.3582658

Published: 07 April 2023 Publication History

Abstract

Current state-of-the-art image captioning models generate captions in a single language, requiring a combination of multiple language specific models to build a multilingual image captioning system. However, as the number of supported languages increases, it leads to the parameters of the multilingual image captioning system grow linearly. To tackle this issue, we propose a single Compact Interactive TransformEr (CITE) model, which can describe an image in multiple languages simultaneously, making the captioning system more compact. Specifically, based on the standard Transformer model, we share the encoder and decoder backbone parameters and replace the self-attention sub-layer in the decoder with the interactive attention sub-layer. In addition, we extend the traditional monolingual reinforcement learning mechanism to a multilingual version to promote better description generation. Due to the wide use of Chinese and English, we evaluate the performance of our CITE model by simultaneously generating English and Chinese captions. We expand the image captions of the whole MSCOCO dataset and release a COCO-EN-CN dataset. Extensive experiments on the COCO-EN-CN dataset show that our single CITE model with more parameter-efficient can maintain the competitive performance or even better than the monolingual captioning models.

References

[1]

Aozhu Chen, Xinyi Huang, Hailan Lin, and Xirong Li. 2021. Towards annotation-free evaluation of cross-lingual image captioning. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia. 1–7.

Digital Library

[2]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.

[3]

Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 2615–2624.

Digital Library

[4]

Desmond Elliott, Stella Frank, and Eva Hasler. 2015. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709 (2015).

[5]

Desmond Elliott, Stella Frank, Khalil Sima'an, and Lucia Specia. 2016. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459 (2016).

[6]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision (ECCV). 503–519.

Digital Library

[7]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems 32 (2019).

[8]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.

[9]

Alan Jaffe. 2017. Generating image descriptions using multilingual data. In Proceedings of the second conference on machine translation. 458–464.

[10]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.

[11]

Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th ACM international conference on Multimedia. 1549–1557.

Digital Library

[12]

Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. 2016. Adding chinese captions to images. In Proceedings of the 2016 ACM on international conference on multimedia retrieval. 271–275.

Digital Library

[13]

Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, and Jieping Xu. 2019. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia 21, 9 (2019), 2347–2360.

[14]

Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Zhongjun He, Hua Wu, Haifeng Wang, and Chengqing Zong. 2020. Synchronous speech recognition and speech-to-text translation with interactive decoding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8417– 8424.

[15]

Ruotian Luo. 2020. A better variant of self-critical sequence training. arXiv preprint arXiv:2003.09971 (2020).

[16]

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. arXiv preprint arXiv:2101.06462 (2021).

[17]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.

[18]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7008–7024.

[19]

Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In Proceedings of the 27th ACM International Conference on Multimedia. 784–792.

Digital Library

[20]

Satoshi Tsutsui and David Crandall. 2017. Using artificial tokens to control languages for multilingual image caption generation. arXiv preprint arXiv:1706.06275 (2017).

[21]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[22]

Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu, and Chengqing Zong. 2019. Synchronously generating two languages with interactive decoding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3350–3355.

[23]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685–10694.

[24]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579–5588.

[25]

Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4777–4786.

[26]

Yuanen Zhou, Yong Zhang, Zhenzhen Hu, and Meng Wang. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3139–3143.

[27]

Sukchatri Prasomsuk and Puthy Mol, "Thai to Khmer Rule-Based Machine Translation Using Reordering Word to Phrase," International Journal of Computer Theory and Engineering vol. 9, no. 3, pp. 223-228, 2017.

Cited By

Sharma AAggarwal M(2025)A Holistic Review of Image-to-Text Conversion: Techniques, Evaluation Metrics, Multilingual Captioning, Storytelling and IntegrationSN Computer Science10.1007/s42979-025-03719-66:3Online publication date: 25-Feb-2025
https://doi.org/10.1007/s42979-025-03719-6
He SYang XMa SSong BHe ZLuo W(2023)Positional Feature Generator-Based Transformer for Image Captioning2023 18th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)10.1109/ISKE60036.2023.10481500(418-425)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ISKE60036.2023.10481500

Index Terms

CITE: Compact Interactive TransformEr for Multilingual Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Generating Image Captions in Polish Using Transformer Architecture
Artificial Intelligence and Soft Computing
Abstract
This paper presents an attention-based method for image captioning. Existing deep-learning methods mainly use recurrent neural networks for producing captions. We present Transformer based image captioning method dubbed CaptFormer. Our method use ...
Transformer based Multitask Learning for Image Captioning and Object Detection
Advances in Knowledge Discovery and Data Mining
Abstract
In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning ...
Transformer with Prior Language Knowledge for Image Captioning
Neural Information Processing
Abstract
The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIGP '23: Proceedings of the 2023 6th International Conference on Image and Graphics Processing

January 2023

246 pages

ISBN:9781450398572

DOI:10.1145/3582649

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSFC
The University Synergy Innovation Program of Anhui Province

Conference

ICIGP 2023

ICIGP 2023: 2023 The 6th International Conference on Image and Graphics Processing

January 6 - 8, 2023

Chongqing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
66
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sharma AAggarwal M(2025)A Holistic Review of Image-to-Text Conversion: Techniques, Evaluation Metrics, Multilingual Captioning, Storytelling and IntegrationSN Computer Science10.1007/s42979-025-03719-66:3Online publication date: 25-Feb-2025
https://doi.org/10.1007/s42979-025-03719-6
He SYang XMa SSong BHe ZLuo W(2023)Positional Feature Generator-Based Transformer for Image Captioning2023 18th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)10.1109/ISKE60036.2023.10481500(418-425)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ISKE60036.2023.10481500

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten