research-article

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Authors:

Qin JinAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 784 - 792

https://doi.org/10.1145/3343031.3350996

Published: 15 October 2019 Publication History

Abstract

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it is prohibitively expensive to collect large-scale paired image-caption dataset for every target language which is critical for training descent image captioning models. Previous works tackle the unpaired cross-lingual image captioning problem through a pivot language, which is with the help of paired image-caption data in the pivot language and pivot-to-target machine translation models. However, such language-pivoted approach suffers from inaccuracy brought by the pivot-to-target translation, including disfluency and visual irrelevancy errors. In this paper, we propose to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate these two types of errors. We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards. We conduct extensive experiments for unpaired cross-lingual image captioning in both English and Chinese respectively on two widely used image caption corpora. The proposed approach achieves significant performance improvement over state-of-the-art methods.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR .

[2]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In NIPS .

[3]

Ali Furkan Biten, Lluis Gomez, Marc cal Rusi nol, and Dimosthenis Karatzas. 2019. Good News, Everyone! Context driven entity-aware captioning for news images. In CVPR .

[4]

Shizhe Chen, Qin Jin, and Jianlong Fu. 2019 a. From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots. In IJCAI .

[5]

Shizhe Chen, Qin Jin, and Alexander Hauptmann. 2019 b. Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data. In AAAI .

[6]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP .

[7]

Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation . Association for Computational Linguistics, 376--380. https://doi.org/10.3115/v1/W14--3348

[8]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE

[9]

: Improving Visual-Semantic Embeddings with Hard Negatives. In spotlight presentation at British Machine Vision Conference (BMVC) . https://arxiv.org/abs/1707.05612?context=cs.CV

[10]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From Captions to Visual Concepts and Back. In CVPR .

[11]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Unsupervised Image Captioning. https://arxiv.org/abs/1811.10787

[12]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired Image Captioning by Language Pivoting. In Computer Vision -- ECCV 2018 . Springer International Publishing, Cham, 519--535.

[13]

Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An Empirical Study of Language CNN for Image Captioning. In ICCV. 10.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR .

[15]

Julian Hitschler, Shigehiko Schamoni, and Stefan Riezler. 2016. Multimodal Pivots for Image Caption Translation. In ACL .

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation, Vol. 9 (12 1997), 1735--80. https://doi.org/10.1162/neco.1997.9.8.1735

[17]

Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding Long-Short Term Memory for Image Caption Generation. In ICCV .

[18]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. (2014). https://arxiv.org/abs/1408.5882

[19]

Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .

[20]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332

[21]

Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-Guided Cross-Lingual Image Captioning. In ACM Multimedia. ACM, 9. https://doi.org/10.1145/3123266.3123366

[22]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context. In ECCV .

[23]

Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In ACM Multimedia. ACM, 9. https://doi.org/10.1145/3240508.3240632

[24]

Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[25]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) .

[26]

Steven J. Rennie, Etienne Marcheret, and Youssef Mroueh et al. 2017. Self-Critical Sequence Training for Image Captioning. In CVPR .

[27]

Satoshi Tsutsui and David Crandall. 2017. Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation. (2017).

[28]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. In CVPR .

[29]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. In CVPR. IEEE. https://doi.org/10.1109/CVPR.2015.7298935

[30]

Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, Yizhou Wang, and Yonggang Wang. 2017. AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding. CoRR, Vol. abs/1711.06475 (2017).

[31]

Jing Shao Dapeng Chen Xiaogang Wang Xihui Liu, Hongsheng Li. 2018. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. In ECCV .

[32]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention., Vol. abs/1502.03044 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1502.html#XuBKCCSZB15

[33]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR .

[34]

Wendong Zhang, Bingbing Ni, Yichao Yan, Jingwei Xu, and Xiaokang Yang. 2017. Depth Structure Preserving Scene Image Generation. In ACM Multimedia. ACM.

Cited By

Ruan SZhang KWu LXu TLiu QChen E(2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2021.3118208
Chai LQi LSun HLi J(2024) RA 3 : A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00032(330-341)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00032
Zhang JGuo DYang XSong PWang M(2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.1145/3634917
Show More Cited By

Index Terms

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
        Scene understanding
    2. Natural language processing
      1. Natural language generation

Recommendations

Fluency-Guided Cross-Lingual Image Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in ...
Visual-linguistic-stylistic Triple Reward for Cross-lingual Image Captioning
Generating image captions in different languages is worth exploring and essential for non-native speakers. Nevertheless, collecting paired annotation for every language is time-consuming and impractical, particularly for minor languages. To this end, the ...
A Hindi Image Caption Generation Framework Using Deep Learning
Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Natural Science Foundation
National Natural Science Foundation of China
National Key Research and Development Plan

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
431
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ruan SZhang KWu LXu TLiu QChen E(2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2021.3118208
Chai LQi LSun HLi J(2024) RA 3 : A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00032(330-341)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00032
Zhang JGuo DYang XSong PWang M(2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.1145/3634917
Xu YHu ZZhou YHao SHong R(2023)CITE: Compact Interactive TransformEr for Multilingual Image CaptioningProceedings of the 2023 6th International Conference on Image and Graphics Processing10.1145/3582649.3582658(175-181)Online publication date: 6-Jan-2023
https://dl.acm.org/doi/10.1145/3582649.3582658
Stefanini MCornia MBaraldi LCascianelli SFiameni GCucchiara R(2023)From Show to Tell: A Survey on Deep Learning-Based Image CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.314821045:1(539-559)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPAMI.2022.3148210
Hu NMing YFan CFeng FLyu B(2023)TSFNet: Triple-Steam Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2022.321586125(6904-6916)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3215861
Zhu PWang XLuo YSun ZZheng WWang YChen C(2023)Unpaired Image Captioning by Image-Level Weakly-Supervised Visual Concept RecognitionIEEE Transactions on Multimedia10.1109/TMM.2022.321409025(6702-6716)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3214090
Ma YJi JSun XZhou YWu YHuang FJi R(2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3164787
Song PGuo DZhou JXu MWang M(2023)Memorial GAN With Joint Semantic Optimization for Unpaired Image CaptioningIEEE Transactions on Cybernetics10.1109/TCYB.2022.317501253:7(4388-4399)Online publication date: Jul-2023
https://doi.org/10.1109/TCYB.2022.3175012
Wang YSu YLi WXiao JLi XLiu A(2023)Dual-Path Rare Content Enhancement Network for Image and Text MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325453033:10(6144-6158)Online publication date: Oct-2023
https://doi.org/10.1109/TCSVT.2023.3254530
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents