Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3343031.3350996acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Published: 15 October 2019 Publication History

Abstract

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it is prohibitively expensive to collect large-scale paired image-caption dataset for every target language which is critical for training descent image captioning models. Previous works tackle the unpaired cross-lingual image captioning problem through a pivot language, which is with the help of paired image-caption data in the pivot language and pivot-to-target machine translation models. However, such language-pivoted approach suffers from inaccuracy brought by the pivot-to-target translation, including disfluency and visual irrelevancy errors. In this paper, we propose to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate these two types of errors. We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards. We conduct extensive experiments for unpaired cross-lingual image captioning in both English and Chinese respectively on two widely used image caption corpora. The proposed approach achieves significant performance improvement over state-of-the-art methods.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR .
[2]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In NIPS .
[3]
Ali Furkan Biten, Lluis Gomez, Marc cal Rusi nol, and Dimosthenis Karatzas. 2019. Good News, Everyone! Context driven entity-aware captioning for news images. In CVPR .
[4]
Shizhe Chen, Qin Jin, and Jianlong Fu. 2019 a. From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots. In IJCAI .
[5]
Shizhe Chen, Qin Jin, and Alexander Hauptmann. 2019 b. Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data. In AAAI .
[6]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP .
[7]
Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation . Association for Computational Linguistics, 376--380. https://doi.org/10.3115/v1/W14--3348
[8]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE
[9]
: Improving Visual-Semantic Embeddings with Hard Negatives. In spotlight presentation at British Machine Vision Conference (BMVC) . https://arxiv.org/abs/1707.05612?context=cs.CV
[10]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From Captions to Visual Concepts and Back. In CVPR .
[11]
Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Unsupervised Image Captioning. https://arxiv.org/abs/1811.10787
[12]
Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired Image Captioning by Language Pivoting. In Computer Vision -- ECCV 2018 . Springer International Publishing, Cham, 519--535.
[13]
Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An Empirical Study of Language CNN for Image Captioning. In ICCV. 10.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR .
[15]
Julian Hitschler, Shigehiko Schamoni, and Stefan Riezler. 2016. Multimodal Pivots for Image Caption Translation. In ACL .
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation, Vol. 9 (12 1997), 1735--80. https://doi.org/10.1162/neco.1997.9.8.1735
[17]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding Long-Short Term Memory for Image Caption Generation. In ICCV .
[18]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. (2014). https://arxiv.org/abs/1408.5882
[19]
Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .
[20]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332
[21]
Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-Guided Cross-Lingual Image Captioning. In ACM Multimedia. ACM, 9. https://doi.org/10.1145/3123266.3123366
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context. In ECCV .
[23]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In ACM Multimedia. ACM, 9. https://doi.org/10.1145/3240508.3240632
[24]
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[25]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) .
[26]
Steven J. Rennie, Etienne Marcheret, and Youssef Mroueh et al. 2017. Self-Critical Sequence Training for Image Captioning. In CVPR .
[27]
Satoshi Tsutsui and David Crandall. 2017. Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation. (2017).
[28]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. In CVPR .
[29]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. In CVPR. IEEE. https://doi.org/10.1109/CVPR.2015.7298935
[30]
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, Yizhou Wang, and Yonggang Wang. 2017. AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding. CoRR, Vol. abs/1711.06475 (2017).
[31]
Jing Shao Dapeng Chen Xiaogang Wang Xihui Liu, Hongsheng Li. 2018. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. In ECCV .
[32]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention., Vol. abs/1502.03044 (2015). http://dblp.uni-trier.de/db/journals/corr/corr1502.html#XuBKCCSZB15
[33]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR .
[34]
Wendong Zhang, Bingbing Ni, Yichao Yan, Jingwei Xu, and Xiaokang Yang. 2017. Depth Structure Preserving Scene Image Generation. In ACM Multimedia. ACM.

Cited By

View all
  • (2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
  • (2024) RA 3 : A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00032(330-341)Online publication date: 13-May-2024
  • (2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-lingual
  2. image captioning
  3. reinforcement learning
  4. self-supervision

Qualifiers

  • Research-article

Funding Sources

  • Beijing Natural Science Foundation
  • National Natural Science Foundation of China
  • National Key Research and Development Plan

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
  • (2024) RA 3 : A Human-in-the-loop Framework for Interpreting and Improving Image Captioning with Relation-Aware Attribution Analysis 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00032(330-341)Online publication date: 13-May-2024
  • (2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
  • (2023)CITE: Compact Interactive TransformEr for Multilingual Image CaptioningProceedings of the 2023 6th International Conference on Image and Graphics Processing10.1145/3582649.3582658(175-181)Online publication date: 6-Jan-2023
  • (2023)From Show to Tell: A Survey on Deep Learning-Based Image CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.314821045:1(539-559)Online publication date: 1-Jan-2023
  • (2023)TSFNet: Triple-Steam Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2022.321586125(6904-6916)Online publication date: 1-Jan-2023
  • (2023)Unpaired Image Captioning by Image-Level Weakly-Supervised Visual Concept RecognitionIEEE Transactions on Multimedia10.1109/TMM.2022.321409025(6702-6716)Online publication date: 1-Jan-2023
  • (2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
  • (2023)Memorial GAN With Joint Semantic Optimization for Unpaired Image CaptioningIEEE Transactions on Cybernetics10.1109/TCYB.2022.317501253:7(4388-4399)Online publication date: Jul-2023
  • (2023)Dual-Path Rare Content Enhancement Network for Image and Text MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325453033:10(6144-6158)Online publication date: Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media