Improved Image Captioning Using GAN and ViT

Vrushank D. Rao¹⁰,
B. N. Shashank¹⁰ &
S. Nagesh Bhattu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2011))

Included in the following conference series:

International Conference on Computer Vision and Image Processing

130 Accesses

Abstract

Encoder-decoder architectures are widely used in solving image captioning applications. Convolutional encoders and recurrent decoders are prominently used for such applications. Recent advances in transformer-based designs have made SOTA performances in solving various language and vision tasks. This work inspects the research question of using transformer-based encoder and decoder in building an effective pipeline for image captioning. An adversarial objective using a Generative Adversarial Network is used to improve the diversity of the captions generated. The generator component of our model utilizes a ViT encoder and a transformer decoder to generate semantically meaningful captions for a given image. To enhance the quality and authenticity of the generated captions, we introduce a discriminator component built using a transformer decoder. The discriminator evaluates the captions by considering both the image and the caption generated by the generator. By training this architecture, we aim to ensure that the generator produces captions that are indistinguishable from real captions, increasing the overall quality of the generated outputs. Through extensive experimentation, we demonstrate the effectiveness of our approach in generating diverse and contextually appropriate captions for various images. We evaluate our model on benchmark datasets and compare its performance against existing state-of-the-art image captioning methods. The proposed approach has achieved superior results compared to previous methods, as demonstrated by improved caption accuracy metrics such as BLEU-3, BLEU-4, and other relevant accuracy measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automatic image captioning system using a deep learning approach

Article 27 May 2023

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Attention Is All You Need to Tell: Transformer-Based Image Captioning

References

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Xu, K., et al.: Neural image caption generation with visual attention. In Proceedings ICML, pp. 2048–2057 (2015)
Google Scholar
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two pure transformers can make one strong GAN, and that can scale up. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14745–14758 (2021)
Google Scholar
Dai, B. Fidler, S., Urtasun, R., Lin, D.: Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2989–2998 (2017). https://doi.org/10.1109/ICCV.2017.323
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 211–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_13
Chapter Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML 2014), vol. 32, pp. II-595–II-603. JMLR.org (2014)
Google Scholar
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312 (2022)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 3156–3164 (2015). https://doi.org/10.1109/CVPR.2015.7298935
Wang, Y., et al.: 3D conditional generative adversarial networks for high-quality PET image estimation at low dose. Neuroimage 174, 550–562 (2018)
Article Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)
Google Scholar
T-Y, L., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation AND/OR Summarization, pp. 65–72 (2005)
Google Scholar
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Parmar, N., et al.: Image transformer. In International Conference on Machine Learning, pp. 4055–4064 (2018)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Wu, J., Hu, H.: Cascade recurrent neural network for image caption generation. Electron. Lett. 53(25), 1642–1643 (2017)
Article Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Google Scholar
Liu, W., Chen, S., Guo, L., Zhu, X., Liu, J.: CPTR: full transformer network for image captioning. arXiv preprint arXiv:2101.10804 (2021)
Luo, Y., et al.: Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2286–2293 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Andhra Pradesh, Tadepalligudem, India
Vrushank D. Rao, B. N. Shashank & S. Nagesh Bhattu

Authors

Vrushank D. Rao
View author publications
You can also search for this author in PubMed Google Scholar
B. N. Shashank
View author publications
You can also search for this author in PubMed Google Scholar
S. Nagesh Bhattu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vrushank D. Rao .

Editor information

Editors and Affiliations

Indian Institute of Technology, Jammu, India
Harkeerat Kaur
Indian Institute of Technology, Jammu, India
Vinit Jakhetiya
Indian Institute of Technology, Ropar, India
Puneet Goyal
Indian Institute of Information Technology, Jabalpur, India
Pritee Khanna
Indian Institute of Technology, Roorkee, Uttarakhand, India
Balasubramanian Raman
Indian Institute of Technology, Roorkee, Uttarakhand, India
Sanjeev Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rao, V.D., Shashank, B.N., Nagesh Bhattu, S. (2024). Improved Image Captioning Using GAN and ViT. In: Kaur, H., Jakhetiya, V., Goyal, P., Khanna, P., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2023. Communications in Computer and Information Science, vol 2011. Springer, Cham. https://doi.org/10.1007/978-3-031-58535-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-58535-7_31
Published: 03 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58534-0
Online ISBN: 978-3-031-58535-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Image Captioning Using GAN and ViT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic image captioning system using a deep learning approach

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Attention Is All You Need to Tell: Transformer-Based Image Captioning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improved Image Captioning Using GAN and ViT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic image captioning system using a deep learning approach

Bangla Image Caption Generation Through CNN-Transformer Based Encoder-Decoder Network

Attention Is All You Need to Tell: Transformer-Based Image Captioning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation