SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14233))

Included in the following conference series:

International Conference on Image Analysis and Processing

991 Accesses
2 Citations

Abstract

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Scene Graph for Better Cross-Domain Image Captioning

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Hybrid explainable image caption generation using image processing and natural language processing

Article 23 September 2024

Notes

1.
https://huggingface.co/CompVis/stable-diffusion-v1-4.

References

Allegretti, S., Bolelli, F., Cancilla, M., Pollastri, F., Canalini, L., Grana, C.: How does connected components labeling with decision trees perform on GPUs? In: CAIP (2019)
Google Scholar
Amoroso, R., Morelli, D., Cornia, M., Baraldi, L., Del Bimbo, A., Cucchiara, R.: Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images. arXiv preprint arXiv:2304.00500 (2023)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: Semantic Propositional Image Caption Evaluation. In: ECCV (2016)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466 (2023)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshops (2005)
Google Scholar
Barraco, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: The Unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In: CVPR Workshops (2022)
Google Scholar
Barraco, M., Stefanini, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: CaMEL: Mean Teacher Learning for Image Captioning. In: ICPR (2022)
Google Scholar
Bolelli, F., Allegretti, S., Grana, C.: One DAG to rule them all. IEEE Trans. PAMI 44(7), 3647–3658 (2021)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Chen, Y., Li, W., Chen, X., Gool, L.V.: Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. In: CVPR (2019)
Google Scholar
Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis. AI Commun. 35(2), 111–129 (2022)
Article MathSciNet Google Scholar
Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training. arXiv preprint arXiv:2111.12727 (2022)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-Memory Transformer for Image Captioning. In: CVPR (2020)
Google Scholar
Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2021)
Google Scholar
Fabbri, M., et al.: MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? In: ICCV (2021)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image Captioning: Transforming Objects into Words. In: NeurIPS (2019)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In: EMNLP (2021)
Google Scholar
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H., Bennamoun, M.: Text to image synthesis for improved image captioning. IEEE Access 9, 64918–64928 (2021)
Article Google Scholar
Hu, X., et al.: Scaling Up Vision-Language Pre-training for Image Captioning. In: CVPR (2022)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: ICML (2022)
Google Scholar
Li, W., Lotz, F.J., Qiu, C., Elliott, D.: Data curation for image captioning with text-to-image generative models. arXiv preprint arXiv:2305.03610 (2023)
Li, Y., Pan, Y., Yao, T., Mei, T.: Comprehending and ordering semantics for image captioning. In: CVPR (2022)
Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL Workshops (2004)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: Common Objects in Context. In: ECCV (2014)
Google Scholar
Liu, W., Chen, S., Guo, L., Zhu, X., Liu, J.: CPTR: Full Transformer Network for Image Captioning. arXiv preprint arXiv:2101.10804 (2021)
Luo, Y., et al.: Dual-Level Collaborative Transformer for Image Captioning. In: AAAI (2021)
Google Scholar
Micikevicius, P., et al.: Mixed Precision Training. In: ICLR (2018)
Google Scholar
Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)
Article Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-Linear Attention Networks for Image Captioning. In: CVPR (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Pipoli, V., Cappelli, M., Palladini, A., Peluso, C., Lovino, M., Ficarra, E.: Predicting gene expression levels from DNA sequences and post-transcriptional information with Transformers. Comput. Methods Prog. Biomed. 225, 107035 (2022)
Article Google Scholar
Radford, A., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: ICML (2021)
Google Scholar
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In: SC (2020)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-Critical Sequence Training for Image Captioning. In: CVPR (2017)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: CVPR (2023)
Google Scholar
Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Retrieval-augmented transformer for image captioning. In: CBMI (2022)
Google Scholar
Schuhmann, et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: ACL (2016)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual Captions: A Cleaned, Hypernymed. ACL, Image Alt-text Dataset For Automatic Image Captioning. In (2018)
Google Scholar
Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)
Google Scholar
Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: a survey on deep learning-based image captioning. IEEE Trans. PAMI 45(1), 539–559 (2022)
Article Google Scholar
Stefanini, M., Lovino, M., Cucchiara, R., Ficarra, E.: Predicting gene and protein expression levels from DNA and protein sequences with Perceiver. Computer Methods and Programs in Biomedicine 234, 107504 (2023)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)
Google Scholar
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In: ICLR (2022)
Google Scholar
Wolf, T., et al.: Transformers: State-of-the-Art Natural Language Processing. In: EMNLP (2020)
Google Scholar
Wu, M., et al.: DIFNet: Boosting Visual Information Flow for Image Captioning. In: CVPR (2022)
Google Scholar
Xiao, C., Xu, S.X., Zhang, K.: Multimodal Data Augmentation for Image Captioning using Diffusion Models. arXiv preprint arXiv:2305.01855 (2023)
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV (2018)
Google Scholar
Zhang, P., et al.: VinVL: Revisiting visual representations in vision-language models. In: CVPR (2021)
Google Scholar
Zhang, S., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)
Zhang, X., et al.: RSTNet: Captioning with adaptive attention on visual and non-visual words. In: CVPR (2021)
Google Scholar

Download references

Acknowledgements

This work has partially been supported by the European Commission under the PNRR-M4C2 (PE00000013) project “FAIR - Future Artificial Intelligence Research”, by the Horizon Europe project “European Lighthouse on Safe and Secure AI (ELSA)” (HORIZON-CL4-2021-HUMAN-01-03), co-funded by the European Union, and by the PRIN project “CREATIVE: Cross-modal understanding and generation of Visual and textual content” (CUP B87G22000460001), co-funded by the Italian Ministry of University.

Author information

Authors and Affiliations

University of Modena and Reggio Emilia, Modena, Italy
Davide Caffagni, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi & Rita Cucchiara
IIT-CNR, Pisa, Italy
Rita Cucchiara

Authors

Davide Caffagni
View author publications
You can also search for this author in PubMed Google Scholar
Manuele Barraco
View author publications
You can also search for this author in PubMed Google Scholar
Marcella Cornia
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Baraldi
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caffagni, D., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R. (2023). SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-43148-7_10
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43147-0
Online ISBN: 978-3-031-43148-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics