Nothing Special   »   [go: up one dir, main page]

Skip to main content

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/CompVis/stable-diffusion-v1-4.

References

  1. Allegretti, S., Bolelli, F., Cancilla, M., Pollastri, F., Canalini, L., Grana, C.: How does connected components labeling with decision trees perform on GPUs? In: CAIP (2019)

    Google Scholar 

  2. Amoroso, R., Morelli, D., Cornia, M., Baraldi, L., Del Bimbo, A., Cucchiara, R.: Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images. arXiv preprint arXiv:2304.00500 (2023)

  3. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: Semantic Propositional Image Caption Evaluation. In: ECCV (2016)

    Google Scholar 

  4. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  5. Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466 (2023)

  6. Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshops (2005)

    Google Scholar 

  7. Barraco, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: The Unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In: CVPR Workshops (2022)

    Google Scholar 

  8. Barraco, M., Stefanini, M., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: CaMEL: Mean Teacher Learning for Image Captioning. In: ICPR (2022)

    Google Scholar 

  9. Bolelli, F., Allegretti, S., Grana, C.: One DAG to rule them all. IEEE Trans. PAMI 44(7), 3647–3658 (2021)

    Google Scholar 

  10. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  11. Chen, Y., Li, W., Chen, X., Gool, L.V.: Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. In: CVPR (2019)

    Google Scholar 

  12. Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis. AI Commun. 35(2), 111–129 (2022)

    Article  MathSciNet  Google Scholar 

  13. Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training. arXiv preprint arXiv:2111.12727 (2022)

  14. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-Memory Transformer for Image Captioning. In: CVPR (2020)

    Google Scholar 

  15. Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2021)

    Google Scholar 

  16. Fabbri, M., et al.: MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking? In: ICCV (2021)

    Google Scholar 

  17. Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image Captioning: Transforming Objects into Words. In: NeurIPS (2019)

    Google Scholar 

  18. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In: EMNLP (2021)

    Google Scholar 

  19. Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H., Bennamoun, M.: Text to image synthesis for improved image captioning. IEEE Access 9, 64918–64928 (2021)

    Article  Google Scholar 

  20. Hu, X., et al.: Scaling Up Vision-Language Pre-training for Image Captioning. In: CVPR (2022)

    Google Scholar 

  21. Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)

    Google Scholar 

  22. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  24. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: ICML (2022)

    Google Scholar 

  25. Li, W., Lotz, F.J., Qiu, C., Elliott, D.: Data curation for image captioning with text-to-image generative models. arXiv preprint arXiv:2305.03610 (2023)

  26. Li, Y., Pan, Y., Yao, T., Mei, T.: Comprehending and ordering semantics for image captioning. In: CVPR (2022)

    Google Scholar 

  27. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL Workshops (2004)

    Google Scholar 

  28. Lin, T.Y., et al.: Microsoft COCO: Common Objects in Context. In: ECCV (2014)

    Google Scholar 

  29. Liu, W., Chen, S., Guo, L., Zhu, X., Liu, J.: CPTR: Full Transformer Network for Image Captioning. arXiv preprint arXiv:2101.10804 (2021)

  30. Luo, Y., et al.: Dual-Level Collaborative Transformer for Image Captioning. In: AAAI (2021)

    Google Scholar 

  31. Micikevicius, P., et al.: Mixed Precision Training. In: ICLR (2018)

    Google Scholar 

  32. Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)

    Article  Google Scholar 

  33. Pan, Y., Yao, T., Li, Y., Mei, T.: X-Linear Attention Networks for Image Captioning. In: CVPR (2020)

    Google Scholar 

  34. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  35. Pipoli, V., Cappelli, M., Palladini, A., Peluso, C., Lovino, M., Ficarra, E.: Predicting gene expression levels from DNA sequences and post-transcriptional information with Transformers. Comput. Methods Prog. Biomed. 225, 107035 (2022)

    Article  Google Scholar 

  36. Radford, A., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: ICML (2021)

    Google Scholar 

  37. Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In: SC (2020)

    Google Scholar 

  38. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-Critical Sequence Training for Image Captioning. In: CVPR (2017)

    Google Scholar 

  39. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  40. Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: CVPR (2023)

    Google Scholar 

  41. Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Retrieval-augmented transformer for image captioning. In: CBMI (2022)

    Google Scholar 

  42. Schuhmann, et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  43. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: ACL (2016)

    Google Scholar 

  44. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual Captions: A Cleaned, Hypernymed. ACL, Image Alt-text Dataset For Automatic Image Captioning. In (2018)

    Google Scholar 

  45. Shen, S., et al.: How much can CLIP benefit vision-and-language tasks? In: ICLR (2022)

    Google Scholar 

  46. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: a survey on deep learning-based image captioning. IEEE Trans. PAMI 45(1), 539–559 (2022)

    Article  Google Scholar 

  47. Stefanini, M., Lovino, M., Cucchiara, R., Ficarra, E.: Predicting gene and protein expression levels from DNA and protein sequences with Perceiver. Computer Methods and Programs in Biomedicine 234, 107504 (2023)

    Article  Google Scholar 

  48. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  49. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR (2015)

    Google Scholar 

  50. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)

    Google Scholar 

  51. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In: ICLR (2022)

    Google Scholar 

  52. Wolf, T., et al.: Transformers: State-of-the-Art Natural Language Processing. In: EMNLP (2020)

    Google Scholar 

  53. Wu, M., et al.: DIFNet: Boosting Visual Information Flow for Image Captioning. In: CVPR (2022)

    Google Scholar 

  54. Xiao, C., Xu, S.X., Zhang, K.: Multimodal Data Augmentation for Image Captioning using Diffusion Models. arXiv preprint arXiv:2305.01855 (2023)

  55. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  56. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)

    Google Scholar 

  57. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV (2018)

    Google Scholar 

  58. Zhang, P., et al.: VinVL: Revisiting visual representations in vision-language models. In: CVPR (2021)

    Google Scholar 

  59. Zhang, S., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)

  60. Zhang, X., et al.: RSTNet: Captioning with adaptive attention on visual and non-visual words. In: CVPR (2021)

    Google Scholar 

Download references

Acknowledgements

This work has partially been supported by the European Commission under the PNRR-M4C2 (PE00000013) project “FAIR - Future Artificial Intelligence Research”, by the Horizon Europe project “European Lighthouse on Safe and Secure AI (ELSA)” (HORIZON-CL4-2021-HUMAN-01-03), co-funded by the European Union, and by the PRIN project “CREATIVE: Cross-modal understanding and generation of Visual and textual content” (CUP B87G22000460001), co-funded by the Italian Ministry of University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Caffagni, D., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R. (2023). SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43148-7_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43147-0

  • Online ISBN: 978-3-031-43148-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics