Nothing Special   »   [go: up one dir, main page]

Skip to main content

Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2023)

Abstract

Cross-modal recipe retrieval aims to exploit the relationships and accomplish mutual retrieval between recipe images and texts, which is clear for human but arduous to formulate. Although many previous works endeavored to solve this problem, most works did not efficiently exploit the cross-modal information among recipe data. In this paper, we present a frustratingly straightforward cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT) achieving high performance on both recipe retrieval and image generation tasks, which is designed to efficiently exploit the rich cross-modal information. In our proposed framework, Transformer-based encoders are applied for both image and text encoding for cross-modal embedding learning. We also adopt several loss functions like self-supervised learning loss on recipe text to encourage the model to further promote the cross-modal embedding learning. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. The experimental results showed that TNLBT significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M by a huge margin. We also found that CLIP-ViT performs better than ViT-B as the image encoder backbone. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embedding learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We used the open source code on https://github.com/mseitzer/pytorch-fid.

  2. 2.

    We borrow the FID scores of CHEF [17] and ACME [24] reported in X-MRS [9].

References

  1. Carvalho, M., Cadène, R., Picard, D., Soulier, L., Thome, N., Cord, M.: Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–44 (2018)

    Google Scholar 

  2. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of International Conference on Machine Learning (2020)

    Google Scholar 

  3. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)

    Google Scholar 

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)

  6. Fu, H., Wu, R., Liu, C., Sun, J.: MCEN: bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  7. Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  8. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  9. Guerrero, R., Xuan, H.P., Vladimir, P.: Cross-modal retrieval and synthesis (X-MRS): closing the modality gap in shared representation learning. In: Proceedings of ACM International Conference Multimedia (2021)

    Google Scholar 

  10. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)

    Google Scholar 

  11. Guo, W., Wang, J., Wang, S.: Deep multimodal representation learning: a survey. IEEE Access 7, 63373–63394 (2019)

    Article  Google Scholar 

  12. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)

  13. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6629–6640 (2017)

    Google Scholar 

  14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representation (2015)

    Google Scholar 

  15. Li, J., Sun, J., Xu, X., Yu, W., Shen, F.: Cross-modal image-recipe retrieval via intra- and inter-modality hybrid fusion. In: Proceedings of ACM International Conference on Multimedia Retrieval, pp. 173–182 (2021). https://doi.org/10.1145/3460426.3463618

  16. Marin, J., et al.: Recipe1m+: a dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 187–203 (2019)

    Article  Google Scholar 

  17. Pham, H.X., Guerrero, R., Pavlovic, V., Li, J.: CHEF: cross-modal hierarchical embeddings for food domain retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2423–2430 (2021)

    Google Scholar 

  18. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of International Conference on Machine Learning, vol. 139, pp. 8748–8763 (2021)

    Google Scholar 

  19. Salvador, A., Gundogdu, E., Bazzani, L., Donoser, M.: Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  20. Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  21. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  22. Sugiyama, Y., Yanai, K.: Cross-modal recipe embeddings by disentangling recipe contents and dish styles. In: Proceedings of ACM International Conference Multimedia (2021)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  24. Wang, H., Sahoo, D., Liu, C., Lim, E., Hoi, S.C.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 11572–11581 (2019)

    Google Scholar 

  25. Wang, H., et al.: Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. arXiv:2003.03955 (2020)

  26. Zan, Z., Li, L., Liu, J., Zhou, D.: Sentence-based and noise-robust cross-modal retrieval on cooking recipes and food images. In: Proceedings of the International Conference on Multimedia Retrieval, p. 117–125 (2020)

    Google Scholar 

  27. Zhu, B., Ngo, C.W., Chen, J., Hao, Y.: R2GAN: cross-modal recipe retrieval with generative adversarial network. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, J., Chen, J., Yanai, K. (2023). Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-27818-1_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-27817-4

  • Online ISBN: 978-3-031-27818-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics