Nothing Special   »   [go: up one dir, main page]

Skip to main content

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15133))

Included in the following conference series:

  • 222 Accesses

Abstract

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding constructed emotions in response to visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. Notably, the dataset spans a broad range of visual stimuli, covering human heritage and contemporary life, with an average per-turn answer length of about 12 words—5 times that of the VisDial dataset—and explanations exceeding 28 words on average. We explain our determining design decisions in collecting the dataset, data inclusion and exclusion criteria starting from over 100K dialogs for quality control, and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We propose and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art multimodal models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page with the dataset is available through https://affective-visual-dialog.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Openai api: Gpt-4 (2023). https://platform.openai.com/docs/models/gpt-4. Accessed 15 Nov 2023

  2. Achlioptas, P., Ovsjanikov, M., Guibas, L., Tulyakov, S.: Affection: learning affective explanations for real-world visual data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6641–6651, June 2023

    Google Scholar 

  3. Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.J.: ArtEmis: affective language for visual art. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11569–11579 (2021)

    Google Scholar 

  4. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)

    Google Scholar 

  5. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  6. Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  7. Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning (2023)

    Google Scholar 

  8. Bogdanov, D., Porter, A., Tovstogan, P., Won, M.: MediaEval 2019: emotion and theme recognition in music using Jamendo. In: Larson, M., (eds.) MediaEval 2019, Multimedia Benchmark Workshop; 27–30 October 2019, Sophia Antipolis, France. CEUR, Aachen. CEUR Workshop Proceedings (2019)

    Google Scholar 

  9. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402, June 2023

    Google Scholar 

  10. Buechel, S., Hahn, U.: EmoBank: studying the impact of annotation perspective and representation format on dimensional emotion analysis. arXiv preprint arXiv:2205.01996 (2022)

  11. Chen, C., et al.: UTC: a unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18103–18112 (2022)

    Google Scholar 

  12. Chen, J., Zhu, D., Haydarov, K., Li, X., Elhoseiny, M.: Video ChatCaptioner: towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227 (2023)

  13. Chen, J., et al.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning (2023). https://arxiv.org/abs/2310.09478

  14. Chen, S.Y., Hsu, C.C., Kuo, C.C., Ku, L.W., et al.: EmotionLines: an emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379 (2018)

  15. Community: Wiki art (2020). https://www.wikiart.org/. Accessed 6 Nov 2020

  16. Cowen, A.S., Fang, X., Sauter, D., Keltner, D.: What music makes us feel: at least 13 dimensions organize subjective experiences associated with music across different cultures. Proc. Nat. Acad. Sci. 117(4), 1924–1934 (2020)

    Article  Google Scholar 

  17. Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)

    Google Scholar 

  18. Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., Ravi, S.: GoEmotions: a dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 (2020)

  19. Diener, E., Scollon, C.N., Lucas, R.E.: The evolving concept of subjective well-being: the multifaceted nature of happiness (2009)

    Google Scholar 

  20. Ekman, P.: An argument for basic emotions. Cogn. Emotion 6(3–4), 169–200 (1992)

    Article  Google Scholar 

  21. Fan, J., Thorogood, M., Pasquier, P.: Emo-soundscapes: a dataset for soundscape emotion recognition. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 196–201. IEEE (2017)

    Google Scholar 

  22. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)

    Google Scholar 

  23. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

    Google Scholar 

  24. Gunes, H., Piccardi, M.: A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: 18th International Conference on Pattern Recognition (ICPR 2006), vol. 1, pp. 1148–1153. IEEE (2006)

    Google Scholar 

  25. Hung, H.T., Ching, J., Doh, S., Kim, N., Nam, J., Yang, Y.H.: EMOPIA: a multi-modal pop piano dataset for emotion recognition and emotion-based music generation. arXiv preprint arXiv:2108.01374 (2021)

  26. Kottur, S., Moura, J.M., Parikh, D., Batra, D., Rohrbach, M.: CLEVR-Dialog: a diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166 (2019)

  27. Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)

  28. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880 (2020)

    Google Scholar 

  29. Li, D., Wu, H., Zhang, J., Huang, K.: A2-RL: aesthetics aware reinforcement learning for image cropping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8193–8201 (2018)

    Google Scholar 

  30. Li, H., Zhu, S.C., Zheng, Z.: DiPlomat: a dialogue dataset for situated pragmatic reasoning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  31. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  32. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 (2017)

  33. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  34. Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: iMiGUE: an identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10631–10642 (2021)

    Google Scholar 

  35. Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522. Association for Computational Linguistics, Singapore, December 2023). https://doi.org/10.18653/v1/2023.emnlp-main.153, https://aclanthology.org/2023.emnlp-main.153

  36. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  37. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  38. Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 83–92 (2010)

    Google Scholar 

  39. Mohamed, Y., et al.: ArtELingo: a million emotion annotations of WikiArt with emphasis on diversity over language and culture. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2022)

    Google Scholar 

  40. Mohamed, Y., Khan, F.F., Haydarov, K., Elhoseiny, M.: It is okay to not be okay: overcoming emotional bias in affective image captioning by contrastive data collection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21263–21272 (2022)

    Google Scholar 

  41. Mollahosseini, A., Hasani, B., Mahoor, M.H.: AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10(1), 18–31 (2017)

    Article  Google Scholar 

  42. Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) European Conference on Computer Vision, pp. 336–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_20

  43. Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1449–1454. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1152, https://aclanthology.org/D19-1152

  44. Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. arXiv preprint arXiv:1909.10470 (2019)

  45. Nguyen, V.-Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 223–240. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_14

    Chapter  Google Scholar 

  46. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  47. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  48. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  49. Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)

    Google Scholar 

  50. Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381 (2019)

    Google Scholar 

  51. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)

    Article  Google Scholar 

  52. Russell, S.: Human Compatible: Artificial Intelligence and the Problem of Control. Penguin (2019)

    Google Scholar 

  53. Sammani, F., Mukherjee, T., Deligiannis, N.: NLX-GPT: a model for natural language explanations in vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8322–8332 (2022)

    Google Scholar 

  54. Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQA: commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473 (2019)

    Google Scholar 

  55. Seo, P.H., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  56. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL (2018)

    Google Scholar 

  57. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  58. Strapparava, C., Mihalcea, R.: SemEval-2007 task 14: affective text. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 70–74 (2007)

    Google Scholar 

  59. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  60. Urbanek, J., Ringshia, P.: Mephisto: a framework for portable, reproducible, and iterative crowdsourcing (2023). https://doi.org/10.48550/ARXIV.2301.05154, https://arxiv.org/abs/2301.05154

  61. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  62. Weng, S., Zhang, P., Chang, Z., Wang, X., Li, S., Shi, B.: Affective image filter: reflecting emotions from text to images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10810–10819, October 2023

    Google Scholar 

  63. Yanulevskaya, V., van Gemert, J.C., Roth, K., Herbold, A.K., Sebe, N., Geusebroek, J.M.: Emotional valence categorization using holistic image features. In: 2008 15th IEEE International Conference on Image Processing, pp. 101–104. IEEE (2008)

    Google Scholar 

  64. You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  65. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  66. Yuan, W., Neubig, G., Liu, P.: BARTScore: evaluating generated text as text generation. Adv. Neural. Inf. Process. Syst. 34, 27263–27277 (2021)

    Google Scholar 

  67. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (2019)

    Google Scholar 

  68. Zhou, X., et al.: SOTOPIA: interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667 (2023)

  69. Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang, W., Elhoseiny, M.: ChatGPT asks, BLIP-2 answers: automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594 (2023)

Download references

Acknowledgement

This project is funded by KAUST BAS/1/1685-01-01, SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence. The authors express their appreciation to Jack Urbanek, Sirojiddin Karimov, and Umid Nejmatullayev for their valuable assistance in data collection setup. Lastly, the authors extend their gratitude to the diligent efforts of the Amazon Mechanical Turkers, DeepenAI, and SmartOne teams, as their contributions were indispensable for the successful completion of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kilichbek Haydarov .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16875 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Haydarov, K. et al. (2025). Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73226-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73225-6

  • Online ISBN: 978-3-031-73226-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics