Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

Mohammad Nadeem¹,
Shahab Saquib Sohail²,
Laeeba Javed¹,
Faisal Anwer¹,
Abdul Khader Jilani Saudagar³ &
…
Khan Muhammad⁴

486 Accesses
Explore all metrics

Abstract

The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: $CNN_1$ and $CNN_2$), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The $CNN_2$ was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than $CNN_1$ and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 2

Emotion Analysis Using Deep Learning Methods

Emotion Detection Using Convolutional Neural Networks

A Comparative Analysis of GPT-3 and BERT Models for Text-based Emotion Recognition: Performance, Efficiency, and Robustness

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The dataset used in the current study is publicly available.

References

Sai S, Mittal U, Chamola V, Huang K, Spinelli I, Scardapane S, Tan Z, Hussain A. Machine un-learning: an overview of techniques, applications, and future directions. Cogn Comput. 2023;1–25.
O’Leary DE. An analysis of three chatbots: BlenderBot, ChatGPT and Lamda. Intell Syst Accounting Fin Manage. 2023;30(1):41–54.
Article Google Scholar
Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R. Galactica: a large language model for science. arXiv:2211.09085 [Preprint]. 2022. Available from: http://arxiv.org/abs/2211.09085.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al. LLaMa: open and efficient foundation language models. arXiv:2302.13971 [Preprint]. 2023. Available from: http://arxiv.org/abs/2302.13971.
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB. Stanford Alpaca: an instruction-following LLaMa model. 2023. https://github.com/tatsu-lab/stanford_alpaca.
Bakker M, Chadwick M, Sheahan H, Tessler M, Campbell-Gillingham L, Balaguer J, McAleese N, Glaese A, Aslanides J, Botvinick M, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Adv Neural Inf Process Syst. 2022;35:38176–89.
Google Scholar
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
Article Google Scholar
Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst. 2023;3:121–54. https://doi.org/10.1016/j.iotcps.2023.04.003.
Zhao B, Jin W, Del Ser J, Yang G. ChatAgri: exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing. 2023;557:126708.
Article Google Scholar
Zhang H, Li X, Bing L. Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 [Preprint]. Available from: http://arxiv.org/abs/2306.02858.
Hassija V, Chakrabarti A, Singh A, Chamola V, Sikdar B. Unleashing the potential of conversational AI: Amplifying Chat-GPT’s capabilities and tackling technical hurdles. IEEE Access. 2023;11:143657–82. https://doi.org/10.1109/ACCESS.2023.3339553.
Article Google Scholar
Dowling M, Lucey B. ChatGPT for (finance) research: the Bananarama conjecture. Financ Res Lett. 2023;53:103662.
Article Google Scholar
Loh E. Chatgpt and generative AI chatbots: Challenges and opportunities for science, medicine and medical leaders. BMJ Leader. 2023;8(1):51–4. https://doi.org/10.1136/leader-2023-000797.
Article Google Scholar
Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.
Article Google Scholar
Sohail SS, Farhat F, Himeur Y, Nadeem M, Madsen DØ, Singh Y, Atalla S, Mansoor W. Decoding ChatGPT: a taxonomy of existing research, current challenges, and possible future directions. J King Saud Univ Comput Inf Sci. 2023;101675.
Sashida M, Izumi K, Sakaji H. Extraction SDGS-related sentences from sustainability reports using Bert and ChatGPT. In: 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI). IEEE; 2023. p. 742–5.
Google Scholar
Mosaiyebzadeh F, Pouriyeh S, Parizi R, Dehbozorgi N, Dorodchi M, Macêdo Batista D. Exploring the role of ChatGPT in education: applications and challenges. In: Proceedings of the 24th Annual Conference on Information Technology Education. 2023. p. 84–9.
Patrinos GP, Sarhangi N, Sarrami B, Khodayari N, Larijani B, Hasanzad M. Using ChatGPT to predict the future of personalized medicine. Pharmacogenomics J. 2023;23(6):178–84.
Article Google Scholar
Amin MM, Cambria E, Schuller BW. Can ChatGPT’s responses boost traditional natural language processing? IEEE Intell Syst. 2023;38(5):5–11.
Article Google Scholar
Chamola V, Bansal G, Das TK, Hassija V, Reddy NSS, Wang J, Zeadally S, Hussain A, Yu FR, Guizani M, et al. Beyond reality: the pivotal role of generative AI in the metaverse. arXiv:2308.06272 [Preprint]. 2023. Available from: http://arxiv.org/abs/2308.06272.
Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–48.
Article Google Scholar
Sultana F, Sufian A, Dutta P. Advancements in image classification using convolutional neural network. In: 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE; 2018. p. 122–9.
Chapter Google Scholar
Dhruv P, Naskar S. Image classification using convolutional neural network (CNN) and recurrent neural network (RNN): a review. Machine Learning and Information Processing: Proceedings of ICMLIP. 2019;2020:367–81.
Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.
Google Scholar
Lazarus RS. Emotions and interpersonal relationships: toward a person-centered conceptualization of emotions and coping. J Pers. 2006;74(1):9–46.
Article Google Scholar
Elliott EA, Jacobs AM. Facial expressions, emotions, and sign languages. Front Psychol. 2013;4:115.
Article Google Scholar
Li H, Xu H. Deep reinforcement learning for robust emotional classification in facial expression recognition. Knowl-Based Syst. 2020;204:106172.
Article Google Scholar
Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 843–52.
Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.
Article Google Scholar
Shaha M, Pawar M. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). In: Transfer learning for image classification. IEEE; 2018. p. 656–60.
Google Scholar
Fan Y, Lam JC, Li VO. Multi-region ensemble convolutional neural network for facial expression recognition. In: Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27. Springer; 2018. p. 84–94.
Chapter Google Scholar
Wang Y, Li Y, Song Y, Rong X. Facial expression recognition based on auxiliary models. Algorithms. 2019;12(11):227.
Article Google Scholar
Nordén F, von Reis Marlevi F. A comparative analysis of machine learning algorithms in binary facial expression recognition. 2019. http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1329976 &dswid=3676.
Bodapati JD, Veeranjaneyulu N. Facial emotion recognition using deep CNN based features. Int J Innov Technol Explor Eng. 2019;8(7):1928–31.
Google Scholar
Ravi A. Pre-trained convolutional neural network features for facial expression recognition. arXiv:1812.06387 [Preprint]. Available from: http://arxiv.org/abs/1812.06387.
Rescigno M, Spezialetti M, Rossi S. Personalized models for facial emotion recognition through transfer learning. Multimed Tools Appl. 2020;79:35811–28.
Article Google Scholar
Chowdary MK, Nguyen TN, Hemanth DJ. Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Applic. 2021;1–18.
Lakshmi D, Ponnusamy R. Facial emotion recognition using modified hog and LBP features with deep stacked autoencoders. Microprocess Microsyst. 2021;82:103834.
Article Google Scholar
Mishra S, Joshi B, Paudyal R, Chaulagain D, Shakya S. Deep residual learning for facial emotion recognition. In: Mobile Computing and Sustainable Informatics: Proceedings of ICMCSI 2021. Springer; 2022. p. 301–13.
Chapter Google Scholar
Eluri S. A novel leaky rectified triangle linear unit based deep convolutional neural network for facial emotion recognition. Multimed Tools Appl. 2023;82(12):18669–89.
Article Google Scholar
Tseng S-Y, Narayanan S, Georgiou P. Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Process Lett. 2021;28:608–12.
Article Google Scholar
Lammerse M, Hassan SZ, Sabet SS, Riegler MA, Halvorsen P. Human vs. GPT-3: the challenges of extracting emotions from child responses. In: 2022 14th International Conference on Quality of Multimedia Experience (QoMEX). IEEE; 2022. p. 1–4.
Google Scholar
Elyoseph Z, Hadar-Shoval D, Asraf K, Lvovsky M. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol. 2023;14:1199058.
Feng S, Sun G, Lubis N, Zhang C, Gašić M. Affect recognition in conversations using large language models. arXiv:2309.12881 [Preprint]. 2023. Available from: http://arxiv.org/abs/2309.12881.
Lei S, Dong G, Wang X, Wang K, Wang S. InstructERC: reforming emotion recognition in conversation with a retrieval multi-task LLMS framework. arXiv:2309.11911 [Preprint]. 2023. Available from: http://arxiv.org/abs/2309.11911.
Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee D-H, et al. Challenges in representation learning: a report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer; 2013. p. 117–24.
Chapter Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [Preprint]. 2014. Available from: http://arxiv.org/abs/1409.1556.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.
Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Adv Neural Inf Process Syst. 2024;36.
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR; 2021. p. 8748–63.
Google Scholar
Chiang W-L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez JE et al. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. 2023. https://vicuna.lmsys.org. Accessed 14 Apr 2023.
OpenAI. GPT-4 technical report. arXiv:2303.08774 [Preprint]. 2023. Available from: http://arxiv.org/abs/2303.08774.
Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? Adv Neural Inf Proces Syst. 2014;27.
Amin MM, Cambria E, Schuller BW. Will affective computing emerge from foundation models and general artificial intelligence? a first evaluation of ChatGPT. IEEE Intell Syst. 2023;38(2):15–23.
Article Google Scholar
Areeb QM, Nadeem M, Sohail SS, Imam R, Doctor F, Himeur Y, Hussain A, Amira A. Filter bubbles in recommender systems: fact or fallacy-a systematic review. Wiley Interdiscip Rev Data Min Knowl Discov. 2023;13(6):e1512.
Article Google Scholar
Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine. 2023;6(1):120.
Article Google Scholar

Download references

Acknowledgements

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-RP23058).

Author information

Authors and Affiliations

Department of Computer Science, Aligarh Muslim University, Aligarh, India
Mohammad Nadeem, Laeeba Javed & Faisal Anwer
School of Computing Science and Engineering, VIT Bhopal University, Shore, MP, 466114, India
Shahab Saquib Sohail
Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, 11432, Saudi Arabia
Abdul Khader Jilani Saudagar
Visual Analytics for Knowledge Laboratory (VIS2KNOW Lab), Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul, 03063, Republic of Korea
Khan Muhammad

Authors

Mohammad Nadeem
View author publications
You can also search for this author in PubMed Google Scholar
Shahab Saquib Sohail
View author publications
You can also search for this author in PubMed Google Scholar
Laeeba Javed
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Anwer
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Khader Jilani Saudagar
View author publications
You can also search for this author in PubMed Google Scholar
Khan Muhammad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mohammad Nadeem: conceptualization, data curation, formal analysis, methodology and writing — original draft. Shahab Saquib Sohail: conceptualization, formal analysis, methodology, project administration and writing — original draft. Laeeba Javed: methodology, project administration, resources, writing — original draft. Faisal Anwer: data curation, formal analysis and visualization. Abdul Khader Jilani Saudagar: visualization, supervision and writing — review and editing. Khan Muhammad: conceptualization, supervision, project administration, writing — review and editing.

Corresponding author

Correspondence to Khan Muhammad.

Ethics declarations

Ethics Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was not required as no humans or animals were involved.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nadeem, M., Sohail, S.S., Javed, L. et al. Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition. Cogn Comput 16, 2566–2579 (2024). https://doi.org/10.1007/s12559-024-10281-5

Download citation

Received: 14 January 2024
Accepted: 07 April 2024
Published: 27 May 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s12559-024-10281-5

Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Analysis Using Deep Learning Methods

Emotion Detection Using Convolutional Neural Networks

A Comparative Analysis of GPT-3 and BERT Models for Text-based Emotion Recognition: Performance, Efficiency, and Robustness

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Informed Consent

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Analysis Using Deep Learning Methods

Emotion Detection Using Convolutional Neural Networks

A Comparative Analysis of GPT-3 and BERT Models for Text-based Emotion Recognition: Performance, Efficiency, and Robustness

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Informed Consent

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation