Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

  • Research
  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: \(CNN_1\) and \(CNN_2\)), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The \(CNN_2\) was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than \(CNN_1\) and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The dataset used in the current study is publicly available.

References

  1. Sai S, Mittal U, Chamola V, Huang K, Spinelli I, Scardapane S, Tan Z, Hussain A. Machine un-learning: an overview of techniques, applications, and future directions. Cogn Comput. 2023;1–25.

  2. O’Leary DE. An analysis of three chatbots: BlenderBot, ChatGPT and Lamda. Intell Syst Accounting Fin Manage. 2023;30(1):41–54.

    Article  Google Scholar 

  3. Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R. Galactica: a large language model for science. arXiv:2211.09085 [Preprint]. 2022. Available from: http://arxiv.org/abs/2211.09085.

  4. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al. LLaMa: open and efficient foundation language models. arXiv:2302.13971 [Preprint]. 2023. Available from: http://arxiv.org/abs/2302.13971.

  5. Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB. Stanford Alpaca: an instruction-following LLaMa model. 2023. https://github.com/tatsu-lab/stanford_alpaca.

  6. Bakker M, Chadwick M, Sheahan H, Tessler M, Campbell-Gillingham L, Balaguer J, McAleese N, Glaese A, Aslanides J, Botvinick M, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Adv Neural Inf Process Syst. 2022;35:38176–89.

    Google Scholar 

  7. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.

    Article  Google Scholar 

  8. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst. 2023;3:121–54. https://doi.org/10.1016/j.iotcps.2023.04.003.

  9. Zhao B, Jin W, Del Ser J, Yang G. ChatAgri: exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing. 2023;557:126708.

    Article  Google Scholar 

  10. Zhang H, Li X, Bing L. Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 [Preprint]. Available from: http://arxiv.org/abs/2306.02858.

  11. Hassija V, Chakrabarti A, Singh A, Chamola V, Sikdar B. Unleashing the potential of conversational AI: Amplifying Chat-GPT’s capabilities and tackling technical hurdles. IEEE Access. 2023;11:143657–82. https://doi.org/10.1109/ACCESS.2023.3339553.

    Article  Google Scholar 

  12. Dowling M, Lucey B. ChatGPT for (finance) research: the Bananarama conjecture. Financ Res Lett. 2023;53:103662.

    Article  Google Scholar 

  13. Loh E. Chatgpt and generative AI chatbots: Challenges and opportunities for science, medicine and medical leaders. BMJ Leader. 2023;8(1):51–4. https://doi.org/10.1136/leader-2023-000797.

    Article  Google Scholar 

  14. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.

    Article  Google Scholar 

  15. Sohail SS, Farhat F, Himeur Y, Nadeem M, Madsen DØ, Singh Y, Atalla S, Mansoor W. Decoding ChatGPT: a taxonomy of existing research, current challenges, and possible future directions. J King Saud Univ Comput Inf Sci. 2023;101675.

  16. Sashida M, Izumi K, Sakaji H. Extraction SDGS-related sentences from sustainability reports using Bert and ChatGPT. In: 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI). IEEE; 2023. p. 742–5.

    Google Scholar 

  17. Mosaiyebzadeh F, Pouriyeh S, Parizi R, Dehbozorgi N, Dorodchi M, Macêdo Batista D. Exploring the role of ChatGPT in education: applications and challenges. In: Proceedings of the 24th Annual Conference on Information Technology Education. 2023. p. 84–9.

  18. Patrinos GP, Sarhangi N, Sarrami B, Khodayari N, Larijani B, Hasanzad M. Using ChatGPT to predict the future of personalized medicine. Pharmacogenomics J. 2023;23(6):178–84.

    Article  Google Scholar 

  19. Amin MM, Cambria E, Schuller BW. Can ChatGPT’s responses boost traditional natural language processing? IEEE Intell Syst. 2023;38(5):5–11.

    Article  Google Scholar 

  20. Chamola V, Bansal G, Das TK, Hassija V, Reddy NSS, Wang J, Zeadally S, Hussain A, Yu FR, Guizani M, et al. Beyond reality: the pivotal role of generative AI in the metaverse. arXiv:2308.06272 [Preprint]. 2023. Available from: http://arxiv.org/abs/2308.06272.

  21. Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–48.

    Article  Google Scholar 

  22. Sultana F, Sufian A, Dutta P. Advancements in image classification using convolutional neural network. In: 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE; 2018. p. 122–9.

    Chapter  Google Scholar 

  23. Dhruv P, Naskar S. Image classification using convolutional neural network (CNN) and recurrent neural network (RNN): a review. Machine Learning and Information Processing: Proceedings of ICMLIP. 2019;2020:367–81.

    Google Scholar 

  24. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

    Google Scholar 

  25. Lazarus RS. Emotions and interpersonal relationships: toward a person-centered conceptualization of emotions and coping. J Pers. 2006;74(1):9–46.

    Article  Google Scholar 

  26. Elliott EA, Jacobs AM. Facial expressions, emotions, and sign languages. Front Psychol. 2013;4:115.

    Article  Google Scholar 

  27. Li H, Xu H. Deep reinforcement learning for robust emotional classification in facial expression recognition. Knowl-Based Syst. 2020;204:106172.

    Article  Google Scholar 

  28. Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 843–52.

  29. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.

    Article  Google Scholar 

  30. Shaha M, Pawar M. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). In: Transfer learning for image classification. IEEE; 2018. p. 656–60.

    Google Scholar 

  31. Fan Y, Lam JC, Li VO. Multi-region ensemble convolutional neural network for facial expression recognition. In: Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27. Springer; 2018. p. 84–94.

    Chapter  Google Scholar 

  32. Wang Y, Li Y, Song Y, Rong X. Facial expression recognition based on auxiliary models. Algorithms. 2019;12(11):227.

    Article  Google Scholar 

  33. Nordén F, von Reis Marlevi F. A comparative analysis of machine learning algorithms in binary facial expression recognition. 2019. http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1329976 &dswid=3676.

  34. Bodapati JD, Veeranjaneyulu N. Facial emotion recognition using deep CNN based features. Int J Innov Technol Explor Eng. 2019;8(7):1928–31.

    Google Scholar 

  35. Ravi A. Pre-trained convolutional neural network features for facial expression recognition. arXiv:1812.06387 [Preprint]. Available from: http://arxiv.org/abs/1812.06387.

  36. Rescigno M, Spezialetti M, Rossi S. Personalized models for facial emotion recognition through transfer learning. Multimed Tools Appl. 2020;79:35811–28.

    Article  Google Scholar 

  37. Chowdary MK, Nguyen TN, Hemanth DJ. Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Comput Applic. 2021;1–18.

  38. Lakshmi D, Ponnusamy R. Facial emotion recognition using modified hog and LBP features with deep stacked autoencoders. Microprocess Microsyst. 2021;82:103834.

    Article  Google Scholar 

  39. Mishra S, Joshi B, Paudyal R, Chaulagain D, Shakya S. Deep residual learning for facial emotion recognition. In: Mobile Computing and Sustainable Informatics: Proceedings of ICMCSI 2021. Springer; 2022. p. 301–13.

    Chapter  Google Scholar 

  40. Eluri S. A novel leaky rectified triangle linear unit based deep convolutional neural network for facial emotion recognition. Multimed Tools Appl. 2023;82(12):18669–89.

    Article  Google Scholar 

  41. Tseng S-Y, Narayanan S, Georgiou P. Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Process Lett. 2021;28:608–12.

    Article  Google Scholar 

  42. Lammerse M, Hassan SZ, Sabet SS, Riegler MA, Halvorsen P. Human vs. GPT-3: the challenges of extracting emotions from child responses. In: 2022 14th International Conference on Quality of Multimedia Experience (QoMEX). IEEE; 2022. p. 1–4.

    Google Scholar 

  43. Elyoseph Z, Hadar-Shoval D, Asraf K, Lvovsky M. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol. 2023;14:1199058.

  44. Feng S, Sun G, Lubis N, Zhang C, Gašić M. Affect recognition in conversations using large language models. arXiv:2309.12881 [Preprint]. 2023. Available from: http://arxiv.org/abs/2309.12881.

  45. Lei S, Dong G, Wang X, Wang K, Wang S. InstructERC: reforming emotion recognition in conversation with a retrieval multi-task LLMS framework. arXiv:2309.11911 [Preprint]. 2023. Available from: http://arxiv.org/abs/2309.11911.

  46. Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee D-H, et al. Challenges in representation learning: a report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer; 2013. p. 117–24.

    Chapter  Google Scholar 

  47. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [Preprint]. 2014. Available from: http://arxiv.org/abs/1409.1556.

  48. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.

  49. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Adv Neural Inf Process Syst. 2024;36.

  50. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR; 2021. p. 8748–63.

    Google Scholar 

  51. Chiang W-L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez JE et al. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. 2023. https://vicuna.lmsys.org. Accessed 14 Apr 2023.

  52. OpenAI. GPT-4 technical report. arXiv:2303.08774 [Preprint]. 2023. Available from: http://arxiv.org/abs/2303.08774.

  53. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? Adv Neural Inf Proces Syst. 2014;27.

  54. Amin MM, Cambria E, Schuller BW. Will affective computing emerge from foundation models and general artificial intelligence? a first evaluation of ChatGPT. IEEE Intell Syst. 2023;38(2):15–23.

    Article  Google Scholar 

  55. Areeb QM, Nadeem M, Sohail SS, Imam R, Doctor F, Himeur Y, Hussain A, Amira A. Filter bubbles in recommender systems: fact or fallacy-a systematic review. Wiley Interdiscip Rev Data Min Knowl Discov. 2023;13(6):e1512.

    Article  Google Scholar 

  56. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine. 2023;6(1):120.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-RP23058).

Author information

Authors and Affiliations

Authors

Contributions

Mohammad Nadeem: conceptualization, data curation, formal analysis, methodology and writing — original draft. Shahab Saquib Sohail: conceptualization, formal analysis, methodology, project administration and writing — original draft. Laeeba Javed: methodology, project administration, resources, writing — original draft. Faisal Anwer: data curation, formal analysis and visualization. Abdul Khader Jilani Saudagar: visualization, supervision and writing — review and editing. Khan Muhammad: conceptualization, supervision, project administration, writing — review and editing.

Corresponding author

Correspondence to Khan Muhammad.

Ethics declarations

Ethics Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was not required as no humans or animals were involved.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nadeem, M., Sohail, S.S., Javed, L. et al. Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition. Cogn Comput 16, 2566–2579 (2024). https://doi.org/10.1007/s12559-024-10281-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-024-10281-5

Keywords

Navigation