Abstract
The study of affective computing in the wild setting is underpinned by databases. Existing multimodal emotion databases in the real-world conditions are few and small, with a limited number of subjects and expressed in a single language. To meet this requirement, we collected, annotated, and prepared to release a new natural state video database (called HEU Emotion). HEU Emotion contains a total of 19,004 video clips, which is divided into two parts according to the data source. The first part contains videos downloaded from Tumblr, Google, and Giphy, including 10 emotions and two modalities (facial expression and body posture). The second part includes corpus taken manually from movies, TV series, and variety shows, consisting of 10 emotions and three modalities (facial expression, body posture, and emotional speech). HEU Emotion is by far the most extensive multimodal emotional database with 9951 subjects. In order to provide a benchmark for emotion recognition, we used many conventional machine learning and deep learning methods to evaluate HEU Emotion. We proposed a multimodal attention module to fuse multimodal features adaptively. After multimodal fusion, the recognition accuracies for the two parts increased by 2.19% and 4.01%, respectively, over those of single-modal facial expression recognition.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ahonen T, Hadid A, Pietikäinen M (2004) Face recognition with local binary patterns. In: European conference on computer vision, pp 469–481. Springer
Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. arXiv preprint arXiv:1902.00038
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation 42(4):335–339
Chen SY, Hsu CC, Kuo CC, Ku LW, et al (2018) Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1724–1734. Association for Computational Linguistics
Chou HC, Lin WC, Chang LC, Li CC, Ma HP, Lee CC (2017) Nnime: The nthu-ntua chinese interactive multimodal emotion corpus. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp 292–298. IEEE
Clavel C, Vasilescu I, Devillers L, Richard G, Ehrette T, Sedogbo C (2006) The safe corpus: illustrating extreme emotions in dynamic situations. First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International conference on Language Resources and Evaluation (LREC 2006)). Genoa, Italy, pp 76–79
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, pp 886–893. IEEE
De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multi-modal information. In: Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat., vol. 1, pp 397–401. IEEE
Dhall A, Goecke R, Lucey S, Gedeon T (2011) Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp 2106–2112. IEEE
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMed 19(3):34–41. https://doi.org/10.1109/MMUL.2012.26
Ekman P (1976) Pictures of facial affect. Consulting Psychologists Press,
Ekman P (1993) Facial expression and emotion. Am psychol 48(4):384
Ekman P (2003) Emotions revealed: Recognizing faces and feelings to improve communication and emotional life. Holt Paperback 128(8):140–140
Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM international conference on Multimedia, pp 835–838. ACM
Fan Y, Lam JC, Li VO (2018) Multi-region ensemble convolutional neural network for facial expression recognition. In: International Conference on Artificial Neural Networks, pp. 84–94. Springer
Fan Y, Lu X, Li D, Liu Y (2016) Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450. ACM
Georgescu MI, Ionescu RT, Popescu M (2019) Local learning with deep and handcrafted features for facial expression recognition. IEEE Access 7:64827–64836
Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee DH, et al (2013) Challenges in representation learning: A report on three machine learning contests. In: International Conference on Neural Information Processing, pp 117–124. Springer
Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: 18th International Conference on Pattern Recognition (ICPR’06), vol 1, pp 1148–1153. IEEE
Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems, pp 398–423. IGI Global
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6546–6555
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Hu P, Cai D, Wang S, Yao A, Chen Y (2017) Learning supervised scoring ensemble for emotion recognition in the wild. In: Proceedings of the 19th ACM international conference on multimodal interaction, pp 553–560. ACM
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Kazemi V, Sullivan J (2014) One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1867–1874
Kosti R, Alvarez JM, Recasens A, Lapedriza A (2017) Emotion recognition in context. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1960–1968. 10.1109/CVPR.2017.212
Levi G, Hassner T (2015) Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: Proceedings of the 2015 ACM on international conference on multimodal interaction, pp 503–510. ACM
Li S, Deng W, Du J (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2584–2593 10.1109/CVPR.2017.277
Li Y, Tao J, Chao L, Bao W, Liu Y (2017) Cheavd: a chinese natural emotional audio-visual database. J Amb Intell Human Comput 8(6):913–924
Li Y, Tao J, Schuller B, Shan S, Jiang D, Jia J (2018) Mec 2017: Multimodal emotion recognition challenge. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp 1–5. IEEE
Liu C, Tang T, Lv K, Wang M (2018) Multi-feature based emotion recognition for video clips. In: Proceedings of the 2018 on International Conference on Multimodal Interaction, pp 630–634. ACM
Liu C, Wechsler H (2002) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans Image process 11(4):467–476
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5):e0196391
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp 94–101. IEEE
Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp 8–8. IEEE
Matsumoto D, Hwang HS (2011) Evidence for training the ability to read microexpressions of emotion. Motivation Emotion 35(2):181–191
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746
McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2011) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17
Mehrabian A (2008) Communication without words. Communication theory, pp 193–200
Mollahosseini A, Hasani B, Mahoor MH (2019) Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans Affective Comput 10(1):18–31. https://doi.org/10.1109/TAFFC.2017.2740923
Ng HW, Nguyen VD, Vonikakis V, Winkler S (2015) Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on international conference on multimodal interaction, pp 443–449. ACM
Peng X, Xia Z, Li L, Feng X (2016) Towards facial expression recognition in the wild: A new database and deep recognition system. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 93–99
Perepelkina O, Kazimirova E, Konstantinova M (2018) Ramas: Russian multimodal corpus of dyadic interaction for affective computing. In: International Conference on Speech and Computer, pp 501–510. Springer
Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2018) Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Rothe R, Timofte R, Van Gool L (2015) Dex: Deep expectation of apparent age from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp 10–15
Sapiński T, Kamińska D, Pelikant A, Ozcinar C, Avots E, Anbarjafari G (2018) Multimodal database of emotional speech, video and gestures. In: International Conference on Pattern Recognition, pp 153–163. Springer
Schuller B, Steidl S, Batliner A (2009) The interspeech 2009 emotion challenge. In: Tenth Annual Conference of the International Speech Communication Association
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Suykens JA (2001) Support vector machines: a nonlinear modelling and control perspective. Eur J Control 7(2–3):311–327
Vasilescu I, Devillers L, Clavel C, Ehrette T (2004) Fiction database for emotion detection in abnormal situations. In: Eighth International Conference on Spoken Language Processing
Xie Z (2010) Ryerson multimedia research laboratory (rml). http://www.rml.ryerson.ca/rml-emotion-database.html
Yan J, Zheng W, Cui Z, Tang C, Zhang T, Zong Y (2018) Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309:27–35
Yu F, Chang E, Xu YQ, Shum HY (2001) Emotion detection from speech to enrich multimedia content. In: Pacific-Rim Conference on Multimedia, pp 550–557. Springer
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503
Zhu Y, Lan Z, Newsam S, Hauptmann A (2018) Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, pp 363–378. Springer
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, J., Wang, C., Wang, K. et al. HEU Emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Comput & Applic 33, 8669–8685 (2021). https://doi.org/10.1007/s00521-020-05616-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05616-w