research-article

Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild

Authors:

Dong Yoon Choi,

Byung Cheol SongAuthors Info & Claims

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

Pages 529 - 535

https://doi.org/10.1145/3136755.3143005

Published: 03 November 2017 Publication History

Abstract

Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%.

References

[1]

Lin, Y. P., Wang, C. H., Jung, T. P., Wu, T. L., Jeng, S. K., Duann, J. R., & Chen, J. H. 2010. EEG-based emotion recognition in music listening. IEEE Transactions on Biomedical Engineering, 57(7), 1798-1806.

[2]

Fan, Y., Lu, X., Li, D., & Liu, Y. 2016. Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 445-450). ACM.

Digital Library

[3]

Yao, A., Cai, D., Hu, P., Wang, S., Sha, L., & Chen, Y. 2016. HoloNet: towards robust emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 472-478). ACM.

Digital Library

[4]

Dalal, N., & Triggs, B. 2005. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE.

Digital Library

[5]

Ojala, T., Pietikainen, M., & Harwood, D. 1994. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on (Vol. 1, pp. 582-585). IEEE.

[6]

Ojala, T., Pietikäinen, M., & Harwood, D. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1), 51-59.

[7]

Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.

Digital Library

[8]

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE.

[9]

Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

Digital Library

[10]

LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems (pp. 396-404).

Digital Library

[11]

Simonyan, K., & Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[12]

He, K., Zhang, X., Ren, S., & Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[13]

Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993.

[14]

Ekman, P. (1992). An argument for basic emotions. Cognition & emotion, 6(3-4), 169-200.

[15]

Fasel, B., & Luettin, J. 2003. Automatic facial expression analysis: a survey. Pattern recognition, 36(1), 259-275.

[16]

Tian, Y. I., Kanade, T., & Cohn, J. F. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence, 23(2), 97-115.

Digital Library

[17]

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on (pp. 94-101). IEEE.

[18]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. 2015. Learningspatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497)

Digital Library

[19]

Bargal, S. A., Barsoum, E., Ferrer, C. C., & Zhang, C. 2016. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 433-436). ACM.

Digital Library

[20]

Zhu, X. 2006. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3), 4.

[21]

Wang, L., Lee, C. Y., Tu, Z., & Lazebnik, S. 2015. Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496.

[22]

Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. 2015. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5325-5334).

[23]

Ruder, S. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098.

[24]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

[25]

Lin, M., Chen, Q., & Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.

[26]

Zagoruyko, S., & Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.

[27]

Chollet, F. 2016. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint arXiv:1610.02357.

[28]

Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1859-1866).

Digital Library

[29]

Jung, H., Lee, S., Yim, J., Park, S., & Kim, J. 2015. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2983-2991).

Digital Library

[30]

Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y., & Sun, N. 2016. Multiclue fusion for emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 458-463). ACM.

Digital Library

[31]

Eyben, F., Wöllmer, M., & Schuller, B. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462). ACM.

Digital Library

[32]

McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference (pp. 18-25).

[33]

Chollet, F. 2017. Keras (2015).

[34]

He, K., Zhang, X., Ren, S., & Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026- 1034).

Digital Library

[35]

Ioffe, S., & Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).

Digital Library

[36]

Zhang, Kaipeng et al. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing letters 23.10 (pp. 1499- 1503).

[37]

Li, Xi, et al. 2016. DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing 25.8 (pp. 3919- 3930).

[38]

Rasmus, Antti, et al. 2015. Semi-supervised learning with ladder networks. Advances in Neural Information Processing Systems.

Digital Library

[39]

Laine, Samuli, and Timo Aila. 2016. Temporal Ensembling for Semi-Supervised Learning. arXiv preprint arXiv: 1610.02242.

[40]

Dhall A., Goecke, R., Ghosh. S., Hoshi J., Hoey J., Gedeon, T. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0, In Proceedings of the 18th ACM International Conference on Multimodal Interaction (in press). ACM.

Digital Library

[41]

Dhall, A., Goecke, R., Lucey, S. and Gedeon, T. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia.

Digital Library

Cited By

Pereira RMendes CRibeiro JRibeiro RMiragaia RRodrigues NCosta NPereira A(2024)Systematic Review of Emotion Detection with Computer Vision and Deep LearningSensors10.3390/s2411348424:11(3484)Online publication date: 28-May-2024
https://doi.org/10.3390/s24113484
Bosch ECorbí RIhme KHörmann SJipp MKäthner D(2023) F rustration R ecognition U sing S patio T emporal Data: A Novel Dataset and GCN Model to Recognize In-Vehicle Frustration IEEE Transactions on Affective Computing10.1109/TAFFC.2022.322926314:4(2864-2875)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2022.3229263
Latif SRana RKhalifa SJurdak RSchuller B(2023)Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322174914:4(3164-3176)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2022.3221749
Show More Cited By

Index Terms

Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Emotion recognition in the wild challenge 2016
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

The fourth Emotion Recognition in the Wild (EmotiW) challenge is a grand challenge in the ACM International Conference on Multimodal Interaction 2016, Tokyo. EmotiW is a series of benchmarking and competition effort for researchers working in the area ...
Human-Computer Interaction Using Emotion Recognition from Facial Expression
EMS '11: Proceedings of the 2011 UKSim 5th European Symposium on Computer Modeling and Simulation

This paper describes emotion recognition system based on facial expression. A fully automatic facial expression recognition system is based on three steps: face detection, facial characteristic extraction and facial expression classification. We have ...
Facial Emotion Recognition in-the-Wild Using Deep Neural Networks: A Comprehensive Review
Abstract
Facial expressions are a crucial aspect of human communication that provide information about emotions, intentions, interactions, and social relationships. They are a universal signal used daily to convey inner behaviors in natural situations. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

November 2017

676 pages

ISBN:9781450355438

DOI:10.1145/3136755

General Chairs:
Edward Lank
University of Waterloo, Canada
,
Alessandro Vinciarelli
University of Glasgow, UK
,
Program Chairs:
Eve Hoggan
Aarhus University, Denmark
,
Sriram Subramanian
University of Sussex, UK
,
Stephen A. Brewster
University of Glasgow, UK

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Trade, industry & Energy (MI, Korea)

Conference

ICMI '17

Sponsor:

SIGCHI

ICMI '17: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 13 - 17, 2017

Glasgow, UK

Acceptance Rates

ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
806
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pereira RMendes CRibeiro JRibeiro RMiragaia RRodrigues NCosta NPereira A(2024)Systematic Review of Emotion Detection with Computer Vision and Deep LearningSensors10.3390/s2411348424:11(3484)Online publication date: 28-May-2024
https://doi.org/10.3390/s24113484
Bosch ECorbí RIhme KHörmann SJipp MKäthner D(2023) F rustration R ecognition U sing S patio T emporal Data: A Novel Dataset and GCN Model to Recognize In-Vehicle Frustration IEEE Transactions on Affective Computing10.1109/TAFFC.2022.322926314:4(2864-2875)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2022.3229263
Latif SRana RKhalifa SJurdak RSchuller B(2023)Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322174914:4(3164-3176)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2022.3221749
Alam GMcChesney INicholl PRafferty J(2023)Open Datasets in Human Activity Recognition Research—Issues and Challenges: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.331764523:22(26952-26980)Online publication date: 15-Nov-2023
https://doi.org/10.1109/JSEN.2023.3317645
Zhao SHong XYang JZhao YDing G(2023)Toward Label-Efficient Emotion and Sentiment AnalysisProceedings of the IEEE10.1109/JPROC.2023.3309299111:10(1159-1197)Online publication date: Oct-2023
https://doi.org/10.1109/JPROC.2023.3309299
Micaelli PVahdat AYin HKautz JMolchanov P(2023)Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02185(22814-22825)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.02185
Yang LFeng ZMeng J(2023)An intelligent experimental container suite: using a chemical experiment with virtual-real fusion as an exampleVirtual Reality & Intelligent Hardware10.1016/j.vrih.2022.07.0085:4(317-337)Online publication date: Aug-2023
https://doi.org/10.1016/j.vrih.2022.07.008
Peng YJin FKong WNie FLu BCichocki A(2022)OGSSL: A Semi-Supervised Classification Model Coupled With Optimal Graph Learning for EEG Emotion RecognitionIEEE Transactions on Neural Systems and Rehabilitation Engineering10.1109/TNSRE.2022.317546430(1288-1297)Online publication date: 2022
https://doi.org/10.1109/TNSRE.2022.3175464
Li SDeng W(2022)Deep Facial Expression Recognition: A SurveyIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298144613:3(1195-1215)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TAFFC.2020.2981446
Yao XLi FZeng Y(2022)Relational structure predictive neural architecture search for multimodal fusionSoft Computing10.1007/s00500-022-06772-yOnline publication date: 3-Feb-2022
https://doi.org/10.1007/s00500-022-06772-y
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten