Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3136755.3143005acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild

Published: 03 November 2017 Publication History

Abstract

Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%.

References

[1]
Lin, Y. P., Wang, C. H., Jung, T. P., Wu, T. L., Jeng, S. K., Duann, J. R., & Chen, J. H. 2010. EEG-based emotion recognition in music listening. IEEE Transactions on Biomedical Engineering, 57(7), 1798-1806.
[2]
Fan, Y., Lu, X., Li, D., & Liu, Y. 2016. Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 445-450). ACM.
[3]
Yao, A., Cai, D., Hu, P., Wang, S., Sha, L., & Chen, Y. 2016. HoloNet: towards robust emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 472-478). ACM.
[4]
Dalal, N., & Triggs, B. 2005. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE.
[5]
Ojala, T., Pietikainen, M., & Harwood, D. 1994. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on (Vol. 1, pp. 582-585). IEEE.
[6]
Ojala, T., Pietikäinen, M., & Harwood, D. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1), 51-59.
[7]
Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.
[8]
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE.
[9]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[10]
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems (pp. 396-404).
[11]
Simonyan, K., & Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[12]
He, K., Zhang, X., Ren, S., & Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[13]
Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993.
[14]
Ekman, P. (1992). An argument for basic emotions. Cognition & emotion, 6(3-4), 169-200.
[15]
Fasel, B., & Luettin, J. 2003. Automatic facial expression analysis: a survey. Pattern recognition, 36(1), 259-275.
[16]
Tian, Y. I., Kanade, T., & Cohn, J. F. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence, 23(2), 97-115.
[17]
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on (pp. 94-101). IEEE.
[18]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. 2015. Learningspatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497)
[19]
Bargal, S. A., Barsoum, E., Ferrer, C. C., & Zhang, C. 2016. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 433-436). ACM.
[20]
Zhu, X. 2006. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3), 4.
[21]
Wang, L., Lee, C. Y., Tu, Z., & Lazebnik, S. 2015. Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496.
[22]
Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. 2015. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5325-5334).
[23]
Ruder, S. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098.
[24]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
[25]
Lin, M., Chen, Q., & Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
[26]
Zagoruyko, S., & Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
[27]
Chollet, F. 2016. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint arXiv:1610.02357.
[28]
Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1859-1866).
[29]
Jung, H., Lee, S., Yim, J., Park, S., & Kim, J. 2015. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2983-2991).
[30]
Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y., & Sun, N. 2016. Multiclue fusion for emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 458-463). ACM.
[31]
Eyben, F., Wöllmer, M., & Schuller, B. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462). ACM.
[32]
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference (pp. 18-25).
[33]
Chollet, F. 2017. Keras (2015).
[34]
He, K., Zhang, X., Ren, S., & Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026- 1034).
[35]
Ioffe, S., & Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).
[36]
Zhang, Kaipeng et al. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing letters 23.10 (pp. 1499- 1503).
[37]
Li, Xi, et al. 2016. DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing 25.8 (pp. 3919- 3930).
[38]
Rasmus, Antti, et al. 2015. Semi-supervised learning with ladder networks. Advances in Neural Information Processing Systems.
[39]
Laine, Samuli, and Timo Aila. 2016. Temporal Ensembling for Semi-Supervised Learning. arXiv preprint arXiv: 1610.02242.
[40]
Dhall A., Goecke, R., Ghosh. S., Hoshi J., Hoey J., Gedeon, T. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0, In Proceedings of the 18th ACM International Conference on Multimodal Interaction (in press). ACM.
[41]
Dhall, A., Goecke, R., Lucey, S. and Gedeon, T. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia.

Cited By

View all
  • (2024)Systematic Review of Emotion Detection with Computer Vision and Deep LearningSensors10.3390/s2411348424:11(3484)Online publication date: 28-May-2024
  • (2023) F rustration R ecognition U sing S patio T emporal Data: A Novel Dataset and GCN Model to Recognize In-Vehicle Frustration IEEE Transactions on Affective Computing10.1109/TAFFC.2022.322926314:4(2864-2875)Online publication date: 1-Oct-2023
  • (2023)Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322174914:4(3164-3176)Online publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction
    November 2017
    676 pages
    ISBN:9781450355438
    DOI:10.1145/3136755
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EmotiW 2017 challenge
    2. Emotion recognition
    3. Multi modal signal
    4. Multi-task learning
    5. Semi-supervised learning

    Qualifiers

    • Research-article

    Funding Sources

    • Ministry of Trade, industry & Energy (MI, Korea)

    Conference

    ICMI '17
    Sponsor:

    Acceptance Rates

    ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;
    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Systematic Review of Emotion Detection with Computer Vision and Deep LearningSensors10.3390/s2411348424:11(3484)Online publication date: 28-May-2024
    • (2023) F rustration R ecognition U sing S patio T emporal Data: A Novel Dataset and GCN Model to Recognize In-Vehicle Frustration IEEE Transactions on Affective Computing10.1109/TAFFC.2022.322926314:4(2864-2875)Online publication date: 1-Oct-2023
    • (2023)Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322174914:4(3164-3176)Online publication date: 1-Oct-2023
    • (2023)Open Datasets in Human Activity Recognition Research—Issues and Challenges: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.331764523:22(26952-26980)Online publication date: 15-Nov-2023
    • (2023)Toward Label-Efficient Emotion and Sentiment AnalysisProceedings of the IEEE10.1109/JPROC.2023.3309299111:10(1159-1197)Online publication date: Oct-2023
    • (2023)Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02185(22814-22825)Online publication date: Jun-2023
    • (2023)An intelligent experimental container suite: using a chemical experiment with virtual-real fusion as an exampleVirtual Reality & Intelligent Hardware10.1016/j.vrih.2022.07.0085:4(317-337)Online publication date: Aug-2023
    • (2022)OGSSL: A Semi-Supervised Classification Model Coupled With Optimal Graph Learning for EEG Emotion RecognitionIEEE Transactions on Neural Systems and Rehabilitation Engineering10.1109/TNSRE.2022.317546430(1288-1297)Online publication date: 2022
    • (2022)Deep Facial Expression Recognition: A SurveyIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298144613:3(1195-1215)Online publication date: 1-Jul-2022
    • (2022)Relational structure predictive neural architecture search for multimodal fusionSoft Computing10.1007/s00500-022-06772-yOnline publication date: 3-Feb-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media