Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3136755.3143016acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Emotion recognition with multimodal features and temporal models

Published: 03 November 2017 Publication History

Abstract

This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods.

References

[1]
Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In ACM International Conference on Multimodal Interaction (ICMI).
[2]
Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 279–283.
[3]
Jun Kai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2014. Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning. In International Conference on Multimodal Interaction. 508–513.
[4]
Albert C. Cruz. 2015. Quantification of Cinematography Semiotics for Videobased Facial Emotion Recognition in the EmotiW 2015 Grand Challenge. In ACM on International Conference on Multimodal Interaction. 511–518.
[5]
S Davis and P Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans.acoust.speech Signal Process 28, 4 (1980), 65–74.
[6]
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. In ACM International Conference on Multimodal Interaction.
[7]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia 19, 3 (2012), 34–41.
[8]
Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. 13, 5 (2015), 467–474.
[9]
Florian Eyben. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM International Conference on Multimedia. 1459–1462.
[10]
Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. Computer Science (2013).
[11]
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).
[12]
Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In International Workshop on Audio/visual Emotion Challenge. 41–48.
[13]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition. 3304–3311.
[14]
Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. In Acm International Conference on Multimodal Interaction. 467–474.
[15]
Yuanliu Liu, Yuanliu Liu, Yuanliu Liu, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In ACM International Conference on Multimodal Interaction. 445–450.
[16]
Stephanie Pancoast and Murat Akbacak. 2014. Softening quantization in bagof-audio-words. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374.
[17]
Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision 105, 3 (2013), 222–245.
[18]
Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. 2012. Emotion Recognition using Acoustic and Lexical Features. In Conference of the International Speech Communication Association.
[19]
Bj Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication 53, 9–10 (2011), 1062–1087.
[20]
Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and Xuewen Wu. 2014. Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 481–486.
[21]
Chung Hsien Wu, Jen Chun Lin, and Wen Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. Apsipa Transactions on Signal & Information Processing 3 (2014), –.
[22]
Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 475–481.
[23]
Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2016. Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness. Neurocomputing (under review) (2016).
[24]
Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. 2016. Multi-clue fusion for emotion recognition in the wild. In ACM International Conference on Multimodal Interaction. 458–463.
[25]
Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 451–458.

Cited By

View all
  • (2024)Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature setMultimedia Tools and Applications10.1007/s11042-024-19171-2Online publication date: 26-Apr-2024
  • (2023)A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognitionComputational Intelligence10.1111/coin.1260740:1Online publication date: 25-Oct-2023
  • (2023)MTTM: Metamorphic Testing for Textual Content Moderation Software2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00200(2387-2399)Online publication date: May-2023
  • Show More Cited By

Index Terms

  1. Emotion recognition with multimodal features and temporal models

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction
      November 2017
      676 pages
      ISBN:9781450355438
      DOI:10.1145/3136755
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 November 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CNN
      2. Emotion Recognition
      3. LSTM
      4. Multimodal Features

      Qualifiers

      • Research-article

      Conference

      ICMI '17
      Sponsor:

      Acceptance Rates

      ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;
      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature setMultimedia Tools and Applications10.1007/s11042-024-19171-2Online publication date: 26-Apr-2024
      • (2023)A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognitionComputational Intelligence10.1111/coin.1260740:1Online publication date: 25-Oct-2023
      • (2023)MTTM: Metamorphic Testing for Textual Content Moderation Software2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00200(2387-2399)Online publication date: May-2023
      • (2023)An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00189(1339-1351)Online publication date: 11-Sep-2023
      • (2022)A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in SpeechSensors10.3390/s2219756122:19(7561)Online publication date: 6-Oct-2022
      • (2021)User-generated video emotion recognition based on key framesMultimedia Tools and Applications10.1007/s11042-020-10203-1Online publication date: 22-Jan-2021
      • (2019)Extracting Audio-Visual Features for Emotion Recognition Through Active Feature Selection2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)10.1109/GlobalSIP45357.2019.8969360(1-5)Online publication date: Nov-2019

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media