research-article

Emotion recognition with multimodal features and temporal models

Authors:

Yong QinAuthors Info & Claims

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

Pages 598 - 602

https://doi.org/10.1145/3136755.3143016

Published: 03 November 2017 Publication History

Abstract

This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods.

References

[1]

Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In ACM International Conference on Multimodal Interaction (ICMI).

Digital Library

[2]

Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 279–283.

Digital Library

[3]

Jun Kai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2014. Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning. In International Conference on Multimodal Interaction. 508–513.

Digital Library

[4]

Albert C. Cruz. 2015. Quantification of Cinematography Semiotics for Videobased Facial Emotion Recognition in the EmotiW 2015 Grand Challenge. In ACM on International Conference on Multimodal Interaction. 511–518.

Digital Library

[5]

S Davis and P Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans.acoust.speech Signal Process 28, 4 (1980), 65–74.

[6]

Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. In ACM International Conference on Multimodal Interaction.

Digital Library

[7]

Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia 19, 3 (2012), 34–41.

Digital Library

[8]

Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. 13, 5 (2015), 467–474.

[9]

Florian Eyben. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM International Conference on Multimedia. 1459–1462.

Digital Library

[10]

Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. Computer Science (2013).

[11]

Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).

[12]

Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In International Workshop on Audio/visual Emotion Challenge. 41–48.

Digital Library

[13]

Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition. 3304–3311.

[14]

Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. In Acm International Conference on Multimodal Interaction. 467–474.

Digital Library

[15]

Yuanliu Liu, Yuanliu Liu, Yuanliu Liu, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In ACM International Conference on Multimodal Interaction. 445–450.

Digital Library

[16]

Stephanie Pancoast and Murat Akbacak. 2014. Softening quantization in bagof-audio-words. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374.

[17]

Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision 105, 3 (2013), 222–245.

Digital Library

[18]

Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. 2012. Emotion Recognition using Acoustic and Lexical Features. In Conference of the International Speech Communication Association.

[19]

Bj Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication 53, 9–10 (2011), 1062–1087.

Digital Library

[20]

Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and Xuewen Wu. 2014. Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 481–486.

Digital Library

[21]

Chung Hsien Wu, Jen Chun Lin, and Wen Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. Apsipa Transactions on Signal & Information Processing 3 (2014), –.

[22]

Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 475–481.

Digital Library

[23]

Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2016. Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness. Neurocomputing (under review) (2016).

Digital Library

[24]

Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. 2016. Multi-clue fusion for emotion recognition in the wild. In ACM International Conference on Multimodal Interaction. 458–463.

Digital Library

[25]

Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 451–458.

Digital Library

Cited By

Xue JWang JLiu XZhang QWu X(2025)Affective Video Content Analysis: Decade Review and New PerspectivesBig Data Mining and Analytics10.26599/BDMA.2024.90200488:1(118-144)Online publication date: Feb-2025
https://doi.org/10.26599/BDMA.2024.9020048
Singh PTripathi MPatil MShivendra Neelakantappa M(2024)Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature setMultimedia Tools and Applications10.1007/s11042-024-19171-284:1(1-36)Online publication date: 26-Apr-2024
https://doi.org/10.1007/s11042-024-19171-2
Xu CDu YWang JZheng WLi TYuan Z(2023)A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognitionComputational Intelligence10.1111/coin.1260740:1Online publication date: 25-Oct-2023
https://doi.org/10.1111/coin.12607
Show More Cited By

Index Terms

Emotion recognition with multimodal features and temporal models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning

Recommendations

Video emotion recognition in the wild based on fusion of multimodal features
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips ...
Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition
AVEC '17: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in the Audio/Visual Emotion Challenge (AVEC) 2017, which ...
Combining Multimodal Features within a Fusion Network for Emotion Recognition in the Wild
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

In this paper, we describe our work in the third Emotion Recognition in the Wild (EmotiW 2015) Challenge. For each video clip, we extract MSDF, LBP-TOP, HOG, LPQ-TOP and acoustic features to recognize the emotions of film characters. For the static ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

November 2017

676 pages

ISBN:9781450355438

DOI:10.1145/3136755

General Chairs:
Edward Lank
University of Waterloo, Canada
,
Alessandro Vinciarelli
University of Glasgow, UK
,
Program Chairs:
Eve Hoggan
Aarhus University, Denmark
,
Sriram Subramanian
University of Sussex, UK
,
Stephen A. Brewster
University of Glasgow, UK

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '17

Sponsor:

SIGCHI

ICMI '17: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 13 - 17, 2017

Glasgow, UK

Acceptance Rates

ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
457
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xue JWang JLiu XZhang QWu X(2025)Affective Video Content Analysis: Decade Review and New PerspectivesBig Data Mining and Analytics10.26599/BDMA.2024.90200488:1(118-144)Online publication date: Feb-2025
https://doi.org/10.26599/BDMA.2024.9020048
Singh PTripathi MPatil MShivendra Neelakantappa M(2024)Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature setMultimedia Tools and Applications10.1007/s11042-024-19171-284:1(1-36)Online publication date: 26-Apr-2024
https://doi.org/10.1007/s11042-024-19171-2
Xu CDu YWang JZheng WLi TYuan Z(2023)A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognitionComputational Intelligence10.1111/coin.1260740:1Online publication date: 25-Oct-2023
https://doi.org/10.1111/coin.12607
Wang WHuang JWu WZhang JHuang YLi SHe PLyu M(2023)MTTM: Metamorphic Testing for Textual Content Moderation Software2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00200(2387-2399)Online publication date: May-2023
https://doi.org/10.1109/ICSE48619.2023.00200
Wang WHuang JHuang JChen CGu JHe PLyu M(2023)An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00189(1339-1351)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00189
Doğdu CKessler TSchneider DShadaydeh MSchweinberger S(2022)A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in SpeechSensors10.3390/s2219756122:19(7561)Online publication date: 6-Oct-2022
https://doi.org/10.3390/s22197561
Wei JYang XDong Y(2021)User-generated video emotion recognition based on key framesMultimedia Tools and Applications10.1007/s11042-020-10203-1Online publication date: 22-Jan-2021
https://doi.org/10.1007/s11042-020-10203-1
Haider FPollak SAlbert PLuz S(2019)Extracting Audio-Visual Features for Emotion Recognition Through Active Feature Selection2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)10.1109/GlobalSIP45357.2019.8969360(1-5)Online publication date: Nov-2019
https://doi.org/10.1109/GlobalSIP45357.2019.8969360

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten