research-article

Learning supervised scoring ensemble for emotion recognition in the wild

Authors:

Yurong ChenAuthors Info & Claims

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

Pages 553 - 560

https://doi.org/10.1145/3136755.3143009

Published: 03 November 2017 Publication History

Abstract

State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records.

References

[1]

D. Bau, B. Zhou, A. Khosla, and A. Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. Proceedings of the IEEE Computer Vision and Pattern Recognition. (2017).

[2]

W. Chen, JE. Meng, and S. Wu. 2006. Illumination compensation and normalization using logarithm and discrete cosine transform. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 36, 2 (2006), 458–466.

Digital Library

[3]

Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual Path Networks. arXiv preprint arXiv:1707.01629. (2017).

[4]

F. Chollet. 2016. Deep Learning with Separable Convolutions. arXiv preprint arXiv:1610.02357. (2016).

[5]

A. Dhall, R. Goecke, T. Gedeon, and N. Sebe. 2013. Emotion recognition in the wild. Proceedings of the 15th ACM on International conference on multimodal interaction. (2013), 509–516.

Digital Library

[6]

A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. Proceedings of the 19th ACM International Conference on Multimodal Interaction. (2017).

Digital Library

[7]

A. Dhall, R. Goecke, J. Joshi, and T. Gedeon. 2016. EmotiW 2016: video and group-level emotion recognition challenges. Proceedings of the 18th ACM on International Conference on Multimodal Interaction. (2016), 427–432.

Digital Library

[8]

A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. 2014. Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol. Proceedings of the 16th ACM on International Conference on Multimodal Interaction. (2014), 461–466.

Digital Library

[9]

A. Dhall, O.V.R. Murthy, R. Goecke, and et al. 2015. Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015. Proceedings of the 17th ACM on International Conference on Multimodal Interaction. (2015), 423– 426.

Digital Library

[10]

Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.A. Heng. 2016. 3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes. International Conference on Medical Image Computing and Computer-Assisted Intervention. (2016), 149–157.

[11]

F. Eyben, M. Wollmer, and B. Schuller. 2010. opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proceedings of the 18th ACM on International Conference on Multimedia. (2010), 1459–1462.

Digital Library

[12]

Y. Fan, X. Lu, D. Li, and Y. Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. Proceedings of the 18th ACM on International Conference on Multimodal. (2016), 445–450.

Digital Library

[13]

T. Hassner, S. Harel, E. Paz, and R. Enbar. 2015. Effective Face Frontalization in Unconstrained Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015), 4295–4304.

[14]

ICMI’17, November 13–17, 2017, Glasgow, UK Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen

[15]

K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385v1. (2015).

[16]

Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. 2016. Deeply supervised salient object detection with short connections. arXiv preprint arXiv:1611.04849v2. (2016).

[17]

A.G. Howard, M. Zhu, B. Chen, and et al. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861v1. (2017).

[18]

G. Huang, Z. Liu, K.Q. Weinberger, and L. Maaten. 2016. Densely Connected Convolutional Networks. arXiv preprint. arXiv:1608.06993. (2016).

[19]

S.E. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. (2015), 467–474.

Digital Library

[20]

S.E. Kahou, C. Pal, X. Bouthillier, and et al. 2013. Combining modality specific deep neural networks for emotion recognition in video. Proceedings of the 15th ACM on International conference on multimodal interaction. (2013), 543–550.

Digital Library

[21]

A. Krizhevsky, I. Sutskever, and G.E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems. (2012), 1097–1105.

Digital Library

[22]

C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. 2014. Deeply-Supervised Nets. arXiv preprint. arXiv: 1409.5185. (2014).

[23]

M. Lin, Q. Chen, and S. Yan. 2013. Network In Network. arXiv preprint arXiv:1312.4400. (2013).

[24]

M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. 2014. Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild. Proceedings of the 16th International Conference on Multimodal Interaction. (2014), 494–501.

Digital Library

[25]

T. Ojala, M. Pietikainen, and T. Maenpaa. 2002. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, 7 (2002), 971–987.

Digital Library

[26]

M. Pantic, Z. Zeng, T.S. Huang, and G.I.Roisman. 2009. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31, 1 (2009), 39–58.

Digital Library

[27]

E. Shelhamer, J. Donahue, J. Long, Y. Jia, and R. Girshick. 2014. DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe. Proceedings of the 13th European Conference on Computer Vision. (2014).

[28]

K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. (2014).

[29]

C. Szegedy, S. Loffe, V. Vanhoucke, and A. Alemi. 2016. Inception-v4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261. (2016).

[30]

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014), 1701–1708.

Digital Library

[31]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv preprint arXiv: 1412.0767. (2014).

[32]

P. Viola and M. Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2001), 511–518.

[33]

X. Xiong and FDL. Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013), 532–539.

Digital Library

[34]

A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen. 2016. HoloNet: towards robust emotion recognition in the wild. Proceedings of the 18th ACM International Conference on Multimodal Interaction. (2016), 472–478.

Digital Library

Cited By

Xu TChen HHe JLi XHandl J(2024)Accelerate Evolution Strategy by Proximal Policy OptimizationProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654090(1064-1072)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638529.3654090
Huang ASamonte M(2024)Using Emotion Analysis, Eye tracking, and Head Movement to Monitor Student Engagement among ESL Students with Facial Recognition Algorithm (Mediapipe)2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE)10.1109/ICAACE61206.2024.10548871(509-513)Online publication date: 1-Mar-2024
https://doi.org/10.1109/ICAACE61206.2024.10548871
Shi GMao SGou SYan DJiao LXiong L(2024)Adaptively Enhancing Facial Expression Crucial Regions via a Local Non-local Joint NetworkMachine Intelligence Research10.1007/s11633-023-1417-921:2(331-348)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s11633-023-1417-9
Show More Cited By

Index Terms

Learning supervised scoring ensemble for emotion recognition in the wild
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

HoloNet: towards robust emotion recognition in the wild
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous ...
Group emotion recognition with individual facial emotion CNNs and global image based CNNs
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on ...
Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal Interaction

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

November 2017

676 pages

ISBN:9781450355438

DOI:10.1145/3136755

General Chairs:
Edward Lank
University of Waterloo, Canada
,
Alessandro Vinciarelli
University of Glasgow, UK
,
Program Chairs:
Eve Hoggan
Aarhus University, Denmark
,
Sriram Subramanian
University of Sussex, UK
,
Stephen A. Brewster
University of Glasgow, UK

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '17

Sponsor:

SIGCHI

ICMI '17: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 13 - 17, 2017

Glasgow, UK

Acceptance Rates

ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

112
Total Citations
View Citations
1,451
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu TChen HHe JLi XHandl J(2024)Accelerate Evolution Strategy by Proximal Policy OptimizationProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654090(1064-1072)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638529.3654090
Huang ASamonte M(2024)Using Emotion Analysis, Eye tracking, and Head Movement to Monitor Student Engagement among ESL Students with Facial Recognition Algorithm (Mediapipe)2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE)10.1109/ICAACE61206.2024.10548871(509-513)Online publication date: 1-Mar-2024
https://doi.org/10.1109/ICAACE61206.2024.10548871
Shi GMao SGou SYan DJiao LXiong L(2024)Adaptively Enhancing Facial Expression Crucial Regions via a Local Non-local Joint NetworkMachine Intelligence Research10.1007/s11633-023-1417-921:2(331-348)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s11633-023-1417-9
Zhang TTan Z(2024)Survey of deep emotion recognition in dynamic data using facial, speech and textual cuesMultimedia Tools and Applications10.1007/s11042-023-17944-983:25(66223-66262)Online publication date: 22-Jan-2024
https://doi.org/10.1007/s11042-023-17944-9
Jiddah SYurtkan K(2024)Feature fusion for human compound emotion recognition: a fusion of facial expression texture and action unit dataPattern Analysis and Applications10.1007/s10044-024-01369-727:4Online publication date: 14-Nov-2024
https://doi.org/10.1007/s10044-024-01369-7
Aguilera AMellado DRojas F(2023)An Assessment of In-the-Wild Datasets for Multimodal Emotion RecognitionSensors10.3390/s2311518423:11(5184)Online publication date: 30-May-2023
https://doi.org/10.3390/s23115184
Bai XWang J(2023)LASTNet: A Swin Transformer with LANets Network for Video emotion recognitionProceedings of the 4th International Conference on Artificial Intelligence and Computer Engineering10.1145/3652628.3652676(291-294)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3652628.3652676
Hsu JWu C(2023)Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.325890014:4(3231-3243)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2023.3258900
Zhao RLiu THuang ZLun DLam K(2023)Spatial-Temporal Graphs Plus Transformers for Geometry-Guided Facial Expression RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.318173614:4(2751-2767)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2022.3181736
Belmonte RAllaert BTirilly PBilasco IDjeraba CSebe N(2023)Impact of Facial Landmark Localization on Facial Expression RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2021.312414214:2(1267-1279)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TAFFC.2021.3124142
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents