Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3136755.3143009acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Learning supervised scoring ensemble for emotion recognition in the wild

Published: 03 November 2017 Publication History

Abstract

State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records.

References

[1]
D. Bau, B. Zhou, A. Khosla, and A. Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. Proceedings of the IEEE Computer Vision and Pattern Recognition. (2017).
[2]
W. Chen, JE. Meng, and S. Wu. 2006. Illumination compensation and normalization using logarithm and discrete cosine transform. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 36, 2 (2006), 458–466.
[3]
Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual Path Networks. arXiv preprint arXiv:1707.01629. (2017).
[4]
F. Chollet. 2016. Deep Learning with Separable Convolutions. arXiv preprint arXiv:1610.02357. (2016).
[5]
A. Dhall, R. Goecke, T. Gedeon, and N. Sebe. 2013. Emotion recognition in the wild. Proceedings of the 15th ACM on International conference on multimodal interaction. (2013), 509–516.
[6]
A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. Proceedings of the 19th ACM International Conference on Multimodal Interaction. (2017).
[7]
A. Dhall, R. Goecke, J. Joshi, and T. Gedeon. 2016. EmotiW 2016: video and group-level emotion recognition challenges. Proceedings of the 18th ACM on International Conference on Multimodal Interaction. (2016), 427–432.
[8]
A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. 2014. Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol. Proceedings of the 16th ACM on International Conference on Multimodal Interaction. (2014), 461–466.
[9]
A. Dhall, O.V.R. Murthy, R. Goecke, and et al. 2015. Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015. Proceedings of the 17th ACM on International Conference on Multimodal Interaction. (2015), 423– 426.
[10]
Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.A. Heng. 2016. 3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes. International Conference on Medical Image Computing and Computer-Assisted Intervention. (2016), 149–157.
[11]
F. Eyben, M. Wollmer, and B. Schuller. 2010. opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proceedings of the 18th ACM on International Conference on Multimedia. (2010), 1459–1462.
[12]
Y. Fan, X. Lu, D. Li, and Y. Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. Proceedings of the 18th ACM on International Conference on Multimodal. (2016), 445–450.
[13]
T. Hassner, S. Harel, E. Paz, and R. Enbar. 2015. Effective Face Frontalization in Unconstrained Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015), 4295–4304.
[14]
ICMI’17, November 13–17, 2017, Glasgow, UK Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen
[15]
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385v1. (2015).
[16]
Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. 2016. Deeply supervised salient object detection with short connections. arXiv preprint arXiv:1611.04849v2. (2016).
[17]
A.G. Howard, M. Zhu, B. Chen, and et al. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861v1. (2017).
[18]
G. Huang, Z. Liu, K.Q. Weinberger, and L. Maaten. 2016. Densely Connected Convolutional Networks. arXiv preprint. arXiv:1608.06993. (2016).
[19]
S.E. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. (2015), 467–474.
[20]
S.E. Kahou, C. Pal, X. Bouthillier, and et al. 2013. Combining modality specific deep neural networks for emotion recognition in video. Proceedings of the 15th ACM on International conference on multimodal interaction. (2013), 543–550.
[21]
A. Krizhevsky, I. Sutskever, and G.E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems. (2012), 1097–1105.
[22]
C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. 2014. Deeply-Supervised Nets. arXiv preprint. arXiv: 1409.5185. (2014).
[23]
M. Lin, Q. Chen, and S. Yan. 2013. Network In Network. arXiv preprint arXiv:1312.4400. (2013).
[24]
M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. 2014. Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild. Proceedings of the 16th International Conference on Multimodal Interaction. (2014), 494–501.
[25]
T. Ojala, M. Pietikainen, and T. Maenpaa. 2002. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24, 7 (2002), 971–987.
[26]
M. Pantic, Z. Zeng, T.S. Huang, and G.I.Roisman. 2009. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31, 1 (2009), 39–58.
[27]
E. Shelhamer, J. Donahue, J. Long, Y. Jia, and R. Girshick. 2014. DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe. Proceedings of the 13th European Conference on Computer Vision. (2014).
[28]
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. (2014).
[29]
C. Szegedy, S. Loffe, V. Vanhoucke, and A. Alemi. 2016. Inception-v4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261. (2016).
[30]
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014), 1701–1708.
[31]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv preprint arXiv: 1412.0767. (2014).
[32]
P. Viola and M. Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2001), 511–518.
[33]
X. Xiong and FDL. Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013), 532–539.
[34]
A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen. 2016. HoloNet: towards robust emotion recognition in the wild. Proceedings of the 18th ACM International Conference on Multimodal Interaction. (2016), 472–478.

Cited By

View all
  • (2024)Accelerate Evolution Strategy by Proximal Policy OptimizationProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654090(1064-1072)Online publication date: 14-Jul-2024
  • (2024)Using Emotion Analysis, Eye tracking, and Head Movement to Monitor Student Engagement among ESL Students with Facial Recognition Algorithm (Mediapipe)2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE)10.1109/ICAACE61206.2024.10548871(509-513)Online publication date: 1-Mar-2024
  • (2024)Adaptively Enhancing Facial Expression Crucial Regions via a Local Non-local Joint NetworkMachine Intelligence Research10.1007/s11633-023-1417-921:2(331-348)Online publication date: 11-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction
November 2017
676 pages
ISBN:9781450355438
DOI:10.1145/3136755
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolutional Neural Networks
  2. Deep Learning
  3. EmotiW 2017 Challenge
  4. Emotion Recognition
  5. Supervised Learning

Qualifiers

  • Research-article

Conference

ICMI '17
Sponsor:

Acceptance Rates

ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;
Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerate Evolution Strategy by Proximal Policy OptimizationProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654090(1064-1072)Online publication date: 14-Jul-2024
  • (2024)Using Emotion Analysis, Eye tracking, and Head Movement to Monitor Student Engagement among ESL Students with Facial Recognition Algorithm (Mediapipe)2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE)10.1109/ICAACE61206.2024.10548871(509-513)Online publication date: 1-Mar-2024
  • (2024)Adaptively Enhancing Facial Expression Crucial Regions via a Local Non-local Joint NetworkMachine Intelligence Research10.1007/s11633-023-1417-921:2(331-348)Online publication date: 11-Jan-2024
  • (2024)Survey of deep emotion recognition in dynamic data using facial, speech and textual cuesMultimedia Tools and Applications10.1007/s11042-023-17944-983:25(66223-66262)Online publication date: 22-Jan-2024
  • (2024)Feature fusion for human compound emotion recognition: a fusion of facial expression texture and action unit dataPattern Analysis and Applications10.1007/s10044-024-01369-727:4Online publication date: 14-Nov-2024
  • (2023)An Assessment of In-the-Wild Datasets for Multimodal Emotion RecognitionSensors10.3390/s2311518423:11(5184)Online publication date: 30-May-2023
  • (2023)LASTNet: A Swin Transformer with LANets Network for Video emotion recognitionProceedings of the 4th International Conference on Artificial Intelligence and Computer Engineering10.1145/3652628.3652676(291-294)Online publication date: 17-Nov-2023
  • (2023)Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.325890014:4(3231-3243)Online publication date: 1-Oct-2023
  • (2023)Spatial-Temporal Graphs Plus Transformers for Geometry-Guided Facial Expression RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.318173614:4(2751-2767)Online publication date: 1-Oct-2023
  • (2023)Impact of Facial Landmark Localization on Facial Expression RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2021.312414214:2(1267-1279)Online publication date: 1-Apr-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media