Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

Published: 28 December 2023 Publication History

Abstract

Learning discriminative and robust representations is important for facial expression recognition (FER) due to subtly different emotional faces and their subjective annotations. Previous works usually address one representation solely because these two goals seem to be contradictory for optimization. Their performances inevitably suffer from challenges from the other representation. In this article, by considering this problem from two novel perspectives, we demonstrate that discriminative and robust representations can be learned in a unified approach, i.e., DR-FER, and mutually benefit each other. Moreover, we make it with the supervision from only original annotations. Specifically, to learn discriminative representations, we propose performing masked image modeling (MIM) as an auxiliary task to force our network to discover expression-related facial areas. This is the first attempt to employ MIM to explore discriminative patterns in a self-supervised manner. To extract robust representations, we present a category-aware self-paced learning schedule to mine high-quality annotated (<italic>easy</italic>) expressions and incorrectly annotated (<italic>hard</italic>) counterparts. We further introduce a retrieval similarity-based relabeling strategy to correct hard expression annotations, exploiting them more effectively. By enhancing the discrimination ability of the FER classifier as a bridge, these two learning goals significantly strengthen each other. Extensive experiments on several popular benchmarks demonstrate the superior performance of our DR-FER. Moreover, thorough visualizations and extra experiments on manually annotation-corrupted datasets show that our approach successfully accomplishes learning both discriminative and robust representations simultaneously.

References

[1]
Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression recognition using CNN with attention mechanism,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2439–2450, May 2019.
[2]
K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020.
[3]
C. Zheng, M. Mendieta, and C. Chen, “POSTER: A pyramid cross-fusion transformer network for facial expression recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2022, pp. 3146–3155.
[4]
F. Zhang, M. Xu, and C. Xu, “Weakly-supervised facial expression recognition in the wild with noisy data,” IEEE Trans. Multimedia, vol. 24, pp. 1800–1814, 2022.
[5]
M. Yeasin, B. Bullot, and R. Sharma, “Recognition of facial expressions and measurement of levels of interest from video,” IEEE Trans. Multimedia, vol. 8, pp. 500–508, 2006.
[6]
Y. Li, Z. Zhang, B. Chen, G. Lu, and D. Zhang, “Deep margin-sensitive representation learning for cross-domain facial expression recognition,” IEEE Trans. Multimedia, vol. 25, pp. 1359–1373, 2023.
[7]
S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2852–2861.
[8]
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: Database for facial expression, valence, and arousal computing in the wild,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan.–Mar. 2019.
[9]
J. Shi, S. Zhu, and Z. Liang, “Learning to amend facial expression representation via de-albino and affinity,” 2021, arXiv:2103.10189.
[10]
L. Fan, W. Wang, S. Huang, X. Tang, and S.-C. Zhu, “Understanding human gaze communication by spatio-temporal graph reasoning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5724–5733.
[11]
T. Zhou, Y. Yang, and W. Wang, “Differentiable multi-granularity human parsing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8296–8310, Jul. 2023.
[12]
S. Chen et al., “Label distribution learning on auxiliary label space graphs for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13984–13993.
[13]
J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with inconsistently annotated datasets,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 222–237.
[14]
J. She et al., “Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6248–6257.
[15]
S. Happy and A. Routray, “Automatic facial expression recognition using features of salient facial patches,” IEEE Trans. Affect. Comput., vol. 6, no. 1, pp. 1–12, Jan.–Mar. 2015.
[16]
Y. Li, J. Zeng, S. Shan, and X. Chen, “Patch-gated CNN for occlusion-aware facial expression recognition,” in Proc. IEEE 24th Int. Conf. Pattern Recognit., 2018, pp. 2209–2214.
[17]
K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties for large-scale facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6897–6906.
[18]
H. Li, M. Sui, F. Zhao, Z. Zha, and F. Wu, “MVT: Mask vision transformer for facial expression recognition in the wild,” 2021, arXiv:2106.04520.
[19]
Y. Zhang, C. Wang, and W. Deng, “Relative uncertainty learning for facial expression recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 17616–17627.
[20]
M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1189–1197.
[21]
J. Yu et al., “Generative image inpainting with contextual attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5505–5514.
[22]
K. He et al., “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16000–16009.
[23]
L. Zhong et al., “Learning active facial patches for expression analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2562–2569.
[24]
P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via a boosted deep belief network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1805–1812.
[25]
B. Ryu, A. R. Rivera, J. Kim, and O. Chae, “Local directional ternary pattern for facial expression recognition,” IEEE Trans. Image Process., vol. 26, no. 12, pp. 6006–6018, Dec. 2017.
[26]
Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE Trans. Image Process., vol. 30, pp. 6544–6556, 2021.
[27]
F. Ma, B. Sun, and S. Li, “Robust facial expression recognition with convolutional visual transformers,” 2021, arXiv:2103.16854.
[28]
F. Xue, Q. Wang, and G. Guo, “Transfer: Learning relation-aware facial expression representations with transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 3601–3610.
[29]
H. Li, N. Wang, X. Ding, X. Yang, and X. Gao, “Adaptively learning facial expression representation via C-F labels and distillation,” IEEE Trans. Image Process., vol. 30, pp. 2016–2028, 2021.
[30]
S. Li and W. Deng, “A deeper look at facial expression dataset bias,” IEEE Trans. Affect. Comput., vol. 13, no. 2, pp. 881–893, Apr.–Jun. 2022.
[31]
F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression modeling for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3359–3368.
[32]
L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: Survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 4037–4058, Nov. 2021.
[33]
J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” in Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https://openreview.net/forum?id=KmykpuSrjcq
[34]
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
[35]
C. Wei et al., “Masked feature prediction for self-supervised visual pre-training,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14668–14678.
[36]
L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann, “Easy samples first: Self-paced reranking for zero-example multimedia search,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 547–556.
[37]
Q. Zhao et al., “Self-paced learning for matrix factorization,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 3196–3202.
[38]
C. Xu, D. Tao, and C. Xu, “Multi-view self-paced learning for clustering,” in Proc. 24th Int. Joint Conf. Artif. Intell., 2015, pp. 3974–3980.
[39]
M. Gong, H. Li, D. Meng, Q. Miao, and J. Liu, “Decomposition-based evolutionary multiobjective optimization to self-paced learning,” IEEE Trans. Evol. Comput., vol. 23, no. 2, pp. 288–302, Apr. 2019.
[40]
Y. Wang, W. Gan, J. Yang, W. Wu, and J. Yan, “Dynamic curriculum learning for imbalanced data classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5017–5026.
[41]
G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output distributions,” 2017, arXiv:1701.06548.
[42]
J. Li, R. Socher, and S. C. Hoi, “DivideMix: Learning with noisy labels as semi-supervised learning,” in Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available: https://openreview.net/forum?id=HJgExaVtwr
[43]
Z. Zhong, J. Cui, S. Liu, and J. Jia, “Improving calibration for long-tailed recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16489–16498.
[44]
Y. Hong et al., “Disentangling label distribution for long-tailed visual recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6626–6636.
[45]
H. Permuter, J. Francos, and I. Jermyn, “A study of Gaussian mixture models of color and texture features for image classification and segmentation,” Pattern Recognit., vol. 39, no. 4, pp. 695–706, 2006.
[46]
P. Ekman and W. V. Friesen, “Facial action coding system: A technique for the measurement of facial movement,” in Palo Alto, vol. 3, p. 5, 1978.
[47]
W. V. Friesen and P. Ekman, “EMFACS-7: Emotional facial action coding system,” Uni Cali SF, vol. 2, no. 36, p. 1, 1983.
[48]
F. Wan et al., “Nearest neighbor classifier embedded network for active learning,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 10041–10048.
[49]
J. Xu, Z. Chen, T. Q. Quek, and K. F. E. Chong, “FedCorr: Multi-stage federated learning for label noise correction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10184–10193.
[50]
W. Su, Y. Yuan, and M. Zhu, “A relationship between the average precision and the area under the ROC curve,” in Proc. Int. Conf. Theory Inf. Retrieval, 2015, pp. 349–352.
[51]
E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proc. 18th ACM Int. Conf. Multimodal Interact., 2016, pp. 279–283.
[52]
I. J. Goodfellow et al., “Challenges in representation learning: Report on three machine learning contests,” in Proc. Int. Conf. Neural Inf. Process., 2013, pp. 117–124.
[53]
Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 3510–3519.
[54]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[55]
Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “MS-Celeb-1M: Dataset and benchmark for large-scale face recognition,” in Proc. 14th Eur. Conf. Comput. Vis., 2016, pp. 87–102.
[56]
A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8026–8037.
[57]
D. Kingma and J. Ba, “Adam: Method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Representations, 2015.
[58]
P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in Proc. Int. Conf. Learn. Representations, 2021.
[59]
A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild via deep attentive center loss,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 2402–2411.
[60]
D. Ruan et al., “Feature decomposition and reconstruction learning for effective facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7660–7669.
[61]
Y. Li et al., “Self-supervised exclusive-inclusive interactive learning for multi-label facial expression recognition in the wild,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 3190–3202, May 2022.
[62]
Y. Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 418–434.
[63]
D. Zeng et al., “Face2Exp: Combating data biases for facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 20291–20300.
[64]
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4690–4699.
[65]
C. Chen, “PyTorch face landmark: Fast and accurate facial landmark detector,” 2021. [Online]. Available: https://github.com/cunjian/pytorch_face_landmark
[66]
L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 11, 2008, pp. 2579–2605.
[67]
T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4295–4304.
[68]
K.-K. Huang, D.-Q. Dai, C.-X. Ren, and Z.-R. Lai, “Learning kernel extended dictionary for face recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 5, pp. 1082–1094, May 2017.
[69]
P. Garrido et al., “Reconstruction of personalized 3D face rigs from monocular video,” ACM Trans. Graph., vol. 35, no. 3, pp. 1–15, 2016.
[70]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
[71]
S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutional block attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19.
[72]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[73]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021.
[74]
J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2013, pp. 554–561.
[75]
P. Welinder et al., “Caltech-ucsd birds-200,” California Institute of Technology, CNS-TR-2010-001, 2010.
[76]
F. Liang, Y. Li, and D. Marculescu, “SupMAE: Supervised masked auto encoders are efficient vision learners,” 2022, arXiv:2205.14540.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
9891 pages

Publisher

IEEE Press

Publication History

Published: 28 December 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media