Exploring the Interplay between Facial Expression Recognition and Physical States

Maria Francesca Roig-Maimó, Ciències Matemàtiques i Informàtica, Universitat de les Illes Balears, Spain, xisca.roig@uib.es

Miquel Mascaró-Oliver, Ciències Matemàtiques i Informàtica, Universitat de les Illes Balears, Spain, miquel.mascaro@uib.es

Esperança Amengual-Alcover, Ciències Matemàtiques i Informàtica, Universitat de les Illes Balears, Spain, eamengual@uib.es

Ramon Mas-Sansó, Ciències Matemàtiques i Informàtica, Universitat de les Illes Balears, Spain, ramon.mas@uib.es

DOI: https://doi.org/10.1145/3657242.3658602
INTERACCION 2024: XXIV International Conference on Human Computer Interaction, A Coruña, Spain, June 2024

This paper suggests a new viewpoint in Facial Expression Recognition (FER), moving beyond conventional approaches focused on understanding human emotions to include also physical states expressions such as pain and effort. These expressions involve facial muscle activities that deviate from straightforward emotional expressions, often overlooked by existing datasets and classifiers that predominantly focus on emotional states. The study presented addresses inaccuracies in facial expression reporting when the input image corresponds to a physical state. By utilizing a pre-trained FER classifier on a specialized dataset, this research analyses the implications of lacking classifiers tailored for physical states. Critical issues in FER tasks are highlighted, revealing how datasets without physical states labels introduce bias and impact accuracy. We consider the UIBVFED Physical States dataset, a dataset featuring facial expressions of physical states, to be a significant contribution. This dataset addresses biased estimations in FER tasks and enhances the training of recognition systems, improving their suitability across diverse scenarios.

CCS Concepts: Human-centered computing∼Human computer interaction (HCI), Computing methodologies∼Machine learning

KEYWORDS: facial expression recognition, machine learning, facial expression datasets, synthetic avatars, convolutional neural network, HCI

ACM Reference Format:
Maria Francesca Roig-Maimó, Miquel Mascaró-Oliver, Esperança Amengual-Alcover∗ and Ramon Mas-Sansó. 2024. Exploring the Interplay between Facial Expression Recognition and Physical States. In XXIV International Conference on Human Computer Interaction INTERACCION 2024), June 19-21, 2024, A Coruña, Spain. ACM, New York, NY, USA, 8 Pages. https://doi.org/10.1145/3657242.3658602

1 INTRODUCTION

Facial Expression Recognition (FER) systems based on machine learning are trained using diverse databases accessible to the research community.

Some recent works in the field focus on exploring advancements in FER. As outlined in [1], current methodologies commonly used in real-world FER applications reveal a multifaceted approach to understanding and interpreting facial expressions. Within this landscape, Convolutional Neural Networks (CNNs) stand out as a widely employed deep learning methodology in image recognition tasks, emphasizing the importance of the quality of datasets used in various scenarios. Results obtained vary depending on the approach and datasets considered, including Fer2013, CKC, FERC, RAF, DEAP, CASMEI, CASME-I, SAMM, MAHNOB-HCI, Oulu-CASIA, LIRIS-CSE. In previous works [2], it is concluded that the most widely used dataset in FER studies is Fer2013, while DAiSEE is the most utilized emotion dataset at the academic level. Although databases for facial expressions may vary in expression labelling, they consistently reference the seven Ekman's universal expressions [3]. As networks are trained relying on these datasets, researchers have dedicated their efforts to construct models that classify emotions into the seven categories, encompassing the six universally recognized emotions alongside a neutral state. This approach is motivated by two primary factors. Firstly, most of available datasets typically comprise between six to eight labeled expressions, Secondly, it is often more feasible to categorize emotions into a smaller set of classes. Nevertheless, in [4] the authors adopt a different approach by utilizing a dataset that contains avatars performing 32 facial expressions (plus the neutral facial expression). In their work, a simple CNN model is trained on this extensive dataset, therefore they obtain a 33 facial expression classifier. Subsequently, predicted facial expressions are translated to their corresponding emotion (refer to Table 2). Impressively, an accuracy of 0.95 is achieved, a performance level comparable to more complex models currently in use.

A common point in most of the reviewed literature [1], [2], [5] is that it is necessary to transcend mere facial expression recognition towards a more comprehensive understanding of human emotions in diverse scenarios. In this context, physical states like pain, effort and sleepiness, involve facial muscle activity, shaping expressions that aren't straightforwardly linked to emotions. As our current understanding, none of the existing facial expression databases comprehensively address the nuances of various physical states. Existing classifiers focus on recognizing facial expressions attributed to emotional states. Our main goal is to analyse how facial expressions are wrongly reported when the input corresponds to a physical state. To do this, we use a pre-trained FER classifier (only with facial expressions attributed to emotional states) and test it on a new dataset created specifically to include facial expressions associated with physical states. Then, we analyse the results reported by the classifier to understand the implications of lacking classifiers tailored for physical states.

2 RELATED WORK

2.1 Facial expressions associated to physical states

Gary Faigin classifies 33 facial expressions attributed to emotional states based on universal emotions and also 17 facial expressions associated to physical states [6]. The latter are those presented in Table 1.

Table 1: Faigin's physical states expressions

Expression	Description
Pain 1	Extreme pain. The face undergoes spasms in an intense contraction
Pain 2	Sudden, unexpected pain, often expressed with a shout
Exertion 1	Similar to expressions of pain. The eyes almost never clench as tightly and often will remain open, somewhat narrowed
Exertion 2	Lip pressing, equally common responses to physical effort
Drowsiness 1	The levator palpebrae relaxes. The eyes attempt to stay open with the action of the frontal muscle lifting the brows
Drowsiness 2	Eyelids droop, and vision becomes blurred. Successive blinking
Yawning 1	General clenching of facial muscles. The mouth opens as wide as possible
Yawning 2	Brows as often lowered as raised in yawn. Eyes partly shut in a squint.
Singing 1	Only two of the multiple mouth positions are illustrated. Only involves the lower half of the face. Illustrates the singing of the sound "ah"
Singing 2	Many vowel sounds, sung and spoken, involve the action of the lip muscle that purses the lips. Illustrates the singing of the sound "eeeee"
Shouting	Usually accompanied by squinting. The lower lip becomes square downward, and the upper lip rises
Passion / Asleep	Slightly opened mouth and closed, relaxed eyes. The expression can have many possible interpretations such as sleeping, stupor, or sexual passion
Intensity / Attention	When not seen as anger, the combination of widened eyes and lowered brows is understood as intense concentration and interest
Brows down / Perplexed	It may be confused with sternness or anger, the act of lowering the brows can be interpreted as confusion, reflection, and frustration
Brows up	The brow lift is one of the most common actions in conversational expressions. It corresponds to surprise, emphasis, greeting, etc.
Shock	Closely related to the emotion of fear. It can express distress
Facial shrug	The facial equivalent of shrugging the shoulders. It is a gesture of resignation.

Despite of being described since 2012, to the best of our knowledge, there is no dataset including physical states. So, we have found interesting to share such a dataset with the scientific community as it could enrich the understanding of human expressions providing a new baseline for the training of neural networks.

2.2 UIBVFED and UIBVFEDPlus-Light

UIBVFED [7] is the first facial recognition dataset with virtual characters. It is annotated with the 33 expressions based on Gary Faigin's classification of the universal emotions. The images of the UIBVFED dataset were generated according to the Facial Action Coding System (FACS) [3] and, therefore, it can assure an objective labelling. The use of synthetic datasets is becoming popular as they provide automatic objective labelling, are free of data privacy issues and have proved to be a good substitution for real images, as synthetic characters get recognition rates similar to the real ones [8], [9]. The original dataset was expanded by adding new characters, a measure aimed at mitigating the overfitting in facial recognition training processes. This led to the development of UIBVFEDPlus-Light [6], which accommodates various lighting configurations.

Both datasets have successfully been employed for conducting FER experiments with CNN models [4], [10], [11]. Based on the insights gained from these studies, the dataset has recently been expanded with 20 new characters.

3 UIBVFED PHYSICAL STATES

Continuing within the same work setting previously described in Section 2.2, the expressions corresponding to the physical states defined by Gary Faigin have been reproduced. Figure 1 shows the different modeled expressions, except for the Brows down / Perplexed expression, which has been excluded because it does not differ significantly from the Intensity / Attention expression. The difference highlighted by Faigin regarding this expression is the distinct pressure on the eyelids. Unfortunately, this variation cannot be accurately replicated by our synthetic avatars due to the lack of precision and deformers in the ocular region.

The modeled physical states have been defined over 100 avatars created using the Autodesk Character Generator tool [12], deforming the facial meshes according to the muscular activity described by Faigin. The images have been generated within a Unity 3D environment, similar to how UIBVFED has been developed.

The result is a new dataset, UIBVFED Physical States, with 100 synthetic avatars, each reproducing the 16 expressions related to the considered physical states. The dataset is openly accessible on the Zenodo repository, adhering to the FAIR principles¹ .

4 METHODOLOGY

This work uses a pre-trained CNN model fed with an extension of the original UIBVFED dataset [7]. The used dataset is composed with 100 gender-balanced avatars performing 32 facial expressions (see second column in Table 2). This dataset can be found as a subset of the dataset UIBVFEDPlus-Light [13]. To the best of our knowledge, this CNN model is the only one that classifies the 33 facial expressions attributed to emotional states described by Faigin (see Table 2). Mascaró-Oliver et al. [9] reported a global accuracy of 0.8 when classifying in 33 facial expressions, and a global accuracy of 0.95, when translating the facial expression reported by the model in its associated emotion. The details of the CNN model can be found in [4].

In this work we utilize the pre-trained CNN model on the new UIBVFED Physical States dataset.

Table 2: Faigin's facial expressions and their associated emotion

Emotion	Facial expression
Anger	Enraged Compressed Lips, Enraged Shouting, Mad, Sternness Anger
Disgust	Disdain, Disgust, Physical Repulsion
Fear	Afraid, Terror, Very Frightened, Worried
Joy	False Laughter 1, False Smile, Smiling closed mouth, Smiling open mouthed, Stifled Smile, Laughter, Uproarious Laughter, False Laughter 2, Abashed Smile, Eager Smile, Ingratiating Smile, Sly Smile, Melancholy Smile, Debauched Smile
Neutral	Neutral
Sadness	Crying closed mouth, Crying open mouthed, Miserable, Nearly crying, Sad, Suppressed sadness
Surprise	Surprise

5 RESULTS AND DISCUSSION

In this section, the results returned by the 33-facial expression classifier applied to the images of the UIBVFED Physical States dataset, translated to their associated emotion according to Table 2 and following the procedure described in [9], are presented.

As expected, none of the images has correctly been classified, since the model was trained with images corresponding to a different set of categories, which did not include physical states. Table 3 illustrates how the CNN model classifies the facial expressions corresponding to physical states. These expressions are translated to their associated emotions.

Table 3: Physical states expressions modeled in the UIBVFED Physical States and their similar facial expression expected by Faigin [4]

Expression	Emotion derived from the facial expression returned by the model							Similar to
Expression	Anger (%)	Disgust (%)	Fear (%)	Joy (%)	Neutral (%)	Sadness (%)	Surprise (%)
Pain 1	13	3	10	62	0	11	1	Crying, Laughing, Exertion
Pain 2	59	21	1	14	0	5	0	Crying, Laughing, Exertion
Exertion 1	22	43	1	29	0	5	0	Pain, Anger, Laughing
Exertion 2	3	8	0	36	0	53	0	Pain, Suppressed Laugh, Anger, Cry
Drowsiness 1	0	10	14	8	34	34	0	Surprise, Fear
Drowsiness 2	0	16	9	4	56	15	0	-
Yawning 1	8	11	1	57	0	21	2	Singing, Shouting
Yawning 2	8	12	1	54	0	23	2	Singing, Shouting
Singing 1	3	6	18	2	4	4	63	Surprise
Singing 2	10	2	39	3	1	37	8	-
Shouting	2	33	2	50	0	1	12	Singing, Yawning
Passion / Asleep	8	6	19	26	0	31	10	Singing
Intensity / Attention	46	1	16	2	33	2	0	Sternness / Anger
Brows up	6	2	32	2	45	12	1	Surprise
Shock	12	1	35	3	44	5	0	Fear
Facial shrug	2	0	68	1	2	25	2	Sadness

In table 3, the first column displays the physical states expressions, while the last column indicates the expressions with which they could be confused according to Faigin (see Table 2). Columns 2 to 8 show the percentage of images for each physical state classified into facial expressions related to each emotion. These data allow for a proper interpretation of the level of confusion in our model. For instance, the CNN model classifies the Pain 1 expression images mostly as Joy (62%), Anger (13%), and Sadness (11%). The Pain 1 physical expression has a theoretical potential confusion with the Crying, Laughing and Exertion expressions according to Faigin. As Crying relates to the emotion of Sadness, Laughing relates to the emotion of Joy, and Exertion relates to Anger, the behavior of the classification (despite being incorrect relating to physical states) agrees in an 86% with the theoretical confusion behavior reported by Faigin.

The same results can be extracted for each of the physical stages, leading to a “coherent with the literature” classification of the physical states within the six universal emotions. However, these results are inaccurate. Even if it is true that the facial expression of Laughing shares facial features with the facial expressions of Pain, and therefore they can be confused, it is not acceptable to classify a Pain physical state as Joy emotion. Therefore, it is necessary to count on classifiers able to correctly label physical states, besides emotions. For that reason, it is mandatory to create new datasets to train those models.

6 LIMITATIONS

One of the limitations of the present work is the classification of the physical states expressions. In this study, we adopted Faigin's classification, which has some justified shortcomings. Regarding facial expressions induced by the reproduction of visemes, only the Singing 1 and Singing 2 expressions are considered. Aspects of speech cannot be completely separated from emotions, although its treatment entails complexity beyond the scope of this work. Other physical states expressions may allow for variations that have not been taken into consideration, for example, configurations of Exertion with open eyes or Drowsiness with the eyes completely closed. We understand that, despite the associated computational cost, including all these subtle variations would increase precision.

Another limitation concerns to the realism of the avatars. Our work is constrained by the avatars generated by the Autodesk Character Generator tool and the accompanying deformers. We have encountered shortcomings in defining movements of pressure on the eyelids and in the precise control of the lips. In addition to these deficiencies related to geometric deformation, there are also issues associated with the texture applied to the avatars. Specifically, the facial wrinkles in our avatars are defined by textures applied to the color and bump of the material. Wrinkles are a crucial part of facial expressions, and being detached from geometric deformers implies a lack of control in their manipulation.

Any study of FER based on static images suffers from the inherent limitation of lacking context inherent to an isolated image. A significant improvement would be achieved, not only by defining physical states with video sequences, but also by using videos for the definition of facial expressions associated with emotional states.

7 CONCLUSION

In this work, we have highlighted the issues arising from the omission of facial expressions related to physical states in the context of Facial Expression Recognition (FER) tasks. It becomes evident that the datasets used to train recognition systems, lacking labels associated with physical states, inevitably introduce bias into their estimations. Our research offers a quantitative analysis of this bias, comparing the results obtained with the expected interpretations provided by Faigin. While our findings are specific to a simple CNN model, they contribute valuable insights into the distribution of errors, shedding light on potential challenges across other recognition methods.

As a significant contribution, we offer the research community a dataset containing facial expressions of physical states. This novel dataset aims to address and alleviate the issues associated with biased estimations in FER tasks. By including a comprehensive range of physical states, the UIBVFED Physical States dataset facilitates more inclusive and accurate training of recognition systems, enhancing their robustness and applicability across diverse scenarios.

ACKNOWLEDGMENTS

The authors acknowledge the Project PID2022-136779OB-C32 (PLEISAR) funded by MCIN/ AEI /10.13039/501100011033/ and FEDER A way to make Europe

REFERENCES

K. I. K. Jajan and P. D. E. A. M. Abdulazeez, ‘Facial Expression Recognition Based on Deep Learning: A Review’, Indones. J. Comput. Sci., vol. 13, no. 1, Art. no. 1, 2024, doi: 10.33022/ijcs.v13i1.3705.
J. X.-Y. Lek and J. Teo, ‘Academic Emotion Classification Using FER: A Systematic Review’, Hum. Behav. Emerg. Technol., vol. 2023, pp. 1–27, May 2023, doi: 10.1155/2023/9790005.
P. Ekman and W. V. Friesen, ‘Facial Action Coding System: A Technique for the Measurement of Facial Movement’, Consult. Psychol. Press, 1978.
M. Mascaró-Oliver, R. Mas-Sansó, E. Amengual-Alcover, and M. F. Roig-Maimó, ‘On the Convenience of Using 32 Facial Expressions to Recognize the 6 Universal Emotions’, in Information Systems and Technologies, vol. 800, A. Rocha, H. Adeli, G. Dzemyda, F. Moreira, and V. Colla, Eds., in Lecture Notes in Networks and Systems, vol. 800. , Cham: Springer Nature Switzerland, 2024, pp. 625–634. doi: 10.1007/978-3-031-45645-9_60.
Y. Wang et al., ‘A systematic review on affective computing: emotion models, databases, and recent advances’, Inf. Fusion, vol. 83–84, pp. 19–52, Jul. 2022, doi: 10.1016/j.inffus.2022.03.009.
G. Faigin, The artist's complete guide to facial expression. Watson-Guptill, 2012.
M. M. Oliver and E. A. Alcover, ‘UIBVFED: Virtual facial expression dataset’, PLOS ONE, vol. 15, no. 4, p. e0231266, Apr. 2020, doi: 10.1371/journal.pone.0231266.
L. Colbois, T. de Freitas Pereira, and S. Marcel, ‘On the use of automatically generated synthetic image datasets for benchmarking face recognition’, in 2021 IEEE International Joint Conference on Biometrics (IJCB), Aug. 2021, pp. 1–8. doi: 10.1109/IJCB52358.2021.9484363.
J. Del Aguila, L. M. González-Gualda, M. A. Játiva, P. Fernández-Sotos, A. Fernández-Caballero, and A. S. García, ‘How Interpersonal Distance Between Avatar and Human Influences Facial Affect Recognition in Immersive Virtual Reality’, Front. Psychol., vol. 12, p. 675515, 2021, doi: 10.3389/fpsyg.2021.675515.
G. Carreto Picón, M. F. Roig-Maimó, M. Mascaró Oliver, E. Amengual Alcover, and R. Mas-Sansó, ‘Do Machines Better Understand Synthetic Facial Expressions than People?’, in Proceedings of the XXII International Conference on Human Computer Interaction, in Interacción ’22. New York, NY, USA: Association for Computing Machinery, Sep. 2022, pp. 1–5. doi: 10.1145/3549865.3549908.
G. del Castillo Torres, M. F. Roig-Maimó, M. Mascaró-Oliver, E. Amengual-Alcover, and R. Mas-Sansó, ‘Understanding How CNNs Recognize Facial Expressions: A Case Study with LIME and CEM’, Sensors, vol. 23, no. 1, Art. no. 1, Jan. 2023, doi: 10.3390/s23010131.
‘Autodesk Character Generator’. Accessed: Mar. 07, 2024. [Online]. Available: https://charactergenerator.autodesk.com/
M. Mascaró-Oliver, E. Amengual-Alcover, M. F. Roig-Maimó, and R. Mas-Sansó, ‘UIBVFEDPlus-Light: Virtual facial expression dataset with lighting’, PLOS ONE, vol. 18, no. 9, p. e0287006, Sep. 2023, doi: 10.1371/journal.pone.0287006.

FOOTNOTE

^∗Corresponding author.

¹ https://zenodo.org/records/10793613

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

INTERACCION 2024, June 19–21, 2024, A Coruña, Spain