1. Introduction
The issue of accurately detecting fatigue is of paramount importance, given the direct correlation between fatigue and endangering numerous lives, particularly those of complex system operators [
1]. Operator fatigue is often the underlying cause of high-cost errors and mishaps when controlling complex systems, such as industrial robotic complexes, chemical reaction systems, or systems of multiple coordinated units [
2].
Fatigue is typically induced when an individual does not receive an adequate amount of sleep or engages in extended periods of laborious activity. According to a comprehensive assessment of sleep patterns among the U.S. population, it was observed that 37% of the workforce received less than the recommended baseline of seven hours of sleep [
3]. The principal challenge associated with fatigue is the difficulty in recognizing the onset of this state until it has escalated to a point where it significantly impairs work and system operation efficiency. Fatigued individuals often exhibit decreased decision-making speed and prolonged reaction times, which render them less effective in their professional roles and more susceptible to risks while working [
4].
Numerous solutions have been proposed to tackle the issue of fatigue detection, with the primary aims of identifying signs of fatigue and subsequently alerting the concerned individual to this hazardous state. The majority of these methodologies employ machine learning classifiers or leverage deep learning models to construct a high-efficacy system for fatigue detection [
5]. Most of the research on fatigue monitoring has concentrated on identifying fatigue during cognitive tasks, such as driving [
6], and some researchers dedicated their endeavors to identifying signs of fatigue in natural-viewing situations, where individuals are not actively engaged in cognitive tasks [
7]. Non-intrusive techniques, including those that utilize remote or webcam-based eye tracking, generally monitor variations in an individual’s pupil response, blink rate, and eye movement patterns to deduce signs of fatigue during such cognitive tasks [
8].
Within the fields of machine learning and deep learning, the selection of the training dataset is a critical factor that influences several aspects of the model, including its accuracy and generalizability. Consequently, numerous datasets have been curated, developed, and utilized for the advancement of fatigue or drowsiness detection techniques [
5]. Some of these datasets rely on the use of sensors and devices to collect pertinent features associated with an individual’s fatigue state and mental cognition. These include eye trackers [
9], electroencephalogram (EEG) signals [
10], and electrocardiography (ECG) signals [
11], as well as other physiological indicators. However, the evolution of data collection techniques has led to the creation of more sophisticated datasets. These datasets comprise videos of drivers or operators engaged in a task or driving. The availability of such datasets opens up new avenues for experimentation, particularly with deep learning models. These models can process video data to recognize signs of fatigue with no need for any additional devices for the individual. Such non-intrusive methods enhance the applicability and usability of these approaches in safety and monitoring systems, thereby offering potential solutions for real-world fatigue detection challenges.
In this article, we present a selection of datasets that offer various advantages, particularly in terms of the wide range of the facial characteristics, behaviors, ethnic backgrounds, lighting conditions, recording settings, and facial poses they encompass. However, we also draw attention to a big limitation, that is, a noticeable trend of existing datasets of primarily focusing on simulating driving scenarios, which has created a lack of recordings that depict operator (computer user)-related tasks where fatigue exerts a substantial influence on both work performance and productivity. Therefore, our current research focuses on operator fatigue detection in work and academic settings. In this study, subjects experienced the causes of the two types of mental fatigue [
12]. The first type is active fatigue, which arises from prolonged task-related perceptual–motor adjustments; for example, operators reading scientific articles develop cognitive strategies that enhance their ability to process complex information. This adaptation includes adjustments in reading speed and focused attention on key concepts. The second type is passive fatigue, which occurs due to extended periods of inactivity, such as sitting in front of a computer for a long time, leading to mental exhaustion. The originality of the proposal is in the usage of physiological indicators obtained by using computer vision for human fatigue state detection based on facial videos. The videos are taken from the public dataset OperatorEYEVP [
13]. Therefore, we applied multiple advanced deep learning and computer vision models to analyze the dataset videos. In doing so, we successfully extracted physiological characteristics, which represents a novel and original contribution to the field. Unlike the usage of wearable or other sensors attached to the human body, computer vision techniques can be more useful in many situations, since they are non-intrusive and do not limit human mobility. Our contribution can be summarized as follows:
Innovative approach to non-intrusive fatigue detection: Diverging from conventional sensor-based research on mental fatigue, this study introduces a novel, non-intrusive method using deep learning, computer vision, and machine learning models. Unlike the traditional methods that leverage sensor-based data for mental fatigue detection, we extract features by using a simple camera and apply machine learning techniques for fatigue prediction. This significant shift offers practical benefits due to its accessibility and user friendliness, demonstrating the potential of our approach as a valuable tool for fatigue research.
Estimation of physiological indicators: We utilize the OperatorEYEVP dataset [
13], which contains a diverse collection of videos capturing the daily tasks of employees and students. Through the application of deep learning and computer vision models developed in our previous work, we estimate a set of physiological indicators that provide an evaluation of the subject’s physiological condition in these videos.
Comprehensive dataset HFAVD: The resulting dataset includes physiological indicators besides other data (Landolt rings test results, VAS_F, and sensor data related to eye tracking and HRV). These indicators comprise head movement (as Euler angles: roll, pitch, and yaw), vital signs, and eye and mouth states (eye closure and yawning), which were automatically calculated and detected by using deep learning models, eliminating the need for manual intervention, which is often required by other existing datasets.
Applicability to work and academic settings: The resulting dataset holds particular significance for work and academic environments, where the impact of fatigue on work performance and efficiency is significant. Therefore, it has the potential to advance fatigue detection research in these contexts.
Our contribution provides a significant opportunity for researchers to advance their comprehension of the intricate relationship between mental fatigue and the diverse range of state changes displayed by subjects.
The paper is structured as follows:
Section 2 describes several approaches and datasets for the task of fatigue and drowsiness detection.
Section 3 begins by providing an overview of the original dataset we used for our research. Next, we explain how we made our own contribution to this dataset. Then, we describe the structure of models used with the details of the final tables that describe the dataset.
Section 4 outlines the experiments we conducted to analyze the dataset with the results.
3. Dataset Description
This article introduces a dataset comprising video data integrated with numerical metrics, offering a comprehensive assessment of the subject’s physiological condition as depicted in the videos. The dataset provides parameters such as head movement in terms of Euler angles (roll, pitch, and yaw), vital signs, and eye and mouth states (eye closure and yawning), eliminating the necessity for manual calculation or detection of these parameters, which is commonly required by other existing datasets. In the following subsections, we describe the original dataset we worked with, explain our main contribution to the dataset, delineate the structure of the models used with a detailed explanation of the final tables.
3.1. OperatorEYEVP Dataset
For this purpose, we utilized the OperatorEYEVP dataset as introduced by [
13]. This dataset features video recordings of ten unique individuals participating in diverse activities. These activities were documented at three distinct moments throughout the day (morning at 9:00 a.m., afternoon at 13:00 p.m., and evening at 18:00 p.m.), over a period spanning eight to ten days. The recordings occurred on weekdays from December 2022 to June 2023. The recorded data not only offer a frontal view of the participants’ facial expressions, but they also include a set of additional information. This encompasses data pertaining to pupil and head movement (including specific coordinates), scene visuals, pulse rate for each interval, and Choice Reaction Time measured twice. Furthermore, the dataset incorporates responses to various questionnaires and scales, notably the VAS-F scale. The VAS-F scale, an 18-question measure, addresses the subjective experiences of fatigue among participants. It is worth noting that these questions were administered prior to the beginning of the experimental session. The timeline of a session is shown in
Figure 1.
The experimental procedure included a daily sleep quality survey conducted prior to the morning session. This was succeeded by the VAS-F questionnaire, a Choice Reaction Time (CRT) task, reading a text in a scientific style, performing the “Landolt rings” correction test, playing the “Tetris” game, and a second CRT. The latter was incorporated at the discretion of the authors, cognizant of the potential variability in operator fatigue levels between the inception and termination of the recording session. The aggregate duration of such sessions, on average, approximated one hour.
Throughout the Choice Reaction Time (CRT) assessment, a comprehensive group of parameters was recorded, such as the average reaction time, its standard deviation, and the number of errors made by participants during task performance. Participants engaged in the reading of scientific-style text, which can be a typical work-related activity. This task served as a static cognitive load task aimed at evaluating cognitive performance. The Landolt rings correction test is a well-established method for evaluating visual acuity. Based on [
13], the following parameters were recorded during the Landolt rings test:
Time spent (t).
The total number of characters viewed up to the last selected character (N).
The total number of lines viewed (C).
The total number of characters to be crossed out (n).
The total number of characters crossed out (M).
Correctly selected letters (S).
Letter characters missed (P).
Wrongly selected characters (O).
These parameters facilitate the computation of various metrics, including attention productivity (
K), mental performance (
Au), and stability of attention concentration (
Ku), using the following equations:
Finally, the Tetris game served as a dynamic active task to investigate hand–eye coordination. The recorded variables included the number of games played, the levels reached, the scores achieved, and the lines cleared.
This dataset’s demographic is primarily composed of individuals with a mean age of 27.3 years, extending from a minimum age of 21 to a maximum of 50 years. It is evident that the sample is skewed towards the younger age spectrum, with the majority below the age of 33. Furthermore, there were no reported instances of significant health issues that could impact vital signs, with the exception of the first participant, who exhibited symptoms of depression. However, it is important to note that the participants predominantly have lighter skin tones. This demographic characteristic may present limitations for future applications of this approach, particularly in assessing fatigue among older individuals, those with serious health conditions, and individuals with darker skin tones. This observation highlights the future need to expand the dataset to include a more diverse range of ages, health conditions, and ethnicities, thereby enhancing the generalizability of the approach.
3.2. Annotation of Dataset with Physiological Indicators
Our primary contribution to the dataset entails the annotation, performed at one-minute intervals, of the videos encompassing four main physiological indicators: blood pressure, heart rate, oxygen saturation, and respiratory rate. We selected the one-minute interval because, as we discuss in more detail in
Section 3.3, the models used for parameter estimation make predictions at the level of seconds, frames, or minutes. Therefore, a one-minute interval was deemed the most suitable choice for synchronizing the estimations. Moreover, supplementary metrics were computed, including estimation of head pose via Euler angles (roll, pitch, and yaw), the proportion of frames exhibiting Euler angles surpassing 30 degrees relative to the total frames within a one-minute span, ratios of eye closure and mouth opening, instances of yawning, numbers of prolonged eye closures exceeding two seconds, and characteristics pertaining to breathing patterns, encompassing rhythmicity and stability.
The computation of the aforementioned parameters was carried out by using computer vision and deep learning approaches developed by members of our laboratory, which are discussed in the subsequent subsection. These approaches were applied to each video within the OperatorEYEVP dataset, and annotations were made for the 15 parameters at each minute of the recordings. For each participant, a comprehensive CSV file was created to document all the details of the session. This file encompassed information related to the CRT and Tetris tasks, as well as eye parameters and HRV parameters provided by the dataset in addition to the parameters extracted by our models with additional features we calculated from the coordinates of the eye-tracking data. A more in-depth exploration of the CSV files, including their structure and contents, is provided later in the discussion.
3.3. HFAVD Dataset Structure
In this subsection, we present an overview of the models utilized for extracting physiological features used in the annotation of the videos. These models leverage deep learning, machine learning, and computer vision techniques to estimate specific physiological indicators based on facial video recordings of the subjects, as shown in
Figure 2. Through these models, we were able to estimate 15 features at each minute of every video, capturing a comprehensive range of physiological information. The physiological features were extracted by using deep learning models with a sampling frequency of one second, as will be detailed in subsequent sections. However, many of these features exhibit minimal variation within a one-second interval, which can result in numerous identical or highly similar samples. This redundancy may compromise the reliability of the experimental outcomes. Consequently, we opted to compute the average of these indicators for each minute and label them with the corresponding mental performance value. This approach was selected to retain as many data as possible, ensuring they remained adequate for the machine learning models. Therefore, increasing the sampling interval from 1 min to 5, for example, could lead to insufficient data, adversely affecting performance, while decreasing it would result in duplicated samples, thereby rendering the experiments invalid and unreliable.
The estimated physiological features were compiled and organized into comprehensive CSV files. These CSV files enhance the distinctiveness of the dataset compared with existing ones. They include almost every detectable or extractable feature from the subject while the latter performs various tasks, without the need for sensors or external devices apart from the widely available camera. This distinguishing characteristic contributes to the dataset’s versatility and accessibility for researchers. These models are explained as follows.
3.3.1. Respiratory Rate and Breathing Characteristics
To estimate the respiratory rate, the approach proposed by [
39] was employed, encompassing the following steps: (a) Chest keypoint detection using OpenPose: The OpenPose framework was utilized to detect the keypoint corresponding to the chest region in the video frames. (b) Displacement detection using SelFlow: An Optical Flow-based neural network called SelFlow was employed to detect the displacement of the chest keypoint between consecutive frames. (c) Projection of displacement along the x- and y-axes: The displacement values obtained were projected onto the x- and y-axes, separating the movement into vertical (up/down) and horizontal (left/right) directions. (d) Signal processing techniques: Signal processing methods, including filtering and detrending, were applied to the displacement data to enhance the accuracy of respiratory rate estimation. (e) Calculation of true peak count and scaling: The true peak count was calculated based on the processed displacement data. This count was then scaled to estimate the number of breaths per minute. Furthermore, in addition to estimating the respiratory rate, various breathing characteristics, such as stability and rhythmicity, were assessed based on the amplitude and wavelength of the respiratory wave. The mean absolute error (MAE) for respiratory rate estimation was 1.5 breaths per minute (bpm). Regarding the limitations of this model, its effectiveness is restricted by its inability to handle body movement efficiently and its reliance on the subject maintaining a stationary position during recording. This requirement poses the risk of inaccurate estimations. Consequently, the model is unsuitable for use when the subject is walking, engaged in motion, or even driving on the road, due to the presence of vehicle vibrations.
3.3.2. Heart Rate
The heart rate estimation approach proposed by [
40] was employed in this study. The methodology involved the following steps: (a) Extraction of the Region of Interest (Face): Landmarks obtained from the 3DDFA_V2 framework [
41] were utilized to extract the Region of Interest, specifically the face, from the input data. (b) Processing through a Vision Transformer: The extracted face region was then processed through a Vision Transformer model, incorporating multi-skip connections. This process generated features from five different levels. (c) Block processing: The output from each level of the Vision Transformer was passed through a block structure. This block structure consisted of a Bidirectional Long Short-Term Memory (BiLSTM) layer, batch normalization, 1D convolution, and a fully connected layer. (d) Averaging of block outputs: The outputs from the five different blocks were averaged to obtain the minimum, maximum, and mean heart rates. (e) Weighted averaging: The minimum, maximum, and mean heart rates were then combined by using a weighted averaging scheme to estimate the final heart rate. This model demonstrated superior accuracy in heart rate estimation compared with previously published methods, as evidenced by its performance on the LGI-PPGI and V4V datasets, achieving mean absolute errors (MAEs) of 2.68 and 9.996 bpm, respectively. Nevertheless, it is important to note that the dataset utilized for training the model exhibited a lack of samples representing both very low and very high heart rates. Consequently, the model may produce inaccurate estimations when presented with subjects exhibiting specific medical conditions characterized by extreme heart rate values.
3.3.3. Blood Pressure
The estimation of blood pressure followed the approach proposed by [
42]. The methodology consisted of the following steps: (a) Extraction of Regions of Interest: The left and right cheeks were identified as the Regions of Interest (ROIs) in each frame of every video. These sequential images were used as input for subsequent analysis. (b) Convolutional Neural Network (CNN) for spatial feature extraction: A CNN architecture was employed to capture spatial features from the ROIs. For systolic blood pressure estimation, EfficientNet_B3 was utilized for the left cheek, while EfficientNet_B5 was used for the right cheek. Conversely, for diastolic blood pressure estimation, an ensemble approach was adopted. This involved combining EfficientNet_B3 and ResNet50V2 for the left cheek and the right cheek, respectively. (c) Long Short-Term Memory (LSTM) for temporal feature extraction: The outputs obtained from the previous step were fed into an LSTM network. This enabled the extraction of temporal features within the sequence of images, considering the dynamic nature of blood pressure. (d) Fully connected layers for blood pressure estimation: Two fully connected layers were employed to derive the blood pressure values based on the extracted spatial and temporal features. Upon training and testing the proposed model, the mean absolute errors recorded were 11.8 and 10.7 mmHg for systolic and diastolic blood pressure, respectively. In parallel, the model’s mean accuracies were found to be 89.5% for systolic blood pressure and 86.2% for diastolic blood pressure. The main challenge faced in training this model was the lack of diversity in the dataset, particularly in terms of the subjects’ skin color. Most of the subjects had light-to-medium skin tones, which impacted the accuracy of the models in predicting blood pressure in individuals with dark skin tones. Another limitation was the dataset’s imbalance, with a scarcity of unusual values for systolic and diastolic blood pressure.
3.3.4. Oxygen Saturation
The estimation of oxygen saturation in this study was conducted based on the method proposed by [
43]. The procedure involved the following steps: (a) Extraction of face region: The face region was initially extracted by using the 3DDFA_V2 framework. This facilitated the isolation of the relevant area for subsequent analysis. (b) Input to VGG19: The extracted face region was then used as input for the VGG19 model, which had been pre-trained on the ImageNet dataset. This allowed for the extraction of relevant features from the face region. (c) Utilization of XGBoost: The output obtained from VGG19 was subsequently fed into an XGBoost algorithm. XGBoost is a machine learning framework that combines multiple decision trees to make predictions. In this case, it was employed to obtain the estimation of the oxygen saturation value. Following the training phase, the model was subjected to testing by using two distinct datasets. The derived mean absolute error (MAE) for the first test set was found to be 1.17%, while for the second test set, it was marginally lower, at 0.84%. The primary challenge encountered in training this model pertained to the limited number of samples depicting SpO2 levels below 85. This scarcity poses a significant obstacle for the models, particularly when examining subjects with specific health conditions that are associated with low SpO2 levels.
3.3.5. Head Pose
The determination of the head pose, characterized by Euler angles (roll, pitch, and yaw), was carried out by using the methodology described in [
44]. The approach involved the following steps: (a) Face detection using YOLO Tiny: Initially, the YOLO Tiny framework was utilized to detect the face in the input data. This allowed for the identification of the Region of Interest. (b) Three-dimensional face reconstruction for landmark alignment: Subsequently, a 3D face reconstruction technique was applied to align the facial landmarks. This process ensured accurate landmark detection even for those not directly visible to the camera. (c) Landmark analysis for Euler angle calculation: Once the facial landmarks were detected, the transitions and rotations between landmarks across successive frames were analyzed. This analysis enabled the calculation of the Euler angles, providing information about the head pose. The limitation of the article is that the model proposed can only estimate head angles up to a maximum value of 70 degrees. This limitation restricts its applicability in scenarios where larger head angles need to be accurately detected. Additionally, the model’s performance is slower compared with the Dlib detector, which itself loses accuracy when the head angle exceeds 35 degrees. This slower performance could hinder real-time applications that require fast and accurate head angle estimation.
3.3.6. Eye and Mouth State
The eye state was assessed through the utilization of a trained model. This model takes the detected face, identified via the FaceBoxes framework, as its input and generates an output that signifies whether the eyes are open or closed. Similarly, the yawning state is recognized by employing a modified version of MobileNet, as proposed by [
45]. This model achieved an accuracy of 95.20 %. However, a limitation of this model is that it relies on a private dataset for validation. This could potentially limit the generalizability of the findings, as the dataset is not representative of the broader population.
The videos in the OperatorEYEVP dataset exhibit a high degree of similarity to the conditions found in the datasets used for training the models. These similarities encompass factors such as lighting conditions, skin color, and the demographic characteristics of the subjects, who were healthy and young. Consequently, the likelihood of encountering extreme or abnormal values in vital signs, particularly oxygen saturation, is minimized. However, the insufficiency of sensor data in the OperatorEYEVP dataset prevents the calculation of estimation errors. Nevertheless, considering the aforementioned conditions, it is reasonable to assume that the models exhibit good accuracy. This assumption is further supported by the alignment of the obtained results with expectations and their consistency with the existing literature.
3.4. Data Example
As previously stated, inclusive CSV files were generated to meticulously record all the details of each video within the OperatorEYEVP dataset. In this subsection, we provide a thorough description of these files, elucidating the significance of each feature included therein, as well as their intended purpose. Additionally, we provide examples of how these files could potentially be utilized by other researchers.
For each participant, a CSV file encompassing 199 columns was generated to provide an extensive description of the session performed by the subject (see
Table 3). The initial columns within the file include metadata associated with the session, including participant ID, session name, the type of activity undertaken, and the time of day during which the session occurred. Subsequently, information pertaining to the CRT sessions, specifically the first and second sessions, is listed. This information includes metrics such as the average reaction time, standard deviation of reaction time, and the number of errors made.
Following that, we included the pertinent information related to the Landolt rings test, encompassing a variety of indicators and metrics. This includes many indicators about attention, stability of attention and work accuracy, the quantification of mental productivity through a mental productivity index, and the evaluation of mental performance. Additionally, we incorporated a concentration indicator, the qualitative characteristic of concentration, the capacity of visual information processing, and the processing rate. Furthermore, we listed additional information related to the test results, such as the total number of characters crossed out, letters correctly selected, letter characters missed, wrongly selected characters, total number of lines viewed, and other relevant variables.
Subsequent to this, we added a collection of features related to eye movement strategy (pupil movement) observed during each session. These features, totaling 60 in number, encompass measurements related to fixation and saccadic time and velocity, alongside additional parameters associated with acceleration. Additionally, an extensive set of 79 parameters were included, encompassing various domains such as time, frequency, and nonlinearity, all of which are related to heart rate variability. Notable examples within this set include VLF, LF, and HF band frequencies, as well as relative and absolute power measurements for these frequency bands, among numerous other relevant metrics. In addition, we extracted the absolute heart rate for each minute from the PPI signal provided. We thought that it would be useful to include both real and predicted heart rate (by our model (
Section 3.3.2)), so that many experiments and comparisons may be carried out by researchers to evaluate the possibility of integrating and relying on computer science for such tasks. All the aforementioned features were provided by [
13].
The next stage involved extracting the relevant features from the x- and y-coordinates from the eye movement data, which were represented as two separate time series for each activity. To accomplish this, we employed basic Python (v. 3.6.9) code in conjunction with the NumPy library to compute seven key features from each series, namely, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile, and 75th percentile.
Finally, we added the 15 features extracted by our models after processing the videos from the OperatorEYEVP dataset. These features included respiratory rate, rhythmicity and stability of breathing, eye closure ratio, mouth opening ratio, head Euler angles (roll, pitch, and yaw), the ratio of frames where any Euler angle exceeds 30 degrees relative to the total frames in one minute (angles > 30), the count of eye closures lasting more than 2 s (count of eye closure > 2 s), the count of yawns, heart rate, systolic and diastolic blood pressure, and blood oxygen saturation. The final column in the dataset represents the subjective assessment of fatigue, specifically measured by using the Visual Analogue Scale for Fatigue (VAS-F). The participants themselves provided this assessment, indicating their perceived level of fatigue.
We believe that our contribution to the OperatorEYEVP dataset offers a valuable opportunity for researchers to enhance their understanding of the relationship between mental fatigue and various responses exhibited by subjects. These responses encompass both observable cues, such as head movement and eye and mouth state, as well as underlying indicators, such as vital signs. Researchers can gain insights beyond fatigue alone by exploring the correlations between these responses and multiple features, including attention, work accuracy, concentration stability, and mental performance. We consider that our previous approaches and research have been employed in an optimal manner. The advantage of utilizing these approaches lies in their ability to provide access to all the aforementioned features without the need for sensors or external devices. This not only saves considerable time and effort in conducting studies but also eliminates the need to develop costly approaches to detecting specific features. Consequently, researchers can promptly apply and test their hypotheses and theories. Furthermore, the availability of annotating videos capturing various activities, not solely limited to fatigue, can provide a deeper understanding of the human mechanism.
4. The Proposed Approach and Its Application to the HFAVD Dataset
In this section, we delve into the experiments used to analyze the dataset, along with presenting the resulting outcomes. The dataset provided was utilized to detect fatigue levels based on the psychological traits extracted by our developed models, as explained in
Section 3.2. The overall framework is illustrated in
Figure 3.
As illustrated in the diagram, our approach involves extracting key physiological indicators from the videos of operators using a set of specialized models developed for this purpose. These indicators include heart rate, respiratory rate, blood pressure, oxygen saturation, eye closure ratio, head pose, and yawning frequency. Each of these physiological signals is calculated on a per-minute basis, ensuring a continuous, time-resolved input into the fatigue detection process. Once these indicators are extracted, they are used as input features for our fatigue detection model. This model is designed to assess the operator’s fatigue state based on the mental performance (AU), which is derived from the results of the Landolt rings test—a well-known measure of visual acuity and attention. The Landolt test provides insights into the operator’s cognitive state, specifically focusing on attention and concentration.
To determine the optimal threshold for AU, which separates the fatigue state from the non-fatigue state, we conducted a detailed analysis of the data. This analysis involved evaluating the mental performance across different AU values to identify the point at which fatigue became evident. In addition to AU, we also analyzed two other important cognitive measures: the attention level (K) and the stability of concentration (Ku). These metrics helped us gain a more comprehensive understanding of how attention and focus degrade over time, particularly as fatigue sets in.
We employed the results of the Landolt test, particularly mental performance, to assess the fatigue level due to its capacity to effectively capture the cognitive and attentional aspects associated with fatigue. Unlike subjective measures, such as the Visual Analog Scale for Fatigue (VAS-F), which rely on self-reported perceptions, which can be swayed by individual biases and emotional states, AU provides a more objective and standardized evaluation of fatigue.
Table 4 shows the mental performance (AU), attention (K), and the stability of concentration of attention (Ku) for each participant.
Given that attention decreases when a person is fatigued, we can draw meaningful insights from our data. For example, participant number 6 has the attention level above 60, which is considered good [
46]. That indicates that the participant 6 is not fatigued. Therefore, we can conclude that the fatigue threshold for mental performance must be below 0.59, as individuals with mental performance scores above this threshold maintain high attention levels.
To predict the fatigue state, we based our study on the hypothesis that mental performance becomes lower with the increase in fatigue.
Figure 4 illustrates the relationship between mental performance (as measured by the Landolt rings test) and the inverse of fatigue (1/Fatigue). We assume that there is a threshold value (a point or a small range of points) where fatigue level change causes a significant change in physiological indicators. We also assume the this change can be captured by a machine learning model.
We assessed each participant’s mental performance, classifying them as fatigued if their performance fell below a predetermined threshold. To identify the optimal threshold, we conducted multiple experiments by using the random forest algorithm, testing thresholds from 0.2 (chosen empirically from our dataset) to 0.59 (discussed above) in increments of 0.015. This increment value was chosen experimentally, as performance differences were negligible with smaller increment values. We selected the random forest algorithm for evaluation due to its superior results demonstrated in our previous research [
2].
Figure 5 shows the F1-score for different threshold values obtained through experiments.
Moreover, we recognized that vital signs vary significantly between individuals. To address this, we meticulously normalized these metrics before incorporating them into our predictive models. This normalization allowed our models to accurately account for individual differences, thereby providing more reliable predictions of fatigue. We used the min–max normalization function shown in Equation (
4).
where the following apply:
X is the original vital sign value.
is the normalized vital sign value.
is the minimum value of the vital sign for the participant.
is the maximum value of the vital sign for the participant.
As we can see from the figure, there is a threshold value, 0.38, which achieves the best F1-score.
To predict the fatigue state, several methods were evaluated, including Support Vector Classifier (SVC), logistic regression, Multi-Layer Perceptron (MLP), decision tree, XGBoost, and random forest. Each method has distinct characteristics, which are shown in
Table 5.
The hyperparameters for each model were fine-tuned by using grid search to optimize performance. For SVC, we used a Radial Basis Function (RBF) kernel with a regularization parameter C = 1.0 and the kernel coefficient set to ‘scale’. In logistic regression, we applied an L2 (Ridge) penalty with a regularization strength C = 0.001 and used the ‘lbfgs’ solver for optimization. The MLP classifier was configured with two hidden layers, with each using a linear activation function. The ReLU function was applied as the activation function for the neurons, Adam was selected as the solver, and the learning rate was set to . For the decision tree, the Gini impurity was used as the splitting criterion, with a maximum tree depth of 20, a minimum of two samples required per leaf, and a minimum of two samples required to split a node. In the XGBoost model, the ‘gbtree’ booster was utilized. Key hyperparameters included a learning rate of 0.1, a maximum depth of 3, a minimum loss reduction (gamma) of 2, a feature subsample ratio (colsample_bytree) of 0.8, and 100 estimators (trees). The random forest classifier also employed 100 estimators with the Gini impurity criterion. It had a maximum tree depth of 20, used ‘log2’ as the maximum number of features to consider when splitting, and required a minimum of two samples to split a node. The environment used for these experiments was Python, with scikit-learn employed for SVC, logistic regression, MLP, decision tree, and random forest, while the XGBoost library was used for the XGBoost method. This setup ensures that the characteristics of each classification method are clearly presented and understood.
Table 5 shows the performance of the models using the threshold of 0.38 to detect the fatigue state based on the mental performance.
The random forest method addresses the issue of overfitting commonly associated with decision trees by combining the predictions of multiple trees to improve accuracy and robustness. While a single decision tree may capture noise in the data, random forests reduce this risk by averaging predictions across many trees, leading to better generalization. Random forest is an ensemble learning method that builds multiple decision trees by using bootstrap sampling (bagging) and random feature selection to reduce correlation between trees [
47]. Each decision tree is constructed by using algorithms like ID3 (Iterative Dichotomiser 3), C4.5, or CART (Classification and Regression Trees). These algorithms split nodes based on criteria such as Gini impurity or entropy/information gain for classification problems, like detecting human fatigue. The final random forest prediction is made by aggregating results—majority voting for classification. It also uses out-of-bag (OOB) error for performance estimation and can calculate feature importance.
It is important to mention that the dataset, collected during working hours at various times of the day, reflects a greater number of non-fatigue state samples compared with fatigue state samples. For the testing dataset, this imbalance is evident, with the non-fatigue state (853 samples) samples being almost twice as numerous as the fatigue state samples (358 samples). This imbalance can make the accuracy metric less reliable, as it may not fully capture the performance of the models on the minority class. To address this issue, we also evaluated the F1-score, which is a more comprehensive measure by considering both precision and recall. As shown in
Table 4, while all models exhibit high accuracy, there is significant variation in the F1-scores. Models such as SVC and logistic regression, despite their high accuracy, have relatively low F1-scores (0.4831 and 0.5215, respectively), indicating poorer performance in predicting the minority class. Conversely, models like MLP, decision tree, XGBoost, and random forest demonstrate both high accuracy and high F1-scores, with random forest achieving the highest F1-score, 0.9465. This indicates a better handling of the imbalanced data. The experiments revealed that the threshold of 0.38 yielded the best F1-score, with the random forest model achieving the F1-score of 0.9465 in fatigue prediction based on vital signs.
To analyze the effect of feature correlation on classification performance, it is essential to asses the relationship between the input variables. In this context, features exhibiting high correlation include num_eye_closure > 2 s and eye_closure_ratio, as well as rhythmicity_coeff and stability_coef, which were assessed by using Spearman rank correlation coefficient. However, other potential correlations, such as those between head Euler angles (roll, pitch, and yaw) with angles exceeding 30 degrees and the relationship between mouth_openness_ratio and num_yawns, were not captured by the Spearman method. This limitation may arise from the non-monotonic relationship between these variables. Notably, when training a random forest classifier without including the participant ID, the model achieved an F1-score of 0.7990 by using the original 15 features. Subsequent removing the redundant features, identified through correlation analysis, led to an improvement in the F1-score to 0.8384, highlighting the significance of feature selection in enhancing classification performance. This finding underscores the importance of addressing feature redundancy to optimize model performance in classification tasks.
Another crucial aspect is analyzing the most relevant characteristics for classification. Given the numerous variables used to characterize the problem, it is essential to determine their relevance and impact on classification performance. The relevance of each feature was analyzed, revealing that the most to least relevant features are BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, average pitch, average roll, mouth_openess_ratio, average yaw, average RR, rhythmicity_coeff, and stability_coef. The experiments shown in
Table 6 revealed that using all 13 features resulted in an accuracy of 98.1% and an F1-score of 0.8658. Reducing the number of features to 11 increased the accuracy to 98.27% and the F1-score to 0.87481. A further reduction to nine features improved the accuracy to 98.43% and the F1-score to 0.8891. However, using only seven features slightly decreased the accuracy to 98.1% and dropped the F1-score to 0.8597. Leaving less than seven features produces significantly worse results. These results indicate that reducing the number of features can enhance classification performance up to a point, highlighting that not all features are equally relevant. Thus, focusing on the most significant features can enhance the model’s efficiency and accuracy. Another notable observation is that the high importance of the participant ID feature means that the fatigue assessment models tend to be person-dependent.
Table 7 shows the drop in the methods’ performance after dropping the participant ID feature.
The participant ID’s role in classification processes is underscored by a set of research observing differential reactions to changes in mental fatigue states among subjects, as evidenced in alterations in vital signs and eye movement. In one study, ref. [
48] discovered that exposure to a fatigue-related task led to an increase in blood pressure in some participants, while others registered a decrease. This observation extended to respiratory rates, reinforcing the individual variance in physiological responses to fatigue. Moreover, ref. [
49] found a correlation between mental fatigue and eye movement data (including blink frequency) in three of their subjects, as captured by an eye detection system, while the rest of the subjects did not show any significant relationship. These findings underscore the crucial nature of participant ID as a parameter. It enables machine learning models to account for person-specific dependencies, thereby enhancing their predictive accuracy for fatigue levels. Thus, the participant ID emerges as a significant factor in optimizing classification processes by accommodating individual variability.
Table 8 presents a comparative analysis of existing methodologies for detecting fatigue through images or videos without the use of sensors or specialized equipment. While our approach demonstrates good accuracy (98.43%), it is important to note that the dataset employed is relatively small compared with those utilized in other studies, such as [
22]. The mentioned accuracy rates in the table were calculated based on different data, where fatigue was identified based on different metrics (subjective, objective, or based on monitoring some features); hence, the approaches listed cannot be compared directly. Nevertheless, the accuracy of our models is high enough for the proposed models to be used for fatigue detection. Additionally, this limitation stems from the scarcity of public datasets addressing mental fatigue among operators in work and academic contexts, highlighting a significant gap that our study aims to fill, encouraging further research efforts to address this. Nevertheless, the only approach that allows for a valid comparison is [
28], as it utilized the same dataset. However, our approach demonstrates superior performance by achieving higher accuracy through the use of physiological features rather than relying on eye-tracking data. Additionally, it is noteworthy that many existing approaches primarily rely on features related to the eyes and mouth. However, factors such as the use of sunglasses, reflections from lighting on eyewear, or mouth coverings for medical purposes can affect the detection accuracy. In contrast, our method incorporates a diverse range of parameters that account for both the internal and external states of the subject, thereby enhancing the robustness of our approach in managing these challenges. Furthermore, the majority of the existing datasets rely on subjective indicators to evaluate fatigue levels among participants. These subjective measures can be influenced by personal perceptions and biases, which may compromise the validity of the findings. On the other hand, our study adopts a more robust approach by employing an objective metric to assess fatigue. This choice not only minimizes potential biases but also enhances the overall reliability and validity of our results.
5. Discussion
The results of the article shed light on the proposed association among mental performance, mental fatigue, and attention indicators. This is particularly noteworthy considering the existing literature that suggests strong correlations between mental fatigue and a range of physiological features, including head movement [
50], vital signs (blood pressure, heart rate, oxygen saturation, and respiratory rate) [
48,
51], blinks [
52], and yawns [
53]. These findings provide empirical support for the idea that these physiological measures can serve as reliable indicators of mental fatigue. Moreover, the study’s findings reveal that employing the identified mental performance threshold value of 0.38 leads to a high F1-score. These findings further indicate a robust relationship between mental performance and the aforementioned physiological features, which subsequently reflect the presence of mental fatigue.
In addition, the study findings not only reinforce the proposed connection among mental performance, mental fatigue, and attention indicators, but they also provide empirical evidence of the reliability and validity of utilizing specific physiological features and thresholds to assess mental fatigue. These insights contribute to a deeper understanding of the intricate relationship between mental performance and associated physiological manifestations, highlighting the potential for employing these measures in improving fatigue detection systems.
In our analysis, we believe that advanced deep learning techniques such as Transformers, CNNs, and RNNs are suited for more complex tasks, such as extracting deep layer features from video; hence, these type of models are used to extract vital signs from video data, a critical aspect of our research, after demonstrating their high capability due to their training on extensive datasets. Conversely, for the specific task at hand, traditional machine learning algorithms are more appropriate. This is primarily because they require significantly fewer data and are computationally efficient, allowing us to streamline our approach and avoid unnecessary complexity and time consumption.
Another reason that led us to exclude advanced deep learning techniques for the classification task is that the available data are insufficient for the effective implementation of deep learning algorithms. Additionally, it is noteworthy that most state-of-the-art (SOTA) fatigue assessment methodologies primarily address driving fatigue, often overlooking the fatigue associated with office work. Consequently, we consider our research to be novel and innovative, as it specifically targets this issue through objective measurement techniques, i.e., the mental performance assessment using the Landolt ring.
This study faced certain limitations due its relatively small size. However, this problem cannot be solved by merging our dataset with the other datasets mentioned in
Section 2.2, because we collected different fatigue indicators in our dataset, such as VAS-F, Landolt rings test, and CRT. Based on our experiments described in [
28], we found that Landolt rings test is the most reasonable in the considered case of fatigue detection. As a result, our models have been trained for the fatigue level estimated by the Landolt rings test, which is absent in the datasets mentioned in
Section 2.2; hence, it is not possible to extend our dataset with the datasets introduced in
Section 2.2. Additionally, another limitation was including group of participants with limited age diversity and ethnic background. As people age, heart rate and respiratory rate remain relatively stable, but maximal heart rate decreases, blood pressure—particularly systolic—tends to rise due to stiffening arteries, and oxygen saturation may decline, especially in those with chronic conditions. However, we have not studied how the data change across different age groups and ethnicities. These factors may hinder the generalizability of the proposed approach. One potential constraint that could impede the generalizability of the trained model is the fact that fatigue detection is human-dependent. This factor could pose challenges in applying the proposed model to novel subjects. Moreover, there remains ambiguity in determining the fatigue threshold, as the VAS-F result is subjective. To address this, we opted for the mental performance indicator as a replacement for fatigue. However, for greater reliability, it is recommended to establish an explicit value for the fatigue threshold.
Another challenge that opens up the discussion is the assessment of the validity of the data predicted by deep learning models. The majority of models employed for physiological indicator estimation have undergone testing on multiple datasets, distinct from the ones used for training. This is as detailed in the respective articles, which are cited in the reference section. The outcomes demonstrated high performance and low error margins, thereby validating their application to the current dataset. This dataset has a significant resemblance to the conditions present in the training datasets, including aspects such as lighting conditions and the skin tone and demographic characteristics of the subjects, who were predominantly healthy and young. As a result, the probability of encountering abnormal values in vital signs is considerably reduced. However, this limitation can be effectively addressed by expanding the training datasets to include a more diverse population. Specifically, incorporating individuals from various age ranges, ethnic backgrounds, and those who suffer from serious health conditions that may impact their vital signs will enhance the models’ robustness. Doing so would better equip the models to be generalized across different demographic groups and health conditions, ultimately improving their accuracy and reliability in estimating physiological indicators for a broader population. This approach would not only mitigate existing biases but also ensure that the models can more accurately reflect the complexities of real-world health scenarios.
6. Conclusions
This research article emphasizes the significance of fatigue detection in system operation contexts and its direct impact on efficiency. The objective of this study is to assess the mental performance of operators who work long hours in front of their computers. Mental performance is considered the opposite of mental fatigue. Accordingly, we concentrate on office routines that do not involve physical work. This study starts with exploring various approaches and datasets that utilize videos, sensors, and devices to capture features associated with fatigue. This paper introduces an ML-based approach that integrates video data with features computed by using computer vision deep learning models. These features encompass head movement (roll, pitch, and yaw), vital signs (blood pressure, heart rate, oxygen saturation, and respiratory rate), and eye and mouth states (blinking and yawning). By incorporating these features, the need for manual calculation or external sensors is eliminated, distinguishing it from existing datasets that rely on such methods. The primary objective of this research is to advance fatigue detection in work and academic settings.
The experiments were carried out by using the publicly available dataset OperatorEYEVP. The choice to apply the estimation process to this specific public dataset was motivated by the relevance of the videos provided, which depict the daily tasks of operators and students. As the main objective of this research is to investigate fatigue in this specific context, it was deemed appropriate to utilize this dataset. The specific task examined is reading scientific articles, as it necessitates a focused and alert mind to effectively analyze the information presented. Additionally, some individuals engage in gaming during their breaks, which is presented in the dataset used by the Tetris game, which also requires a significant level of concentration. Furthermore, we used an objective measurement of the mental performance by using the Landolt rings test to evaluate visual acuity. The physiologically indicative parameters, computed through computer vision techniques, were compiled into another publicly accessible dataset, HFAVD, presented in this paper. Future endeavors are directed towards applying this methodology to other video datasets and expanding the HFAVD dataset.
Through a series of experiments utilizing machine learning techniques, the dataset was analyzed to assess the fatigue state based on the predicted features by using deep learning models (
Section 3.3). The mental performance indicator obtained from the Landolt rings test was utilized as a measure of mental fatigue. The primary objective was to investigate the relationship between this indicator and the attention metric in order to determine the optimal starting threshold for dividing our ground truth into fatigued and non-fatigued classes. Furthermore, a series of experiments were conducted to identify the most effective final value for identifying sessions characterized by mental fatigue in participants.
The research findings indicate that the threshold of 0.38 exhibits the highest potential for effectively discriminating between subjects experiencing mental fatigue and those who are not fatigued for achieving superior performance in the classification process. We employed a range of machine learning techniques, including Support Vector Classifier (SVC), logistic regression, Multi-Layer Perceptron (MLP), decision tree, XGBoost, and random forest. To optimize model performance, we utilized grid search to identify the best hyperparameters that yield the highest accuracy and F1-score. Our results indicate that most of these techniques achieved impressive accuracy levels, all exceeding 90%. However, we prioritized the F1-score as a critical performance metric, given that our test set was imbalanced. By attaining a high F1-score, we ensured that the models performed well across both classes, thereby mitigating potential bias issues. The outcomes consistently showed that random forest achieved the highest F1-score, 94%. These results highlight random forest as a highly promising technique for fatigue detection. XGBoost, decision tree, and MLP reached good F1-scores of 86%, 82%, and 87%, respectively.
It was also revealed that participant ID emerged as a significant factor in optimizing classification, showing that accounting for person-specific dependencies enhances the predictive accuracy and F1-score of machine learning models for the fatigue state. This observation was made by comparing the classification metrics when incorporating participant ID as a feature in the model versus when it was excluded. We found that the F1-score declined from 94% to 81%, highlighting the significance of individual differences in the context of mental fatigue classification.
Additionally, the concept of feature importance motivated us to investigate which features contribute most significantly to enhancing classification performance. Consequently, we conducted an experiment in which we sequentially analyzed the top n features based on their relevance. The results indicate that using 9 out of the 13 features achieved the highest accuracy, 98.43%. This finding suggests that we can streamline our approach by excluding certain less impactful features, such as those related to respiratory rate, thereby reducing the time required for implementation.
In summary, this study contributes to the topic of fatigue detection by presenting an ML-based approach and its application to a comprehensive dataset, demonstrating the effectiveness of machine learning in accurately assessing fatigue levels. The integration of video data and the use of deep learning techniques offer a more efficient and practical approach to identifying signs of fatigue, potentially enhancing safety in various contexts. Further research and development in this area can lead to the implementation of effective fatigue detection systems that mitigate risks and improve overall well-being.