1 Introduction

AI is the ability of a digital computer to perform tasks commonly associated with intelligent beings, matching the performance levels of human experts in tasks that require everyday knowledge, such as discovering meaning, HPE, and the ability to reason. HPE is an important computer vision task that has garnered significant attention in recent years owing to its wide range of applications in various fields, surveillance, robotics, and including human-computer interaction (HCI) (Zhang et al. 2023). The goal of HPE is to determine the position and orientation of a person’s head relative to a fixed coordinate system, typically defined by a camera (Borghi et al. 2018). Pose is usually expressed by rotational (yaw, pitch, and roll) components (Cobo et al. 2024), or translational (x, y, and z) components (Algabri and Choi 2021, 2022), or both. With the recent rapid development of computer vision technology, deep learning techniques have outperformed classical techniques in various tasks, including AI-based HPE systems. The field of HPE has seen significant progress in recent years, with many state-of-the-art techniques achieving high accuracy and robustness in various applications. However, many challenges and open problems in HPE still require further research, such as the gimbal lock and discontinuous representation problems, as well as the need for dynamic head motions and real-time performance. Moreover, there is a need for standardized evaluation protocols and benchmarks to enable fair comparisons between different HPE techniques. Most of the existing methods use the quaternion representation or Euler angle to train their networks, leading to high performance for the HPE using deep learning methods. More recently, attention has been paid to the use of rotation matrix representation with convolutional networks to mitigate the discontinuous representation and gimbal lock problems.

Given the rapid advancements in the HPE topic, this survey aims to trace the recent progress and recap these achievements to present a clear picture of existing research on deep learning techniques with advanced elements of the holistic system. Before diving deeper into HPE, it is essential to understand some related concepts:

  • Face detection is the process of locating and identifying faces of people within an image or video frame, as shown in Fig. 1a. The main objective of facial detection is to identify the presence and location of faces within an image or video stream. It is a crucial step in various applications such as facial recognition and emotion analysis. Face detection techniques include traditional methods such as Haar cascades and more advanced approaches using deep learning models such as convolutional neural networks (CNNs). These methods help in identifying the bounding boxes (rectangles) around detected faces. See Minaee et al. (2021), Liu et al. (2023) for more details.

  • Facial landmark detection (Lee et al. 2020) is the process of identifying specific points (landmarks) on a person’s face, such as the corners of the eyes and mouth and the tip of the nose, as shown in Fig. 1b. Facial landmark detection provides precise information about the locations of key facial features. These landmarks are often used as a foundation for various facial analysis tasks, including face alignment and emotion recognition (Tomar et al. 2023). Like face detection, facial landmark detection can be performed using deep learning models, especially those designed for keypoint detection. These models are trained to predict the coordinates of facial landmarks.

  • Head detection is a broader task than face detection and involves identifying and localizing the entire head in an image or video frame (Kumar et al. 2019). This can include detecting both the face and the surrounding head region, as shown in Fig. 1c.

  • HPE refers to the process of determining the position and orientation of a person’s head in three-dimensional (3D) space. It provides information regarding the direction people are looking, which can be valuable for understanding user engagement or gaze tracking in applications such as virtual reality, driver monitoring systems (DMSs), or HCI. Estimating the head pose typically involves identifying facial landmarks and using geometric and trigonometric calculations to determine the head’s orientation, as shown in Fig. 1d. However, HPE is the topic that will covered in this survey.

Fig. 1
figure 1

Example images of four different setups

1.1 Previous surveys and our contributions

Several related reviews and surveys on HPE have been previously conducted (Murphy-Chutorian and Trivedi 2008; Shao et al. 2020; Khan et al. 2021; Abate et al. 2022; Asperti and Filippini 2023). Reference (Murphy-Chutorian and Trivedi 2008) reviewed research works related to HPE based on traditional learning approaches conducted before 2009. References (Shao et al. 2020; Khan et al. 2021; Abate et al. 2022) provide surveys of both traditional and deep learning approaches in the context of HPE. Reference (Shao et al. 2020) summarized the HPE methods, from classical to recent deep learning-based ones, encompassing nearly 60 papers up to 2019. Reference (Asperti and Filippini 2023) presents a survey of deep learning-based methods in the context of HPE. However, this survey focuses on the general field of deep learning-based HPE and ignores the effect of the elements on the holistic system.

An online search and the collection of literature were performed using multiple search engines, including IEEE Xplore, Google Scholar, Science Direct, PubMed, Springer, ACM, and others. The search process included a recursive review of the cited references. The keywords used for the search included HPE, head pose estimation, head poses, head orientation, AI-based head pose estimation, deep learning HPE, real-time head pose estimation, HPE metrics, and HPE datasets. We included papers published in recent years that focused on AI-based methods, selecting those with high citations, available code, or those presenting novel methods or significant improvements in the field of HPE.

This survey covers the recently published papers, given the rapid development of HPE over the past few years. Moreover, it offers a comprehensive discussion of the effect of each element of the framework on the subsequent process and its preprocess on the overall system, addressing aspects neglected in prior surveys. As indicated by detailed analysis and existing experiments, the holistic system’s performance relies on each element. Consequently, a comprehensive review of these elements is essential for readers aspiring to construct a state-of-the-art HPE system from the ground up. This survey seeks to offer valuable insights for a comprehensive understanding of the broader context of end-to-end HPE and to facilitate a systematic exploration. The primary contributions can be succinctly summarized as follows:

  • This article comprehensively surveys the elements of HPE, encompassing over 210 papers published until 2024. We review and categorize the recent advancements of each element in detail so that readers can systematically understand them.

  • We survey the eleven reviewed elements from many aspects: choice of application or academic research to improve the performance, multi-task or single task, environment, dataset type, range angle, representation way, freedom degrees, techniques used, landmark method, rotation type, and metrics to evaluate the model as well as challenges (Table 3). Moreover, we point out the influence of each element on the system’s overall performance by highlighting the weaknesses and strengths of the different methods for each element.

  • We gather the current challenges for each element and its sub-categories, aiming to understand their history and status and then support future research from the point of view of the holistic framework (Table 1).

  • We present an overview of current applications of HPE, such as DMSs, surveillance, virtual try-on (VTON), augmented reality (AR), and healthcare.

  • We provide a comprehensive comparison of several publicly available datasets in tabular form under a summary of datasets of HPE and their annotations (Table 2).

1.2 Article organization

Fig. 2
figure 2

Structure of this survey

This paper is divided into five main sections: Introduction (Sect. 1), Main steps of HPE frameworks (Sect. 2), Datasets and ground-truth techniques (Sect. 3), and discussion, challenges, and future directions (Sect. 4), and conclusions (Sect. 5). Figure 2 shows the structure of this survey. HPE frameworks (Sect. 2) comprise eleven main steps under four  broader groups (application context, data handling and preparation, echniques and methodologies, and  evaluation metrics, as shown in Fig. 2 (right side). Every step is divided into subcategories, including choice of application, environment, task types, dataset type, range angle, rotation representations way, number of six degrees of freedom (DoF), techniques used, landmark-based or free methods, rotation type, and evaluation metrics. In datasets and ground-truth techniques (Sect. 3), the main characteristics of the available datasets are discussed, including the number of participants and their gender (i.e., female and male), angles (i.e., yaw, pitch, and roll) and range (i.e., full or narrow range), the environment that the dataset captured (i.e., indoor and outdoor), data type (i.e., two-dimensional (2D) image or depth), resolution, and ground truth tools. Moreover, discussion, challenges, and future research directions are presented relatively short (Sect. 4). Finally, the article provides conclusions from this survey (Sect. 5).

2 Main steps of head pose estimation frameworks

In this section, we categorize the process for HPE into eleven major steps: application choice, multi-task or single task, environment, dataset type, range angle, representation way, freedom degrees, techniques used, landmark method, rotation type, and metrics to evaluate the model. The categories can be organized into broader groups (four groups) that logically sequence the steps and aspects involved in developing an HPE system, better reflecting the logical relationships between these categories, as follows: 1. application context: This group includes the choice of application, the specific tasks, and the environment in which the system will operate. 2. data handling and preparation: This encompasses the type of dataset, the range of angles, the representation method, and the degrees of freedom involved. 3. techniques and methodologies: This includes the techniques used, the approach to landmark detection, and the type of rotation. 4. evaluation metrics: This group covers the metrics used to evaluate the system’s performance. Moreover, we summarize the advantages and disadvantages of all these categories in Table 1.

Fig. 3
figure 3

Taxonomy of steps order for head pose estimation

Organizing all the diverse categorizations and order processes for HPE and unifying them under a single ubiquitous taxonomy presents both a challenge and an ambition. We explored the idea of employing a functional categorization, grouping each method based on the process order and its respective operating domain. By employing this approach, a clear distinction would have been made between different methods. Through this organization, we facilitate discussions on the progression of various techniques while also steering clear of ambiguities that may appear when these techniques are applied beyond their original functional boundaries. As shown in Fig. 3, our evolutionary taxonomy comprises eleven steps describing the conceptual approaches utilized for HPE, where the small gray circles numbered 1 through 4 refer to groups, with each group having its respective subcategories:

2.1 Application context

It includes the choice of application, the specific tasks, and the environment in which the system will operate.

2.1.1 Application

HPE has extensive applications, such as HCI (Liu et al. 2021a, b; Madrigal and Lerasle 2020), DMS (Lu et al. 2023; Chai et al. 2023; Wang et al. 2022; Jha and Busso 2022; Hu et al. 2020), examinee monitoring system (EMS) (Chuang et al. 2017), VTON (Shao 2022), AR (Huang et al. 2012), gaming and entertainment (Kulshreshth and LaViola Jr 2013; Malek and Rossi 2021), audience analysis (Alghowinem et al. 2013), emotion recognition (Mellouk and Handouzi 2020), head fake detection (Becattini et al. 2023), health and medicine (Hammadi et al. 2022; Ritthipravat et al. 2024), security (Bisogni et al. 2024), surveillance (Rahmaniar et al. 2022), age estimation (Zhang and Bao 2022), human-robot interaction (HRI) and control (Mogahed and Ibrahim 2023; Wang and Li 2023; Edinger et al. 2023; Hwang et al. 2023), sports analysis (Kredel et al. 2017), attention span prediction (Singh et al. 2021; Xu and Teng 2020), and human behavior analysis (Liu et al. 2021; Baltrusaitis et al. 2018). Overall, HPE has diverse applications across industries and domains, enabling more sophisticated and natural interactions between humans and machines.

2.1.2 Task method

The task method can be approached based on single-task (i.e., HPE only) or multi-task (that is, HPE with face detection, gaze detection, facial landmark detection, or other tasks) methods, as shown in Fig. 4.

Fig. 4
figure 4

Task type

Single task It focuses on solving the HPE problem as a standalone task. Most recent landmark-free approaches have been designed to estimate head pose as a single task by employing deep-learning models to address the occlusion problem. For example, latent space regression (LSR) (Celestino et al. 2023), FSA-Net (Yang et al. 2019), 6DHPENet (Chen et al. 2022), and 6DoF-HPE (Algabri et al. 2024) are single task methods.

LSR (Celestino et al. 2023) is designed to estimate head poses under occlusions based on multi-loss and the ResNet-50 backbone. However, it requires substantial graphics processing unit (GPU) power as it has over 23 million parameters and needs to generate its occluded and unoccluded datasets using the same annotations. In Yang et al. (2019), the authors proposed a model called FSA-Net to implement only a single task, which is an HPE based on a feature aggregation and soft-stage regression architecture. In Chen et al. (2022), the authors proposed a single-task approach named 6DHPENet using a 6D rotation representation with multi-regression loss for fine-grained HPE. 6DoF-HPE (Algabri et al. 2024) is designed to estimate the head pose in real time using a RealSense D435 camera. This method reported full and narrow range angles using various datasets.

The advantages of these methods with a single task are as follows: (1) Simplicity: single-task models are often easier to develop and train because they have a single objective, simplifying the training process. (2) Specificity: these models are designed for a specific task, making them well-suited for applications where HPE is the primary concern. However, these methods with a single task have the following disadvantages: (1) Lack of context: single-task models may not take advantage of additional information in the image or video that could improve the accuracy of pose estimation. (2) Limited use cases: they may be used for applications requiring only single tasks.

Multiple tasks In contrast to the aforementioned topic, multi-task networks are employed to estimate head pose and other tasks, such as face detection, face alignment, emotion recognition, gender, and age.

Hyperface (Ranjan et al. 2017) is a landmark-based approach for gender recognition, landmark localization, HPE, and face detection using a CNN. The advantage of HyperFace is that it employs the intermediate layers for fusion to boost the performance of the landmark localization task. The lower layers of the CNN contain local information, which becomes relatively invariant as the depth increases. Zhang et al. (2018) address the challenge of recognizing facial expressions (FER) under arbitrary head poses through a generative adversarial network (GAN) model using the geometry of the face. This method is a landmark-based end-to-end deep-learning model. Xia et al. (2022) presented an alignment, tracking, and pose network (ATPN), multiple tasks neural network specifically developed for face tracking, face alignment, and HPE. ATPN improves face alignment by integrating a shortcut connection between deep and shallow layers, effectively utilizing structural facial information. Additionally, ATPN generates a heatmap from face alignment to boost the performance of HPE and provides attention cues for face tracking. Thai et al. (2022) introduced MHPNet, a lightweight multi-task model. This end-to-end deep model is designed to address both HPE and masked face classification problems simultaneously. The authors adjusted the narrow-range angles from [- 99, 99] to [- 93, 93] degrees based on their observation that the ground-truth Euler angles predominantly fall within this specific range. Consequently, the backbone generated a 62-dimensional distribution vector for each angle, representing 62 bins instead of 66 bins as in HopeNet. Bafti et al. (2022) proposed an architecture for improving the performance of dense prediction tasks using the multi-task learning model named MBMT-Net. The proposed architecture is a multi-backbone (Mask-RCNN and Resnet50), multi-task deep CNN that simultaneously estimates head poses and age. Malakshan et al. (2023) presented an approach that includes task-related components such as classification, representation alignment, head pose adversarial, and regression losses. These components work together to improve the accuracy of HPE for low-resolution faces. Basak et al. (2021) presented a methodology based on a semi-supervised approach for learning 3D head poses from synthetic data. This method also generates synthetic head pose data with a diverse range of variations in gender, race, and age. Fard et al. (2021) presented the active shape model network (ASMNet), a CNN designed for the multiple tasks of face alignment and pose estimation. ASMNet is engineered to be efficient and lightweight to improve the performance in detecting facial landmarks and estimating the pose of a human face. Chen et al. (2021) presented an end-to-end multi-task method called TRFH to estimate head poses and detect faces simultaneously. The authors adopted DLA-34 as a backbone, which is a deep learning technique. In Wu et al. (2021), the authors introduced the Synergy method, which employs 3D facial landmarks and a 3D morphable model (3DMM) to detect 3D face mesh, landmarks, texture, and HPE. The 3DMM has some advantages in tasks, such as face analysis, because the semantic meaning is well-known and prevents possible tracking failures caused by the sudden emergence of face regions Yu et al. (2018). Khan et al. (2021) presented a face segmentation and 3D HPE approach based on deep learning. This approach involves segmenting a face image into seven distinct classes. The proposed method uses a probabilistic classification method and creates probability maps (PMAPS) for HPE along with segmentation results to extract features from the CNNs and build a soft-max classifier for face parsing. Dapogny et al. (2020) introduced an approach that combines HPE and facial landmark alignment inside an attentional cascade. This cascade employs a geometry transfer network (GTN) to enhance landmark localization accuracy by integrating diverse annotations. Additionally, the authors introduced a doubly conditional fusion method to select relevant feature maps and regions based on both head pose and landmark estimates, creating a single deep network for these tasks. Valle et al. (2020) presented an architecture combining landmark-based face alignment, and HPE called a multi-task neural network (MNN) based on bottleneck residual blocks with a U-Net encoder-decoder using four in-the-wild landmark-related datasets and one dataset acquired in laboratory conditions. Jha et al. (2023) proposed a framework that depends on CNNs to take the driver’s head pose and eye appearance as inputs, creating a fusion model for estimating probabilistic gaze maps of the driver. The data used for this study was obtained from the multimodal driver monitoring (MDM) corpus (Jha et al. 2021).

Multi-task methods aim to collectively address HPE alongside other correlated tasks to improve overall performance. The advantages of these methods with multiple tasks are as follows: 1) Information fusion: multi-task models can take advantage of additional information, such as facial landmarks, which can improve HPE accuracy. 2) Resource efficiency: by sharing features across tasks, multi-task models can be more resource-efficient than training separate models for each task. However, these methods have the following disadvantages: 1) Task interference: tasks can potentially interfere with each other, and optimizing one task can negatively impact the performance of another. 2) Increased complexity: multi-task models are more complex and may require additional data and training, which can make them more computationally intensive. Developing and training multi-task models can be more challenging and complex than single-task models.

The choice between single-task and multi-task HPE depends on the specific requirements of the application. If HPE is the primary focus, a single-task model might suffice. However, if the application needs additional information or resource usage optimization, a multi-task approach could be more suitable. Ultimately, the choice should be driven by a project’s specific needs and trade-offs.

2.1.3 Environment

HPE systems are designed to determine the orientation or position of an individual’s head in a given environment. These systems find applications in various fields, as mentioned in Sect. 2.1.1. The challenges and requirements for HPE can differ between indoor and outdoor environments.

Indoor environments Indoor environments, such as museums (Zhao et al. 2024), laboratories, and offices, which are characterized by controlled settings and predictable conditions, serve as compelling spaces for various technological applications. For example, Algabri et al. (2024) propose a novel real-time HPE framework using red, green, blue, and depth (RGB-D) data and deep learning without relying on facial landmark localization in an indoor environment. Bafti et al. (2022) introduced a multi-backbone architecture for simultaneously estimating head poses and age. Celestino et al. (2023) introduced a deep learning-based methodology to autonomously assist in feeding disabled people with a robotic arm. Song et al. (2023) proposed an algorithm that combines HPE and human pose tracking with automatic name detection for autistic children. The authors claimed that the experimental results demonstrated high consistency between the proposed approach and clinical diagnosis.

The advantages of these approaches in indoor environments are as follows: (1) Controlled lighting: Controlled and consistent lighting conditions indoors facilitate more reliable and accurate HPE. (2) Structured backgrounds: indoor environments often have less complex and more structured backgrounds, making it easier to distinguish the head from the surroundings and track the head. (3) Calibration: camera calibration is generally more straightforward indoors, allowing for accurate geometric transformations. However, these methods have the following disadvantages in indoor environments: (1) Limited realism: indoor environments may not completely capture the diversity of real scenarios, limiting the realism of the training dataset. (2) Restricted applications: indoor HPE may not directly apply to outdoor scenarios, limiting the system’s versatility.

Outdoor environments Roth and Gavrila (2023) introduced a technique named IntrApose for an in-car application using a BBox camera. Fard et al. (2021) proposed a lightweight CNN for face pose estimation and facial landmark point detection in the wild. In Bisogni et al. (2021), the authors proposed a fractal-based technique called FASHE. This method determines the closest match between the reference array using Hamming distance and the fractal code of the target image for HPE in the wild.

The advantages of these methods in outdoor environments are as follows: (1) Real-world scenarios: outdoor HPE directly applies to real-world scenarios, such as surveillance in public spaces or AR experiences. (2) Diverse backgrounds: outdoor scenes provide more diverse and dynamic backgrounds, challenging the system to adapt to various environments. (3) Adaptability: systems designed for outdoor use are often more adaptable to lighting conditions and scenarios. (4) Practical applications: outdoor HPE is crucial for applications such as AR navigation, HCI in public spaces, and crowd monitoring. However, these methods have the following disadvantages in outdoor environments: (1) Uncontrolled lighting: outdoor environments experience variable and sometimes unpredictable lighting conditions, thus requiring robust algorithms to handle changes in illumination. (2) Background complexity: outdoor scenes may have complex and dynamic backgrounds, posing challenges in accurately isolating and tracking the head. (3) Camera calibration challenges: calibrating cameras in outdoor environments is more challenging owing to the absence of fixed calibration patterns and larger distances.

Both environments Other studies have addressed both indoor and outdoor environments. For example, Berral-Soler et al. (2021) proposed a single-stage model called RealHePoNet, which is a landmark-free method to work in indoor and outdoor environments based on single-channel (i.e., grayscale) images.

Common considerations for both environments (1) Robustness to occlusions: HPE systems should be robust to partial occlusions of the head. (2) Adaptability to different head movements: systems need to accommodate a wide range of head movements, including rotations, tilts, and nods. (3) Integration with other systems: HPE is often part of a broader system, and integration with other components, such as gaze tracking or facial expression analysis, may be important. (4) Privacy concerns: in both indoor and outdoor scenarios, privacy concerns should be considered, and systems should adhere to ethical guidelines.

Developers of HPE systems need to carefully consider these factors based on the specific requirements and challenges posed by the target environment, whether indoor or outdoor.

2.2 Data handling and preparation

2.2.1 Dataset type

Selecting the dataset type is the most important step for the HPE frameworks. RGB images and RGB-D images are two types of data used in HPE. These data are sometimes videos, whether RGB frame video or depth video. A video is a sequence of images displayed in rapid succession to create the illusion of motion. The main emphasis of this survey (Abate et al. 2022) is on the category of dataset types. RGB images provide surface appearance and texture details of the head but lack depth information. HPE using RGB images relies on analyzing facial features and their relative positions to estimate head orientation. RGB images are more appropriate in applications that do not require 6DoF and are within almost uniform lighting environments.

RGB-D images are usually red, RGB-D data consisting of color images along with depth information for each pixel in the image. RGB-D images include an additional depth channel alongside the RGB channels. This depth channel provides information about the distance of each pixel from the camera. When used for HPE, RGB-D images can offer more accurate and robust results, as they enable the system to account for the 3D structure of the head, making it less susceptible to lighting variations and occlusions. RGB-D images can be better for applications that require 6DoF.

RGB images These images, also known as grayscale or color images, represent visual data in two dimensions: width and height. Most HPE methods use RGB images due to plenty. 2DHeadPose (Wang et al. 2023), HeadPosr (Dhingra 2021), and other methods use RGB images. Dhingra (2021) proposed HeadPosr, which is an end-to-end trainable model to estimate the head hoses based on transformer encoders using a single RGB image. Hu et al. (2021) presented a Bernoulli heatmap method to create CNNs without fully connected layers from a single RGB image for HPE. Liu et al. (2021) presented an approach that eliminates the need for training with head pose labels. Instead, it relies on matches between the 2D input single RGB image and a reconstructed 3D face model for HPE. It uses CNNs that optimize asymmetric Euclidean and keypoint losses jointly. This method comprises two main components: the 3D face reconstruction and 3D-2D matching keypoints. The authors conducted a comparative analysis of their method against those utilizing different data types, asserting that their approach demonstrated superior performance.

The primary benefit of employing the HPE from RGB images as the ground truth in numerous HPE datasets is that a large dataset can be amassed with a relatively straightforward and cost-effective process without requiring special setups. Conversely, the most significant disadvantage is that landmark-free HPE methods may be trained and evaluated using an imperfect ground truth derived from head poses estimated from RGB images (Li et al. 2022). Moreover, this data category relies on active sensing, and its application outdoors and in brightly lit environments poses challenges owing to the potential exposure of active lighting (Shao et al. 2020). Although RGB images are widely employed and understood in computer vision and image processing, these images are weak under severe illumination conditions and are inaccurate for 6DoF systems.

RGB-D images RGB-D images provide information about the distance of the corresponding point in a 3D scene from a camera or sensor (Du et al. 2021). 6DoF-HPE (Algabri et al. 2024) and face-from-depth (Borghi et al. 2018) methods use either RGB-D images with depth (D) information, point clouds, or both. For example, 6DoF-HPE (Algabri et al. 2024) used depth data to estimate the translational components (x, y, and z) and RGB data to estimate the rotational components (yaw, pitch, and roll) relative to the camera pose in real-time. However, it required more GPU power because it used deep learning models, and the mean absolute error (MAE) was still above 5% for the full range dataset. Borghi et al. (2018) employed a deterministic conditional GAN model to transform RGB-D images into grayscale images. Consequently, the authors introduced a complete end-to-end model called Face-from-Depth based on the \(POSEidon^{+}\) network, designed for tracking driver body posture, with a primary focus on estimating head and shoulder poses based on gestures captured on depth images only. Ma et al. (2021) presented an end-to-end framework based on the PointNet network and DRF to estimate the head pose from a single-depth image. Luo et al. (2019) presented a system based on an iterative closest point (ICP) algorithm that can estimate head pose and generate a realistic face model in real time from a single depth image. The system used a deformable face model aligned to the RGB-D image to generate the face model. Hu et al. (2021) adopted a method to extract discriminative head pose information, leveraging temporal information across frames without handcrafted features using point cloud data. Xu et al. (2022) proposed a network architecture that utilizes a 3D point cloud generated from depth as input, departing from using RGB or RGB-D image for HPE. In Chen et al. (2023), the authors introduced a technique for HPE that leverages asymmetry-aware bilinear pooling on RGB-D feature pairs. This approach aims to capture asymmetry patterns and multi-modal interactions within head pose-related regions. This bilinear pooling approach is effective for merging multi-modal information to support various tasks. Nevertheless, it should be noted that the bilinear pooling’s high memory requirements can pose limitations, particularly on low computational power devices, as highlighted in. López-Sánchez et al. (2020). Importantly, the proposed method’s performance degrades notably when dealing with large pose angles where crucial facial features may become occluded. In Wang et al. (2023a), the authors developed a complete HDPNet pipeline utilizing RGB-D images for head detection and HPE in complicated environments. The proposed method is similar to the HopeNet architecture in terms of HPE.

In contrast to RGB images, depth data maps may exhibit reduced texture detail (Wang et al. 2023b). Depth data offers a solution to mitigate certain limitations of RGB data, such as issues related to illumination, while also providing more dependable facial landmark detection (Drouard et al. 2017). However, RGB-D images require special sensors, leading to a high-cost process.

Other datasets Liu et al. (2022) presented a Gaussian mixed distribution learning (GMDL) for the HPE model for understanding student attention using infrared (IR) data. However, the author reported the results for only two angles (yaw and pitch). Another study (Kim et al. 2023) proposed a real-time DMS system based on HPE and facial landmark-based eye closure detection to monitor driver behavior using IR data.

In summary, RGB images are based solely on color and texture, while RGB-D images incorporate depth information, enhancing the precision and reliability of HPE algorithms. Given the importance of this step, in Sect. 3 we describe in detail 24 public datasets published after 2010, namely, 2DHeadPose (Wang et al. 2023), Dad-3dheads (Martyniuk et al. 2022), avatars in geography optimized for regression analysis (AGORA) (Patel et al. 2021), MDM corpus (Jha et al. 2021), ETH-XGaze (Zhang et al. 2020), GOTCHA-I (Barra et al. 2020), DD-Pose (Roth and Gavrila 2019), VGGFace2 (Cao et al. 2018), WFLW (Wu et al. 2018), SynHead (Gu et al. 2017), DriveAHead (Schwarz et al. 2017), SASE (Lüsi et al. 2017), Pandora (Borghi et al. 2017), 300W across Large Pose (300W-LP)  (Zhu et al. 2016), AFLW2000 (Zhu et al. 2016), CCNU (Liu et al. 2016), UPNA (Ariz et al. 2016), WIDER FACE (Yang et al. 2016), Carnegie Mellon university (CMU) Panoptic (Joo et al. 2015), Dali3DHP (Tulyakov et al. 2014), EYEDIAP (Funes Mora et al. 2014), McGill (Demirkus et al. 2014), Biwi (Fanelli et al. 2013), ICT-3DHP (Baltrušaitis et al. 2012), annotated faces in the wild (AFW) (Zhu and Ramanan 2012) and NIMH-ChEFS (Egger et al. 2011).

2.2.2 Range angle method

The choice between narrow-range and full-range angles for HPE relies on the specific requirements and constraints of the application. Figure 5 shows the range of narrow and full angles.

Fig. 5
figure 5

Different range angles

Narrow-range angle It typically refers to estimating the head pose within a limited field of view, such as a smaller range of yaw, pitch, and roll angles. For example, narrow-range angle methods might focus on head poses that are primarily within a \(\pm 90^{\circ }\) range around the frontal view (Figs. 5 and 6). The THESL-Net (Zhu et al. 2022), AGCNNs (Ju et al. 2022), DADL (Zhao et al. 2024), HPNet-RF (Thai et al. 2023), DSFNet (Li et al. 2023), and DS-HPE (Menan et al. 2023) methods, uses narrow-range angles.

Dhingra (2022) proposed LwPosr, a lightweight network that combines transformer encoder layers and depthwise separable convolution (DSC). Organized in three stages and two streams, these layers collectively aim to deliver precise regression for HPE. According to the researchers cited in Li et al. (2022), landmark-free techniques fail to address the issue of perspective distortion in facial images, which arises due to the misalignment of the face with the camera’s coordinate system. To mitigate this problem and enhance the accuracy of HPE using a lightweight network, the authors introduce an image rectification approach within narrow-range head pose angles. Liu et al. (2022) proposed a method for only two narrow-range angles (yaw and pitch) for attention understanding in the learning and instruction scenarios. Huang et al. (2020) used narrow datasets on a framework similar to the HopeNet framework with average top-k regression. Although HeadFusion Yu et al. (2018) was designed to track \(360^{\circ }\) head poses, the range angles of data used were between [0 and 80] degrees.

Fig. 6
figure 6

Narrow-range angles (Biwi dataset)

Dhingra Wang et al. (2022) presented a method that includes a regional information exchange fusion network and a four-branch feature selective extraction network (FSEN). The proposed approach aims to address the challenges posed by complex environments, such as occlusions, lighting variations, and cluttered backgrounds. The four-branch FSEN is designed to extract three independent discriminative features of pose angles by three branches and features corresponding to multiple pose angles by one branch from the input images. The regional information exchange fusion network is then used to fuse the extracted features to estimate the head pose.

The advantages of these methods with narrow-range angles are as follows: (1) Reduced computational complexity: narrow-range angles, often limited to a specific range (e.g., \(45^{\circ }\) to \(- 45^{\circ }\)), can simplify the computational process, making it faster and more efficient. (2) Simplified classification: in applications like facial expression analysis or gaze tracking, focusing on a narrow range of angles can simplify classification tasks, as it reduces the number of possible pose categories. (3) Improved accuracy: narrow-range angles can lead to improved estimation accuracy because the model can focus on a specific subset of possible poses, reducing ambiguity (Guo et al. 2020). (4) Reduced noise: By excluding extreme angles, narrow-range approaches may be less sensitive to noise or outliers in the input data. However, these methods have the disadvantages. (1) Limited coverage: narrow-range angles may not provide a complete representation of head pose, limiting the applicability of the system in scenarios where a wider range of poses needs to be detected (Zhou and Gregson 2020), such as security cameras (Viet et al. 2021). 2) Loss of information: by discarding angles outside the narrow range, valuable information about head orientation may be lost, which can be crucial in some applications (Zhou and Gregson 2020).

Full-range angle It involves estimating the head pose across a wider range of yaw, pitch, and roll angles. This approach aims to cover all possible head orientations or one head orientation at least (the frontal and back views, see Figs. 5 and 7).

Few studies in the field of HPE have focused on predicting head poses across the full range of angles because the full range does not have rich visual features like those of the narrow range, which is focused on the frontal or large-angle face and yields satisfactory performance in most scenarios. Zhou and Gregson (2020) introduced WHENet, the first HPE approach to encompass the full range of head angles by combining the 300W-LP and CMU Panoptic datasets. WHENet is an extension to HopeNet Ruiz et al. (2018) that expands the number of bins for yaw prediction within the full range using EfficientNet as a backbone. Zhou et al. (2023) proposed DirectMHP, a one-stage network architecture to train end-to-end based on YOLOv5 that predicts full-range angle head poses. However, DirectMHP and WHENet face challenges related to the gimbal lock problem owing to their use of Euler angles for HPE, as mentioned in Zhou et al. (2023). Viet et al. (2021) introduced the multitask-net model to estimate full-range angle head poses. To improve HPE, the authors changed from Euler angles to using vectors of the rotation matrix as a representation of the human face. Hempel et al. (2024) extended their previous method Hempel et al. (2022) to cover full-range angles using CMU Panoptic datasets. The limitation of this work is that robustness and accuracy may decrease in application scenarios with unusual head poses and camera angles. Viet et al. (2021) developed FSANet-Wide with a full-range angle dataset called the UET-Headpose. However, this dataset is not available.

Fig. 7
figure 7

Full-range angles (CMU dataset)

The advantages of these methods with full-range angles are as follows: (1) Comprehensive pose estimation: full-range angles offer a more comprehensive representation of head pose, allowing the system to estimate head orientation across a wide spectrum of positions (Zhou and Gregson 2020). (2) Versatility: full-range approaches are suitable for applications that require detecting head poses in diverse scenarios, including extreme angles (Zhou and Gregson 2020). (3) Flexibility: a full-range system can handle variations in head pose that may occur in unconstrained environments, making it adaptable to different real-world situations (Zhou et al. 2023). However, these methods have the following disadvantages: (1) Increased computational complexity: estimating full-range angles can be more computationally demanding, especially when dealing with a wide range of pose possibilities. (2) Potentially lower accuracy or complex classification: the larger number of possible pose categories can complicate classification tasks, potentially leading to reduced accuracy due to increased ambiguities in the estimation. 3) Greater sensitivity to noise: full-range angle estimation may be more sensitive to noise, outliers, or inaccuracies in the input data (Zhou et al. 2023).

In summary, the choice between narrow-range and full-range angles for HPE should consider the specific application requirements, computational constraints, and trade-off between accuracy and coverage. Narrow-range angles may be suitable for scenarios where simplicity, speed, and specific pose categories are prioritized, whereas full-range angles offer a more comprehensive solution for applications demanding versatility and adaptability across a wide range of head poses.

2.2.3 Rotation representation method

HPE requires a rotation representation to estimate the orientation of the head. Different representations offer unique advantages in capturing complex head movements. These representations are used to enhance HPE accuracy and robustness. Rotation representation methods have different advantages and disadvantages. The authors in Kim and Kim (2023) conducted a thorough examination of frequently employed rotation representations in industry and academia, which included rotation matrices, Euler angles, rotation axis angles, unit complex numbers, and unit quaternions. The study involved elucidating rotations in both 2D and 3D spaces. Common representations include Euler angles, quaternions, and rotation matrix, as shown in Fig. 8.

Fig. 8
figure 8

Different rotation representations

Euler angles They represent rotations using a set of three angles, typically denoted as (\(\alpha , \beta , \gamma\)), which describe rotations around the X, Y, and Z axes, respectively. The order and direction of rotations can vary, and several different conventions exist, such as XYZ, XZY, and YXZ. Euler angles are a widely used representation for describing HPE. They offer an intuitive way to understand how the head is oriented in 3D space. Euler angles decompose rotations into three sequential angles around fixed axes (yaw, pitch, and roll). The DirectMHP (Zhou et al. 2023), 2DHeadPose (Wang et al. 2023), OsGG-Net (Mo and Miao 2021), Hopenet (Ruiz et al. 2018), WHENet (Zhou and Gregson 2020), FSA-Net (Yang et al. 2019), LSR (Celestino et al. 2023), HeadDiff (Wang et al. 2024), and HHP-Net (Cantarini et al. 2022) methods use the Euler angle representation.

Barra et al. (2022) presented a method based on a previously partitioned iterated function system (PIFS) using gradient boosting regression for HPE by Euler angles. The proposed method aims to decrease the computational cost while maintaining an acceptable accuracy. In Kuhnke and Ostermann (2023), the authors introduced a semi-supervised learning approach called relative pose consistency. This method utilized Euler angles to represent the HPE.

The advantages of these methods with Euler angles are as follows: 1) Intuitive interpretation: Euler angles provide a straightforward and intuitive understanding of head orientation by representing rotations around distinct axes (Toso et al. 2015). 2) Compact: Euler angles only require storing three numbers to represent a rotation (Bernardes and Viollet 2022). This makes them memory-efficient compared to other representations like quaternions or rotation matrices. 3) Suitable for a single degree of freedom: in cases with only one degree of freedom, Euler angles can be sufficient to represent the rotation (Bernardes and Viollet 2022). This makes them a practical choice for problems with limited rotational complexity. 4) Compatibility with legacy systems: Euler angles have enjoyed extensive use over numerous years, and their continued prevalence in legacy systems and algorithms ensures compatibility with a substantial body of existing work. 5) Visualization: Euler angles are easy to visualize, as they describe rotations in a fixed reference frame. However, these methods have the following disadvantages: 1) Gimbal lock: in certain orientations, Euler angles can encounter a gimbal lock, causing a loss of one degree of freedom and inaccuracies in tracking (Liu et al. 2021). When the gimbal lock occurs, a small change in input angles can lead to a large, sudden change in the resulting orientation, causing a discontinuity in the rotation representation. 2) Sequence dependency: the order of rotations significantly affects the final orientation, leading to complexities in calculations and potential confusion (Hsu et al. 2018). 3) Limited range: Euler angles might exhibit limitations when dealing with complex or extreme rotations, reducing their suitability for certain applications. 4) Discontinuous: This representation is discontinuous for neural networks and difficult to learn (Zhou et al. 2019). 5) Difficulty in vector transformation: Euler angles lack a simple algorithm for vector transformation (Janota et al. 2015). This makes them less suitable for applications requiring frequent vector transformations. 6) Singularities: Euler angles suffer from singularities, where two or more rotations can produce the same orientation. This can lead to ambiguity in the representation of orientation and problems with numerical stability (Hsu et al. 2018).

Euler angles provide a straightforward interpretation of head pose rotations, making them easy to understand and implement. However, their susceptibility to gimbal lock and potential complexity in certain scenarios might lead to considering alternative representations for more accurate and robust HPE in applications requiring precise orientation analysis.

Quaternions They are a mathematical extension of complex numbers and are used to represent rotations in 3D space. The quaternions are represented as q = [w, x, y, z], where w is the scalar part and (x, y, z) is the vector part. Quaternions have gained prominence as an alternative representation for conveying head pose rotations across various applications. These four-dimensional (4D) mathematical constructs provide an efficient means of representing orientation, avoiding the issues of gimbal lock associated with Euler angles. Quaternions offer seamless interpolation and are well-suited for tasks involving smooth motion tracking. Zhu et al. (2017) extended their previous 3D dense face alignment(3DDFA) study Zhu et al. (2016), in which ambiguity was a big limitation when the yaw angle reached \(90^{\circ }\), by replacing Euler angles with quaternions to eliminate the ambiguity and improving the performance of face alignment in large poses. Hsu et al. (2018) proposed a landmark-free method called QuatNet and conducted a study of HPE using quaternions to mitigate the Euler angle representation ambiguity. The study used a CNN with a multi-regression loss approach. This combines ordinal regression and L2 regression losses to train a dedicated CNN using RGB images. The ordinal regression loss addresses changing facial features with varying head angles, enhancing feature robustness. The L2 regression loss then utilizes these features to achieve accurate angle predictions. As an advantage of this method, Quatnet shows resilience to nonstationary cases of head poses (Nejkovic et al. 2022). Zeng et al. (Zeng et al. 2022) introduced an approach called structural relation-aware network (SRNet) for HPE by transforming the problem into the quaternion representation space. The proposed approach explicitly investigates the correlation among various face regions to extract global facial structural information. In. Höffken et al. (2014), the authors proposed a tracking module to estimate the head poses based on an extended Kalman filter (EKF) using the quaternion space.

The advantages of these methods with quaternions are as follows: (1) No gimbal lock: quaternions avoid gimbal lock, a limitation faced by Euler angles, ensuring an accurate representation of orientation (Hsu et al. 2018). (2) Smooth interpolation: they enable smooth interpolation (slerp-spherical linear interpolation) between orientations, resulting in seamless animations and tracking transitions (Hsu et al. 2018; Peretroukhin et al. 2020). By contrast, interpolating rotation matrices involves more complex calculations, including matrix multiplication and normalization. (3) Efficient computations: quaternion operations, such as conjugation, normalization, and multiplication, are computationally efficient using simple formulas, making them appropriate for real-time applications like gaming and robotics (Dantam 2021). (4) Normalization: quaternions can be normalized trivially, which is much more efficient than having to cope with the corresponding matrix orthogonalization problem (Bernardes and Viollet 2022). (5) Compact representation: quaternions exhibit greater compactness compared with rotation matrices. While a quaternion comprises merely four components (a scalar and a vector), a 3D rotation matrix requires nine components. This enhanced compactness contributes to the memory efficiency of quaternions and accelerates computation processes. (6) Numerical stability: quaternions are more numerically stable than matrices, particularly when handling combining multiple rotations or small rotations. Quaternion operations, such as normalization and multiplication, encompass fewer floating-point operations and are less prone to accumulated errors and numerical errors. However, the cons of these methods are: (1) Complexity: quaternions are not as intuitive as Euler angles, making it challenging for non-experts, e.g., for beginning roboticists and computer visionists, to grasp their meaning (Toso et al. 2015). This can make them less suitable for applications where simplicity and ease of use are important. (2) Storage: quaternions consist of four parameters, requiring more memory and computation compared to Euler angles’ three parameters (Holzinger and Gerstmayr 2021). This can be a disadvantage in applications with limited memory resources. (3) Antipodal problem: quaternions have an antipodal problem, where two quaternions representing the same rotation can have opposite signs (Roth and Gavrila 2023). This can lead to ambiguity in certain calculations and applications. (4) Discontinuous: neural networks find it challenging to learn when faced with discontinuities (Zhou et al. 2019).

Quaternions have proven valuable in an accurate HPE owing to their efficiency, lack of gimbal lock, and suitability for continuous motion analysis, making them a compelling choice for representing complex head movements in different contexts. Quaternions offer advantages such as gimbal lock prevention and smooth interpolation, making them valuable for accurate HPE in dynamic scenarios. While their interpretation and computation might pose challenges, their benefits often outweigh the complexities, especially in applications where precision and continuity are paramount.

Rotation matrix It is a (\(3 \times 3\)) matrix that describes a rotation in 3D space. Each column of the matrix represents the transformed basis vectors of the coordinate system after the rotation. Rotation matrices are a fundamental approach employed to represent head pose rotations across various applications. These matrices describe orientation using a set of orthogonal vectors, offering a robust and straightforward representation of head movement within 3D space. The TriNet (Cao et al. 2021), MFDNet (Liu et al. 2021), and TokenHPE-E (Liu et al. 2023) methods, among others, use rotation matrix representation.

Cao et al. (2021) employed a \(3 \times 3\) orthogonal rotation matrix within the framework to estimate head pose. They evaluated the performance of TriNet using the mean absolute error of vectors (MAEV). MFDNet (Liu et al. 2021) has been designed to tackle the challenge of low pose tolerance amidst various disturbances by introducing an exponential probability density model. This model uses a rotation matrix and matrix Fisher distribution. TokenHPE-E (Liu et al. 2023) is developed to steer the orientation tokens in their acquisition of the desired regional relationships and similarities using rotation matrix representation. This study is designed to address extreme orientations and low illumination, as stated by the authors. In Kim et al. (2023), the authors presented a multi-task network architecture to estimate landmarks and head poses using rotation matrix representation. The proposed method followed the 6DRepNet for HPE. Hempel et al. (2022) proposed a landmark-free technique called 6DRepNet, an end-to-end network and continuous representation for HPE. The final output of this method is a rotation matrix representation. The 6D rotation representation is used as part of deep learning architecture to improve performance and achieve the continuous rotation representation that will be discussed in detail in Sect. 2.3.3.

The advantages of these methods with rotation matrices are as follows: 1) Precise representation: rotation matrices provide an accurate and unambiguous representation of orientation in 3D space. 2) Orthogonality: rotation matrices are orthonormal, meaning their columns are orthogonal unit vectors (Evans 2001). This property ensures that the length of any vector and the angle between any pair of vectors remain unchanged during the rotation. 3) Applicable to compositions: rotation matrices can be easily combined to compose multiple rotations without encountering a gimbal lock or singularities. However, these methods have the following disadvantages: 1) Complex calculations: computation of rotation matrices involves matrix multiplications and trigonometric functions, which can be computationally intensive (Hsu et al. 2018). This can make them less suitable for applications where simplicity and ease of use are important. 2) Numerical stability: small numerical errors during calculations can accumulate, potentially leading to inaccuracies over time. 3) Inefficient storage: rotation matrices require nine values, resulting in higher memory requirements than quaternion or Euler angle representations.

It is important to note that the advantages and disadvantages mentioned above are general characteristics of these rotation representation methods. The choice of representation depends on the specific requirements and limitations of the application.

Other rotation representation Rodrigues’ formula, axis-angle, and lie algebra are other rotation representations (Cao et al. 2021). Researchers are continuously exploring different rotation representations to improve the accuracy and efficiency of HPE in various applications. The choice of head pose rotation representation depends on the specific application’s requirements. Each representation balances factors such as accuracy, efficiency, and handling of complex rotations. By selecting the appropriate representation, HPE systems can accurately interpret and analyze human head movements in diverse scenarios.

2.2.4 Degrees of Freedom (DoF)

In the context of motion and mechanics, “Degrees of Freedom (DoF)” refers to the number of independent parameters or ways in which a mechanical system can move or rotate in 3D space. It is a concept used in physics, engineering, and computer graphics to describe the flexibility and mobility of an object or a system. Figure 9 shows the number of degrees of freedom.

Fig. 9
figure 9

Number of degrees of freedom

6DoF The term 6DoF refers to the ability of an object or system to move freely in 3D space along three rotational axes (pitch, yaw, and roll) and three translational axes (up-down, left-right, and forward-backward). In virtual reality and robotics, 6DoF allows for a more immersive and realistic experience by enabling users or objects to move and interact with their environment naturally and unconstrained, as shown in Fig. 9b.

Roth and Gavrila (2023) presented a method called IntrApose. This approach relies on intensity information along with camera intrinsics using a single camera image without landmark localization or prior detection for continuous 6DoF HPE. Kao et al. (2023) utilized the perspective-n-point (PnP) method facial landmarks to estimate a 6DoF face pose from monocular images. To support this approach, they curated a comprehensive 3D face dataset called the ARKitFace dataset, which consists of 902,724 2D facial images from 500 subjects. This dataset encompasses a diverse range of expressions, poses, and ages. It is worth noting that, despite employing deep learning techniques for HPE, Kao et al. (2023) adopted a landmark-based methodology. Luo et al. (2019) estimated rotation angles and position using a RGB-D image. These systems track a head’s position and orientation, enhancing realism and enabling greater interaction in 3D space. However, 6DoF systems typically have higher costs and more complex space requirements.

DoF Three degrees of freedom describe the movement capability along three specific axes in 3D space. In the context of virtual reality, 3DoF typically refers to rotational movement around the pitch, yaw, and roll axes (Ma et al. 2024; Yao and Huang 2024), as shown in Fig. 9a. Unlike 6DoF, which allows for both rotational and translational movement, 3DoF only captures rotational changes (Tomenotti et al. 2024), limiting the range of motion and spatial interaction. This technology is often used in simpler VR systems or devices requiring constrained movement.

Bisogni et al. (2021) presented a PIFS model to improve face recognition. The method aims to estimate head pose simultaneously with face recognition. The major limitation of the proposed method is that the system performance could drop drastically if a few frames are very different from those inserted in the input. Another study (Hu et al. 2022) introduced an integrated framework that comprises yawning, blinking, gaze, and head rotation using attention-based feature decoupling and heteroscedastic loss for multi-state driver monitoring. These systems are simple and easy to use compared to 6DoF systems. However, 3DoF systems have some limitations, such as reduced realism due to the limited range of movement.

Other DoF Liu et al. (2022) presented an asymmetric relation-based head pose estimation (ARHPE) model, an asymmetric relation cue between pitch and yaw angles that is powerful in learning the discriminative representations of adjacent head pose images for HPE. The ARHPE model computed the overall MAE for only two degrees of freedom along the pitch and yaw angles. Different weights are allocated to the pitch and yaw orientations using the half-at-half maximum of the 2-D Lorentz distribution. Vo et al. (2019) introduced a method based on an extreme gradient boosting neural network and histogram of oriented gradients (HOG) in multi-stacked autoencoders to predict yaw and pitch for HPE. Berral-Soler et al. (2021) estimated head poses by 2DoF using the pitch and yaw angles. Hsu and Chung (2020) presented a complete representation (CR) pipeline that adaptively learns and generates two comprehensive representations (CR-region and CR-center) of the same individual. This study focuses on eye center localization for head poses based on geometric transformations with CR-center and image translation learning with the CR-region using five image databases tested only with two DoFs (yaw and roll angles). Liu et al. (2021a) estimated head poses by 2D Euler angles using the pitch and yaw angles.

The choice of the number of DoF relies on the specific requirements and goals of the application or system, as well as factors like budget, hardware complexity, and interaction needed. Each has its strengths and weaknesses, and the decision should align with the intended use case.

2.3 Techniques and methodologies

2.3.1 Landmark vs. free method

In the realm of facial analysis and computer vision, two fundamental approaches have emerged for detecting and analyzing facial features: landmark-free and landmark-based methods (Yan and Zhang 2024). Each approach offers distinct advantages and disadvantages, catering to different application needs and challenges. In this discussion, we will delve into the benefits and drawbacks of both approaches, shedding light on their respective strengths and limitations.

Landmark-based methods These methods rely on the detection and localization of specific keypoints or landmarks on the person’s face, such as the positions of eye-brows, eyes, lips, nose, and mouth (Zhao et al. 2024). Figure 10 shows the visualization of facial landmarks of different versions. For HPE, 5 to 7-point landmarks are sufficient. These keypoints act as reference points for determining the head’s orientation in the 3D space. By analyzing the spatial arrangement of these landmarks, the head pose angles (e.g., roll, pitch, and yaw) can be estimated. This approach benefits from using easily recognizable and distinct facial features, and it is commonly utilized in applications such as facial recognition, gaze tracking, and AR. These approaches first detect either dense or sparse facial landmarks and subsequently estimate the head pose based on these key points. For instance, solvePnP (Gao et al. 2003) may be employed for this purpose.

Fig. 10
figure 10

Different versions of facial landmarks representation

Wu and Ji (2019) provide a literature survey for facial landmark detection, which classifies it into three major categories: regression-based, constrained local model (CLM), and holistic methods. In Gupta et al. (2019), the authors presented a deep-learning method to estimate the head pose. The proposed method used the uncertainty maps in the form of 2D soft localization heatmap images over five facial key points: the nose, right eye, right ear, left eye, and left ear, which pass through a CNN to obtain Euler angles. Common methods used for detecting facial landmarks include face alignment network (FAN) (Bulat and Tzimiropoulos 2017), Dlib (Kazemi and Sullivan 2014), MediaPipe (Lugaresi et al. 2019), 3DDFA (Zhu et al. 2016), and 3DDFA_v2 (Guo et al. 2020). FAN (Bulat and Tzimiropoulos 2017) is a DNN technique designed to detect facial landmarks, with the added potential for use in HPE applications. The proposed method utilizes detected facial landmarks to map RGB images into 3D head poses. This approach converted 12 facial point coordinates to a 3D head pose, yielding MAE of \(9.12^{\circ }\) and \(8.84^{\circ }\) on the annotated facial landmarks in the wild (AFLW) (Koestinger et al. 2011) and AFW (Zhu and Ramanan 2012) datasets, respectively (Huang et al. 2020). Dlib (Kazemi and Sullivan 2014) is a C++ toolkit that includes classical algorithms and tools to create complex software for solving real-world problems. One of the applications of Dlib is HPE, which is widely used in diverse computer vision applications such as VR, hands-free gesture-controlled, driver’s attention detection, and gaze estimation applications. Dlib provides 68 informative landmarks that facilitate the estimation of the head pose (Al-Nuimi and Mohammed 2021). MediaPipe (Lugaresi et al. 2019) is a framework for building perception pipelines, which can be used for various computer vision applications, including HPE. MediaPipe is a method to detect the face area with 468 landmarks (Al-Nuimi and Mohammed 2021). 3DDFA (Zhu et al. 2016) is a cascaded CNN based on a regression method. Ariz et al. (2019) improved a method named weighted POSIT (wPOSIT) based on 2D tracking of the face to enhance the performance of both 2D point tracking and 3D HPE using the BU (La Cascia et al. 2000) and UPNA (Ariz et al. 2016) databases. A set of 12 facial landmarks were selected to track a 2D face. This approach fits a dense 3D facial model onto an image and concurrently predicts the head pose. 3DDFA_v2 (Guo et al. 2020) is the second version of 3DDFA. It is proposed to make a balance between speed, stability, and accuracy based on a lightweight backbone.

The previous methods suffer from inevitable difficulties involving large-angle, low-resolution, or partial occlusion. In particular, when the detected landmarks become chaotic in nature, the precision of HPE is significantly compromised. Consequently, researchers frequently refrain from directly employing the generated landmarks (Zhou et al. 2023). For example, Mo and Miao (2021) introduced a method called a one-step graph generation network (OsGG-Net). This model is an end-to-end integration of the graph convolutional network (GCN) and CNN to estimate head poses based on Euler angle representation and the landmark-based method. HHP-Net (Cantarini et al. 2022) introduces a technique for estimating head pose angles from single images using a small set of automatically computed head keypoints. The authors adopted OpenPose (Cao et al. 2017) as a keypoint extractor, given its ability to balance effectiveness and efficiency. The approach utilizes a carefully designed loss function to quantify heteroscedastic uncertainties related to the three angles. This correlation between uncertainty values and error offers supplementary information for subsequent computational processes. KEPLER (Kumar et al. 2017) presents an iterative method to learn local and global features simultaneously using a Heatmap-CNN with a modified GoogLeNet architecture for joint keypoint estimation and HPE in unconstrained facial scenarios.

Nevertheless, landmark-based approaches suffer from a significant limitation in terms of model expressive capability, making it challenging for them to attain performance levels comparable to landmark-free methods. To address this challenge, Xin et al. (2021) introduced an approach that involves the creation of a landmark-connection graph and the utilization of GCN to capture intricate nonlinear relationships between the graph typologies and head pose angles. Moreover, to address the challenge of unstable landmark detection, this approach incorporates the edge-vertex attention (EVA) mechanism and further enhances performance through the introduction of densely-connected architecture (DCA) and adaptive channel attention (ACA).

The benefits of the landmark-based approaches are as follows: 1) Precise localization: landmark-based approaches provide accurate and specific facial feature localization, making them suitable for tasks requiring detailed facial structure information. 2) Multiple tasks applications: these methods are well-suited for multiple tasks such as facial expression analysis, head motion, HPE, facial deformations, and facial animation due to their ability to provide detailed landmark information (Çeliktutan et al. 2013). 3) Widely known and used: landmark-based methods are probably the most commonly used. They can bring the face into a canonical configuration, typically a frontal head pose (Belmonte et al. 2021). 4) Interpretable results: explicit landmarks allow for easier interpretation of the model’s decisions and behavior. Moreover, the head can be segmented into small partitions (Zubair et al. 2022). However, these methods have the following drawbacks: 1) Sensitivity to landmark quality: landmark-based methods can struggle with variations in facial expressions, poses, occlusion, extreme rotation, and lighting conditions, potentially leading to less robust performance (Hempel et al. 2022; Roth and Gavrila 2023). 2) Data requirements: training landmark-based models often requires large annotated datasets with accurately labeled landmarks. The primary issue arises from the data itself. Manually annotating landmarks on faces with significant poses is an extremely tedious task, especially when occluded landmarks need to be estimated. This is an insurmountable task for the majority of individuals (Zhu et al. 2016). 3) Computational complexity: detecting and tracking multiple landmarks can be computationally intensive, affecting real-time performance (Xia et al. 2022). 4) Limited generalization, applicability, and range angles: landmark-based methods might struggle to generalize to unseen variations or populations that differ significantly from the training data. The landmark-based method has a significant drawback: it becomes challenging to locate these landmarks when the face orientation exceeds \(60^{\circ }\). Consequently, to circumvent this critical limitation, other models directly carried out HPE models from images without relying on landmark detection (Viet et al. 2021).

Landmark-free method Conversely, these methods do not depend on the explicit detection and usage of predefined facial keypoints for HPE. Instead, they utilize other facial features or the full image to infer the head’s orientation. These methods analyze head contours and head patterns (see Fig. 11) or utilize depth information from depth sensors or depth estimation algorithms to estimate the head pose.

Fig. 11
figure 11

Head pose using a landmark-free method

Landmark-free techniques can be advantageous when dealing with challenging conditions, such as partial occlusion of facial features or when keypoints are difficult to detect accurately.

HopeNet (Ruiz et al. 2018), LSR (Celestino et al. 2023), 6DRepNet (Hempel et al. 2022), and TokenHPE (Zhang et al. 2023), among others, are landmark-free methods. HopeNet, introduced by Ruiz et al. (2018), was among the pioneering landmark-free techniques. It incorporates multiple loss functions to estimate Euler angles, initially employing a cross-entropy function to classify the angles and subsequently refining the fine-grained predictions by minimizing the mean-squared error between the predicted pose and the ground truth labels. In Celestino et al. (2023), the authors presented a deep-learning approach relying on LSR with multi-loss for HPE under occlusions, which was designed as a new application to autonomously feed disabled people with a robotic arm. The authors generated synthetic, occluded datasets by covering part of the human faces using six levels of occlusions without changing the labels of the original datasets for head poses. Hempel et al. (2022) employed the approach outlined in Zhou et al. (2019) to introduce 6DRepNet, an end-to-end network without landmarks. In Zhang et al. (2023), the authors presented a landmark-free method called TokenHPE for efficient HPE using vision transformers (ViT) that allow the model to better capture the spatial relationships between different face parts. This method was designed to address serious occlusions and extreme head pose randomness, as stated by the authors.

The advantages of the landmark-free approaches are: 1) Flexibility: landmark-free methods do not rely on predefined facial landmarks, making them suitable for facial expressions, a wide range of head poses, and variations in appearance. 2) Robustness: they can handle partial occlusions and variations in lighting and background, making them robust for real-world scenarios. 3) Reduced annotation effort: landmark-free methods typically require less manual annotation of facial landmarks, simplifying data preparation. However, these methods have the following disadvantages: 1) Coarser estimation: they may provide coarser head pose estimates compared to landmark-based methods, which can limit their precision for some applications. 2) Limited spatial information: landmark-free methods might not capture fine-grained spatial information about facial features, which can be important in certain contexts. 3) Performance variability: the accuracy of landmark-free methods can vary depending on the head pose’s complexity and the input data’s quality.

Hybrid methods These methods refer to approaches that combine both landmark-based and landmark-free techniques for estimating the head pose. These methods leverage the strengths of each approach to improve accuracy, robustness, and generalization in HPE tasks. Xia et al. (2022) proposed a collaborative learning framework based on CNNs that utilizes two branches, one based on landmarks and the other landmark-free, as shown in Fig. 12 for HPE. These two branches work together, engaging in both implicit and explicit interactions with information for mutual promotion and complementary semantic learning.

Fig. 12
figure 12

Head pose using landmark-free (bottom backbone network) and landmark-based (top backbone network) methods

Fu et al. (2023) proposed an adaptive occlusion hybrid second-order attention network containing a second-order attention module, occlusion-aware module, and exponential map-based pose prediction. The proposed method is implemented using the PyTorch deep learning framework. ResNet50 is used as a feature extractor backbone. It has some limitations due to the challenges of extracting key points in the presence of occlusions and extensive poses in real-world scenarios. Additionally, landmark detection adds computational cost.

In conclusion, both landmark-free and landmark-based approaches exhibit unique strengths and limitations. The choice between these approaches should be driven by the specific requirements of the application, available resources, and desired level of precision. While landmark-free methods offer adaptability and robustness, landmark-based methods excel in the precision and use of established techniques. As the field of facial analysis continues to evolve, a balanced consideration of the advantages and disadvantages will pave the way for more effective and accurate facial analysis solutions.

2.3.2 Technique

Classical techniques Early approaches employed classical models for HPE, such as template matching, quad-tree, random forest (RF), HOG, Haar cascades, and support vector machine (SVM) Abate et al. (2022).

The proposed method in Abate et al. (2019) described a quad-tree adaptation for facial landmark representation, enabling the HPE with a discrete angular resolution of \(5^{\circ }\) in terms of pitch, yaw, and roll angles. In Abate et al. (2020), a method based on the overlap of a web-shape with landmark detection models to HPE was proposed. WSM is employed to identify the corresponding sector for each of the 68 landmarks, thereby facilitating accurate landmark position prediction within images. Höffken et al. (2014) employed a synchronized submanifold embedding (SSE), EKF, and quaternion representation to estimate the head poses. SSE is a nonlinear regression method, which comprises a barycentric coordinate estimation, k-nearest neighbor search, and dimensionality reduction. Benini et al. (2019) proposed a multi-feature framework based on SVM and RF to perform classification tasks on expression, gender, and head pose. However, these techniques are traditional models to train on small to medium-sized datasets and require manual extraction of pertinent features from the data. Barra et al. (2022) presented a fractal method based on extreme gradient boosting regressor and gradient boosting regressor for HPE by Euler angles. The work of Hoffken et al. achieved a high MAE, though it was trained and tested on the same dataset (Biwi and AFLW2000). Most papers were trained on the 300W-LP dataset and then tested on the Biwi and AFLW2000 datasets for fair comparison.

Deep learning techniques The field of deep learning has garnered significant research interest, driven by its diverse applications in online retail, art/film production, video conferencing, and virtual agents. The progress in deep learning has facilitated the on-demand generation of a person’s visual attributes, including their face and pose. The major focus of this survey (Asperti and Filippini 2023) is on deep learning techniques.

6DoF-HPE (Algabri et al. 2024), 6DHPENet (Chen et al. 2022) and 6DRepNet360 (Hempel et al. 2024) were implemented on a RepVGG backbone (Ding et al. 2021) with 6D continuous rotation representation. In Celestino et al. (2023), Wang et al. (2023), Zhu et al. (2022), the proposed methods were implemented on a ResNet backbone (He et al. 2016) with Euler angle representation. In Roth and Gavrila (2023), the proposed method was implemented on a ResNet backbone (He et al. 2016) with a differentiable rotation representation. In Zhou et al. (2023), the authors proposed a one-stage network architecture, which is built upon the YOLOv5 framework and Euler angle representation. In Chen et al. (2023), the authors designed a network architecture with multi-modal attention using asymmetry-aware bilinear pooling to estimate head pose. In Chen et al. (2023), the authors proposed a lightweight network called Neko-Net for HPE without relying on facial keypoints. Neko-Net consists of a soft stagewise regression (SSR), external attention modules, and a dual-stream lightweight backbone network. It aims to achieve high precision with low computational cost and model parameters.

These techniques are advanced models with multiple layers of interconnected nodes that are trained effectively on large datasets to achieve high performance and are designed to extract pertinent features from the data automatically. However, these techniques require large datasets, demand high computational power, and are difficult to interpret.

2.3.3 Rotation type method

In the context of HPE, the choice of rotation representation can significantly impact the accuracy and performance of the methods. Some representations are more suitable for capturing the full range of head motions and providing smooth, continuous estimates of pitch, yaw, and roll, while others may suffer from discontinuities and limitations in certain scenarios. Continuous and discontinuous representations refer to different ways of representing the head pose. Zhou et al. (2019) explore the concept of continuous representations in the context of DNNs. The authors proved that any rotation representation with four or fewer dimensions is discontinuous based on topological concepts. Based on these concepts, we categorized works into continuous and discontinuous representations.

Continuous representation A rotation matrix is a continuous representation that can accurately capture the full range of head motions without suffering from discontinuities or gimbal lock. Six and five dimensions (6D and 5D) and vector-based representations are also continuous representations as reported in Zhou et al. (2019). 6DoF-HPE (Algabri et al. 2024), 6DHPENet (Chen et al. 2022), 6DRepNet360 (Hempel et al. 2024), TriNet (Cao et al. 2021), MFDNet (Liu et al. 2021), TokenHPE-E (Liu et al. 2023), and TokenHPE (Zhang et al. 2023) are continuous rotation representation.

6DoF-HPE (Algabri et al. 2024), 6DHPENet (Chen et al. 2022) and 6DRepNet (Hempel et al. 2022) predicted 6D dimensions and then processed to output nine parameters using the Gram–Schmidt process. TriNet (Cao et al. 2021) employed a (\(3 \times 3\)) orthogonal rotation matrix to predict the three vectors. TokenHPE-E (Liu et al. 2023), and TokenHPE (Zhang et al. 2023) employed a transformer architecture to predict the final rotation matrix (\(3 \times 3\)). Roth and Gavrila (2023) adopted a continuous, differentiable rotation representation known as singular value decomposition (\(SVDO^{+}\)), which directly maps into SO(n) following an unconstrained representation of nine values. \(SVDO^{+}\) is a well-known symmetric orthogonalization (Levinson et al. 2020)

In HPE, this continuity can be beneficial when needing precise and smooth tracking of head movements. The researchers and developers can use 6D and 5D and vector-based or rotation matrix-based representations to continuously estimate the head’s orientation.

The benefits of the continuous representation are as follows: (1) Smooth transformations: it allows smooth and continuous transformations between angles, making them suitable for applications where gradual changes are essential, such as animations or robotic control systems. (2) Differentiability: it is often differentiable, which is crucial for optimizing algorithms and neural networks. This enables efficient gradient-based training. (3) Better performance: compared to discontinuous representations, it performs better. However, continuous representation has the following drawbacks: (1) Ambiguity: it can lead to unexpected behavior or difficulties in interpretation. (2) Computational complexity: some continuous rotation representations, especially those involving trigonometric functions, can be computationally expensive, particularly in high-dimensional spaces.

Discontinuous representation Quaternions, lie algebra, and Euler representations are discontinuous for 3D rotations, including HPE, because they have three or four dimensions (Cao et al. 2021). Nonetheless, as indicated in Zhou et al. (2019), it has been demonstrated that a continuous representation in 3D space requires a minimum of five dimensions of information. However, if researchers and developers discretize the Euler angles into bins or use them carefully, they can still be useful for certain applications that do not require a perfectly smooth transition between poses. Hopenet (Ruiz et al. 2018), QuatNet (Hsu et al. 2018), Lie algebra residual architecture (LARNeXt) (Yang et al. 2023), WHENet (Zhou and Gregson 2020), FSA-Net (Yang et al. 2019), LSR (Celestino et al. 2023), and HHP-Net (Cantarini et al. 2022) are discontinuous rotation representation.

Hopenet (Ruiz et al. 2018), WHENet (Zhou and Gregson 2020), FSA-Net (Yang et al. 2019), LSR (Celestino et al. 2023), and HHP-Net (Cantarini et al. 2022) predict only three dimensions because they use Euler representation, whereas QuatNet (Hsu et al. 2018) predict four dimensions because it use quaternions representation. LARNeXt (Yang et al. 2023) is an integrated embedded end-to-end method on the ResNet-50 backbone that achieves head pose and face recognition using Lie algebra representation. The limitation of this method is that the accuracy dramatically decreases when the face has occluded parts.

The advantages of the discontinuous representation are as follows: 1) Simplicity: it is often simpler to implement and understand, which can be advantageous when transparency and interpretability are important. 2) Intuitive interpretation: quaternions and Euler angles may offer greater intuitiveness to certain users as they can be associated with terms like roll, pitch, and yaw, which find common usage in disciplines such as robotics and aviation. However, discontinuous representation has the following disadvantages: 1) Gimbal lock: quaternions and Euler angles can encounter the gimbal lock, a phenomenon in which the representation becomes singular and loses the ability to depict specific rotations precisely. 2) Difficulty in interpolation: interpolating discontinuous representations can pose difficulties in achieving smooth transitions, potentially causing problems in animation and rendering processes.

The choice between continuous and discontinuous rotation representations should be made based on the application’s specific needs. Continuous representations are better for smooth transitions, whereas discontinuous representations are appropriate for discrete decision points or categories. Careful consideration of the trade-offs and suitability of each representation is essential for effectively addressing the requirements of a problem involving angles.

2.4 Evaluation metrics

Some researchers used two or more metrics to evaluate their work. For example, Firintepe et al. (2020) adopted four metrics to evaluate the HPE, namely MAE, balanced mean angular error (BMAE), root mean squared error (RMSE), standard deviation (STD), and others. Cao et al. (2021) adopted the MAE and MAEV metrics. WHENet (Zhou and Gregson 2020) also was evaluated by MAE and mean absolute wrapped error (MAWE). Meanwhile, Khan et al. (2020) proposed a framework that evaluated the HPE on four databases, namely AFLW, ICT-3DHP, BU, and Pointing’04 (Gourier 2004), using MAE and accuracy. The details of these metrics are presented in the following subsections.

2.4.1 Mean absolute error (MAE)

It is a widely adopted metric in assessing HPE frameworks, frequently employed across various papers reviewed in this study. Its popularity stems from its ability to offer a concise and informative performance evaluation, encompassing all three angles (i.e., roll, pitch, and yaw). Its mathematical equation is as follows:

$$\begin{aligned} {\textrm{MAE}}=\frac{1}{n} \sum _{i=1}^n\left| \theta _i-{\hat{\theta }}_i\right| , \end{aligned}$$
(1)

where \({\hat{\theta }}_i\) is the prediction angle and \(\theta _i\) is the ground truth angle. A smaller MAE value indicates superior performance when comparing methods.

2.4.2 Pose estimation accuracy (PEA)

The PEA is another metric used to assess HPE. As an accuracy metric, PEA relies on pose counts, providing limited insights into actual system performance. Notably, most recent research has not employed PEA as the second metric in the context of HPE. A high PEA value signifies better performance when comparing methods (Asperti and Filippini 2023).

2.4.3 Mean absolute error of vectors (MAEV)

Cao et al. (2021) raised concerns about the suitability of using the MAE of Euler angles as an evaluation metric, particularly for profile images, arguing that it may not accurately measure the performance of networks. Instead, they advocate for adopting the MAEV for evaluation. In their approach, three vectors derived from the rotation matrix are utilized to characterize head poses, and the disparity between the predicted vectors and the ground-truth vectors is computed. Their findings demonstrated the enhanced consistency of this representation and established MAEV as a more dependable indicator for assessing pose estimation outcomes. Its mathematical equation is as follows:

$$\begin{aligned} {\textrm{MAEV}}=\frac{1}{n} \sum _{i=1}^n \cos ^{-1}\left( \frac{v_g \cdot v_p}{\left| v_g\right| \left| v_p\right| }\right) \end{aligned}$$
(2)

where \(v_p\) and \(v_g\) are the predicted and the ground truth head orientation vectors (Hempel et al. 2024).

2.4.4 Mean absolute wrapped error (MAWE)

The MAWE is a metric used to assess the accuracy of models for full-range angle data. It measures the mean absolute difference between the predicted and the ground truth values while considering the shortest angular path between them. However, MAWE is not a widely recognized metric in statistics or machine learning. MAWE is particularly useful in applications where the direction or phase is critical, such as methods of HPE for full-range angles. When applied to measure narrow-range angles, it yields identical results to MAE (Asperti and Filippini 2023). Its mathematical equation is:

$$\begin{aligned} M A W E=\frac{1}{n} \sum _{i=0}^n \min \left( \left| {\hat{\theta }}_i-\theta _i\right| , 360-\left| {\hat{\theta }}_i-\theta _i\right| \right) , \end{aligned}$$
(3)

where \({\hat{\theta }}_i\) is the prediction angle and \(\theta _i\) is the ground truth angle. A lower MAWE value signifies better performance when comparing methods (Viet et al. 2021).

2.4.5 Balanced mean angular error (BMAE)

In driving scenarios, there is a bias towards frontal orientations, resulting in an uneven distribution of various head orientations. Schwarz et al. (2017) proposed the BMAE metric to tackle this issue. Its mathematical equation is:

$$\begin{aligned} {\textrm{BMAE}}:=\frac{d}{k} \sum _i \phi _{i, i+d}, \quad i \in d {\mathbb {N}} \cap [0, k], \end{aligned}$$
(4)

where \(\phi _{i, i+d}\) is the average angular error between the ground truth and prediction angle. The original study set d and k to \(5^{\circ }\) and \(75^{\circ }\), respectively (Schwarz et al. 2017). A lower BMAE value signifies superior performance when comparing methods.

2.4.6 Root mean squared error (RMSE)

The RMSE is a widely employed metric in machine learning and statistics to measure the average magnitude of errors between predicted \({\hat{\theta }}_i\) and actual \({\theta }_i\) values. It is estimated by obtaining the square root of the mean of the squared differences between actual and predicted values. RMSE measures how well a predictive model or algorithm performs, with smaller values indicating better predictive accuracy. However, RMSE is not a widely used metric in HPE frameworks. Its mathematical equation is as follows:

$$\begin{aligned} {\textrm{RMSE}}=\sqrt{\frac{1}{n} \sum _{i=1}^n({\theta }_i-{\hat{\theta }}_i)^2}, \end{aligned}$$
(5)

where n is the number of images for testing (Firintepe et al. 2020).

Table 1 Summary of main steps of HPE frameworks

3 Datasets and ground-truth techniques

This section discusses the datasets commonly used for HPE research. Table 2 presents an overview of currently available datasets. These datasets vary regarding the number of images, variety of head poses, annotations available, and other characteristics.

3.1 Datasets for head pose estimation

Many public datasets for HPE have been published since 2010. Ordered from the most recent to the oldest, the following datasets are included in the literature:

2DHeadPose dataset (Wang et al. 2023) contains 10,376 RGB images, a rich set of angles, dimensions, and attributes, in 19 scenes, such as masked, obscured, dark, and blurred. A 3D virtual human head was used to simulate the head pose in the images, and Gaussian label smoothing was used to suppress annotation noises, resulting in the 2DHeadPose dataset.

Dad-3dheads dataset (Martyniuk et al. 2022) is an accurate, dense, large-scale, and diverse for 3D head alignment from a single image. This database is an in-the-wild collection featuring a diverse range of head poses, facial expressions, challenging lighting conditions, image quality, age groups, and instances of occlusions. It comprises 44,898 images annotated utilizing a 3D head model. It includes annotations for more than 3.5K landmarks that provide precise representations of 3D head shapes compared to ground-truth scans.

AGORA (Patel et al. 2021) is a 3D synthetic dataset for the 3D human pose and shape (3DHPS) estimation task. The dataset consists of training, validation, and test images. Corresponding Masks, SMPL-X (Pavlakos et al. 2019)/SMPL (Loper et al. 2015) ground truth, and camera information are also provided for training and validation images. The AGORA dataset contains 3K test and 14K training images rendering between 5 and 15 subjects per image, and it includes 173K individual person crops. The dataset provides high-realism images, with two resolutions of (\(3840 \times 2160\)) and (\(1280 \times 720\)) for multi-person with ground truth 3D bodies under the complexity of clothing, environmental conditions, and occlusion. The original dataset does not provide head pose labels. DirectMHP (Zhou et al. 2023) generated head pose labels for each person in an image based on SMPL-X (Pavlakos et al. 2019) by extracting camera parameters and 3D face landmarks.

MDM corpus dataset (Jha et al. 2021) was gathered by recording 59 participants while driving cars and engaging in various tasks. The Fi-Cap helmet device, which continuously tracks head movement using fiducial markers, was employed to capture head poses. This dataset comprises 50.23 h at various frames per second (fps) of recordings (approximately one million frames) and encompasses a wide range of head poses across all three rotational axes. This diversity arises from including numerous subjects and considering various main and secondary driving activities in the data collection process. Specifically, yaw angles span from \(\pm 80^{\circ }\) around the origin, whereas pitch angles exhibit an asymmetric span ranging from \(-50^\circ\) to \(100^\circ\).

ETH-XGaze dataset (Zhang et al. 2020) is a substantial dataset created for the purpose of gaze estimation, particularly under challenging illumination conditions involving extreme head poses and variations in gaze direction. This dataset contains 1,083,492 images collected from 110 participants with varying head orientations and gaze directions using 18 digital SLR cameras. The age range of participants was 19-41 years.

GOTCHA-I dataset (Barra et al. 2020) is a large-scale dataset for face, gait, and HPE, containing 493 videos with an average duration of 4 min (i.e., approximately 137,826 images). The videos were captured using multiple mobile and body-worn cameras with 11 different video modes in both indoor and outdoor environments for 62 subjects, 47 male and 15 female, with an average age between 18 and 20 years. The dataset was extracted in the range of \(\pm 20^{\circ }\) in roll, \(\pm 30^{\circ }\) in pitch, and \(\pm 40^{\circ }\) in yaw with \(5^{\circ }\) deviations.

DD-Pose dataset (Roth and Gavrila 2019) is a large-scale driver benchmark consisting of \(2 \times 330k\) images of drivers in a car. The dataset was captured using stereo and RGB cameras with a \(2048 \times 2048\) resolution. Six DoF continuous head pose annotations were acquired by a motion capture system that estimates the pose of a marker fixed at the back of the person’s head to a reference coordinate. This coordinate was calibrated by estimating the position of eight facial landmarks for each subject’s face. The dataset included 27 subjects, 21 male and 6 female, with an average age of 36 years. The oldest and youngest drivers were 64 and 20 years old. The dataset was extracted in the range of \(\pm 100^{\circ }\) in yaw, \(\pm 40^{\circ }\) in pitch, and \(\pm 60^{\circ }\) in roll.

VGGFace2 (Cao et al. 2018) is a dataset for recognizing faces across pose and age collected at the University of Oxford by the visual geometry group. The dataset includes over 3.31 million images, with a total of 9,131 subjects (i.e., an average of 362.6 images per subject). The dataset is designed to be used for face recognition tasks and includes a wide variety of poses (yaw, pitch, and roll), expressions, and lighting conditions. Gender information is balanced, with 59.3% males and the remaining females.

SynHead dataset (Gu et al. 2017) is a collection of 3D synthetic images of human heads. The dataset includes over 510,960 images, 10 head models (5 male and 5 female), and 70 motion tracks. It is designed to support research in the area of HPE and related applications. The dataset includes annotations for the yaw, pitch, and roll angles of the heads in each image.

DriveAHead (Schwarz et al. 2017) is a driver’s head pose dataset that contains images of drivers captured from the interior of a car. The dataset consists of one million images captured from 20 drivers (16 male and 4 female), with each driver captured under realistic driving conditions. The images were captured using a Kinect V2 sensor with a resolution of \(512 \times 424\).

SASE (Lüsi et al. 2017) is a 3D dataset for HPE and emotion. It consists of 30,000 annotated images with their head pose labeled, including 50 subjects (18 female and 32 male). The dataset includes images captured using a Microsoft Kinect 2, with resolutions of (\(1080 \times 1920\) for RGB frames) and (\(424 \times 512\) for 16-bit depth frames). The dataset includes images of males and females aged 7-35. The range of yaw angles is \(-75^{\circ }\) to \(75^{\circ }\), whereas pitch and roll angles are \(-45^{\circ }\) to \(45^{\circ }\).

Pandora (Borghi et al. 2017) is a 3D dataset for the driver’s HPE and upper body under severe illumination changes, occlusions, and extreme poses. It contains more than 250k images, with the corresponding annotations: 110 annotated sequences for 22 subjects (12 female and 10 male). Every participant got five recordings. The first Kinect version captured images with full-resolution RGB-D images (\(512 \times 424\)) and RGB (\(1920 \times 1080\) pixels). The dataset includes annotations for the yaw, pitch, and roll angles in the range of \(\pm 125^{\circ }\), \(\pm 100^{\circ }\), and \(\pm 70^{\circ }\), respectively.

300W-LP (Zhu et al. 2016) is a widely used synthetic dataset, which was generated from the 300W dataset (Sagonas et al. 2013a), standardizing multiple alignment databases with 68 landmarks. The authors adopted the presented face profiling to generate 61,225 images across large poses collected from multiple datasets (1,786 from IBUG (Sagonas et al. 2013b), 5,207 from AFW (Zhu and Ramanan 2012), 16,556 from LFPW (Belhumeur et al. 2013), and 37,676 from HELEN (Le et al. 2012). XM2VTSDB (Messer et al. 1999) was not used). It was further expanded to 122,450 images by image flipping. The dataset provides annotations for three angles: yaw, pitch, and roll. The ground truth is provided in the Euler angle format (Hempel et al. 2022).

AFLW2000 (Zhu et al. 2016) is one commonly used dataset. The first 2,000 images were obtained from the AFLW dataset (Koestinger et al. 2011). It includes ground truth of 3D faces along with their corresponding 68 facial landmarks. It includes samples in various in-the-wild settings and varying lighting and occlusion conditions.

Valle and colleagues (Valle et al. 2020) re-annotated the AFLW2000-3D dataset, incorporating poses estimated from accurate landmarks. This revised dataset is named AFLW2000-3D-POSIT. As a result, the mean MAE of their approach decreased to 1.71. This improvement in performance is significant and demonstrates the importance of accurate landmark annotation in HPE. The AFLW2000-3D dataset is commonly employed to evaluate 3D facial landmark detection models.

CCNU (Liu et al. 2016) is a 2D dataset for HPE that consists of 4,350 images of 58 human subjects in 75 different poses. The images were captured indoors, covering a range of yaw from \(-90^{\circ }\) to \(90^{\circ }\) and pitch angles from \(-45^{\circ }\) to \(90^{\circ }\) with various illuminations, expressions, low resolution (\(70 \times 80\)), and poses. Orientation and position of all head pose ground-truth images were labeled using Senso Motoric instruments (SMI) eye-tracking glasses.

UPNA (Ariz et al. 2016) is a dataset designed for HPE. The database includes both 2D and 3D images of faces and features automatic annotation (roll, yaw, and pitch ) based on 54 face landmarks using a magnetic sensor. The dataset consists of 120 videos of 10 individuals (6 male and 4 female), with 12 videos per model. Every video is 10 s long and contains 300 frames. The videos were recorded using a standard webcam with a resolution of (\(1280 \times 720\) pixels).

WIDER FACE dataset (Yang et al. 2016) is a large-scale face detection benchmark. It contains 32,203 images with a total of 393,703 annotated faces. The images were selected with occlusion variability, high degree of scale, and pose. WIDER FACE provided labels for the human body, which were annotated manually, but no labels for the head pose. The researchers applied existing HPE methods (e.g., RetinaFace (Deng et al. 2020) +PnP or FSA-Net (Yang et al. 2019)) to label the angles for each face in the images. Img2Pose (Albiero et al. 2021) annotated the WIDER FACE dataset based on a semi-supervised way for a head pose. However, this weakly supervised learning approach has many factors for more improvement, particularly in handling small faces and automated labeling, and there are no head samples with invisible faces in the labeled WIDER FACE.

CMU Panoptic (Joo et al. 2015) contains 65 sequences (5.5 h) with a massive multi-view system collected by synchronized HD video streams captured by many cameras. Some of these videos focus primarily on a single person, and some of these are on multi-person scenarios in a hemispherical device. The original dataset was not designed to provide head pose labels. However, a software technique was used to obtain the full-range head pose labels by other authors.

Dali3DHP (Tulyakov et al. 2014) is a 3D dataset for HPE research. It consists of 60,000 depth and color images of 33 subjects. The head poses in the dataset cover a range of yaw, pitch, and roll angles, including extreme poses. Specifically, the yaw angle ranges from \(-89.29^{\circ }\) to \(75.57^{\circ }\), the pitch angle from \(-52.6^{\circ }\) to \(65.76^{\circ }\), and the roll angle from \(-\)29.85 to \(27.09^{\circ }\). Ground-truth labels were obtained by a Shimmer sensor.

EYEDIAP database (Funes Mora et al. 2014) contains 94 sessions for gaze estimation tasks from RGB and RGB-D data. Each session was recorded for 2 to 3 minutes using a Kinect camera, with resolutions of (\(1920 \times 1080\)) and (\(640 \times 480\)) for a total of more than 4 hours. The participants were 16 people (12 male and 4 female). Several sessions were recorded twice for participants 14, 15, and 16 under various lighting, distances relative to the camera pose, and day conditions. The range of yaw covered \(40^{\circ }\), which was recorded manually. This data did not provide any annotations for pitch and roll angles.

McGill database (Demirkus et al. 2014) is real-world face and head videos comprising 60 videos for 60 subjects. The database was recorded indoors and outdoors using a Canon PowerShot SD770 camera at (\(640 \times 480\)) resolution. Yaw angles varied in the interval \(\pm 90^{\circ }\) and were labeled by a semi-automatic labeling framework. In this database, the gender and face location were also labeled. For each participant, a 60-s video with 30 fps (i.e., 1,800 frames per participant) was recorded under free behavior. Therefore, background clutter and arbitrary illumination conditions are present, especially outdoors, owing to this free behavior. An expert manually labeled the 18,000 frames with the head pose angle to obtain ground truth annotations.

Biwi (Fanelli et al. 2013) is commonly employed in 3D face analysis and comprises 24 sequences featuring 20 individuals (comprising 6 female and 14 male, with 4 individuals wearing glasses). This dataset encompasses a total of 15,000 images and depth data records. The dataset was recorded indoors while people were sitting and turning their heads in front of a Kinect camera positioned approximately one meter away. The variation of the head pose is a \(\pm 50^{\circ }\) in roll, \(\pm 60^{\circ }\) in pitch, and \(\pm 75^{\circ }\) in yaw. Person-specific templates and ICP tracking were employed to annotate the data, providing information in the form of 3D head location and rotations.

ICT-3DHP (Baltrušaitis et al. 2012) dataset collected using a Kinect camera contained 10 RGB-D videos (both RGB and depth data) for a total of approximately 1400 images. The ground truth was obtained using a Polhemus FASTRACK based on electromagnetic technology. The dataset was evaluated for all three angles: yaw, pitch, and roll. The number of participants was ten (6 male, 4 female).

AFW dataset (Zhu and Ramanan 2012) is a collection of face images designed to evaluate facial detection and head pose. It contains a total of 205 RGB images of 468 faces, with annotations for facial landmarks. The dataset was collected from the internet with various resolutions annotated with a range of yaw, pitch, and roll angles.

Table 2 Summary of datasets of HPE

In summary, Table 2 presents all the characteristics of the datasets of HPE. These characteristics are the number of images or videos and their dimensions, the number of participants and their gender (female and male), angles (yaw, pitch, and roll), their range (full or narrow range), the environment where the dataset was captured in (indoor or outdoor), data type (i.e., RGB image or depth), and published year. Moreover, some datasets provided the age of participants, such as ETH-XGaze, GOTCHA-I, DD-Pose, and SASE, and the translational (x, y, and z) components, such as DD-Pose, Biwi, DriveAHead, SASE, CCNU, and UPNA. Most studies used two or more datasets to evaluate the performance of their methods. However, some of the works used only one dataset, such as Biwi in Chen et al. (2023) and AFLW2000 in Zhang and Yu (2022). Other available datasets of HPE published before 2010 are still employed by some researchers to evaluate their works, such as Bosphorus (Savran et al. 2008), CAS-PEAL (Gao et al. 2007), Pointing’04 (PRIMA) (Gourier 2004), BU (La Cascia et al. 2000), and FERET (Phillips et al. 2000). Overall, this discussion of datasets aims to provide a comprehensive understanding of the benchmarking and evaluation practices in HPE research.

3.2 Ground-truth dataset

This is an important step in several areas of computer vision, including semantic segmentation, object detection, and pose estimation. Ground-truth dataset is the annotated data employed to train and evaluate classical techniques. The ground-truth dataset’s accuracy and completeness significantly affect the HPE algorithm’s performance. Creating a high-quality ground-truth dataset is a time-consuming and iterative process that needs careful planning and attention to detail. The most common techniques for creating ground truth data for HPE are described in the following subsections.

3.2.1 Software

In recent years, ground truth data sets created using software involve automated labeling or annotating techniques (Wang et al. 2023; Martyniuk et al. 2022; Valle et al. 2020; Yang et al. 2016; Patel et al. 2021). These methods are often employed for annotation tasks in HPE. Software-based approaches can provide accurate and consistent annotations and can be faster and more cost-effective than other methods. However, the effectiveness of such approaches relies on the quality of the software algorithms and the availability of appropriate training data. Moreover, software methods may not handle nuanced or complex annotation tasks, as shown in Fig. 13. CMU panoptic (Joo et al. 2015) and AGORA (Patel et al. 2021) datasets originally did not provide annotation about the head pose. However, the AGORA provided information about the 3D facial landmarks and body poses; Zhou et al. (2023) leveraged this information and employed the software technique to annotate the head pose in full-range angles.

Fig. 13
figure 13

Examples of wrong-detected and annotated heads using a software technique (AGORA dataset)

3.2.2 Optical motion capture systems (Om-cap)

Om-caps are expensive, robust deployments primarily employed in professional cinematography to capture articulated body movements. A set of near-infrared cameras typically are calibrated by software algorithms with multiview stereo to monitor reflective markers affixed to an individual (Liu et al. 2016). In the context of HPE, these markers can be applied to the rear of a subject’s head (Roth and Gavrila 2019; Schwarz et al. 2017), allowing for accurate position and orientation tracking. This method has facilitated the collection of diverse datasets varying in scope, accuracy, and availability. The fi-cap helmet device is similar to the Mo-cap system but without the need for expensive sensors (Jha et al. 2021).

3.2.3 Magnetic sensors

Sensors like the Flock of Birds or Polhemus FastTrak operate by measuring and emitting a magnetic field. These sensors can be attached to a person’s head, providing measurements for orientation angles (Baltrušaitis et al. 2012; Ariz et al. 2016) and the head’s position. This method is relatively cost-effective and can collect an accurate ground truth; thus, it has been widely adopted. However, magnetic sensors are susceptible to noise when presenting metals in the surroundings. While these sensors offer relatively cost-effective objective pose estimates, they have been widely adopted as a source of objective ground truth. Consequently, data collection with these sensors is severely constrained, making certain applications, such as automotive HPE, impractical.

3.2.4 Inertial sensors

These sensors employ components such as gyroscopes, accelerometers, or other motion-sensing devices, frequently incorporating a Kalman filter to mitigate noise (Tulyakov et al. 2014). The more commercial low-cost sensors, such as the Shimmer sensor, provide orientation angle measurements. However, these sensors do not provide position measurements. The advantage of inertial sensors is that they have immunity to metallic interference compared to magnetic sensors. In HPE methods, inertial sensors can be attached to a person’s head to capture data (Borghi et al. 2017; Tulyakov et al. 2014).

3.2.5 Camera arrays

This approach uses multiple cameras positioned at various fixed locations and simultaneously captures head images from diverse angles (Zhang et al. 2020). The approach offers an accurate ground truth dataset when subjects’ head positions remain consistent during acquisition. However, it is limited to near-field images and unsuitable for real-world video scenarios or fine poses.

3.2.6 Manual annotation

The probably earliest way for generating ground-truth dataset involves human observers who assign pose labels based on their subjective perception when viewing head pose images. This method may suffice for a basic set of poses in a single DoF. Nevertheless, it is inadequate for precise HPE, particularly for finer variations, owing to the increased likelihood of human errors (Zhu and Ramanan 2012; Funes Mora et al. 2014). This method is sometimes used with other methods to improve the quality of the dataset annotation, as was done in Demirkus et al. (2014), Cao et al. (2018).

In summary, Table 2 lists the techniques used for obtaining ground truth datasets, how to measure them (relative or absolute), and the sensors used to capture the datasets with their resolution and ranges. In addition, other old methods were used to annotate head poses, such as directional suggestion with a laser pointer. Most datasets are narrow-range angles, as can be observed in Table 2; this challenge should be solved in future works. For example, Fig. 14 shows the distribution of the angles of full and narrow range angle datasets and their averages.

Fig. 14
figure 14

Pose labels distribution of the three angles

4 Discussion, challenges, and future directions

HPE still has several problems for each element, as mentioned above. Subsequently, we first compare the different HPE methods. Following this, we delve into a discussion about the challenges for HPE, future research directions, and the advantages and limitations of this work.

4.1 Discussion

Table 3 compares different HPE methods, ranking from most recent to oldest. The comparison includes the choice of application, environment, number of tasks, dataset type, range angle, rotation representations way, number of DoF, techniques used, landmark-based or free, rotation type, evaluation metrics, and challenges. This table ignores work for some applications, such as Yu et al. (2021), Ye et al. (2021), Perdana et al. (2021), Indi et al. (2021), and so on to rely on other works without providing more details. We compare 113 papers published in the last seven years. The papers were published in 2024, 2023, 2022, 2021, 2020, 2019, and 2018; their numbers were 4, 28, 29, 25, 14, 8, and 5, respectively.

We found that the papers (Re.) for improving performance accuracy were around 67%, whereas papers (App) for different applications were around 33%, and work implemented in indoor (I) and outdoor (O) environments was 80.6 and 19.4%, respectively, as shown in Fig. 15. Notably, we considered any work implemented without a camera to be work implemented in indoor environments. Approximately 67% of the surveyed papers prioritize enhancing performance accuracy, which is critical for applications in DMSs, AR/VR environments, and so on. The remaining 33% of the studies explore diverse applications, illustrating the versatility of HPE technologies across various domains, including healthcare, entertainment, and security. The strong focus on performance enhancement suggests a mature understanding of core algorithms. However, the relatively smaller proportion of application-focused studies indicates a potential gap in translating these advancements into real-world scenarios. Future research could benefit from a balanced approach that equally prioritizes both algorithmic improvements and practical applications. A significant majority (80.6%) of the reviewed methods are designed for indoor environments, leveraging controlled conditions that simplify the estimation process. In contrast, only 19.4% address outdoor environments, where challenges like fluctuating lighting and complex backgrounds present significant hurdles. This focus highlights an opportunity for future research to address the uncontrolled conditions of outdoor HPE and develop robust methods capable of handling outdoor settings, which is essential for applications like DMS or public safety systems. This would involve improving models to cope with varying illumination, occlusions, and background clutter.

Furthermore, we found that the methods for a single task were 25%, whereas methods for multiple tasks were 75%, and methods that were implemented under narrow and full-range angles were 94.6 and 5.4%, respectively. The prevalence of multi-task methods (75%) indicates a growing trend towards integrating HPE with other tasks, such as facial expression recognition or gaze estimation. This multi-tasking capability is beneficial for comprehensive systems that require simultaneous analysis of multiple aspects of human behavior. Only 5.4% of the methods were developed to handle full-range angles, which cover the complete spectrum of possible head orientations, due to not enough available datasets before 2020.

Moreover, the methods that adopted Euler angles and rotation matrices were 72.3 and 16.1%, respectively, whereas other rotation representation methods were 11.6%, and methods that were implemented under 6 and 3 DoFs were 6.2 and 77.7%, respectively, whereas other methods were 16.1%. A majority of methods (72.3%) use Euler angles due to their simplicity and intuitive interpretation to understand how the head is oriented in 3D space, while rotation matrices (16.1%) and continuous representations (17.6%) are less common. The limited use of continuous representations, despite their potential to avoid gimbal locks and provide smooth transitions, suggests an area for further innovation. Most methods (77.7%) focus on 3 DoF, which covers basic head movements but does not capture full spatial dynamics. Only 6.2% tackle 6 DoF, which includes translational movements. There is a critical need to explore 6 DoF representations, especially for applications like robotics or immersive VR, where understanding full head and body movement is crucial. Additionally, adopting continuous rotation representations could improve model robustness and accuracy.

Fig. 15
figure 15

Proportion of different HPE methods

Besides, the methods based on deep learning and classical learning accounted for 70.5 and 24.1%, respectively. In contrast, hybrid methods (deep learning and classical learning) accounted for 5.4%. Deep learning techniques account for 70.5% of the methods, outpacing classical (24.1%) and hybrid methods (5.4%). The number of deep learning-based HPE publications has continuously increased in recent years. This dominance reflects deep learning’s superior ability to model complex patterns and handle large datasets.

The landmark-based and landmark-free methods accounted for 71 and 26.2%, respectively. In contrast, hybrid methods (landmark-based and landmark-free) accounted for 2.8%. Landmark-based methods (71%) are predominant, leveraging specific facial features for pose estimation. However, landmark-free methods (26.2%) have gained traction in recent years due to their flexibility and reduced dependency on precise landmark detection. The shift towards landmark-free approaches is promising, particularly for scenarios where landmarks are not easily detectable. Future work should investigate hybrid approaches that combine the strengths of both methodologies to enhance accuracy and robustness across varied conditions.

Finally, the methods based on discontinuous representation were 82.4%, whereas methods based on continuous representation were 17.6%, as shown in Fig. 15. The limited use of continuous representations, such as 5D, 6D, and rotation matrices related to the number of dimensions of the output of the model is an area for further research (Zhou et al. 2019), as these can provide a more accurate and robust handling of head poses.

On the other hand, the methods that employed Biwi, AFLW2000, 300W-LP, POINTING’04, and AFLW datasets to train their models were 23.4, 20.3, 17.6, 6.2, and 5.5%, respectively, whereas own and other datasets were 6.6 and 20.3%, respectively. The analysis reveals that the methods utilized various datasets to train their models. This distribution reflects a reliance on a few narrow-range angles datasets, which may limit the diversity and comprehensiveness of the training data. As a result, there is a potential risk of these models not being generalized for use with applications that require a wide range. Therefore, future work should focus on expanding the dataset variety, including more diverse and real-world data, to enhance the robustness and applicability of HPE models.

Moreover, most methods evaluated their performance using MAE, as shown in Fig. 16. Furthermore, we observed around 25.9% of methods offer public codes, as shown in Table 3. Only 25.9% of the methods provide publicly available code. To foster transparency and accelerate progress in this field, the community should prioritize open-source contributions, including sharing datasets and model implementations.

Fig. 16
figure 16

Proportion of different datasets and metrics

4.2 Challenges of HPE

HPE holds immense potential across the aforementioned diverse applications. Despite great success, HPE remains an open research topic, especially in unconstrained environments with complex human motion. Some of the challenges in HPE for real-time applications are as follows:

Accuracy: Achieving accurate HPE in real-time can be difficult because of factors such as variations in lighting conditions, occlusions, and complex head movements. Ensuring high accuracy is crucial for applications such as facial recognition and gaze estimation. MAE should be \(5^{\circ }\) or less (Asperti and Filippini 2023).

Speed: Real-time applications require fast (30 fps or faster) and efficient HPE algorithms to process video streams in real-time (Murphy-Chutorian and Trivedi 2008). The challenge lies in developing algorithms that can provide accurate results within the limited processing time available.

Robustness: HPE algorithms should be robust to variations in head appearance, such as strong illumination conditions, large head pose variations, and occlusions (Asperti and Filippini 2023; Xu et al. 2022). Robustness is essential to ensure accurate estimation regardless of the individual’s appearance.

Variability: People have different head shapes, sizes, and orientations, which adds to the challenge of HPE (Murphy-Chutorian and Trivedi 2008; Baltanas et al. 2020). Algorithms need to handle this variability and adapt to different individuals to provide accurate and consistent results.

Real-world conditions: Real-time HPE should be able to handle challenging real-world conditions, such as different light conditions, varying camera viewpoints, cluttered backgrounds, and noisy environments (Fanelli et al. 2012; Madrigal and Lerasle 2020). These factors can affect the accuracy and reliability of the estimation.

Computational resources: Real-time HPE requires efficient utilization of computational resources, especially in resource-constrained environments, such as embedded systems or mobile devices (Fanelli et al. 2011). Balancing accuracy and computational efficiency is a challenge in developing real-time algorithms.

Researchers and developers are continuously working on addressing these challenges by exploring advanced techniques, such as deep learning models, optimization algorithms, and data augmentation methods, to improve the accuracy, speed, and robustness of HPE for real-time applications. Erik Murphy-Chutorian proposed design criteria that the HPE method must satisfy (Murphy-Chutorian and Trivedi 2008). The HPE method must be accurate, autonomous, multi-person, monocular, resolution independent, invariant lighting, allow full head motion range, and provide real-time results.

In summary, advanced deep learning techniques have significantly enhanced the performance of HPE methods. In recent years, the utilization of neural networks with continuous rotation representations has resulted in notable improvements in landmark-free methods, exemplified by TriNet (Cao et al. 2021) and 6DRepNet (Hempel et al. 2022). Table 1 presents the advantages and disadvantages of the main steps of HPE.

4.3 Future research directions

In this survey, we identified several potential research directions for HPE. One area of interest is integrating multimodal information, such as audio and gaze cues, to improve the accuracy and robustness of HPE. For example, using audio-visual synchronization techniques can enhance HPE in noisy environments, and integrating gaze tracking with HPE can be crucial in scenarios like DMSs, where both head orientation and gaze direction are critical. Another direction is the exploration of self-supervised and unsupervised learning methods to decrease the dependence on annotated data and improve generalization to novel scenarios. For example, use contrastive learning to learn robust feature representations from video sequences to adapt HPE models to new environments without extensive retraining. Additionally, developing more effective and efficient training strategies, such as curriculum learning or adversarial training, could lead to better performance and scalability of HPE models. For example, utilize curriculum learning, where training begins with simpler tasks, such as predicting head rotation angles in a controlled environment, and progresses to more complex tasks, such as predicting in dynamic, multi-person scenarios. Another promising direction is incorporating attention mechanisms and spatial reasoning to enable more fine-grained localization and understanding of head pose. For example, integrate attention mechanisms into the model architecture to selectively focus on key facial regions or contextual features, thereby improving the model’s ability to differentiate subtle head movements. Furthermore, developing HPE techniques for specific applications, such as real-time tracking in VR or surveillance scenarios and other applications, could provide new opportunities for the practical deployment of HPE models. To meet the requirements of various applications, design criteria should be used as a roadmap for future developments. For example, the MAE should be equal to or less than \(5^\circ\) under the full range of head motion; the process should be autonomous, without expectation of manual initialization, and estimating multiple people in a single image, with both high and low resolution at a high frame per second (30 fps or faster) by monocular with the dynamic lighting found in many environments in real-time to solve all the challenges mentioned in Sect. 4.2. Moreover, the relationship between human pose and head pose can be explored through various methodologies to analyze the orientation and position of a person’s body and head. This relationship is significant in applications such as HCI and surveillance because understanding the head pose within the context of the overall body pose can provide valuable insights into a person’s actions and intentions. Finally, the ethical and social implications of HPE, such as privacy concerns and potential biases, warrant further investigation and consideration in future research. Therefore, the research must be conducted into privacy-preserving techniques, including guidelines for data collection and user consent, where models are trained locally on devices without sharing sensitive data. Additionally, explore bias mitigation strategies in HPE models to ensure fair and equitable outcomes across diverse populations. Overall, these potential research directions highlight the exciting opportunities and challenges in the field of HPE, and we hope that this survey paper will contribute to advancing the state-of-the-art and facilitating further research in this important area of computer vision.

4.4 Advantages and limitations of this work

The advantages and limitations of this work can be outlined as follows:

4.4.1 Advantages

Comprehensive coverage: The survey encompasses over 214 papers published until 2024, providing a broad overview of advancements in HPE, which is crucial given the rapid development in this field. This extensive review allows readers to gain insights into various methodologies and applications of HPE.

Detailed categorization: The classification of the HPE techniques into categories. This structured approach aids in understanding the relationships between different components of HPE systems and facilitates systematic exploration of the field.

Discussion of state-of-the-art techniques: The survey includes descriptions of the latest advancements in HPE, such as the use of continuous rotation representations and attention mechanisms. This focus on cutting-edge methods makes it a valuable resource for researchers looking to understand current trends and innovations.

Comparison of datasets: The survey includes a thorough comparison of publicly available datasets relevant to HPE, summarizing their characteristics and annotations. This information is crucial for researchers when selecting appropriate datasets for training and evaluation.

Identification of challenges and future directions: This work identifies current challenges in HPE, such as handling occlusions, achieving real-time performance, and dealing with diverse conditions, and suggests future research directions. This forward-looking perspective is valuable for guiding ongoing and future studies in the field.

4.4.2 Limitations

Focus more on recent developments: The emphasis on recent advancements may overlook foundational techniques that still hold relevance. A more balanced view that includes both historical and contemporary methods could enhance the survey’s comprehensiveness.

A short explanation of applications: Although the survey identifies HPE applications, the explanation may be perceived as relatively brief due to space limitations. A more in-depth exploration of these applications could provide better guidance for future research initiatives. The HPE applications may require a separate survey paper.

5 Conclusions

In this survey paper, we have provided a comprehensive overview of recent techniques for AI-based HPE systems. The survey paper included 214 articles related to AI-based HPE systems published over the last two decades. We compared and analyzed 113 articles published between 2018 and 2024, with 70.5% focusing on deep learning, 24.1% on machine learning, and 5.4% on hybrid approaches. We have categorized the steps of HPE frameworks into eleven main categories, discussed the available datasets and evaluation metrics, and identified potential future research directions. The eleven steps were organized into four groups as follows. 1. application context, including the choice of application, the specific tasks, and the environment in which the system will operate. 2. data handling and preparation that contains the type of dataset, the range of angles, the representation method, and the degrees of freedom. 3. techniques and methodologies involve the techniques used, the approach to landmark detection, and the type of rotation, and 4. evaluation metrics. Moreover, we provided a comprehensive comparison of several publicly available datasets and a visualization of each category’s proportion of different HPE methods. Through this survey, we have highlighted the strengths and limitations of different approaches and provided insights into the challenges and opportunities in this area of computer vision. Overall, our analysis shows that HPE is a challenging and important problem with many potential applications for AI-based HPE systems. While significant progress has been made in recent years, there are still many open research questions and practical challenges to be addressed, such as robustness to occlusion and lighting variation, scalability, and efficiency of models for applications that require full-range angles, particularly in unconstrained environments. We hope that this survey paper will provide a useful resource for researchers and practitioners in the field, facilitating a better understanding and comparison of the different approaches and stimulating further research and development in this important area of computer vision.