Abstract
This study offers a systematic literature review on the application of Convolutional Neural Networks in Virtual Reality, Augmented Reality, Mixed Reality, and Extended Reality technologies. We categorise these applications into three primary classifications: interaction, where the networks amplify user engagements with virtual and augmented settings; creation, showcasing the networks’ ability to assist in producing high-quality visual representations; and execution, emphasising the optimisation and adaptability of apps across diverse devices and situations. This research serves as a comprehensive guide for academics, researchers, and professionals in immersive technologies, offering profound insights into the cross-disciplinary realm of network applications in these realities. Additionally, we underscore the notable contributions concerning these realities and their intersection with neural networks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Convolutional Neural Networks (CNNs) are a subset of Machine Learning (ML) within the Deep Learning (DL) model family, commonly used in image and video analysis. They excel in detection, recognition, reconstruction, and object tracking, making them valuable for virtual reality (VR), augmented reality (AR), mixed reality (MR), and extended reality (XR) applications (Alzubaidi et al. 2021). The COVID-19 pandemic (Sarfraz et al. 2021; Lee and Trimi 2021) and Facebook’s shift towards Meta (Kraus et al. 2022) have accelerated advancements and popularity in these fields.
1.1 Contribution
In recent years, there has been an increase in the usage of CNNs in the field of computer vision due to their ease in processing images or video and recognising or classifying their content (Alzubaidi et al. 2021; Bhatt et al. 2021) On the other hand, the use of Virtual, Augmented, Mixed, and Extended Realities has increased in popularity and interest, even in business fields. The use of virtual elements in socialisation, marketing, exploration, and education, among others, is becoming more common every day. Given these advances, it is essential to know the use of CNNs in the various fields of XR over the past decade.
Given the strong focus of CNNs on image and video processing, their application in the field of XR is particularly promising. While other deep learning methods are effective in various domains, CNNs are distinguished by their exceptional efficiency in processing visual data, making them potentially suitable for enhancing immersive experiences in XR environments.
In this paper, we explore the use of CNNs in the field of VR/AR/MR/XR. Based on the problem described above, this article aims to answer the following research question: "What is the use of CNNs in VR, AR, MR, and XR as a whole?". To answer this question, a systematic literature review has performed.
1.2 Structure
This work is divided into the following sections: in section 2, there is a description of the key concepts related to our work. In Sect. 3, the related studies that partially address the research question within the field of study are analysed. In Sect. 4, we outline the methodological process followed in the systematic review. In Sect. 5, the different classifications proposed in this research, their features, and distributions are depicted. In Sect. 6, we discuss the study’s scope, its limitations, and potential scenarios not considered in the research. Finally, in Sect. 7, we conclude with the most relevant finds.
2 Background
In this section, we address the concepts regarding to immersive technologies such as VR, AR, MR, and XR. We also go in depth with the concept of Deep Learning, which is a part of the of machine learning, specifically, Convolutional Neural Networks.
2.1 Clarifying immersive technologies: virtual, augmented, mixed, and extended realities
Immersive technologies such as VR, AR, MR, and XR each offer unique ways of blending digital elements with real-world experiences, but they do so in distinctly different manners. VR provides a completely immersive experience where users interact within a computer-generated environment using specialised devices, completely isolating them from the real world to create a strong sense of presence (Burdea and Coiffet2017). In contrast, AR overlays digital objects onto the real world, allowing users to see and interact with a blend of real and virtual elements without losing touch with their surroundings (Azuma 1997). MR goes a step further by not just overlaying but also anchoring virtual objects to the real world, enabling interactions that affect both the digital and physical elements simultaneously(Milgram and Kishino 1994). Finally, XR encompasses all the aforementioned technologies, serving as an umbrella term that covers the entire spectrum of interactions between the digital and physical realms(Billinghurst and Nebeling 2021; Gugenheimer et al. 2022). Each technology plays a unique role in how users perceive and interact with the blend of real and virtual environments, from complete immersion in VR to seamless integration in MR, and a holistic approach in XR.
2.1.1 Virtual reality
VR, is defined by Burdea and Coiffet in their book "Virtual Reality Technology" (Burdea and Coiffet 2017), as an immersive technology that allows users to interact with a three-dimensional environment generated by a computer. Through specialised input and output devices, such as virtual reality glasses, VR immerses the user in a virtual environment, allowing them to experience and manipulate this environment as if it were real. This total immersion can generate a feeling of presence, that is, the perception of truly being inside the virtual environment.
2.1.2 Augmented reality
AR, is defined as a variant of virtual environments that complements reality by allowing the user to see the real world with overlaid or combined virtual objects (Azuma 1997). Unlike VR, which fully immerses the user in a synthetic environment, AR maintains the user’s connection to the real world, creating an experience in which virtual and real objects seem to coexist in the same space (Azuma 1997). AR is positioned between the entirely synthetic and the entirely real (Milgram and Kishino 1994).
2.1.3 Mixed reality
MR, is a concept that lies within the reality-virtuality domain, proposed by Milgram and Kishino (1994), where it combines elements of reality and virtual environments to create enriched, unique experiences. This technology allows the simultaneous communication and integration of digital objects and elements with the real environment, creating a space in which both real and virtual objects coexist and influence each other (Milgram and Kishino 1994). In this way, MR enriches the user’s perception and virtual integration with the surrounding world, providing a balance between reality and virtuality.
2.1.4 Extended reality
XR, is an inclusive term that encompasses all immersive technologies that merge the physical world with the digital, such as VR, AR, and MR (Billinghurst and Nebeling 2021; Gugenheimer et al. 2022; Mhaidli and Schaub 2021; Ratclife et al. 2021). This closeness to the virtual and the real can be observed in the Fig. 1 To simplify the classification, in this research, XR will be named to refer to those works that within their structure contemplate or use CNNs in VR, AR, and MR.
2.2 Deep learning and convolutional neural networks
DL, according to Lecun et al. (2015), is a branch of machine learning based on the concept of deep neural networks. It is characterised using multiple layers of processing nodes, or "neurons," on each one can learn representations of data at different levels of abstraction. These representations allow deep neural networks to automatically learn from raw data, making them particularly effective for tasks such as voice recognition, computer vision, and natural language processing.
CNNs are part of Deep Learning networks and, being also models used in image and video analysis. According to LeCun et al. (1998), they are a type of deep neural network especially effective in image processing tasks. CNNs are characterised by their ability to automatically learn hierarchical representations of data through the application of convolutional filters, allowing the detection of local features and patterns in images. Additionally, CNNs are translation-invariant, that is, they can recognise a feature anywhere in the image, regardless of its location. This makes them particularly useful for tasks such as object recognition and medical image analysis.
In the context of image processing, CNNs are distinguished by their specialised architecture. Designed specifically for handling visual data, CNNs are composed of three main types of layers: convolutional layers, pooling layers, and fully-connected layers. The combination and stacking of these layers form the architecture of a CNN. Fig. 2 illustrates a simplified CNN architecture for MNIST classification based on LeCun et al. (1998), O’Shea and Nash (2015).
The convolutional layers apply filters over the input image to detect important features, such as edges and textures. The filters move through the image and calculate how they align with different parts of the image, creating a feature map that highlights the detected elements (O’Shea and Nash 2015; Li et al. 2022).
The pooling layers reduce the size of the feature maps generated by the convolutional layers. This is done by selecting the maximum or average value of small regions of the map, helping to reduce the amount of data and making the network more efficient and resilient to small variations in the image (O’Shea and Nash 2015; Li et al. 2022).
The fully-connected layers take the reduced feature map and transform it into a one-dimensional vector. Each neuron in these layers is connected to all neurons in the previous layer. This final vector is used to perform classification, assigning a score to each possible class to decide which category the image belongs to O’Shea and Nash (2015), Li et al. (2022).
2.3 Terminology in the classification of interaction
In this section, we broadly define some concepts used as categories in the proposed classification of CNNs use in XR technologies, considering interaction. One of the concepts used is the Brain-Computer Interface (BCI), a system that allows direct communication and control between the human brain and electronic devices by the translation of neural signals into commands that can be interpreted by a computer (Lotte 2014). These systems often use electroencephalogram (EEG) signals, a non-invasive technique that measures and records brain electrical activity using electrodes placed on the scalp. This technique provides a real-time representation of the human brain activity using different frequency waves (Chartier et al. 2009). Human-computer interaction (HCI) is a multidisciplinary field that studies the design and use of computer technology, focusing on the interfaces between people (users) and computers. The main goal of HCI is to improve the interaction between users and computer systems, making it more efficient and effective (Rogers 2005). Gestures are defined as a physical movement or posture performed with some part of the body, primarily the hands, to convey information or interact with the environment. In computing, gesture recognition refers to the ability of computer systems to interpret these human gestures accurately and effectively. Moreover, it allows for a more intuitive and natural interaction between humans and machines (Mitra and Acharya 2007).A common term in this field is avatar, referring to a user’s digital representation, often used in online environments such as video games, internet forums, and virtual communities. These avatars can vary in realism and can significantly affect interaction and communication in virtual environments. Avatars can influence aspects like information disclosure, nonverbal communication, emotional recognition, and the sense of presence in dyadic interaction (Bailenson et al. 2006). Foveated refers to rendering techniques inspired by the human eye’s structure. These techniques focus on the detailed representation of a limited portion of vision, known as the fovea, where visual acuity is highest and represents peripheral areas with less detail (Guenter et al. 2012). Long Short-Term Memory (LSTM) networks are a special class of recurrent neural networks (RNN) designed to avoid the gradient vanishing problem in learning long sequences. LSTMs networks use gates, units that can allow or block information based on certain criteria to regulate the information that is retained or discarded over time in the network’s memory. This gate structure helps LSTMs retain relevant long-term information while forgetting non-essential details, making them highly effective for many tasks involving sequential data, such as natural language processing, time series analysis, and voice recognition (Hochreiter and Schmidhuber 1997). Besides, they are related to CNNs in the CNN-LSTM model (Tara et al. 2015).
2.4 Terminology in the classification of executing
Photogrammetry is a scientific method used to obtain information about the physical properties of objects and the environment through the interpretation and analysis of photographs and light patterns (Cleveland and Wartman 2006). This technique is widely used to create maps or estimate the geometry of a scene. According to Chen et al. (2016), semantic segmentation refers to the task of assigning a semantic label to each pixel in an image. This computer vision technique aims to understand images at a more detailed level and provide precise descriptions of scenes. Object Detection (OD) refers to the process of identifying specific instances of objects from certain classes, such as people, items, animals, etc., within an image or video. This task not only involves classifying which object is presented but also locating that object in space, usually represented by a bounding box around the object. OD differs from semantic segmentation in that the latter labels each pixel in an image with the object class to which it belongs, rather than just placing a bounding box around the entire object. Therefore, semantic segmentation provides a much more detailed and precise understanding of the image content, but it is also computationally more demanding (Girshick et al. 2013).
2.5 Terminology in the classification of creation
Simultaneous Localisation And Mapping (SLAM) is a technique belongs to robotics that allows an autonomous device to generate a map of its environment and orient itself within it simultaneously. The SLAM technique combines data from various sensors to calculate the device’s position and update the environment’s map in real-time (Cadena et al. 2016). In addition, Visual SLAM (VSLAM) is a variant of SLAM that uses cameras as the primary sensor for localisation and mapping. VSLAM can employ one or more cameras to capture images of the environment and use key visual features to calculate the device’s position and create or update the map (Cadena et al. 2016). Some of these applications use the corresponding coordinates in the 3D domain to represent a point. The grouping of these in the reconstruction is known as a point cloud. A notable example of key point extraction techniques based on CNNs is SuperPoint, which was a leading contender in the CVPR2020 image-matching challenge (Liu et al. 2022) (Table 1).
3 Related work
Several literature reviews have been published that consider CNNs in the field of AR, VR, MR, or XR. Most related works focus on a specific approach to the use of convolutional net-works, where authors envision the application of their research in some area of XR. In Minaee et al. (2021), authors conducted a meticulous examination of image segmentation algorithms based on deep learning techniques. These algorithms were grouped according to their architectural categories, aiming to perform a quantitative evaluation of their performance (Chitty-Venkata and Somani 2022). In Chen et al. (2021), the focus was on understanding fluid simulation in the field of ML. They also referenced papers that use CNNs as a data-based Eulerian solver and improve their simulations. The authors were enthusiastic about potential applications in VR fields, with significant potential for use in AR/VR. Jiang et al. (2022) reviewed animal pose estimation. They noted that some research still manually creates silhouettes, but through semantic image segmentation using CNNs. Also, it was possible to separate both the silhouette and the pose of objects, humans, and even animals (Cheng et al. 2023). Another related work was a review where 2D instance segmentation types and various CNNs architectures are discussed (Gu et al. 2022).
In other research work, the performance of some key disparity estimation methods in images was compared, evaluating depth, an essential element in scene reconstruction tasks, AR, and 3D modelling (Laga et al. 2020; Xu et al. 2019). The latter could be implemented using point clouds and CNNs (Zhang et al. 2020; Wang et al. 2020; Zhang et al. 2023, which require subsequent processing to be used in the virtual reality graphic domains (Fahim et al. 2021). The identification of characteristic points (Zhang et al. 2020; Laga et al. 2020) and scene recognition for indoor navigation or movement were already part of the role of CNNs (Khan et al. 2022). As seen in SLAM, a convolutional network was trained instead of building an explicit map, thus maintaining a constant size without experiencing linear growth problems (Yan and Zha 2019).
In Wei et al. (2019), a method for stitching images or videos through semantic matching was proposed. Besides, its outstanding features were extracted using CNNs. It was possible to construct a panoramic view from a sequence of images, and this panorama can be displayed in VR (Wei et al. 2019). In Hamza and Dao (2022), authors reviewed techniques for sensors that preserve user information, showcasing a use case employing CNNs, still within the context of AR/VR.
Estimating depth, as well as determining the position and detection of objects, are relevant topics in AR, VR, MR, or XR. In Sahin et al. (2020), a categorisation based on mathematical models of object pose estimation methods (Yan and Zha 2019; Zou et al. 2023) or human motion and pose (Desmarais et al. 2021) was established, both in 2D (Wang et al. 2021; Han et al. 2017) and 3D (Gamra and Akhloufi 2021; Ji et al. 2020; Chen et al. 2020). It concluded that the most accurate techniques utilise various architectures. Another related work is Hoque et al. (2021); thanks to this review, different architectures used for object detection with their pose estimation and their respective viewpoint (Zhang and Fei 2019) and their edges (Han and Zhao 2019) could be identified. Scale is a factor in such recognitions, with CNNs being among the techniques for detecting small or tiny objects (Tong and Wu 2022).
Quality evaluation of panoramic content is essential in virtualisation and immersive applications. Another related work is a literature review that analysed the techniques and measurements of \(360^\circ\) content. In this work, related articles were found that, through video patch samples, it obtained a quality score using the 3D-CNNs architecture (Ullah et al. 2022).
To effectively apply CNNs within extended reality technologies, understanding their use is paramount. This becomes especially relevant when considering the need to equip both students and professionals for evolving professions by offering a flexible virtual environment for real-world case studies, leveraging AR/VR-enhanced learning spaces (Bermejo et al. 2023)
HCI is another related field through gesture recognition in AR/VR domains (Yuanyuan et al. 2021), For example, hand poses (Huang et al. 2021), as the BCI (Hu et al. 2022; Miltiadous et al. 2023), or the study of the foveated visual approach, where eye movement was predicted using CNNs in AR/VR platforms (Mohanto et al. 2022). In D’Orazio et al. (2016), they emphasised the importance of depth in the HCI field, and within a category of methods. Therefore, they found the use of CNNs in hand gesture recognition. In Yang et al. (2019), different algorithms used for gesture recognition in virtual reality are represented; it can also be seen that the combination between CNNs and LSTM is useful in this field.
4 Methodology of the systematic literature review
The goal of this paper is to present the last thirteen years of research on the use of CNNs in XR. A systematic literature review (SLR) is necessary to achieve our research question. Through a SLR, we could identify, evaluate, and select relevant information in a rigorous and structured manner. For this, this section presents a detailed description of the literature selection process based on Kitchenham et al. (2009) between the years 2010 to 2023.
4.1 Databases
From the width spectrum of the current scientific databases, we selected: Explorer of the Institute of Electrical and Electronics Engineers (IEEE): IEEExplorer, Science Direct, and Clarivate Web of Science (WoS).
4.2 Keywords
In order to perform the most suitable query in the previous databases, we defined the following keywords: “CNN”, “Convolutional Neural Network”, “Virtual Reality”, “VR", “Augmented Reality", “AR", “Mixed Reality", “MR", “Extended Reality", “XR" and “Metaverse". The keyword “metaverse” is included as a conceptual extension of Virtual Reality, due to the recent popular context of the name change of one of the most used social platforms in the world.
Considering the specified keywords and the research questions to be answered, we built the following logical expression: (“Convolutional Neural Network" OR “CNN") AND (“Virtual Reality" OR “Augmented Reality" OR “Mixed Reality" OR “Extended Reality” OR “AR/VR” OR “MR/XR” OR “Metaverse”). The logical expression was evaluated from 2010 to 2023, narrowing the results through specific filters for each database. As a result, we obtained a total of 844 research papers available.
4.3 Search and selection process
The performed queries in each database and its respective filters are described through this section, together with the filters used, the adaptations and limitations of each search engine (see Fig. 3).
In IEEE database, the previous keywords were used, adding a characteristic of the advanced search engine (the use of asterisks to perform an extended search on different conjugations of the term) Then, the logical expression used in IEEE was: ((“Convolutional Neural Network*” OR “CNN*” ) AND (“Virtual* Reality*” OR “Augmented* Reality” OR “Mixed* Reality” OR “Extended* Reality” OR “AR/VR” OR “VR/AR” OR “Metaver*” OR “MR/XR”)). Additionally, the filters were used from Journals and Magazines.
In ScienceDirect the asterisks are removed since its search engine does not have this feature. Then, the logical expression used was: (“Convolutional Neural Network” OR “CNN”) AND (“Virtual Reality” OR “Augmented Reality ” OR “Mixed Reality” OR “Extended Reality” OR “AR/VR” OR “VR/AR” OR “Metaverse”). In this case, it was necessary to remove prepositions since this engine does not allow the use of more than 8 logical conditionals. To address potential gaps, an additional search was conducted, ensuring consistency in search logic. For both searches, the applied filters were: Years between 2010 and 2023. Type of article: Review articles, Research articles. Journals selected were those ranked in quarterlies Q1 and Q2. Subject areas: Computer Science.
In the WoS database, the search query was adjusted by prefixing with ’All’, denoting a search across all editions. The search logic employed was: ALL=((“Convolutional Neural Network” OR “CNN”) AND (“Virtual Reality” OR “Augmented Reality” OR “Mixed Reality” OR “Extended Reality” OR “AR/VR” OR “VR/AR” OR “MR/XR” OR “Metaverse”)). The scope was restricted to publications from 2010 to 2023. Applied filters include: Type of Documents: Review articles, Research articles; Journals selected from quarterlies Q1 and Q2.
To ensure rigorous selection and systematic analysis of the literature, we refined the description of our inclusion and exclusion criteria. Initially, our search yielded 844 works, as seen in intersection "D" of Fig. 4. These articles were then subjected to a detailed evaluation process, beginning with an initial screening based on the title, abstract, introduction, and conclusions.
Inclusion criteria for the review were as follows: articles had to specifically discuss the application or potential application of technologies within extended reality environments, involve or discuss the implementation of convolutional neural networks, and be published in journals and magazines to ensure credibility and scholarly value. Exclusion criteria included articles that did not specifically address extended reality technologies or did not incorporate CNNs, as well as incomplete or preliminary studies based solely on abstracts without full studies. This meticulous categorisation and selection process led to the final inclusion of 348 articles, which were further organised by categories relevant to the review’s focus.
4.4 Bibliometric results
The 348 selected works were analysed bibliometrically, obtaining useful information, such as the year of publication (see Table 2). Since 2018 is a turning point in these research works, we divided the time spectrum into two blocks: before 2018 and after that. As shown in Fig. 5. There was an increase in research works before 2020. The 2022 is the year in which the most research works are published. Furthermore, between 2019 and 2020, the number of research works doubled compared to the previous period and although an increase in research is evident in the years 2021 and 2022. It does not represent a slope as steep as that of the period between 2018 and 2019. Additionally, a drastic decrease in articles is evident in 2023 until August.
5 The use of CNNs in extended reality environments
In this section, we address the classification of research works focusing on crucial aspects that underline our goal of understanding the use of CNNs in AR/VR/MR/XR. We examine the following fundamental characteristics: interaction, execution, and creation to effectively cover the different research areas and applications of these technologies. The interaction category focuses on how users interact with virtual environments, and how CNNs can enhance these interactions, including gesture recognition, EEG, gaze tracking, and the integration of virtual objects into the physical environment, allowing for more immersive and intuitive experiences. The execution category focuses on the implementation and performance of XR applications, including performance optimisation, context adaptation, and personalisation of the experience based on user and environment features, resulting in smoother, more efficient, and adaptive XR experiences, enhancing user satisfaction and the adoption of these technologies in various fields and applications. Finally, the creation category refers to the generation of content and visual elements for XR environments using CNNs, addressing texture synthesis, 3D model generation, character animation, and the creation of realistic virtual environments, facilitating and accelerating the content production process and allowing for more detailed and customised results Table 3 displays the classification of the selected articles, separated by rows into interaction, execution, and creation sections and by columns into different XR technologies. It’s important to clarify that the classification between interaction, execution, and creation is mutually exclusive, while between the VR, AR, MR, and XR technologies, it is non-exclusive. Additionally, Fig. 7 shows the number of articles per year in the mentioned classifications, and Fig. 8 displays their percentage distribution, with the execution category having the highest concentration. In order to detail the articles included in interaction, execution and creation, an internal classification of each of these sections is proposed, using the characteristics of the different articles and maintaining the classification of XR. In Fig. 6 we can see the different categories in which the articles found were classified. In the following sections you will find details about each of these categories.
The Fig. 8 is divided into three main categories, creation, represented by the colour orange, this category takes up 24% of the total number of items. Execution, represented by yellow, this is the largest category, representing 43% of the total articles reviewed. It includes research focused on the implementation and performance of XR applications. It focuses on topics such as performance optimisation, identification, segmentation and tracking of objects or contextual elements, object motion and pose, and personalisation of the experience based on user and environment characteristics. And interaction category, represented by the colour green, this category makes up 33% of the total number of articles.
5.1 Interaction classification
In the context of CNNs in XR, the category ’interaction’ refers to the dynamic exchange between users and virtual environments, where CNNs process and respond to user inputs, enabling engagement with XR content. Considering this fundamental category, a classification in this area (and sub-areas) is proposed. In Table 4, articles related to the interaction between users and AR/VR/MR/XR environments using CNNs can be observed, where the highest concentration is in the classification of HCI and gestures Fig. 9.
The Fig. 9 shows that 41% of the articles focus on human-computer interaction and gestures, followed by hand detection with 15% and gaze tracking with 11%. The subcategories of body recognition, use of sensors, facial recognition, emotions, tools and audio represent between 5% and 7% each, while brain-computer interface is the lowest with 2%. This indicates that most research prioritises improving the interface and natural interaction with XR technologies, with significant interest in methods such as hand detection and eye tracking.
5.1.1 Brain-computer interface
In the BCI subcategory, articles that use brain signals in the field of study of this research were classified. In this subcategory, it was found that CNNs are used to decode movement. As in the case of Achanccaray and Hayashibe (2020), which decodes EEG signals to obtain hand movement. In this field, it is also possible to detect the level of acrophobia, using EEG and a ResNeT network, in VR environments. In Wang et al. (2021), the levels of a user’s acrophobia in VR are classified.
5.1.2 HCI and gestures
One of the most relevant category in the interaction classification is in the HCI and gestures subcategory. Here, we find articles that use CNNs to improve the interface and immersion, such as "Deep Spherical Harmonics Light" (Mohammed et al. 2019). This technique estimates the lighting configuration of the real environment from a single RGB image, without prior knowledge of the scene. It helps solving inconsistencies in lighting that could occur when integrating the user’s hands and virtual elements, enhancing the user’s immersion in the MR application. In Sen et al. (2022), Sagayam et al. (2022), He et al. (2022), Ge et al. (2022), Yao and Qiu (2021), Alam et al. (2022), Jia et al. (2021), Bose and Kumar (2021), Kang et al. (2020), Polap et al. (2020), Xu et al. (2020), Li and Fan (2020), the focus is on hand gestures, improving interaction processes in virtual environments or using heat sensors to capture hand movements and enable writing (Kim et al. 2017). Additionally, in Liu and Pan (2022), a system is proposed that enhances the perception of the virtual world, capturing and transmitting user movement information to a supervised model. This system quickly collects and analyses data from multiple sources and provides feedback to the mobile device about the ongoing activity. This allows the user to move and make simple gestures in a specially designed space, among other things, distinguishing the player’s change of direction in real-time. In Karambakhsh et al. (2019), AR and CNNs are used to enhance teaching in medical education, specifically in anatomy teaching. The authors propose a CNNs for gesture recognition, which interprets human gestures as specific instructions. They use AR technology to simulate scenarios where students can learn anatomy using HoloLens, instead of real specimens that can be difficult to obtain. Their approach is not only more accurate, but it also has more potential to add new gestures, highlighting different models.
5.1.3 Foveated and ocular visualisation
In the subcategory of foveated and ocular visualisation, the focus is on eye tracking, attention identification, and gaze recognition. In Hu et al. (2020), authors propose training a CNNs to directly segment the full elliptical structures of the eye. They argue that this framework is more robust against obstructions than the previous ones and offers a higher performance in tracking the pupil and iris. Compared to using standard segmentation of parts of the eye, the authors claim that their method improves the detection rate of the pupil and iris centers by at least 10% and 24% respectively (within a margin of error of two pixels). Both segmentation, biometric verification (Boutros et al. 2020), attention (Dai et al. 2021; Lee et al. 2019), and gaze tracking (Hu et al. 2021) or its prediction (Yuan et al. 2017) are part of the use of CNNs in AR and VR. In Dai et al. (2022), the authors propose a new gaze-tracking method based on the fusion of binocular features and a CNN. This method integrates both local (LBSAM) and global (GBSAM) binocular spatial attention mechanisms into the network model to improve accuracy. LBSAM is a mechanism used to distinguish the importance of different regions of the two eyes, aiming to enhance gaze-tracking accuracy. GBSAM spatially weighs the head, face, and image angle to include global variables in gaze tracking. Additionally, the authors validated this method using the GazeCapture database. In Olszewski et al. (2016), CNNs are used to map images of a user’s mouth region to the parameters controlling a digital avatar in VR. However, the authors demonstrate that their approach can also track expressions in the user’s eye region using an internal infrared camera, allowing for complete facial tracking. In Kothari et al. (2021), they perform facial recognition considering AR filters applied to the face and use distance algorithms and different facial landmarks, especially the eyes. In Huong et al. (2022), the authors develop a machine learning model for assessing the quality of 360\(^{\circ }\) images in VR, utilizing foveated technologies. Foveated technologies leverage the focusing feature of the human eye, which significantly reduces the data required for transmission and computational complexity in rendering. This is important because 360\(^{\circ }\) images, a key component of VR systems, are typically large and therefore require efficient transmission and rendering solutions.
5.1.4 Hand detection
In the subcategory of hand detection, the articles that use CNNs in VR, AR, MR, or XR to identify and track hands are grouped. It is important to note the overlapping between this category and the gestures one.. Hands detection allows for a more fluid and natural interaction between the user and the virtual, augmented, or mixed environment. For example, a user might interact with virtual objects using hand gestures (Wang et al. 2022; Cofer et al. 2022; Yuan et al. 2021; Achanccaray and Hayashibe 2020; Aly and Aly 2020; Zhou et al. 2019; Emporio et al. 2022; Caputo et al. 2021; Li and Zhao 2021; Li et al. 2020; Zhang and Chi 2020; Malik et al. 2019; Gomez-Donoso et al. 2019; Marques et al. 2018; Li et al. 2022; Sen et al. 2022; He et al. 2022; Ge et al. 2022; Yao and Qiu 2021; Polap et al. 2020; Xu et al. 2020; Mohammed et al. 2019). Instead of relying on traditional controllers or input devices. Estimating the pose of the hand and realism is crucial in VR/AR (Deng et al. 2021) stands out as it manages to determine the hand pose with great precision and realism using CNNs. Additionally, they generate a dataset for future applications. In Liu et al. (2021), the authors propose a 3D micro-gesture recognition system based on a 3D holoscopic image sensor. Due to the lack of 3D holoscopic datasets, they created a comprehensive 3D holoscopic micro-gesture database (HoMG) that is used to develop a robust 3D micro-gesture recognition method. They improve performance using multiple viewpoints from a single holoscopic image and apply a CNNs model with an attention-based residual block to each hand viewpoint image. In Wu et al. (2020), a system is proposed that uses depth images to accurately estimate a hand’s position in 3D space for XR applications. The CNNs is designed with a skeleton difference loss function that allows the CNNs to effectively learn the physical constraints of a hand. This enables accurate prediction of hand joint positions even in challenging environments or with occlusions.
5.1.5 Facial recognition
In the facial recognition subcategory, segmentation (Liang et al. 2019) is fundamental, as well as its recognition for the identification of expressions (Albraikan et al. 2022; Alashhab et al. 2022), emotions, similarities, and structures or features. In Zhou and Feng (2022), they propose M3SPCANet, a CNNs that uses multiscale PCA filters to obtain facial feature maps and enhances detection and recognition. In Refat et al. (2022), they combine the conventional steps of face detection, face alignment, feature extraction, and similarity computation into a single cohesive process. Face detection has its application in animation for VR. In Olszewski et al. (2016), they use recordings and expressions generated by artists to simulate facial expressions.
5.1.6 Body analysis for interactivity
The body analysis for interactivity subcategory focuses on the identification of the body, its actions, and the representations these can generate for interaction, such as the localisation of anthropometric landmarks from 3D body scans (Kozbial et al. 2020). In Kozbial et al. (2020), they propose an approach to detect and provide real-time feedback on body movement errors in physical training conducted in virtual reality. In Zherdev et al. (2021), they estimate the 3D human pose from a single image. Instead of relying on a single complex estimator, they use multiple partial hypotheses. With this approach, they select several joint groups from a human joint model and estimate the 3D pose of each joint group separately using CNNs. These pose estimates are then combined to obtain the final 3D pose.
5.1.7 Use of sensors
In the use of sensors subcategory, the selected works use sensor signals to improve interaction. In Wang et al. (2022), a wrist sensor is used, and different movement features are combined to classify 12 gestures; highlighting its utility in detecting gestures based on a wrist sensor (for example, through a smartwatch). In Smith et al. (2021), a safety control framework for human-robot collaboration is proposed, and both image analysis and the robot’s sensors are used to monitor safety in a digital twin using CNNs. In Brandolt Baldissera and Vargas (2020), the accuracy of manual operations in VR training systems is evaluated using gloves with sensors that collect precise data on hand movements. The authors assert that datasets from multiple sensors are seldom leveraged to assess actions in VR training systems. In Tao et al. (2020), worker activity recognition is carried out in AR in intelligent manufacturing systems using sensors and vision. The authors developed a dataset of worker activity, which includes six common activities in assembly tasks: grabbing a tool/part, hammering a nail, using an electric screwdriver, resting the arms, turning a screwdriver, and using a wrench. In Liu et al. (2020), the authors propose a mirror therapy system based on the recognition of multichannel signal patterns and mobile augmented reality. The overall accuracy of the SVM is 93.07%, while that of the CNNs reaches up to 97.8%. These results suggest that machine learning techniques can play a crucial role in enhancing the effectiveness of mirror therapy for the rehabilitation of post-stroke hemiparesis; alterations in the functioning of one side of the body.
5.1.8 Auditory processing
In the auditory processing subcategory, articles identify different sound inputs or outputs to enhance interaction. This includes voice recognition or the spatial application of sound in immersive technologies. In Siyaev and Jo (2021), the authors propose an MR-based solution for education and training in aircraft maintenance, specifically the Boeing 737, using smart glasses. The solution includes a deep learning-based voice interaction module that allows trainee engineers to control virtual assets and workflows through voice commands, freeing their hands for other tasks. In Lopez Ibanez et al. (2021), researchers developed a head gesture recognition system to identify and interpret human emotions, specifically fear. This system is designed to be integrated into an adaptive music system (LitSens) in virtual reality applications, aiming to improve immersion and virtual presence. In Amjad et al. (2022), emotions are identified from audio signals using CNNs and LSTMs. These models learned audio representations from deep segments of Mel spectrograms, which are visual representations of the spectral energy distribution of an audio signal. The models were trained using raw voice data, as well as Mel spectrogram segments from different perspectives (middle, left, right, and side), allowing the models to learn both local and global features of the audio signals. In Ling et al. (2020), UltraGesture is presented, a system for perceiving and recognising finger movements based on ultrasounds. UltraGesture uses the Channel Impulse Response (CIR), which detects and recognises small finger movements through sound.
5.1.9 Emotional analysis
In the emotional analysis subcategory, all the articles that manage to predict or categorise feelings are depicted, as well as disorders or some diseases. In Liang et al. (2019), a CNNs is proposed that efficiently classifies human facial expressions of emotions. Thus, from the face, different emotions can be classified (Albraikan et al. 2022; Xiao et al. 2020; Song et al. 2020; Izountar et al. 2022; Martínez et al. 2021; Chirra et al. 2021), or emotion recognition through speech (Mustaqeem Sajjad and Kwon 2020; Amjad et al. 2022). Consequently, in Chiu et al. (2020) an innovative and easy-to-use emotionally aware virtual assistant for university campus environments is presented, which improves efficiency in semantic interpretation and emotion identification through voice. In Hedman et al. (2022), the authors predict the degree of motion-induced dizziness when viewing a 360\(^{\circ }\) stereoscopic video. The method is based on the use of three-dimensional convolutional neural networks (3D-CNN) and considers the movement of the user’s eye as a new feature, to add to the speed and depth of motion characteristics of the video.
5.1.10 Interaction tools
In the interaction tools subcategory, articles are grouped which indirectly affect or use Interaction for a specific purpose. In Liu (2021) the focus is on developing a motion detection system using a CNNs for the recognition of high-difficulty sports movements. In this case, the CNNs is used to extract images and perform computational preprocessing for the recognition of each human motion image. In Mukhopadhyay et al. (2022), authors propose an application of detection technologies to improve workplace safety by measuring social distance between individuals. Using data visualisation techniques based on intermittent layers, heat maps, CNNs, and digital twins, they recognise proximities between people in VR. In Vaughan and Gabrys (2020), methods of scoring for personalised haptic virtual reality medical training simulators are proposed and evaluated. A novel approach called dynamic time warp multivariate prototypes (DTW-MP) was proposed and the VR data was classified into experience level categories: Beginner, Intermediate, or Expert. Various algorithms were used for classification, achieving different levels of accuracy: dynamic time warp with one nearest neighbour (DTW-1NN) achieved 60%, the SoftDTW nearest centroid classification achieved 77.5%, and deep learning: ResNet achieved 85%. Demonstrating the use of CNNs for VR interaction assessment. In Tai et al. (2021), an approach is proposed to improve the accuracy of image-guided lung biopsy in patients with COVID-19 through the combination of AR, custom haptic surgical tools, and CNN. The authors propose a personalised surgical navigation system that can adapt to the individual needs of each patient. The system’s performance was evaluated by 24 thoracic surgeons through objective and subjective tests. The results show that the use of AR with the deep learning model outperforms existing navigation techniques as of the year 2021, offering significantly better performance. In Liu et al. (2022), user interaction is used to predict potential hacker intrusions. It uses an intrusion detection simulation training system with a model identification based on CNNs and LSTM in VR.
Thanks to the analysis of the articles included in Table 4, it can be said that the use of CNNs plays a crucial role in improving interaction in virtual, augmented, mixed, and extended reality applications. They allow users to communicate and interact with their environment in a more natural, fluid, and accessible way. Additionally, they enable the identification, tracking, and real-time following of people or parts of their body in these environments, enhancing the interaction between the user and their virtual or augmented surroundings. Furthermore, they are used to track the position of the user’s eyes and determine where they are looking. This allows VR and AR applications to provide information and interaction options based on the user’s gaze direction, enhancing the interaction experience. They can also be used to recognise the user’s voice or translate language or gestures in real time. This allows for a more accessible and natural interaction with VR, AR, MR, and XR applications, especially for users with hearing or speech disabilities. Finally, CNNs are used in behaviour analysis and emotion recognition in virtual environments. This allows applications to adapt to user emotions and reactions, even detecting their dizziness or fear of heights, leading to offering adapted experiences and improving interaction.
5.2 Execution
It can be observed in Table 5, the articles that address the use of CNNs in the execution of AR/VR/MR/XR applications and systems. Research in this category focuses on topics such as performance optimisation, identification, segmentation, and tracking of objects or context elements. The movement and pose of objects, the experience based on user and environment characteristics. Showing a unique distribution of 50%, as seen in Fig. 10 of the articles in this category, between recognition and segmentation.
The Fig. 10 shows that 29% of the articles focus on advanced object and scene recognition, followed by 21% on semantic and image segmentation. Object detection and tracking takes up 18%, while optimisation accounts for 13%. Motion tracking and recognition accounts for 10%, pose estimation and tracking 6%, and 360 degree content 3%. This indicates a primary focus on improving the analysis and understanding of scenes and objects to optimise functionality in XR technologies.
5.2.1 360 degree content processing and analysis
In the subcategory of 360 degree content processing and analysis, it refers to the full or perigonal angle. Here, the articles that contemplate immersive content are found. In Yang et al. (2018), the creation of the 360\(^{\circ }\) dataset is presented, and this study proposes an end-to-end 3D convolutional neural network to catalogue the quality of VR videos without needing a reference VR video. This method can extract spatio-temporal features, eliminating the need for manually designed features. In Irfan and Munsif (2022), the quality of panoramic videos and stereoscopic panoramic videos is evaluated. The proposed method combines spherical CNNs and non-local neural networks, enabling effective extraction of complex spatio-temporal information from the panoramic video. In Adhuran et al. (2022), researchers propose a new 360\(^{\circ }\) video encoding framework that leverages user-observed viewing information to reduce pixel redundancy in 360\(^{\circ }\) videos. By optimising areas with greater attention in 360\(^{\circ }\) content, the experience in VR is improved (Zhu et al. 2021; Su and Grauman 2022). In Su and Grauman (2021), visual recognition in spherical images produced by 360\(^{\circ }\) cameras is addressed. The authors propose learning a Spherical Convolution Network (SphConv) that translates a flat CNNs to the equirectangular projection of 360\(^{\circ }\) images. Given an original CNNs for perspective images as input, SphConv learns to reproduce the outputs of the flat filter in 360\(^{\circ }\) data, considering the variable distortion effects on the viewing sphere. Additionally, the authors present a Faster R-CNN model based on SphConv and demonstrate that it’s possible to use a spherical object detector without any object annotations in 360\(^{\circ }\) images.
5.2.2 Object detection and tracking
In the subcategory of object detection and tracking, the primary focus is on identifying and following objects. In Hoang et al. (2019), a rapid object detection approach based on deep learning is proposed to identify and recognise types of obstacles on the road, as well as to interpret and predict complex traffic situations. A single CNNs directly predicts regions of interest and class probabilities from full images in a single evaluation. In Huang and Yan (2022), the use of MR headset-mounted cameras for artificial vision-based object detection related to diet activities is proposed, followed by the subsequent display of real-time visual interventions to support the choice of healthy foods. In Thiel et al. (2022), the focus is on the classification and retrieval of 3D objects. The authors propose a novel method that combines a Global Point Signature Plus (GPSPlus) with a CNN. GPSPlus is a novel descriptor that can capture more shape information from a 3D object for a single 2D view. First, the original 3D model is converted into a colored one using GPSPlus. Next, the 2D projection obtained from this 3D-colored model is stored in a 32 \(\times\) 32 \(\times\) 3 matrix, which is used as input data for the Deep Residual Network that uses a unique CNNs structure. In You et al. (2021), the use of object detection algorithms to enrich visitors’ experiences at a cultural site is proposed, through the implementation of these algorithms on wearable devices, such as smart glasses. In Yu et al. (2022), authors proposed a solution for 3D object localisation with mobile devices. The proposed method combines a CNNs model for 2D object detection with AR technologies to recognise objects in the environment and determine their coordinates in the real world. In Lai et al. (2020), a methodology is introduced to address visual target tracking tasks. It involves using a CNNs capable of classifying a set of patches based on how well the target is centred or framed. To counteract potential interferences, the network is fed patches located around the object detected in the previous frame, and of different sizes, to account for potential scale changes and detect the shift. One of the most recent studies is Zhang et al. (2022), in which machine learning and synthetically generated data are used for the creation of object tracking configurations exclusively from this data. The data is highly optimised for training a CNN, providing reliable and robust results during real-world applications and using only simple RGB cameras.
5.2.3 Advanced object and scene recognition
In the subcategory of advanced object and scene recognition, an activity is addressed after the detection of the object or its possible movement. This is the section where the object or its attributes or their representation in a context are recognised, allowing the entire scenario to be recognised (Tang et al. 2022). In Polap et al. (2017), the relationship between the scene and the associated objects in everyday activities from an egocentric vision perspective, that is, from the observer’s point of view, is explored. The authors argue that daily activities tend to occur in prototypical scenes that share many visual features, regardless of who or where the video was recorded, thus recognising the context. In Su and Grauman (2021), a lightweight but powerful CNNs called Efficient Feature Reconstruction Network (EFRNet) is presented for real-time scene recognition. The central idea breaks down the process into two stages: i) bottom-up dictionary learning/encoding and ii) top-down feature reconstruction. In Bai et al. (2021), RoadNet-RT is introduced, a lightweight, high-speed CNNs architecture specifically designed for road recognition and segmentation. This architecture has been optimised for autonomous driving and virtual reality, where real-time processing speed is essential. In Nambu et al. (2022), MR is used to offer immersive and enriched experiences to the visitors of the Taxila Museum in Pakistan. It recognises museum artefacts using DL in real time and retrieves supporting multimedia information for visitors. To provide the user with the exact content, CNNs are applied to correctly recognise the artefacts. In Ko and Lee (2020), the authors propose an approach to improve 3D object recognition using a view-weighted CNNs (VWN); the hypothesis is that different projections of the same 3D object have distinct discriminatory characteristics. Therefore, some images are more meaningful than others for object recognition. In Zhang et al. (2020), researchers propose the establishment of a CNNs model for classifying geometric figures. This is achieved by optimising hyperparameters through random search, a random search technique. It optimises the image recognition and classification process.
5.2.4 Segmentation
In the subcategory of segmentation, semantic segmentation and image segmentation are incorporated. In the XR context, segmentation allows for the identification and classification of different objects and elements in a scene, giving the system a deeper understanding of the environment. This process is essential to augment the scene with relevant information, allowing a more natural and precise interaction between the user and the virtual or augmented environment. In Yi et al. (2019), researchers train a neural network for semantic segmentation in different scenarios to process images taking into account the category of the scene, yielding better results. In Zadeh et al. (2020), this study uses deep learning-based semantic segmentation in gynaecology to detect and locate a structure in an image at the pixel level, providing augmented information to the specialist. In Zou et al. (2020), the focus is on semantic segmentation and depth completion, central tasks for scene understanding that are vital in AR/VR applications. In Zhang et al. (2020), the authors propose a curriculum-based learning approach that seeks to bridge the domain gap in the semantic segmentation of urban scenes. This approach is based on first solving simpler tasks that allow inferring important properties about the target domain. Specifically, they learn global and local label distributions in the images, referencing superpixels. Once these properties are inferred, a segmentation network is trained and its predictions in the target domain are adjusted to fit the inferred properties. In Han et al. (2020), the focus is on challenges and solutions in the semantic understanding of 3D environments. The authors propose a sparse convolution scheme based on fragments to reuse neighbouring points within each spatially organised fragment. By implementing semantic and geometric 3D reconstruction simultaneously on a portable tablet device, the authors demonstrate a foundational platform for AR applications. In Tanzi et al. (2021), different architectures for semantic segmentation are compared in the task of identifying and locating a catheter in medical images and its possible application in AR. In Al-Sabbag et al. (2022), the authors propose a method for visual inspection of structural defects using an XR device. They allow for the interactive detection and quantification of defects using this device and image segmentation, which can overlay graphical information in a real environment. In Liu et al. (2022), a morphological diagnostic system is established for the detection of bone marrow cells based on a Faster R-CNN object detection model. The system is trained to perform pixel-level image segmentation and automatically detects bone marrow cells and determines their types. The information is visualised and integrated into a microscope with AR. In Zhang and Aliaga (2022), the authors use the CNN, RFCNet, which uses regularisation, fusion, and completeness to improve urban segmentation accuracy. Their approach uses urban structures as they often present regular patterns, resulting in improved accuracy. In Jurado-Rodríguez et al. (2022), an automatic procedure is proposed for the generation and semantic segmentation of 3D cars obtained from UAV-based photogrammetric image processing. The authors recognise that deep learning architectures, coupled with the wide availability of image datasets, offer new opportunities for 3D model segmentation. One of the most notable papers on segmentation is Hu and Gong (2022). The authors present a Lightweight Asymmetric Refinement Fusion Network (LARFNet), designed to perform real-time semantic segmentation on mobile devices. LARFNet is a CNNs that uses an asymmetric encoder-decoder structure, incorporating a depth-separable asymmetric interaction module (DSAI) in the encoding process and a bilateral pyramid attention module (BPPA) along with a multi-stage refinement fusion module (MRF) in the decoder. These modules facilitate effective information extraction and feature map refinement and fusion, respectively. In Park et al. (2020), instance segmentation is used through Mask R-CNN, coupled with marker less AR, to overlay the 3D spatial mapping of a real object in its surrounding environment. This 3D spatial information with instance segmentation is used to provide 3D guidance and navigation, assisting users in identifying and understanding physical objects as they move through the physical environment.
5.2.5 Optimisation
In the subcategory of optimisation, articles on performance, architecture, or peripherals. Advances in execution have allowed for smoother, more efficient, and adaptive AR/VR/MR/XR experiences, enhancing user performance and satisfaction and increasingly enabling access to these technologies in everyday settings. In Qu et al. (2023), a metric is proposed to evaluate quality based on a CNN, called LFACon, which surpasses last-generation metrics and achieves the best performance for most distortion types with reduced computational time. In Spagnolo et al. (2023), they utilise a custom Fast SR CNNs (FSRCNN) accelerator capable of processing up to 214 ultra-high-definition frames/s, with lower energy consumption without compromising perceptual visual quality, achieving a 55% energy reduction and a performance rate \(\times\)14 times higher than other devices. In Pinkham et al. (2023), they introduce a near-sensor processor architecture, ANSA, that supports flexible processing schemes and data flows to maintain high efficiency for dynamic CNNs workloads on devices, improving energy efficiency by using CNNs. In Luo et al. (2023), Sun et al. (2023), the authors focus on finding a CNNs configuration for point cloud classification, increasing accuracy, reducing computational cost, and execution time.
5.2.6 Movement estimation
In the subcategory of movement estimation, articles are framed that focus on tracking and movement recognition. In Dai et al. (2020), consisting of an object tracking motion and observation model, a "Motion Estimation Network" (MEN) is used. This network seeks the most probable locations of the target and creates a path additional to the target’s previous position. This optimises movement estimation by generating a small number of candidates close to two possible positions. These candidates are introduced into a Siamese network trained to identify the most probable candidate. Each candidate is compared to an adaptable buffer that is updated according to a predefined condition. To adapt to changes in the target’s appearance, a weighting CNNs is used, which adaptably assigns weights to the final similarity scores of the Siamese network using sequence-specific information, allowing it to identify and predict movement (Zeng et al. 2021). In Shariati et al. (2020), the authors introduce a solution to estimate Ego-mobility in a way that preserves user privacy. Ego-mobility is a key concept in robotic systems and augmented and virtual reality applications, referring to the ability to estimate one’s movement from sensory perception. They use a very low-resolution monocular camera and a CNN, named SRFNet, to retrieve Ego-mobility. The results of this study indicate that it is possible to robustly retrieve "ego-mobility" from very low-resolution images when camera orientations and metric scales are retrieved from inertial sensors and merged with the estimated translations.
5.2.7 Pose estimation techniques
The pose estimation techniques subcategory refers to articles that use position, from orientation to the distinction in XYZ coordinates and their degrees of freedom. In Kim and Lee (2019) the authors present a non-marker-based location method for live broadcasts. This approach uses two CNNs to automatically detect the target object and estimate its initial 3D pose. As a result, the 3D model can be aligned on a global map without the need for manual intervention or markers.
Thanks to the analysis of the articles included in the Table 5 under the category of execution, various uses of CNNs in AR/VR/MR/XR technologies are highlighted. Most of these works focus on aspects such as the identification and semantic segmentation of objects and context elements, as well as their tracking, fundamental aspects for determine and enrich the scene. In addition, emphasis is placed on the analysis of movement and pose of objects in both closed and open environments. This section of the research illustrates the role of CNNs in understanding and improving immersive reality experiences, emphasising their potential to provide richer, more intuitive experiences adapted to the user’s context, underscoring their importance in the development and advancement of extended reality technologies.
5.3 Creation classification
In Table 6, we can see the articles that address the use of CNNs in AR/VR/MR/XR from the point of view of creation. Key topics in this category include automatic 3D model generation, texture synthesis, character animation, and creating realistic virtual environments. In this classification you can see in Fig. 11 a more uniform distribution, however 3D and reconstruction stand out. CNNs have been used to speed up and simplify the content creation process, while allowing for more detailed and personalised results.
Advances in this area have driven the development of content creation tools and platforms for VR/AR/MR/XR, allowing more users and developers to access and create immersive experiences increasingly faster and of higher quality. Each of the subcategories in which the articles in Table 6 were catalogued are described below:
The Fig. 11 shows that 26% of the articles focus on the creation of virtual environments, followed by 23% on the reconstruction of 3D scenes and objects. The generation of 3D models accounts for 14%, while mapping and lighting simulation take up 10% and 9% respectively. Texture generation is the smallest subcategory with 5%. This indicates that most of the research in the creation category is focused on developing realistic and detailed environments and reconstructions to improve functionality and user experience in XR technologies.
5.3.1 Texture
In the texture subcategory, articles are grouped that are associated with the generation and synthesis of textures. This generation is a crucial component in creating compelling and realistic AR/VR/MR/XR environments. CNNs have been employed to model and generate realistic textures that adapt to the environment and virtual objects. The use of CNNs not only improves the visual quality and realism of textures but can also optimise the performance of applications by allowing for the efficient generation and rendering of textures. In Liu et al. (2021), an improvement in texture generation in video synthesis is proposed, taking into account fine-scale details, such as wrinkles in clothing that depend on posture. The method is based on the combination of two convolutional neural networks. With posture information, the first CNNs predicts a dynamic texture map containing high-frequency details that are temporally coherent. The second CNNs conditions the generation of the final video on the temporally coherent output of the first CNN. In Rodriguez-Pardo et al. (2019), the detection and replication of repetitive texture patterns from a single image are proposed. This technology has significant implications in graphic processes, where repetitive texture patterns are a key tool for creating realistic visual representations. A relevant point is its ability to determine the minimum repeated pattern size in an image and replicate it so that the resulting image is as close as possible to the original.
5.3.2 Mapping
In the subcategory of mapping, articles use mapping as a component of XR. Mapping allows for the construction of environment models that can be used for the location and orientation of virtual objects. CNNs have proven effective in mapping, allowing for precise and robust registration of objects and the environment. This results in a better integration of virtual and physical elements and a more consistent and compelling experience for the user. In Yang et al. (2022), the authors introduce a CNNs model called SDF-SLAM; this model can carry out the camera’s position estimation in a broader indoor environment and can also perform depth estimation and semantic segmentation in monocular images, thereby constructing a comprehensive and precise three-dimensional map. In Liu and Miura (2021), the authors propose a fundamental technology for augmented reality, RDMO-SLAM, a real-time vSLAM that combines RDS-SLAM (Liu and Miura 2021) and Mask R-CNN. RDMO-SLAM estimates the speed of each feature point and uses this information as a constraint to minimise the influence of dynamic objects on tracking, reducing error and increasing precision in simultaneous localisation and mapping processes. In Su and Yu (2022), CNNs are used to enhance deep image reconstruction by working with dense three-channel colour images (red, green, and blue), focusing on the transformation of multi-layer image invariant features.
5.3.3 Reconstruction
In the subcategory of reconstruction, articles are grouped by the frame of the reconstruction of environments or objects. Reconstruction allows for the creation of virtual replicas from real-world data. CNNs have been successfully applied in reconstruction, enabling the generation of precise and detailed 3D models that can be used in a variety of applications, from VR content creation to augmented reality for the construction industry. In Bi et al. (2020), CNNs are used to extract information about the materials, geometry, and lighting of an object from a single RGB image and reconstruct its appearance. In Song et al. (2022), ACINR-MVSNet is introduced, a framework for multi-view stereo reconstruction that features group adaptive correlation and implicit neural enhancement to refine the depth map and reconstruction guided by a corresponding reference image, achieving the recovery of finer details. In Manni et al. (2021), an Android application is proposed that tracks the phone’s position relative to the world, captures an RGB image for each exposed object, and estimates the scene’s depth, as well as a server program that classifies the captured objects, retrieves the corresponding 3D models from a database, and estimates their position, rotation, and scale in AR. In addition to this, in reconstruction, the process of joining images (image-stitching or image mosaic) can be considered, as this process creates 360\(^{\circ }\) mosaics in AR/VR. Despite its relevance, maintaining homogeneity between the input image sequences during the stitching is a significant challenge. In Chilukuri et al. (2021), the authors propose a methodology for image stitching, called left-right stitching unit (L,r-Stitch), which handles multiple non-homogeneous image sequences to generate a homogeneous panoramic view. L,r-Stitch consists of a CNN, named l,r-PanoED. The l,r-PanoED encoder extracts semantically rich feature maps from the inputs to perform the stitching in a broad panoramic domain, while the decoder reconstructs the output panoramic view from the feature maps.
5.3.4 Environment
In the subcategory of environment, articles utilise CNNs to understand and model the user’s environment, allowing for a smoother and more natural interaction by incorporating elements such as depth Li et al. (2022) or scene lighting. By better understanding the environment, XR applications can offer more immersive and safe experiences. In Ye et al. (2022), there is an effort to improve the efficiency and accuracy of stereo-matching algorithms. The authors propose a stereo network that uses the prior consistency of local disparity to enhance the performance of real-time disparity estimation. They conduct an initial disparity estimation calculated by a lightweight pyramid matching network and introduce two new modules, the Spatial Consistency Refinement (SCR) module and the Temporal Consistency Refinement (TCR) module. The SCR module utilises high-confidence predictions from sparse neighbourhoods to refine the less reliable regions of disparity. This module incorporates a single-layer dynamic local filter to adapt the propagation to contents, which enhances the disparity quality without significantly increasing the computation and memory burden. The TCR module, on the other hand, is used to refine the disparity estimation of consecutive frames based on disparity consistency over time. In Wu and Wang (2022), the Rich Global Feature Guided Network (RGFN) is proposed for monocular depth estimation using CNNs and Transformer (Vaswani et al. 2017).
5.3.5 Light
In the light subcategory, articles that modify or interpret light are grouped. Light shaping is essential to creating compelling virtual and augmented reality experiences. CNNs have been used to estimate ambient illumination and consistently apply it to virtual objects. This improves immersion by making virtual objects look as if they were present in the scene, with the same lighting as the physical environment. In Chalmers et al. (2021), the authors present a method for ambient lighting reconstruction, essential for improving spatial presence in AR and MR applications. This illumination is encoded as a reflection map generated from a conventional photograph. The method uses a stacked CNNs to predict roughness and light levels from a low dynamic range photograph with a limited field of view. The reflection maps are predicted with different degrees of roughness, corresponding to those of the virtual objects that are rendered, from the most diffuse to the brightest.
5.3.6 Three dimensional modelling
In the three dimensional modelling subcategory, some articles allow a reconstruction of 3D models or that use this domain for the reconstruction. CNNs have proven to be effective in generating and manipulating 3D models. This includes generating 3D models from 2D images, synthesising realistic 3D models, and manipulating 3D models for animation or interaction. The ability of CNNs to work with 3D data has expanded the possibilities of XR, allowing the creation of richer and more realistic content. In Amara et al. (2022) the authors use a CNN, called O-Net, designed to automatically segment COVID-19 infected chest CT scans. This information is part of a 3D modelling process that is used in a virtual reality platform, COVIR, to visualise and manipulate 3D lungs and segmented COVID-19 lesions. In Ye et al. (2020), HAO-CNN is proposed, a network to reconstruct 3D hair models from a single image, it presents an advance in detail and direction, however its authors see a challenge in the reconstruction of curly hair and receding hair that presents occlusions.
5.3.7 Point cloud
In the point cloud subcategory, all the articles are based on the generation of a point cloud, carry out a reconstruction of both objects and the environment. Point cloud is a popular method for representing 3D data in the field of XR. CNNs have been used to process and manipulate point clouds, enabling object identification and classification, pose estimation, and 3D reconstruction. In Zhao et al. (2022) a context-aware deep network is presented, called PCUNet, this network generates point clouds in a stepwise manner, from coarse to fine. PCUNet employs an encoder-decoder structure, with the encoder following a shape-relation convolutional neural network (RS-CNN) design, and the decoder consisting of fully connected layers and two stacked decoder modules to predict point clouds. complete. Thus, achieving more accurate models with related point clouds. In Jia et al. (2021), it generates a point cloud by doing a geometric decomposition and using CNNs to learn about the conformation of compressed 2D points that can be propagated to 3D point cloud frames.
6 Discussion
This review considers articles based on the potential use of CNNs in the fields of XR. In this work, the reviewed the related works from 2010 to 2023, with the aim of capturing a transcendental stage in the development and evolution of both CNNs and AR/VR/MR/XR. This adjustment is based on the rapid progression and swift adoption of these technologies observed between 2021 and 2022, characterised by significant advances in the capabilities of CNNs and their implementation in AR/VR/MR/XR. In 2023, research in this area saw a decline. Furthermore, it was anticipated that including the years 2010 and 2011 in the review would increase the number of publications before 2015, providing a more comprehensive and enriched overview of advancements in these areas over time. As a result of the search and classification in the selected databases, it was evident that during the period between 2010 and 2015, no applicable articles were found in the domains of this research. It is possible that expanding the database selection might uncover articles from this period. However, several reasons are proposed that might explain the lack of research in this area during those years: between 2010 and 2015, CNNs were still in an early stage of development and adoption. Although they were already being used and popularised in some computer vision applications (Schmidhuber 2015), their application in XR was not common. Additionally, during that period, XR devices were still under development, and the available hardware could not support complex CNNs models with high resource demands. One of the most popular CNNs of this period took between 10 to 30 min to classify 8 million pixels using four GPUs (Ciresan et al. 2012) (Fig. 12).
It’s possible that during those years, researchers were more focused on other aspects of CNNs and XR technologies, such as algorithm optimisation and the development of specific applications, rather than exploring the intersection of both technologies. As CNNs and XR technologies evolved and matured, research in these fields became more specialised. Events such as Facebook’s transition to Meta or the crisis caused by SARS-CoV-2 might have drawn more attention to the intersection of these technologies.
Among the various areas of CNNs use in different XR technologies, the "execution" category stands out as the one that has received the most attention in research and development due to the number of works in this area. The main reason could lie in the inherent nature of CNNs, designed for image analysis and processing. The execution category encompasses image processing, and these networks have demonstrated exceptional efficiency and precision in image and object recognition and classification. They highlight in segmentation, pose, and movement, which is essential for a smooth and consistent XR experience. Being able to identify and track objects and shapes in real time, CNNs enable XR systems to understand and respond to the user’s environment and actions, ensuring a more natural and precise immersion.
6.1 Impact of CNNs in AR, MR, and XR: interaction, creation, and execution
When describing the technologies of VR, AR, MR, and XR separately, we can say that in the case of VR, CNNs have been used in interaction to provide precise tracking of the user’s hands and body, as well as to recognise and process gestures and actions in real time. This enhances immersion and interaction in entirely virtual environments. In terms of creation, CNNs have been applied in generating detailed and realistic virtual environments from images or data captured from the real world, as well as in animating virtual characters and objects. Regarding execution, CNNs have contributed to optimising the performance of VR applications, such as reducing latency and increasing frame rate, which is crucial for an optimal user experience.
In AR, CNNs have been used in interaction for object recognition and understanding the physical environment, allowing precise and coherent integration of virtual elements in the real world. In creation, CNNs have been applied to generate realistic textures and 3D models from images or data captured from the environment. In execution, CNNs have aided in developing more efficient AR applications with lower resource consumption, enabling their use on mobile devices and systems with limited hardware.
In MR, which combines elements of AR and VR, CNNs have been used in interaction to enhance the fusion of virtual and physical environments, providing a more immersive and coherent experience. In creation, CNNs have been applied in generating 3D models and textures that adapt to the physical environment and lighting conditions. In execution, CNNs have been employed to optimise performance and real-time adaptation of MR applications, considering environment characteristics and hardware capabilities.
Regarding XR, which encompasses all the previous technologies, CNNs have been used in interaction to provide more natural and accessible interfaces, such as voice and gesture recognition, or adapting the information presented to the user based on their context. In creation, CNNs have been applied to generate personalised and adaptive content and experiences. In execution, CNNs have been employed to optimise and adapt XR applications to various devices and platforms, ensuring a smooth and efficient experience across a wide range of scenarios and applications.
The inherent limitations of this study include, first and foremost, temporality. While this study offers valuable insights into the current applications and potential of CNNs in XR technologies, it’s important to note that both fields are rapidly advancing. The relevance and accuracy of the information could be affected by this temporal limitation. Furthermore, the study relies on the consulted databases, which means there’s a potential exclusion of relevant literature or research not indexed or contained in other databases not considered. Future research needs to be aware of these limitations when interpreting and applying the findings of this study.
The quality assessment process led to the exclusion of several studies that did not meet our criteria, mainly due to methodological shortcomings or studies that explicitly did not mention that their research could be applied in XR technologies. This filtering ensured that our analysis was based on studies that provide reliable and unbiased data, thus reinforcing our conclusions about the effectiveness and applicability of CNNs in extended reality technologies.
6.2 Analysis of distributions
In Fig. 8, the execution category is the most representative with 43%. This could be due to the fact that most of the uses of CNNs in XR are in real time and while they are running. Segmentation and recognition are critical areas to ensure that XR applications work efficiently and effectively on a variety of devices and situations. Interaction takes up 33%, highlighting the importance of improving how users interact with XR technologies through gestures, eye tracking, and hand detection. Creation, although crucial, has a lower focus at 24%, which might indicate less research into content generation and detailed virtual environments. In Fig. 9, the subcategory of Human-Computer Interaction is the most significant with 41%. This could be because improving the naturalness and fluidity of interactions is fundamental to making XR technologies intuitive and accessible. In Fig. 10, recognition is the most representative subcategory with 29%. This could be due to the need for advanced understanding of the environment to improve the functionality of XR applications. Segmentation (21%) and object detection and tracking (18%) are also critical areas, facilitating the identification and classification of objects within a scene for more detailed and accurate interactions. In Fig. 11, the creation of virtual environments is the most prominent subcategory with 26%, followed by the reconstruction of 3D scenes and objects with 23%. Together, they account for 50% of the creation research, which could be due to the crucial need to develop more realistic and detailed environments, as well as to accurately represent the virtual environment. The distributions in these graphs may reflect research priorities and approaches to the use of CNNs in XR technologies. The execution category predominates due to the need to segment and recognise the real-time environment in XR applications. Interaction focuses on developing more natural interfaces and interaction methods, with a particular emphasis on gestures and human-computer interfaces due to their importance for accessibility and usability. Creation seeks to improve the quality and realism of virtual content, with a strong focus on environments and reconstruction to provide more detailed and accurate immersive experiences. These integrated efforts are essential to advance the functionality and user experience of extended reality technologies.
7 Conclusions
In the work carried out, the research question has been answered: in the context of XR, CNNs are used to enhance the user experience by processing and analysing images, movements, signals, and videos, even in real-time. From this literature review, it can be identified that one of the uses of CNNs in the field of VR is to improve image quality and performance in generating virtual worlds by tracking the user’s gaze, optimising the points the user is focusing on. In AR, CNNs are used for object and marker recognition, allowing the overlay of virtual information onto the real world. In XR, CNNs are used to combine real elements with virtual reality, creating more realistic immersive experiences. In MR, CNNs are used to recognise and track objects in the real world and overlay virtual content on them. In summary, CNNs are essential to provide a richer and more realistic reality experience in these fields.
Additionally, it was concluded that CNNs can be classified within the VR/AR/MR/XR domain into three major groups: interaction, execution, and creation.
In terms of interaction, CNNs can be employed for recognising gestures and movements of users in VR, AR, and MR, enabling more natural and accurate interaction with virtual content. They are also utilised for tracking head and eye movements to adapt the perspective in real-time in VR.
In execution, CNNs are harnessed for object and marker recognition in the real world for AR and MR. They further provide a more realistic and smooth experience in VR, AR, and MR. This encompasses the use of CNNs to generate more detailed virtual worlds, enhance image quality, and reduce latency. They are also used for tracking objects and individuals in the real world for AR and MR, facilitating precise overlay of virtual content. Additionally, they are applied to optimise performance and image quality in VR, heighten efficiency in virtual world generation, and minimise loading time and latency.
Regarding creation, CNNs are employed for the reconstruction of objects and 3D scenes in VR, AR, and MR. This enables the crafting of more detailed and realistic virtual content and also allows for the accurate creation of three-dimensional models of the real world for augmented and extended reality applications. They are even used for real-time scene reconstruction captured by cameras to enhance the accuracy of virtual content overlay in AR and MR. Furthermore, they can be utilised for the reconstruction of objects and 3D scenes from captured images and videos, facilitating the creation of more precise immersive experiences.
The implications of our findings extend across various domains, influencing researchers, developers, and practitioners involved in the technology and application of extended reality. This detailed breakdown provides a clear roadmap for each stakeholder group to harness the potential of CNNs in their respective fields, thus enhancing the utility and impact of the review. The Table 7 offers a detailed view of the use of CNNs in extended reality for each Stakeholder, focusing on the proposed classification, specifically on interaction, execution, and creation.
The future work horizon of this research includes a detail of CNNs architectures, the analysis of advanced Deep Learning methods, such as Generative Adversarial Networks (GAN) (Bau et al. 2018) among others, in the field of XR. This exploration would enrich the study of artificial intelligence techniques in XR. It is vital to underline the importance of areas such as semantic segmentation, HCI and 3D reconstruction in the application of CNNs in XR. Its detailed research not only remains a priority, but also opens doors to more specialized research in each subdomain. This work could be further extended by adding new databases and delving into the aforementioned areas.
Availability of data and materials
This work has not used any data set that should be published.
References
Abdi L, Meddeb A (2018) Driver information system: a combination of augmented reality, deep learning and vehicular ad-hoc networks. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-5054-6
Abolfazli Esfahani M, Wu K, Yuan S, Wang H (2019) Deepdsair: deep 6-DOF camera relocalization using deblurred semantic-aware image representation for large-scale outdoor environments. Image Vis Comput 89:120–130. https://doi.org/10.1016/j.imavis.2019.06.014
Achanccaray D, Hayashibe M (2020) Decoding hand motor imagery tasks within the same limb from EEG signals using deep learning. IEEE Trans Med Robot Bion 2(4):692–699. https://doi.org/10.1109/TMRB.2020.3025364
Adhuran J, Kulupana G, Fernando A (2022) Deep learning and bidirectional optical flow based viewport predictions for 360° video coding. IEEE Access 10:118380–118396
Afsar MM, Saqib S, Aladfaj M, Alatiyyah MH, Alnowaiser K, Aljuaid H, Jalal A, Park J (2023) Body-worn sensors for recognizing physical sports activities in exergaming via deep learning model. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3239692
Al Koutayni MR, Rybalkin V, Malik J, Elhayek A, Weis C, Reis G, Wehn N, Stricker D (2020) Real-time energy efficient hand pose estimation: a case study. Sensors. https://doi.org/10.3390/s20102828
Alam MM, Islam MT, Rahman SMM (2022) Unified learning approach for egocentric hand gesture recognition and fingertip detection. Pattern Recognit. https://doi.org/10.1016/j.patcog.2021.108200
Alam MM, Rahman SMM (2020) Affine transformation of virtual 3d object using 2d localization of fingertips. Virtual Real Intell Hardware 2:534–555. https://doi.org/10.1016/j.vrih.2020.10.001
Alashhab S, Gallego AJ, Lozano M (2022) Efficient gesture recognition for the assistance of visually impaired people using multi-head neural networks. Eng Appl Artif Intell 114:105188. https://doi.org/10.1016/j.engappai.2022.105188
Albraikan AA, Alzahrani JS, Alshahrani R, Yafoz A, Alsini R, Hilal AM, Alkhayyat A, Gupta D (2022) Intelligent facial expression recognition and classification using optimal deep transfer learning model. Image Vis Comput 128:104583. https://doi.org/10.1016/j.imavis.2022.104583
Alemayoh TT, Lee JH, Okamoto S (2023) Leg-joint angle estimation from a single inertial sensor attached to various lower-body links during walking motion \(\dagger\). Appl Sci. https://doi.org/10.3390/app13084794
Alharthi AS, Casson AJ, Ozanyan KB (2021) Spatiotemporal analysis by deep learning of gait signatures from floor sensors. IEEE Sens J 21(15):16904–16914. https://doi.org/10.1109/JSEN.2021.3078336
Alhejri A, Bian N, Alyafeai E, Alsharabi M (2022) Reconstructing real object appearance with virtual materials using mobile augmented reality. Comput Graph 108:1–10. https://doi.org/10.1016/j.cag.2022.08.001
Al-Sabbag ZA, Yeum CM, Narasimhan S (2022) Interactive defect quantification through extended reality. Adv Eng Inf 51:101473. https://doi.org/10.1016/j.aei.2021.101473
Al-Sabbag ZA, Yeum CM, Narasimhan S (2022) Enabling human-machine collaboration in infrastructure inspections through mixed reality. Adv Eng Inform 53:101709. https://doi.org/10.1016/j.aei.2022.101709
Aly S, Aly W (2020) Deeparslr: a novel signer-independent deep learning framework for isolated Arabic sign language gestures recognition. IEEE Access 8:83199–83212. https://doi.org/10.1109/ACCESS.2020.2990699
Al-Zoube MA (2022) Efficient vision-based multi-target augmented reality in the browser. Multimed Tools App 81(10):14303–14320. https://doi.org/10.1007/s11042-022-12206-6
Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:53. https://doi.org/10.1186/s40537-021-00444-8
Amara K, Aouf A, Kennouche H, Djekoune AO, Zenati N, Kerdjidj O, Ferguene F (2022) Covir: a virtual rendering of a novel NN architecture o-net for COVID-19 CT-scan automatic lung lesions segmentation. ComputersandGraphics 104:11–23. https://doi.org/10.1016/j.cag.2022.03.003
Amjad A, Khan L, Ashraf N, Mahmood MB, Chang HT (2022) Recognizing semi-natural and spontaneous speech emotions using deep neural networks. IEEE Access 10:37149–37163
Ansari MF, Kasprowski P, Peer P (2023) Person-specific gaze estimation from low-quality webcam images. Sensors. https://doi.org/10.3390/s23084138
Anvari T, Park K, Kim G (2023) Upper body pose estimation using deep learning for a virtual reality avatar. Appl Sci. https://doi.org/10.3390/app13042460
Apicella A, Arpaia P, De Benedetto E, Donato N, Duraccio L, Giugliano S, Prevete R (2022) Enhancement of SSVEPS classification in BCI-based wearable instrumentation through machine learning techniques. IEEE Sens J 22(9):9087–9094
Asish SM, Kulshreshth AK, Borst CW (2022) Detecting distracted students in educational vr environments using machine learning on eye gaze data. Comput Graphs 109:75–87. https://doi.org/10.1016/j.cag.2022.10.007
Azuma RT (1997) A survey of augmented reality. Presence Teleop Virt 6:355–385. https://doi.org/10.1162/PRES.1997.6.4.355
Bai L, Lyu Y, Huang X (2021) Roadnet-rt: high throughput CNN architecture and SOC design for real-time road segmentation. IEEE Trans Circuits Syst I Regul Pap 68(2):704–714. https://doi.org/10.1109/TCSI.2020.3038139
Bailenson JN, Yee N, Merget D (2006) The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic. direct.mit.edu 15:359–372
Balachandran G, Krishnan JVG (2022) Machine learning based video segmentation of moving scene by motion index using IO detector and shot segmentation. Image Vis Comput 122:104443. https://doi.org/10.1016/j.imavis.2022.104443
Bamps K, Buck SD, Ector J (2022) Deep learning based tracked x-ray for surgery guidance. Comput Methods Biomech Biomed Eng Imag Vis. https://doi.org/10.1080/21681163.2021.2002193
Bau D, Zhu J-Y, Strobelt H, Zhou B, Tenenbaum JB, Freeman WT, Torralba A (2018) GAN dissection: visualizing and understanding generative adversarial networks. https://doi.org/10.48550/arXiv.1811.10597
Bermejo B, Juiz C, Cortes D, Oskam J, Moilanen T, Loijas J, Govender P, Hussey J, Schmidt AL, Burbach R, King D, Connor C, Dunlea D (2023) Ar/vr teaching-learning experiences in higher education institutions (HEI): a systematic literature review. Informatics. https://doi.org/10.3390/informatics10020045
Bernal-Berdun E, Martin D, Gutierrez D, Masia B (2022) Sst-sal: a spherical spatio-temporal approach for saliency prediction in 360\(^\circ\) videos. Comput Graph 106:200–209. https://doi.org/10.1016/j.cag.2022.06.002
Bharadwaj AG, Starly B (2022) Knowledge graph construction for product designs from large cad model repositories. Adv Eng Inform 53:101680. https://doi.org/10.1016/j.aei.2022.101680
Bhatt D, Patel C, Talsania H, Patel J, Vaghela R, Pandya S, Modi K, Ghayvat H (2021) Cnn variants for computer vision: history, architecture, application, challenges and future scope. Electronics. https://doi.org/10.3390/electronics10202470
Bi Z, Huang W (2021) Human action identification by a quality-guided fusion of multi-model feature. Fut Generat Comput Syst Int J E-Sci 116:13–21. https://doi.org/10.1016/j.future.2020.10.011
Bi T, Ma J, Liu Y, Weng D, Wang Y (2020) Sir-net: self-supervised transfer for inverse rendering via deep feature fusion and transformation from a single image. IEEE Access 8:201861–201873. https://doi.org/10.1109/ACCESS.2020.3035213
Billinghurst M, Nebeling M (2021) Rapid prototyping of XR experiences. In: Conference on human factors in computing systems—proceedings. https://doi.org/10.1145/3411763.3445002
Bimbraw K, Nycz CJ, Schueler M, Zhang Z, Zhang HK (2023) Simultaneous estimation of hand configurations and finger joint angles using forearm ultrasound. IEEE Trans Med Rob Bionics. https://doi.org/10.1109/TMRB.2023.3237774
Bose SR, Kumar VS (2021) In-situ identification and recognition of multi-hand gestures using optimized deep residual network. J Intell Fuzzy Syst 41(6):6983–6997. https://doi.org/10.3233/JIFS-210875
Boutros F, Damer N, Raja K, Ramachandra R, Kirchbuchner F, Kuijper A (2020) Iris and periocular biometrics for head mounted displays: segmentation, recognition, and synthetic data generation. Image Vis Comput 104:104007. https://doi.org/10.1016/j.imavis.2020.104007
Brandolt Baldissera F, Vargas FL (2020) A light implementation of a 3d convolutional network for online gesture recognition. IEEE Lat Am Trans 18(02):319–326. https://doi.org/10.1109/TLA.2020.9085286
Bu X (2020) Human motion gesture recognition algorithm in video based on convolutional neural features of training images. IEEE Access 8:160025–160039. https://doi.org/10.1109/ACCESS.2020.3020141
Burdea GC, Coiffet P (2017) Virtual reality techology, vol 464, second edition. Wiley, New Jersey
Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, Reid I, Leonard JJ (2016) Past, present, and future of simultaneous localization and mapping: towards the robust-perception age. IEEE Trans Rob 32:1309–1332. https://doi.org/10.1109/TRO.2016.2624754
Caglayan A, Imamoglu N, Nakamura R (2022) Mmsnet: multi-modal scene recognition using multi-scale encoded features. Image Vis Comput 122:104453. https://doi.org/10.1016/j.imavis.2022.104453
Cao L, Fan C, Wang H, Zhang G (2019) A novel combination model of convolutional neural network and long short-term memory network for upper limb evaluation using kinect-based system. IEEE Access 7:145227–145234. https://doi.org/10.1109/ACCESS.2019.2944652
Caputo A, Giachetti A, Giannini F, Lupinetti K, Monti M, Pegoraro M, Ranieri A (2020) Sfinge 3d: a novel benchmark for online detection and recognition of heterogeneous hand gestures from 3d fingers’ trajectories. Comput Graph 91:232–242. https://doi.org/10.1016/j.cag.2020.07.014
Caputo A, Giachetti A, Soso S, Pintani D, D’Eusanio A, Pini S, Borghi G, Simoni A, Vezzani R, Cucchiara R, Ranieri A, Giannini F, Lupinetti K, Monti M, Maghoumi M Jr, Le MQ, Nguyen HD, Tran MT (2021) Shrec 2021: skeleton-based hand gesture recognition in the wild. Comput Graph 99:201–211. https://doi.org/10.1016/j.cag.2021.07.007
Cha Y-W, Price T, Wei Z, Lu X, Rewkowski N, Chabra R, Qin Z, Kim H, Su Z, Liu Y, Ilie A, State A, Xu Z, Frahm J-M, Fuchs H (2018) Towards fully mobile 3d face, body, and environment capture using only head-worn cameras. IEEE Trans Visual Comput Graph 24(11):2993–3004. https://doi.org/10.1109/TVCG.2018.2868527
Cha G, Lee M, Cho J, Oh S (2019) Deep pose consensus networks. Comput Vis Image Underst 182:64–70. https://doi.org/10.1016/j.cviu.2019.03.004
Chalmers A, Zhao J, Medeiros D, Rhee T (2021) Reconstructing reflection maps using a stacked-CNN for mixed reality rendering. IEEE Trans Visual Comput Graph 27(10):4073–4084. https://doi.org/10.1109/TVCG.2020.3001917
Chang C, Wang D, Zhu D, Li J, Xia J, Zhang X (2022) Deep-learning-based computer-generated hologram from a stereo image pair. Opt Lett 47(6):1482–1485. https://doi.org/10.1364/OL.453580
Charco JL, Sappa AD, Vintimilla BX, Velesaca HO (2021) Camera pose estimation in multi-view environments: from virtual scenarios to the real world. Image Vis Comput 110:104182. https://doi.org/10.1016/j.imavis.2021.104182
Chartier D, Dellinger MB, Evans JR, Budzynski HK (2009) Introduction to quantitative EEG and neurofeedback, vol 550. Elsevier, Amsterdam
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans Pattern Anal Mach Intell 40:834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Chen TY, Ting PW, Wu MY, Fu LC (2018) Learning a deep network with spherical part model for 3d hand pose estimation. Pattern Recogn 80:1–20. https://doi.org/10.1016/j.patcog.2018.02.029
Chen Y, Hu S, Mao H, Deng W, Gao X (2020) Application of the best evacuation model of deep learning in the design of public structures. Image Vis Comput 102:103975. https://doi.org/10.1016/j.imavis.2020.103975
Chen R, Hei L, Lai Y (2020) Image recognition and safety risk assessment of traffic sign based on deep convolution neural network. IEEE Access 8:201799–201805. https://doi.org/10.1109/ACCESS.2020.3032581
Chen Y, Tian Y, He M (2020) Monocular human pose estimation: a survey of deep learning-based methods. Comput Vis Image Underst 192:102897
Chen Q, Wang Y, Wang H, Yang X (2021) Data-driven simulation in fluids animation: a survey. Virtual Real Intell Hardware 3(2):87–104
Cheng D, Shi J, Chen Y, Deng X, Zhang X (2018) Learning scene illumination by pairwise photos from rear and front mobile cameras. Comput Graph Forum 37(7):213–221. https://doi.org/10.1111/cgf.13561
Cheng Q, Zhang S, Bo S, Chen D, Zhang H (2020) Augmented reality dynamic image recognition technology based on deep learning algorithm. IEEE Access 8:137370–137384. https://doi.org/10.1109/ACCESS.2020.3012130
Cheng J, Li H, Li D, Hua S, Sheng VS (2023) A survey on image semantic segmentation using deep learning techniques. Comput Mater Continua. https://doi.org/10.32604/cmc.2023.032757
Chilukuri PK, Padala P, Padala P, Desanamukula VS, Pvgd PR (2021) l, r-stitch unit: encoder-decoder-CNN based image-mosaicing mechanism for stitching non-homogeneous image sequences. IEEE Access 9:16761–16782. https://doi.org/10.1109/ACCESS.2021.3052474
Chilukuri DM, Yi S, Seong Y (2022) A robust object detection system with occlusion handling for mobile devices. Comput Intell 38(4):1338–1364. https://doi.org/10.1111/coin.12511
Chirra VRR, Uyyala SR, Kolli VKK (2021) Virtual facial expression recognition using deep CNN with ensemble learning. J Ambient Intell Hum Comput 12(12):10581–10599. https://doi.org/10.1007/s12652-020-02866-3
Chitty-Venkata KT, Somani AK (2022) Neural architecture search survey: a hardware perspective. ACM Comput Surveys. https://doi.org/10.1145/3524500
Chiu P-S, Chang J-W, Lee M-C, Chen C-H, Lee D-S (2020) Enabling intelligent environment by the design of emotionally aware virtual assistant: a case of smart campus. IEEE Access 8:62032–62041. https://doi.org/10.1109/ACCESS.2020.2984383
Cho SM, Choi BJ (2020) Cnn-based recognition algorithm for four classes of of roads. Int J Fuzzy Logic Intell Syst 20(2):114–118. https://doi.org/10.5391/IJFIS.2020.20.2.114
Cho Y, Kim J (2021) Production of mobile english language teaching application based on text interface using deep learning. Electronics. https://doi.org/10.3390/electronics10151809
Ciresan D, Giusti A, Gambardella L, Schmidhuber J (2012) Deep neural networks segment neuronal membranes in electron microscopy images
Cleveland LJ, Wartman J (2006) Principles and applications of digital photogrammetry for geotechnical engineering. Am Soc Civil Eng. https://doi.org/10.1061/40861(193)16
Cofer S, Chen TN, Yang JJ, Follmer S (2022) Detecting touch and grasp gestures using a wrist-worn optical and inertial sensing network. IEEE Robot Automat Lett 7(4):10842–10849
Cruz S, Chan A (2019) Is that my hand? an egocentric dataset for hand disambiguation. Image Vis Comput 89:131–143. https://doi.org/10.1016/j.imavis.2019.06.002
Dai L, Liu J, Ju Z, Gao Y (2021) Attention-mechanism-based real-time gaze tracking in natural scenes with residual blocks. IEEE Trans Cognit Develop Syst 14(2):696–707
Dai L, Liu J, Ju Z (2022) Binocular feature fusion and spatial attention mechanism based gaze tracking. IEEE Trans Hum Mach Syst 52(2):302–311
Dai S, Liu W, Yang W, Fan L, Zhang J (2020) Cascaded hierarchical cnn for rgb-based 3d hand pose estimation. Math Probl Eng. https://doi.org/10.1155/2020/8432840
Dangxiao W, Yuan G, Shiyi L, Zhang Y, Weiliang X, Jing X (2019) Haptic display for virtual reality: progress and challenges. Virtual Real Intell Hardware 1(2):136–162
Dash AK, Behera SK, Dogra DP, Roy PP (2018) Designing of marker-based augmented reality learning environment for kids using convolutional neural network architecture. Displays 55(SI):46–54. https://doi.org/10.1016/j.displa.2018.10.003
De Gregorio D, Tonioni A, Palli G, Di Stefano L (2020) Semiautomatic labeling for deep learning in robotics. IEEE Trans Autom Sci Eng 17(2):611–620. https://doi.org/10.1109/TASE.2019.2938316
Dede MA, Genc Y (2022) Direct pose estimation from RGB images using 3d objects. Pamukkale University J Eng Sci Pamukkale Universitesi Muhendislik bilimleri dergisi 28(2):277–285. https://doi.org/10.5505/pajes.2021.08566
Dede MA, Genc Y (2022) Object aspect classification and 6dof pose estimation. Image Vis Comput 124:104495. https://doi.org/10.1016/j.imavis.2022.104495
Deng X, Zhang Y, Shi J, Zhu Y, Cheng D, Zuo D, Cui Z, Tan P, Chang L, Wang H (2021) Hand pose understanding with large-scale photo-realistic rendering dataset. IEEE Trans Image Process 30:4275–4290. https://doi.org/10.1109/TIP.2021.3070439
Deng A, Wu Y, Zhang P, Lu Z, Li W, Su Z (2022) A weakly supervised framework for real-world point cloud classification. Comput Graph 102:78–88. https://doi.org/10.1016/j.cag.2021.12.008
Deng Y, Han S-Y, Li J, Rong J, Fan W, Sun T (2020) The design of tourism product cad three-dimensional modeling system using VR technology. Plos one 15(12). https://doi.org/10.1371/journal.pone.0244205
Desmarais Y, Mottet D, Slangen P, Montesinos P (2021) A review of 3d human pose estimation algorithms for markerless motion capture. Comput Vis Image Underst 212:103275
Dong L, Yang Z, Cai X, Zhao Y, Ma Q, Miao X (2022) Wave: edge-device cooperated real-time object detection for open-air applications. IEEE Trans Mob Comput. https://doi.org/10.1109/TMC.2022.3150401
D’Orazio T, Marani R, Renò V, Cicirelli G (2016) Recent trends in gesture recognition: how depth data has improved classical approaches. Image Vis Comput 52:56–72
Doughty M, Ghugre NR (2022) HMD-EGOPOSE: head-mounted display-based egocentric marker-less tool and hand pose estimation for augmented surgical guidance. Int J Comput Assisted Radio Surg 17(12, SI):2253–2262. https://doi.org/10.1007/s11548-022-02688-y
Duan P, Wang T, Cui M, Sang H, Sun Q (2019) Multi-person pose estimation based on a deep convolutional neural network. J Vis Commun Image Represent 62:245–252. https://doi.org/10.1016/j.jvcir.2019.05.010
Du M, Cui H, Wang Y, Duh HBL (2023) Learning from deep stereoscopic attention for simulator sickness prediction. IEEE Trans Vis Comput Graph. https://doi.org/10.1109/TVCG.2021.3115901
Duong ND, Soladié C, Kacete A, Richard PY, Royan J (2020) Efficient multi-output scene coordinate prediction for fast and accurate camera relocalization from a single RGB image. Comput Vis Image Underst 190:102850. https://doi.org/10.1016/j.cviu.2019.102850
Egger J, Wild D, Weber M, Bedoya CAR, Karner F, Prutsch A, Schmied M, Dionysio C, Krobath D, Jin Y, Gsaxner C, Li J, Pepe A (2022) Studierfenster: an open science cloud-based medical imaging analysis platform. J Dig Imag. https://doi.org/10.1007/s10278-021-00574-8
Emporio M, Caputo A, Giachetti A, Cristani M, Borghi G, D’Eusanio A, Le M-Q, Nguyen H-D, Tran M-T, Ambellan F, Hanik M, Nava-Yazdani E, Tycowicz C (2022) Shrec 2022 track on online detection of heterogeneous gestures. Comput Graph 107:241–251. https://doi.org/10.1016/j.cag.2022.07.015
Ertugrul E, Zhang H, Zhu F, Lu P, Li P, Sheng B, Wu E (2020) Embedding 3d models in offline physical environments. Comput Animat Virtual Worlds. https://doi.org/10.1002/cav.1959
Fahim G, Amin K, Zarif S (2021) Single-view 3d reconstruction: a survey of deep learning methods. Comput Graph 94:164–190
Fahim G, Amin K, Zarif S (2022) Enhancing single-view 3d mesh reconstruction with the aid of implicit surface learning. Image Vis Comput 119:104377. https://doi.org/10.1016/j.imavis.2022.104377
Fan S, Ng T-T, Koenig BL, Herberg JS, Jiang M, Shen Z, Zhao Q (2018) Image visual realism: from human perception to machine computation. IEEE Trans Pattern Anal Mach Intell 40(9):2180–2193. https://doi.org/10.1109/TPAMI.2017.2747150
Fang L, Zhong W, Ye L, Li R, Zhang Q (2020) Light field reconstruction with a hybrid sparse regularization-pseudo 4dcnn framework. IEEE Access 8:171009–171020. https://doi.org/10.1109/ACCESS.2020.3023505
Francois T, Calvet L, Madad Zadeh S, Saboul D, Gasparini S, Samarakoon P, Bourdel N, Bartoli A (2020) Detecting the occluding contours of the uterus to automatise augmented laparoscopy: score, loss, dataset, evaluation and user study. Int J Comput Assisted Radiol Surg 15(7, SI):1177–1186. https://doi.org/10.1007/s11548-020-02151-w
Fu Q, Lv J, Tang S, Xie Q (2020) Optimal design of virtual reality visualization interface based on Kansei engineering image space research. Symmetry. https://doi.org/10.3390/sym12101722
Fuchs K, Haldimann M, Grundmann T, Fleisch E (2020) Supporting food choices in the internet of people: automatic detection of diet-related activities and display of real-time interventions via mixed reality headsets. Futur Gener Comput Syst 113:343–362. https://doi.org/10.1016/j.future.2020.07.014
Gamra MB, Akhloufi MA (2021) A review of deep learning techniques for 2d and 3d human pose estimation. Image Vis Comput 114:104282
Gao Q, Shen X (2021) Thickseg: efficient semantic segmentation of large-scale 3d point clouds using multi-layer projection. Image Vis Comput 108:104161. https://doi.org/10.1016/j.imavis.2021.104161
Ge H, Zhu Z, Dai Y, Wang B, Wu X (2022) Facial expression recognition based on deep learning. Comput Methods Programs Biomed. https://doi.org/10.1016/j.cmpb.2022.106621
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2014.81
Gomez-Donoso F, Orts-Escolano S, Cazorla M (2019) Large-scale multiview 3d hand pose dataset. Image Vis Comput 81:25–33. https://doi.org/10.1016/j.imavis.2018.12.001
Gonzalez M, Kacete A, Murienne A, Marchand E (2021) L6dnet: light 6 DOF network for robust and precise object pose estimation with small datasets. IEEE Robot Automat Lett 6(2):2914–2921. https://doi.org/10.1109/LRA.2021.3062605
Gu X, Yang B, Gao S, Gao H, Yan L, Xu D, Wang W (2022) BCI+ VR rehabilitation design of closed-loop motor imagery based on the degree of drug addiction. China Commun 19(2):62–72
Gu W, Bai S, Kong L (2022) A review on 2d instance segmentation based on deep neural networks. Image Vis Comput 104401
Guenter B, Finch M, Drucker S, Tan D, Snyder J (2012) Foveated 3d graphics. ACM Trans. Graph. https://doi.org/10.1145/2366145.2366183
Gugenheimer J, Tseng WJ, Mhaidli AH, Rixen JO, McGill M, Nebeling M, Khamis M, Schaub F, Das S (2022) Novel challenges of safety, security and privacy in extended reality. In: Conference on human factors in computing systems—proceedings. https://doi.org/10.1145/3491101.3503741
Guo YC, Weng TH, Fischer R, Fu LC (2022) 3d semantic segmentation based on spatial-aware convolution and shape completion for augmented reality applications. Comput Vis Image Underst 224:103550. https://doi.org/10.1016/j.cviu.2022.103550
Gupta YP, Mukul Gupta N (2023) Deep learning model based multimedia retrieval and its optimization in augmented reality applications. Multimed Tools Appl 82(6):8447–8466. https://doi.org/10.1007/s11042-022-13555-y
Gupta N, Khan NM (2022) Efficient and scalable object localization in 3d on mobile device. J Imaging. https://doi.org/10.3390/jimaging8070188
Hadfield S, Lebeda K, Bowden R (2017) Stereo reconstruction using top-down cues. Comput Vis Image Underst 157:206–222. https://doi.org/10.1016/j.cviu.2016.08.001. (Large-Scale 3D Modeling of Urban Indoor or Outdoor Scenes from Images and Range Scans)
Hamza R, Dao MS (2022) Privacy-preserving deep learning techniques for wearable sensor-based big data applications. Virtual Real Intell Hardware, 1–13
Han P, Zhao G (2019) A review of edge-based 3d tracking of rigid objects. Virtual Real Intell Hardware 1(6):580–596
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3d skeletal data: a review. Comput Vis Image Underst 158:85–105
Han L, Zheng T, Zhu Y, Xu L, Fang L (2020) Live semantic 3d perception for immersive augmented reality. IEEE Trans Visual Comput Graphics 26(5):2012–2022. https://doi.org/10.1109/TVCG.2020.2973477
Han B, Zhang X, Ren S (2022) Pu-gacnet: graph attention convolution network for point cloud upsampling. Image Vis Comput 118:104371. https://doi.org/10.1016/j.imavis.2021.104371
Hasan MK, Calvet L, Rabbani N, Bartoli A (2021) Detection, segmentation, and 3d pose estimation of surgical tools using convolutional neural networks and algebraic geometry. Med Image Anal. https://doi.org/10.1016/j.media.2021.101994
He H, Li G, Ye Z, Mao A, Xian C, Nie Y (2019) Data-driven 3d human head reconstruction. Comput Graph 80:85–96. https://doi.org/10.1016/j.cag.2019.03.008
He Y, Ren J, Yu G, Cai Y (2020) Optimizing the learning performance in mobile augmented reality systems with CNN. IEEE Trans Wireless Commun 19(8):5333–5344. https://doi.org/10.1109/TWC.2020.2992329
Hedman P, Skepetzis V, Hernandez-Diaz K, Bigun J, Alonso-Fernandez F (2022) On the effect of selfie beautification filters on face detection and recognition. Pattern Recogn Lett 163:104–111. https://doi.org/10.1016/j.patrec.2022.09.018
He F, Liu Y, Zhan W, Xu Q, Chen X (2022) Manual operation evaluation based on vectorized spatio-temporal graph convolutional for virtual reality training in smart grid. Energies. https://doi.org/10.3390/en15062071
Ho N, Wong P-M, Hoang N-S, Koh D-K, Chua MCH, Chui C-K (2021) Cps-based manufacturing workcell for the production of hybrid medical devices. J Ambient Intell Hum Comput 12(12):10865–10879. https://doi.org/10.1007/s12652-020-02798-y
Hoang L, Lee SH, Kwon KR (2020) A 3d shape recognition method using hybrid deep learning network CNN-SVM. Electronics. https://doi.org/10.3390/electronics9040649
Hoang L, Lee SH, Kwon KR (2021) A deep learning method for 3d object classification and retrieval using the global point signature plus and deep wide residual network. Sensors. https://doi.org/10.3390/s21082644
Hoang L, Lee SH, Lee EJ, Kwon KR (2022) Gsv-net: a multi-modal deep learning network for 3d point cloud classification. Appl Sci. https://doi.org/10.3390/app12010483
Hoang L, Lee SH, Kwon OH, Kwon KR (2019) A deep learning method for 3d object classification using the wave kernel signature and a center point of the 3d-triangle mesh. Electronics. https://doi.org/10.3390/electronics8101196
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735
Hoeller B, Mossel A, Kaufmann H (2021) Automatic object annotation in streamed and remotely explored large 3d reconstructions. Comput Vis Med 7(1):71–86. https://doi.org/10.1007/s41095-020-0194-4
Hoque S, Arafat MY, Xu S, Maiti A, Wei Y (2021) A comprehensive review on 3d object detection and 6d pose estimation with deep learning. IEEE Access 9:143746–143770
Hossain MA, Assiri B (2022) Facial expression recognition based on active region of interest using deep learning and parallelism. Peerj Comput Sci. https://doi.org/10.7717/peerj-cs.894
Hu X, Gong J (2022) Larfnet: lightweight asymmetric refining fusion network for real-time semantic segmentation. Comput Graph 109:55–64. https://doi.org/10.1016/j.cag.2022.10.002
Hu Z, Hu Y, Liu J, Wu B, Han D, Kurfess T (2018) 3d separable convolutional neural network for dynamic hand gesture recognition. Neurocomputing 318:151–161. https://doi.org/10.1016/j.neucom.2018.08.042
Hu Z, Li S, Zhang C, Yi K, Wang G, Manocha D (2020) Dgaze: Cnn-based gaze prediction in dynamic scenes. IEEE Trans Visual Comput Graphics 26(5):1902–1911. https://doi.org/10.1109/TVCG.2020.2973473
Hu Z, Zhang D, Li S, Qin H (2020) Attention-based relation and context modeling for point cloud semantic segmentation. Comput Graph 90:126–134. https://doi.org/10.1016/j.cag.2020.06.001
Hu Z, Bulling A, Li S, Wang G (2021) Fixationnet: forecasting eye fixations in task-oriented virtual environments. IEEE Trans Vis Comput Graphics 27(5):2681–2690. https://doi.org/10.1109/TVCG.2021.3067779
Hu F, Wang H, Wang Q, Feng N, Chen J, Zhang T (2021) Acrophobia quantified by EEG based on CNN incorporating granger causality. Int J Neural Syst. https://doi.org/10.1142/S0129065720500690
Hu H, Liu Y, Yue K, Wang Y (2022) Navigation in virtual and real environment using brain computer interface: a progress report. Virtual Real Intell Hardware 4(2):89–114
Huang Q, Wang Y, Yin Z (2020) View-based weight network for 3d object recognition. Image Vis Comput 93:103828. https://doi.org/10.1016/j.imavis.2019.11.006
Huang Y, Shum HPH, Ho ESL, Aslam N (2020) High-speed multi-person pose estimation with deep feature transfer. Comput Vis Image Underst 197–198:103010. https://doi.org/10.1016/j.cviu.2020.103010
Huang L, Zhang B, Guo Z, Xiao Y, Cao Z, Yuan J (2021) Survey on depth and RGB image-based 3d hand shape and pose estimation. Virtual Real Intell Hardware 3(3):207–234
Huang Z, Yan Z (2022) Digital twins model of industrial product management and control based on lightweight deep learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022. https://doi.org/10.1155/2022/4452128
Hülsmann F, Göpfert JP, Hammer B, Kopp S, Botsch M (2018) Classification of motor errors to provide real-time feedback for sports coaching in virtual reality—a case study in squats and tai chi pushes. Comput Graph 76:47–59. https://doi.org/10.1016/j.cag.2018.08.003
Huong TT, Tran HT, Viet ND, Tien BD, Thanh NH, Thang TC, Nam PN et al (2022) An effective foveated 360° image assessment based on graph convolution network. IEEE Access 10:98165–98178
Im D, Park G, Ryu J, Li Z, Kang S, Han D, Lee J, Park W, Kwon H, Yoo H-J (2022) Dspu: an efficient deep learning-based dense RGB-D data acquisition with sensor fusion and 3-d perception SOC. IEEE J Solid-State Circuits. https://doi.org/10.1109/JSSC.2022.3218278
Im D, Park G, Ryu J, Li Z, Kang S, Han D, Lee J, Park W, Kwon H, Yoo HJ (2023) Dspu: an efficient deep learning-based dense RGB-D data acquisition with sensor fusion and 3-d perception SOC. IEEE J Solid-State Circ. https://doi.org/10.1109/JSSC.2022.3218278
Irfan M, Munsif M (2022) Deepdive: A learning-based approach for virtual camera in immersive contents. Virtual Real Intell Hardware 4:247–262. https://doi.org/10.1016/j.vrih.2022.05.001. (Advances in Wireless Sensor Networks under AI-SG forAugmented Reality Special Issue)
Irfan M, Muhammad K, Sajjad M, Malik KM, Cheikh FA, Rodrigues JJPC, Albuquerque VHCD (2023) Deepview: deep-learning-based users field of view selection in \(360^\circ\) videos for industrial environments. IEEE Internet Things J 10:1. https://doi.org/10.1109/JIOT.2021.3118003
Izountar Y, Benbelkacem S, Otmane S, Khababa A, Masmoudi M, Zenati N (2022) Vr-peer: a personalized exer-game platform based on emotion recognition. Electronics. https://doi.org/10.3390/electronics11030455
Izquierdo-Domenech J, Linares-Pellicer J, Orta-Lopez J (2023) Towards achieving a high degree of situational awareness and multimodal interaction with AR and semantic AI in industrial applications. Multimed Tools Appl 82(10):15875–15901. https://doi.org/10.1007/s11042-022-13803-1
Jang JW, Kwon YC, Lim H, Choi O (2019) Cnn-based denoising, completion, and prediction of whole-body human-depth images. IEEE Access 7:175842–175856. https://doi.org/10.1109/ACCESS.2019.2957862
Jeong J, Yoon TS, Park JB (2018) Multimodal sensor-based semantic 3d mapping for a large-scale environment. Expert Syst Appl 105:1–10. https://doi.org/10.1016/j.eswa.2018.03.051
Ji Z, Qi X, Wang Y, Xu G, Du P, Wu X, Wu Q (2019) Human body shape reconstruction from binary silhouette images. Comput Aided Geomet Des 71:231–243. https://doi.org/10.1016/j.cagd.2019.04.019
Ji X, Fang Q, Dong J, Shuai Q, Jiang W, Zhou X (2020) A survey on monocular 3d human pose estimation. Virtual Real Intell Hardw 2(6):471–500
Jia S (2023) Multi-modal human-computer virtual fusion interaction in mixed reality. J Appl Sci Eng. https://doi.org/10.6180/jase.202311_26(11).0010
Jia W, Li L, Li Z, Liu S (2021) Deep learning geometry compression artifacts removal for video-based point cloud compression. Int J Comput Vis 129(11):2947–2964. https://doi.org/10.1007/s11263-021-01503-6
Jia Y, Ding R, Ren W, Shu J, Jin A (2021) Gesture recognition of somatosensory interactive acupoint massage based on image feature deep learning model. Traitement Du Signal 38(3):565–572. https://doi.org/10.18280/ts.380304
Jiang D, Li G, Tan C, Huang L, Sun Y, Kong J (2021) Semantic segmentation for multiscale target based on object recognition using the improved faster-RCNN model. Futur Gener Comput Syst 123:94–104. https://doi.org/10.1016/j.future.2021.04.019
Jiang Z, Wang X, Huang X, Li H (2021) Triangulate geometric constraint combined with visual-flow fusion network for accurate 6dof pose estimation. Image Vis Comput 108:104127. https://doi.org/10.1016/j.imavis.2021.104127
Jiang L, Lee C, Teotia D, Ostadabbas S (2022) Animal pose estimation: a closer look at the state-of-the-art, existing gaps and opportunities. Comput Vis Image Understand 103483
Jin X, Sun X, Zhang X, Sun H, Xu R, Zhou X, Li X, Liu R (2019) Sun orientation estimation from a single image using short-cuts in DCNN. Opt Laser Technol 110(SI):191–195. https://doi.org/10.1016/j.optlastec.2018.08.009
Jinyu L, Bangbang Y, Danpeng C, Nan W, Guofeng Z, Hujun B (2019) Survey and evaluation of monocular visual-inertial slam algorithms for augmented reality. Virtual Real Intell Hardware 1(4):386–410
Joardar BK, Doppa JR, Li H, Chakrabarty K, Pande PP (2023) Realprune: Reram crossbar-aware lottery ticket pruning for CNNS. IEEE Trans Emerg Topics Comput. https://doi.org/10.1109/TETC.2022.3223630
Jurado-Rodríguez D, Jurado JM, Pádua L, Neto A, Muñoz-Salinas R, Sousa JJ (2022) Semantic segmentation of 3d car parts using UAV-based images. Comput Graph 107:93–103. https://doi.org/10.1016/j.cag.2022.07.008
Kalaivani K, Chinnadurai M (2021) A hybrid deep learning intrusion detection model for fog computing environment. Intell Automat Soft Comput 30(1):1–15. https://doi.org/10.32604/iasc.2021.017515
Kang T, Chae M, Seo E, Kim M, Kim J (2020) Deephandsvr: hand interface using deep learning in immersive virtual reality. Electronics. https://doi.org/10.3390/electronics9111863
Karambakhsh A, Kamel A, Sheng B, Li P, Yang P, Feng DD (2019) Deep gesture interaction for augmented anatomy learning. Int J Inf Manage 45:328–336. https://doi.org/10.1016/j.ijinfomgt.2018.03.004
Karambakhsh A, Sheng B, Li P, Li H, Kim J, Jung Y, Chen CLP (2023) Sparsevoxnet: 3-d object recognition with sparsely aggregation of 3-d dense blocks. IEEE Trans Neural Networks Learn Syst. https://doi.org/10.1109/TNNLS.2022.3175775
Kashiani H, Shokouhi SB (2019) Visual object tracking based on adaptive siamese and motion estimation network. Image Vis Comput 83–84:17–28. https://doi.org/10.1016/j.imavis.2019.02.003
Khan MA, Israr S, Almogren AS, Din IU, Almogren A, Rodrigues JJPC (2021) Using augmented reality and deep learning to enhance taxila museum experience. J Real-Time Image Proc 18(2, SI):321–332. https://doi.org/10.1007/s11554-020-01038-y
Khan D, Cheng Z, Uchiyama H, Ali S, Asshad M, Kiyokawa K (2022) Recent advances in vision-based indoor navigation: a systematic literature review. Comput Graph
Kim YH, Lee KH (2019) Pose initialization method of mixed reality system for inspection using convolutional neural network. J Adv Mech Des Syst Manuf. https://doi.org/10.1299/jamdsm.2019jamdsm0093
Kim S, Ban Y, Lee S (2017) Tracking and classification of in-air hand gesture based on thermal guided joint filter. Sensors. https://doi.org/10.3390/s17010166
Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering-a systematic literature review. Inf Softw Technol 51(1):7–15
Ko TY, Lee SH (2020) Novel method of semantic segmentation applicable to augmented reality. Sensors. https://doi.org/10.3390/s20061737
Koch T, Liebel L, Körner M, Fraundorfer F (2020) Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset. Comput Vis Image Underst 191:102877. https://doi.org/10.1016/j.cviu.2019.102877
Kothari RS, Chaudhary AK, Bailey RJ, Pelz JB, Diaz GJ (2021) Ellseg: an ellipse segmentation framework for robust gaze tracking. IEEE Trans Vis Comput Graphics 27(5):2757–2767. https://doi.org/10.1109/TVCG.2021.3067765
Kozbial M, Markiewicz L, Sitnik R (2020) Algorithm for detecting characteristic points on a three-dimensional, whole-body human scan. Appl Sci. https://doi.org/10.3390/app10041342
Kraus S, Kanbach DK, Krysta PM, Steinhoff MM, Tomini N (2022) Facebook and the creation of the metaverse: radical business model innovation or incremental transformation? Int J Entrepreneurial Behav Res. https://doi.org/10.1108/IJEBR-12-2021-0984
Ku T, Veltkamp RC, Boom B, Duque-Arias D, Velasco-Forero S, Deschaud J-E, Goulette F, Marcotegui B, Ortega S, Trujillo A, Suárez JP, Santana JM, Ramírez C, Akadas K, Gangisetty S (2020) Shrec 2020: 3d point cloud semantic segmentation for street scenes. Comput Graph 93:13–24. https://doi.org/10.1016/j.cag.2020.09.006
Kumar D, Raut S, Shimasaki K, Senoo T, Ishii I (2021) Projection-mapping-based object pointing using a high-frame-rate camera-projector system. Robomech J. https://doi.org/10.1186/s40648-021-00197-2
Kushwaha M, Choudhary J, Singh DP (2022) Enhancement of human 3d pose estimation using a novel concept of depth prediction with pose alignment from a single 2d image. Comput Graph 107:172–185. https://doi.org/10.1016/j.cag.2022.07.021
Laga H, Jospin LV, Boussaid F, Bennamoun M (2020) A survey on deep learning techniques for stereo-based depth estimation. IEEE Trans Pattern Anal Mach Intell 44(4):1738–1764
Lai Z-H, Tao W, Leu MC, Yin Z (2020) Smart augmented reality instructional system for mechanical assembly towards worker-centered intelligent manufacturing. J Manuf Syst 55:69–81. https://doi.org/10.1016/j.jmsy.2020.02.010
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 2015(521):436–444. https://doi.org/10.1038/nature14539
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE 86:2278–2323. https://doi.org/10.1109/5.726791
Lee SM, Trimi S (2021) Convergence innovation in the digital age and in the COVID-19 pandemic crisis. J Bus Res 123:14–22. https://doi.org/10.1016/j.jbusres.2020.09.041
Lee TM, Yoon J-C, Lee I-K (2019) Motion sickness prediction in stereoscopic videos using 3d convolutional neural networks. IEEE Trans Visual Comput Graphics 25(5):1919–1927. https://doi.org/10.1109/TVCG.2019.2899186
Li H, Fan L (2020) A flexible technique to select objects via convolutional neural network in VR space. Sci China Inf Sci. https://doi.org/10.1007/s11432-019-1517-3
Li X, Kong D (2023) SRIF-RCNN: sparsely represented inputs fusion of different sensors for 3d object detection. Appl Intell 53(5):5532–5553. https://doi.org/10.1007/s10489-022-03594-1
Li Y, Zhao K (2021) Sports motional characteristics modeling by leveraging multi-modal image technique. Futur Gener Comput Syst 119:37–42. https://doi.org/10.1016/j.future.2021.01.031
Li C, Sun X, Li Y (2019) Information hiding based on augmented reality. Math Biosci Eng 16(5):4777–4787. https://doi.org/10.3934/mbe.2019240
Li M, An L, Yu T, Wang Y, Chen F, Liu Y (2020) Neural hand reconstruction using a single RGB image. Virtual Real Intell Hardw 2:276–289. https://doi.org/10.1016/j.vrih.2020.05.001. (3D Visual Processing and Reconstruction Special Issue)
Li Z, Zhang X, Wang K, Jiang H, Wang Z (2021) High accuracy and geometry-consistent confidence prediction network for multi-view stereo. Comput Graph 97:148–159. https://doi.org/10.1016/j.cag.2021.04.020
Li X, Yang F, Luo A, Jiao Z, Cheng H, Liu Z (2021) Efrnet: efficient feature reconstructing network for real-time scene parsing. IEEE Trans Multimed 24:2852–2865
Li Z, Liu F, Yang W, Peng S, Zhou J (2022) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 33(12):6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827
Li C, Yi R, Ali SG, Ma L, Wu E, Wang J, Mao L, Sheng B (2022) Radepthnet: reflectance-aware monocular depth estimation. Virtual Real Intell Hardw 4:418–431. https://doi.org/10.1016/j.vrih.2022.08.005
Li H, Ma W, Wang H, Liu G, Wen X, Zhang Y, Yang M, Luo G, Xie G, Sun C (2022) A framework and method for human-robot cooperative safe control based on digital twin. Adv Eng Inform 53:101701. https://doi.org/10.1016/j.aei.2022.101701
Li H, Ma W, Wang H, Liu G, Wen X, Zhang Y, Yang M, Luo G, Xie G, Sun C (2022) A framework and method for human-robot cooperative safe control based on digital twin. Adv Eng Inf. https://doi.org/10.1016/j.aei.2022.101701
Li W, Wang J, Liu M, Zhao S, Ding X (2023) Integrated registration and occlusion handling based on deep learning for augmented-reality-assisted assembly instruction. IEEE Trans Indus Inf. https://doi.org/10.1109/TII.2022.3189428
Liang H, Yuan J, Lee J, Ge L, Thalmann D (2019) Hough forest with optimized leaves for global hand pose estimation with arbitrary postures. IEEE Trans Cyber 49(2):527–541. https://doi.org/10.1109/TCYB.2017.2779800
Liao X, Chen X (2021) Construction of prediction model for multi-feature fusion time sequence data of internet of things under VR and LSTM. IEEE Access 9:153027–153036
Ling K, Dai H, Liu Y, Liu AX, Wang W, Gu Q (2020) Ultragesture: fine-grained gesture sensing and recognition. IEEE Trans Mob Comput 21(7):2620–2636
Linse C, Alshazly H, Martinetz T (2022) A walk in the black-box: 3d visualization of large neural networks in virtual reality. Neural Comput Appl 34(23):21237–21252. https://doi.org/10.1007/s00521-022-07608-4
Liu W (2022) Simulation training auxiliary model based on neural network and virtual reality technology. Comput Intell Neurosci. https://doi.org/10.1155/2022/2636877
Liu L (2021) Objects detection toward complicated high remote basketball sports by leveraging deep CNN architecture. Futur Gener Comput Syst 119:31–36. https://doi.org/10.1016/j.future.2021.01.020
Liu Y, Miura J (2021) Rdmo-slam: real-time visual slam for dynamic environments using semantic label prediction with optical flow. IEEE Access 9:106981–106997
Liu Y, Miura J (2021) Rds-slam: real-time dynamic slam using semantic segmentation methods. IEEE Access 9:23772–23785. https://doi.org/10.1109/ACCESS.2021.3050617
Liu F, Wang S, Ding D, Yuan Q, Yao Z, Pan Z, Li H (2018) Retrieving indoor objects: 2d–3d alignment using single image and interactive ROI-based refinement. Comput Graph 70:108–117. https://doi.org/10.1016/j.cag.2017.07.029. (CAD/Graphics 2017)
Liu Y, Peng M, Swash MR, Chen T, Qin R, Meng H (2021) Holoscopic 3d microgesture recognition by deep neural network model based on viewpoint images and decision fusion. IEEE Trans Hum Mach Syst 51(2):162–171. https://doi.org/10.1109/THMS.2020.3047914
Liu L, Xu W, Habermann M, Zollhöfer M, Bernard F, Kim H, Wang W, Theobalt C (2021) Learning dynamic textures for neural rendering of human actors. IEEE Trans Vis Comput Graph 27(10):4009–4022. https://doi.org/10.1109/TVCG.2020.2996594
Liu Y, Yan X, Liu X, Wang X, Jing T, Lin M, Chen S, Li P, Jiang X (2021) Fusion coding of 3d real and virtual scenes information for augmented reality-based holographic stereogram. Front Phys. https://doi.org/10.3389/fphy.2021.736268
Liu X, Deng Y, Han C, Di Renzo M (2021) Learning-based prediction, rendering and transmission for interactive virtual reality in RIS-assisted terahertz networks. IEEE J Sel Areas Commun 40(2):710–724
Liu X, Wang M, Wang A, Hua X, Liu S (2022) Depth-guided learning light field angular super-resolution with edge-aware inpainting. Visual Computer 38(8):2839–2851. https://doi.org/10.1007/s00371-021-02159-6
Liu Y, Li J, Huang K, Li X, Qi X, Chang L, Long Y, Zhou J (2022) Mobilesp: an FPGA-based real-time keypoint extraction hardware accelerator for mobile Vslam. IEEE Trans Circuits Syst i-regular Papers 69(12):4919–4929. https://doi.org/10.1109/TCSI.2022.3190300
Liu Z, Xue J, Wang N, Bai W, Mo Y (2023) Intelligent damage assessment for post-earthquake buildings using computer vision and augmented reality. Sustainability. https://doi.org/10.3390/su15065591
Liu L, Cui J, Niu J, Duan N, Yu X, Li Q, Yeh S-C, Zheng L-R (2020) Design of mirror therapy system base on multi-channel surface-electromyography signal pattern recognition and mobile augmented reality. Electronics. https://doi.org/10.3390/electronics9122142
Liu X, Pan H (2022) The path of film and television animation creation using virtual reality technology under the artificial intelligence. Sci Programm. https://doi.org/10.1155/2022/1712929
Liu J, Yuan R, Li Y, Zhou L, Zhang Z, Yang J, Xiao L (2022) A deep learning method and device for bone marrow imaging cell detection. Annals Transl Med. https://doi.org/10.21037/atm-22-486
Liu C, Zhu H, Tang D, Nie Q, Zhou T, Wang L, Song Y (2022) Probing an intelligent predictive maintenance approach with deep learning and augmented reality for machine tools in IoT-enabled manufacturing. Robot Comput Integr Manuf. https://doi.org/10.1016/j.rcim.2022.102357
Lohr D, Komogortsev OV (2022) Eye know you too: towards viable end-to-end eye movement biometrics for user authentication. IEEE Trans Inf Forensics Secur 17:3151–3164
Lopez Ibanez M, Miranda M, Alvarez N, Peinado F (2021) Using gestural emotions recognised through a neural network as input for an adaptive music system in virtual reality. Entertain Comput. https://doi.org/10.1016/j.entcom.2021.100404
Lotte F (2014) A tutorial on EEG signal-processing techniques for mental-state recognition in brain-computer interfaces. Guide Brain Comput Music Interf. https://doi.org/10.1007/978-1-4471-6584-2_7
Lu F, He L, You S, Chen X, Hao Z (2017) Identifying surface BRDF from a single 4-d light field image via deep neural network. IEEE J Selected Top Signal Process 11(7):1047–1057. https://doi.org/10.1109/JSTSP.2017.2728001
Lu L, Ma J, Qu S (2020) Value of virtual reality technology in image inspection and 3d geometric modeling. IEEE Access 8:139070–139083. https://doi.org/10.1109/ACCESS.2020.3012207
Lu Z, Chen X, Chung VYY, Liu S (2021) Lfi-augmenter: intelligent light field image editing with interleaved spatial-angular convolution. IEEE Multimed 28(4):84–95. https://doi.org/10.1109/MMUL.2021.3069912
Lu Y, Wang H, Feng N, Jiang D, Wei C (2022) Online interaction method of mobile robot based on single-channel EEG signal and end-to-end CNN with residual block model. Adv Eng Inform 52:101595. https://doi.org/10.1016/j.aei.2022.101595
Lu Y, Li H (2019) Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Appl Sci. https://doi.org/10.3390/app9081599
Luo G, He B, Xiong Y, Wang L, Wang H, Zhu Z, Shi X (2023) An optimized convolutional neural network for the 3d point-cloud compression. Sensors. https://doi.org/10.3390/s23042250
Luo H, Yin D, Zhang S, Xiao D, He B, Meng F, Zhang Y, Cai W, He S, Zhang W, Hu Q, Guo H, Liang S, Zhou S, Liu S, Sun L, Guo X, Fang C, Liu L, Jia F (2020) Augmented reality navigation for liver resection with a stereoscopic laparoscope. Comput Methods Prog Biomed. https://doi.org/10.1016/j.cmpb.2019.105099
Maiwald F, Lehmann C, Lazariv T (2021) Fully automated pose estimation of historical images in the context of 4d geographic information systems utilizing machine learning methods. ISPRS Int J Geo-inf. https://doi.org/10.3390/ijgi10110748
Maldonado-Romo J, Aldape-Perez M (2021) Interoperability between real and virtual environments connected by a GAN for the path-planning problem. Appl Sci. https://doi.org/10.3390/app112110445
Malekijoo A, Fadaeieslam MJ (2019) Convolution-deconvolution architecture with the pyramid pooling module for semantic segmentation. Multimed Tools Appl 78(22):32379–32392. https://doi.org/10.1007/s11042-019-07990-7
Malik J, Elhayek A, Nunnari F, Stricker D (2019) Simple and effective deep hand shape and pose regression from a single depth image. Comput Graph 85:85–91. https://doi.org/10.1016/j.cag.2019.10.002
Manni A, Oriti D, Sanna A, Pace FD, Manuri F (2021) Snap2cad:3d indoor environment reconstruction for AR/VR applications using a smartphone device. Comput Graph 100:116–124. https://doi.org/10.1016/j.cag.2021.07.014
Marques BAD, Clua EWG, Vasconcelos CN (2018) Deep spherical harmonics light probe estimator for mixed reality games. Comput Graph 76:96–106. https://doi.org/10.1016/j.cag.2018.09.003
Marques BAD, Clua EWG, Montenegro AA, Vasconcelos CN (2022) Spatially and color consistent environment lighting estimation using deep neural networks for mixed reality. Comput Graph 102:257–268. https://doi.org/10.1016/j.cag.2021.08.007
Martínez A, Belmonte LM, García AS, Fernández-Caballero A, Morales R (2021) Facial emotion recognition from an unmanned flying social robot for home care of dependent people. Electronics. https://doi.org/10.3390/electronics10070868
Martinez-Diaz S (2021) 3d distance measurement from a camera to a mobile vehicle, using monocular vision. J Sensors. https://doi.org/10.1155/2021/5526931
Mhaidli A, Schaub F (2021) Identifying manipulative advertising techniques in XR through scenario construction. In: Conference on human factors in computing systems—proceedings. https://doi.org/10.1145/3411764.3445253
MILGRAM P, KISHINO F (1994) A taxonomy of mixed reality visual displays. IEICE TRANSACTIONS on Information and Systems E77-D, 1321–1329
Miltiadous A, Tzimourta KD, Giannakeas N, Tsipouras MG, Glavas E, Kalafatakis K, Tzallas AT (2023) Machine learning algorithms for epilepsy detection based on published EEG databases: a systematic review. In: IEEE Access. https://doi.org/10.1109/ACCESS.2022.3232563
Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Machine Intell
Mishra P, Sarawadekar KP (2021) Fingertips detection with nearest-neighbor pose particles from a single RGB image. IEEE Trans Circuits Syst Video Technol 32(5):3001–3011
Mitra S, Acharya T (2007) Gesture recognition: A survey. IEEE Trans Syst Man Cybern C Appl Rev 37:311–324. https://doi.org/10.1109/TSMCC.2007.893280
Modi N, Singh J (2022) Real-time camera-based eye gaze tracking using convolutional neural network: a case study on social media website. Virtual Real 26(4):1489–1506. https://doi.org/10.1007/s10055-022-00642-6
Mohammed AAQ, Lv J, Islam MS (2019) A deep learning-based end-to-end composite system for hand detection and gesture recognition. Sensors. https://doi.org/10.3390/s19235282
Mohanto B, Islam AT, Gobbetti E, Staadt O (2022) An integrative view of foveated rendering. Comput Graph 102:474–501
Mondejar-Guerra V, Garrido-Jurado S, Munoz-Salinas R, Marin-Jimenez MJ, Medina-Carnicer R (2018) Robust identification of fiducial markers in challenging conditions. Expert Syst Appl 93:336–345. https://doi.org/10.1016/j.eswa.2017.10.032
Muhammad K, Mustaqeem Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G, Albuquerque VHC (2021) Human action recognition using attention based LSTM network with dilated CNN features. Futur Gener Comput Syst 125:820–830. https://doi.org/10.1016/j.future.2021.06.045
Mukhopadhyay A, Reddy GSR, Saluja KS, Ghosh S, Peña-Rios A, Gopal G, Biswas P (2022) Virtual-reality-based digital twin of office spaces with social distance measurement feature. Virtual Real Intell Hardw 4:55–75. https://doi.org/10.1016/j.vrih.2022.01.004
Mukthineni V, Mukthineni R, Sharma O, Narayanan SJ (2020) Face authenticated hand gesture based human computer interaction for desktops. Cybernet Inf Technol 20(4):74–89. https://doi.org/10.2478/cait-2020-0048
Mustaqeem Sajjad M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating learned features and deep bilstm. IEEE Access 8:79861–79875. https://doi.org/10.1109/ACCESS.2020.2990405
Nambu Y, Mariya T, Shinkai S, Umemoto M, Asanuma H, Sato I, Hirohashi Y, Torigoe T, Fujino Y, Saito T (2022) A screening assistance system for cervical cytology of squamous cell atypia based on a two-step combined CNN algorithm with label smoothing. Cancer Med 11(2):520–529. https://doi.org/10.1002/cam4.4460
Nousi P, Tefas A, Pitas I (2020) Dense convolutional feature histograms for robust visual object tracking. Image Vis Comput 99:103933. https://doi.org/10.1016/j.imavis.2020.103933
Nousias S, Arvanitis G, Lalos AS, Pavlidis G, Koulamas C, Kalogeras A, Moustakas K (2020) A saliency aware CNN-based 3d model simplification and compression framework for remote inspection of heritage sites. IEEE Access 8:169982–170001. https://doi.org/10.1109/ACCESS.2020.3023167
Olszewski K, Lim JJ, Saito S, Li H (2016) High-fidelity facial and speech animation for vr hmds. ACM Trans Graph 35(6). https://doi.org/10.1145/2980179.2980252
Oñoro-Rubio D, López-Sastre RJ, Redondo-Cabrera C, Gil-Jiménez P (2018) The challenge of simultaneous object detection and pose estimation: a comparative study. Image Vis Comput 79:109–122. https://doi.org/10.1016/j.imavis.2018.09.013
O’Shea K, Nash R (2015) An introduction to convolutional neural networks
Ouali I, Halima MB, Wali A (2023) An augmented reality for an Arabic text reading and visualization assistant for the visually impaired. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-14880-6
Pang S, Coz JJ, Yu Z, Luaces O, Díez J (2017) Deep learning to frame objects for visual target tracking. Eng Appl Artif Intell 65:406–420. https://doi.org/10.1016/j.engappai.2017.08.010
Pang J, Zhang J, Li Y, Sun W (2020) A marker-less assembly stage recognition method based on segmented projection contour. Adv Eng Inform 46:101149. https://doi.org/10.1016/j.aei.2020.101149
Park KB, Kim M, Choi SH, Lee JY (2020) Deep learning-based smart task assistance in wearable augmented reality. Robot Comput Integr Manuf. https://doi.org/10.1016/j.rcim.2019.101887
Pasqualino G, Furnari A, Signorello G, Farinella GM (2021) An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites. Image Vis Comput 107:104098. https://doi.org/10.1016/j.imavis.2021.104098
Pasqualino G, Furnari A, Farinella GM (2022) A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training. Comput Vis Image Underst 222:103487. https://doi.org/10.1016/j.cviu.2022.103487
Ping G, Esfahani MA, Chen J, Wang H (2022) Visual enhancement of single-view 3d point cloud reconstruction. Comput Graph 102:112–119. https://doi.org/10.1016/j.cag.2022.01.001
Pinkham R, Erhardt J, Salvo BD, Berkovich A, Zhang Z (2023) Ansa: Adaptive near-sensor architecture for dynamic DNN processing in compact form factors. IEEE Trans Circ Syst I Regular Papers. https://doi.org/10.1109/TCSI.2022.3228725
Polap D, Kesik K, Ksiazek K, Wozniak M (2017) Obstacle detection as a safety alert in augmented reality models by the use of deep learning techniques. Sensors. https://doi.org/10.3390/s17122803
Polap D, Kesik K, Winnicka A, Wozniak M (2020) Strengthening the perception of the virtual worlds in a virtual reality environment. ISA Trans 102:397–406. https://doi.org/10.1016/j.isatra.2020.02.023
Qu Q, Chen X, Chung YY, Cai W (2023) Lfacon: Introducing anglewise attention to no-reference quality assessment in light field space. IEEE Trans Visuali Comput Graph. https://doi.org/10.1109/TVCG.2023.3247069
Quon JL, Chen LC, Kim L, Grant GA, Edwards MSB, Cheshier SH, Yeom KW (2020) Deep learning for automated delineation of pediatric cerebral arteries on pre-operative brain magnetic resonance imaging. Front Surgery. https://doi.org/10.3389/fsurg.2020.517375
Rad M, Roth PM, Lepetit V (2020) Alcn: adaptive local contrast normalization. Comput Vis Image Underst 194:102947. https://doi.org/10.1016/j.cviu.2020.102947
Rafique AA, Ghadi YY, Alsuhibany SA, Chelloug SA, Jalal A, Park J (2022) Cnn based multi-object segmentation and feature fusion for scene recognition. CMC-Comput Materials Continua 73(3):4657–4675. https://doi.org/10.32604/cmc.2022.027720
Raina P, Mudur S, Popa T (2019) Sharpness fields in point clouds using deep learning. Comput Graph 78:37–53. https://doi.org/10.1016/j.cag.2018.11.003
Ratclife J, Soave F, Bryan-Kinns N, Tokarchuk L, Farkhatdinov I (2021) Extended reality (xr) remote research: A survey of drawbacks and opportunities. In: Conference on human factors in computing systems—proceedings. https://doi.org/10.1145/3411764.3445170
Ravi A, Lu J, Pearce S, Jiang N (2022) Enhanced system robustness of asynchronous bci in augmented reality using steady-state motion visual evoked potential. IEEE Trans Neural Syst Rehabil Eng 30:85–95
Refat MAR, Singh BC, Rahman MM (2022) Sentinet: a nonverbal facial sentiment analysis using convolutional neural network. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001422560079
Restrepo Rodriguez AO, Casas Mateus DE, Gaona Garcia PA, Montenegro Marin CE, Gonzalez Crespo R (2018) Hyperparameter optimization for image recognition over an ar-sandbox based on convolutional neural networks applying a previous phase of segmentation by color-space. Symmetry. https://doi.org/10.3390/sym10120743
Restrepo Rodriguez AO, Ariza Riano M, Alonso Gaona-Garcia P, Enrique Montenegro-Marin C, Sarria I (2019) Image classification methods applied in immersive environments for fine motor skills training in early education. Int J Interact Multimed Artif Intell 5(7):151–158. https://doi.org/10.9781/ijimai.2019.10.004
Rodriguez-Pardo C, Suja S, Pascual D, Lopez-Moreno J, Garces E (2019) Automatic extraction and synthesis of regular repeatable patterns. Comput Graph 83:33–41. https://doi.org/10.1016/j.cag.2019.06.010
Rogers Y (2005) New theoretical approaches for human-computer interaction. Annual Rev Inf Sci Technol. https://doi.org/10.1002/aris.1440380103
Roy SD, Bhowmik MK (2022) Awdmc-net: classification of adversarial weather degraded multiclass scenes using a convolution neural network. Comput Vis Image Underst 222:103498. https://doi.org/10.1016/j.cviu.2022.103498
Sabeti S, Shoghli O, Baharani M, Tabkhi H (2021) Toward ai-enabled augmented reality to enhance the safety of highway work zones: feasibility, requirements, and challenges. Adv Eng Inf 50:101429. https://doi.org/10.1016/j.aei.2021.101429
Sagayam KM, Andrushia AD, Ghosh A, Deperlioglu O, Elngar AA (2022) Recognition of hand gesture image using deep convolutional neural network. Int J image Graph. https://doi.org/10.1142/S0219467821400088
Sahin C, Garcia-Hernando G, Sock J, Kim TK (2020) A review on object pose recovery: from 3d bounding box detectors to full 6d pose estimators. Image Vis Comput 96:103898
Samet N, Akbas E (2021) Hprnet: hierarchical point regression for whole-body human pose estimation. Image Vis Comput 115:104285. https://doi.org/10.1016/j.imavis.2021.104285
Sarfraz Z, Sarfraz A, Iftikar HM, Akhund R (2021) Is covid-19 pushing us to the fifth industrial revolution (society 5.0)? Pakistan J Med Sci. https://doi.org/10.12669/pjms.37.2.3387
Schissler C, Loftin C, Manocha D (2018) Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Trans Visual Comput Graphics 24(3):1246–1259. https://doi.org/10.1109/TVCG.2017.2666150
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Sen A, Mishra TK, Dash R (2022) A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network. Multimed Tools Appl 81(28):40043–40066. https://doi.org/10.1007/s11042-022-11909-0
Sexton JP, Simiscuka AA, Mcguinness K, Muntean GM (2021) Automatic CNN-based enhancement of \(360^\circ\) video experience with multisensorial effects. IEEE Access 9:133156–133169
Shariati A, Holz C, Sinha S (2020) Towards privacy-preserving ego-motion estimation using an extremely low-resolution camera. IEEE Robotics and Automation Letters 5(2):1223–1230. https://doi.org/10.1109/LRA.2020.2967307
Sharma A, Nett R, Ventura J (2020) Unsupervised learning of depth and ego-motion from cylindrical panoramic video with applications for virtual reality. Int J Semant Comput 14(3):333–356. https://doi.org/10.1142/S1793351X20400139
Shi Y, Zhang L (2020) Design of Chinese character coded targets for feature point recognition under motion-blur effect. IEEE Access 8:124467–124475. https://doi.org/10.1109/ACCESS.2020.3006020
Shi L, Li B, Kim C, Kellnhofer P, Matusik W (2021) Towards real-time photorealistic 3d holography with deep neural networks. Nature 591(7849):234. https://doi.org/10.1038/s41586-020-03152-0
Silva LJS, Silva DLS, Raposo AB, Velho L, Lopes HCV (2019) Tensorpose: real-time pose estimation for interactive applications. Comput Graph 85:1–14. https://doi.org/10.1016/j.cag.2019.08.013
Siyaev A, Jo GS (2021) Towards aircraft maintenance metaverse using speech interactions with virtual objects in mixed reality. Sensors. https://doi.org/10.3390/s21062066
Smith JW, Thiagarajan S, Willis R, Makris Y, Torlak M (2021) Improved static hand gesture classification on deep convolutional neural networks using novel sterile training technique. IEEE Access 9:10893–10902. https://doi.org/10.1109/ACCESS.2021.3051454
Song G, Zheng J, Cai J, Cham TJ (2020) Recovering facial reflectance and geometry from multi-view images. Image Vis Comput 96:103897. https://doi.org/10.1016/j.imavis.2020.103897
Song X, Zhu J, Fan J, Ai D, Yang J (2021) Topological distance-constrained feature descriptor learning model for vessel matching in coronary angiographies. Virtual Real Intell Hardware 3:287–301. https://doi.org/10.1016/j.vrih.2021.08.003
Song B, Hu X, Xiao J, Zhang G, Chen T (2022) Implicit neural refinement based multi-view stereo network with adaptive correlation. Image Vis Comput 124:104511. https://doi.org/10.1016/j.imavis.2022.104511
Sorokin MI, Zhdanov DD, Zhdanov AD, Potemin IS, Bogdanov NN (2020) Restoration of lighting parameters in mixed reality systems using convolutional neural network technology based on rgbd images. Programm Comput Software 46(3):207–216. https://doi.org/10.1134/S0361768820030093
Spagnolo F, Corsonello P, Frustaci F, Perri S (2023) Design of a low-power super-resolution architecture for virtual reality wearable devices. IEEE Sens J. https://doi.org/10.1109/JSEN.2023.3256524
Su Y-C, Grauman K (2021) Learning spherical convolution for 360\(^\circ\) recognition. IEEE Trans Pattern Anal Mach Intell 44(11):8371–8386
Su YC, Grauman K (2022) Learning spherical convolution for 360 degrees recognition. IEEE Trans Pattern Anal Mach Intell 44(11):8371–8386. https://doi.org/10.1109/TPAMI.2021.3113612
Su Z, Zhou T, Li K, Brady D, Liu Y (2020) View synthesis from multi-view RGB data using multilayered representation and volumetric estimation. Virtual RealityandIntelligent Hardware 2:43–55. https://doi.org/10.1016/j.vrih.2019.12.001
Su Y, Rambach J, Pagani A, Stricker D (2021) Synpo-net-accurate and fast CNN-based 6dof object pose estimation using synthetic training. Sensors. https://doi.org/10.3390/s21010300
Sun W, Min X, Zhai G, Gu K, Duan H, Ma S (2020) Mc360iqa: a multi-channel CNN for blind 360-degree image quality assessment. IEEE J Select Top Signal Process 14(1):64–77. https://doi.org/10.1109/JSTSP.2019.2955024
Sun H, Wang T, Yu E (2022) A dynamic keypoint selection network for 6dof pose estimation. Image Vis Comput 118:104372. https://doi.org/10.1016/j.imavis.2022.104372
Sun Q, Xu Y, Sun Y, Yao C, Lee JSA, Chen K (2023) Gn-cnn: a point cloud analysis method for metaverse applications. Electronics. https://doi.org/10.3390/electronics12020273
Su Y, Yu L (2022) A dense RGB-D slam algorithm based on convolutional neural network of multi-layer image invariant feature. Measur Sci Technol. https://doi.org/10.1088/1361-6501/ac38f1
Tai Y, Qian K, Huang X, Zhang J, Jan MA, Yu Z (2021) Intelligent intraoperative haptic-ar navigation for COVID-19 lung biopsy using deep hybrid model. IEEE Trans Industr Inf 17(9):6519–6527. https://doi.org/10.1109/TII.2021.3052788
Tan J, Wang K, Chen L, Zhang G, Li J, Zhang X (2021) Hcfs3d: hierarchical coupled feature selection network for 3d semantic and instance segmentation. Image Vis Comput 109:104129. https://doi.org/10.1016/j.imavis.2021.104129
Tang Q, Liu F, Zhang T, Jiang J, Zhang Y (2021) Attention-guided chained context aggregation for semantic segmentation. Image Vis Comput 115:104309. https://doi.org/10.1016/j.imavis.2021.104309
Tang Z, Chen G, Han Y, Liao X, Ru Q, Wu Y (2022) Bi-stage multi-modal 3d instance segmentation method for production workshop scene. Eng Appl Artif Intell 112:104858. https://doi.org/10.1016/j.engappai.2022.104858
Tanzi L, Piazzolla P, Porpiglia F, Vezzetti E (2021) Real-time deep learning semantic segmentation during intra-operative surgery for 3d augmented reality assistance. Int J Comput Assisted Radiol Surg 16(9):1435–1445. https://doi.org/10.1007/s11548-021-02432-y
Tanzi L, Piazzolla P, Moos S, Vezzetti E (2022) Exploiting deep learning and augmented reality in fused deposition modeling: a focus on registration. Int J Interact Des Manuf—IJIDEM 17(1):103–114. https://doi.org/10.1007/s12008-022-01107-5
Tao W, Leu MC, Yin Z (2020) Multi-modal recognition of worker activity for human-centered intelligent manufacturing. Eng Appl Artif Intell 95:103868. https://doi.org/10.1016/j.engappai.2020.103868
Tara NS, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings. https://doi.org/10.1109/ICASSP.2015.7178838
Thiel KK, Naumann F, Jundt E, Günnemann S, Klinker G (2021) C. dot-convolutional deep object tracker for augmented reality based purely on synthetic data. IEEE Trans Visual Comput Graph 28(12):4434–4451
Thiel KK, Naumann F, Jundt E, Guennemann S, Klinker G (2022) C.dot-convolutional deep object tracker for augmented reality based purely on synthetic data. IEEE Trans Vis Comput Graph 28(12):4434–4451. https://doi.org/10.1109/TVCG.2021.3089096
Tong K, Wu Y (2022) Deep learning-based detection from the perspective of small or tiny objects: a survey. Image Vis Comput 104471
Tu Z, Weng D, Liang B, Luo L (2022) Expression retargeting from images to three-dimensional face models represented in texture space. J Soc Inf Dis 30(10):775–788. https://doi.org/10.1002/jsid.1165
Ullah H, Afzal S, Khan IU (2022) Perceptual quality assessment of panoramic stitched contents for immersive applications: a prospective survey. Virtual Real Intell Hardware 4(3):223–246
Vaca-Castano G, Das S, Sousa JP, Lobo ND, Shah M (2017) Improved scene identification and object detection on egocentric vision of daily activities. Comput Vis Image Underst 156:92–103. https://doi.org/10.1016/j.cviu.2016.10.016. (Image and Video Understanding in Big Data)
VanHorn K, Cobanoglu MC (2022) Democratizing AI in biomedical image classification using virtual reality. Virtual Real 26(1):159–171. https://doi.org/10.1007/s10055-021-00550-1
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst
Vaughan N, Gabrys B (2020) Scoring and assessment in medical VR training simulators with dynamic time series classification. Eng Appl Artif Intell 94:103760. https://doi.org/10.1016/j.engappai.2020.103760
Wang Y, Shi Y, Du J, Lin Y, Wang Q (2020) A CNN-based personalized system for attention detection in wayfinding tasks. Adv Eng Inform 46:101180. https://doi.org/10.1016/j.aei.2020.101180
Wang C, Wen C, Dai Y, Yu S, Liu M (2020) Urban 3d modeling with mobile laser scanning: a review. Virtual Real Intell Hardw 2(3):175–212
Wang K, Zhang G, Zheng H, Yang J (2021) Learning dense correspondences for non-rigid point clouds with two-stage regression. IEEE Trans Image Process 30:8468–8482
Wang H, Kim B, Xie J, Han Z (2021) Energy drain of the object detection processing pipeline for mobile devices: analysis and implications. IEEE Trans Green Commun Netw 5(1):41–60. https://doi.org/10.1109/TGCN.2020.3041666
Wang C, Zhang F, Ge SS (2021) A comprehensive survey on 2d multi-person pose estimation methods. Eng Appl Artif Intell 102:104260
Wang H, Kang P, Gao Q, Jiang S, Shull PB (2022) A novel PPG-FMG-ACC wristband for hand gesture recognition. IEEE J Biomed Health Inform 26(10):5097–5108
Wang P, Yang WA, You Y (2023) A cyber-physical prototype system in augmented reality using RGB-D camera for CNC machining simulation. J Intell Manuf. https://doi.org/10.1007/s10845-022-02021-z
Wang S, Guo C, Yang R, Zhang Q, Ren H (2023) A lightweight vision-based measurement for hand gesture information acquisition. IEEE Sens J. https://doi.org/10.1109/JSEN.2022.3204641
Wang J, Mueller F, Bernard F, Sorli S, Sotnychenko O, Qian N, Otaduy MA, Casas D, Theobalt C (2020) Rgb2hands: real-time tracking of 3d hand interactions from monocular RGB video. ACM Trans Graph. https://doi.org/10.1145/3414685.3417852
Wang Q, Wang H, Hu F, Hua C, Wang D (2021) Using convolutional neural networks to decode eeg-based functional brain network with different severity of acrophobia. J Neural Eng. https://doi.org/10.1088/1741-2552/abcdbd
Wang D, Wang X, Ren B, Wang J, Zeng T, Kang D, Wang G (2022) Vision-based productivity analysis of cable crane transportation using augmented reality-based synthetic image. J Comput Civil Eng. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000994
Wei Y, Akinci B (2019) A vision and learning-based indoor localization and semantic mapping framework for facility operations and management. Automat Construc. https://doi.org/10.1016/j.autcon.2019.102915
Wei X, Yang Z, Liu Y, Wei D, Jia L, Li Y (2019) Railway track fastener defect detection based on image processing and deep learning techniques: a comparative study. Eng Appl Artif Intell 80:66–81. https://doi.org/10.1016/j.engappai.2019.01.008
Wei L, Zhong Z, Lang C, Yi Z (2019) A survey on image and video stitching. Virtual Real Intell Hardware 1(1):55–83
Wei M, Tang J, Tang H, Zhao R, Gai X, Lin R (2021) Adoption of convolutional neural network algorithm combined with augmented reality in building data visualization and intelligent detection. Complexity. https://doi.org/10.1155/2021/5161111
Wen D, Liang B, Li J, Wu L, Wan X, Dong X, Lan X, Song H, Zhou Y (2023) Feature extraction method of EEG signals evaluating spatial cognition of community elderly with permutation conditional mutual information common space model. IEEE Trans Neural Syst Rehabil Eng. https://doi.org/10.1109/TNSRE.2023.3273119
Wu Q (2021) Construction and 3d simulation of virtual animation instant network communication system based on convolution neural networks. Comput Intell Neurosci. https://doi.org/10.1155/2021/7277733
Wu B, Wang Y (2022) Rich global feature guided network for monocular depth estimation. Image Vis Comput 125:104520. https://doi.org/10.1016/j.imavis.2022.104520
Wu MY, Ting PW, Tang YH, Chou ET, Fu LC (2020) Hand pose estimation in object-interaction based on deep learning for virtual reality applications. J Vis Commun Image Represent. https://doi.org/10.1016/j.jvcir.2020.102802
Wu F, Yan F, Shi W, Zhou Z (2022) 3d scene graph prediction from point clouds. Virtual Real Intell Hardw 4:76–88. https://doi.org/10.1016/j.vrih.2022.01.005
Xiao M, Feng Z, Yang X, Xu T, Guo Q (2020) Multimodal interaction design and application in augmented reality for chemical experiment. Virtual Real Intell Hardw 2:291–304. https://doi.org/10.1016/j.vrih.2020.07.005
Xiao D, Niu J, Feng J (2022) A football training method based on improved tiny-yolov3 and virtual reality. Multimed Tools Appl. https://doi.org/10.1007/s11042-022-12404-2
Xiu H, Liang Y, Zeng H, Li Q, Liu H, Fan B, Li C (2022) Robust self-supervised monocular visual odometry based on prediction-update pose estimation network. Eng Appl Artif Intell 116:105481. https://doi.org/10.1016/j.engappai.2022.105481
Xu H, Xu J, Xu W (2019) Survey of 3d modeling using depth cameras. Virtual Real Intell Hardware 1(5):483–499
Xu Y, Arai S, Tokuda F, Kosuge K (2020) A convolutional neural network for point cloud instance segmentation in cluttered scene trained by synthetic data without color. IEEE Access 8:70262–70269. https://doi.org/10.1109/ACCESS.2020.2978506
Xu Y, Liu J, Zhai Y, Gan J, Zeng J, Cao H, Scotti F, Piuri V, Labati RD (2020) Weakly supervised facial expression recognition via transferred DAL-CNN and active incremental learning. Soft Comput 24(8, SI):5971–5985. https://doi.org/10.1007/s00500-019-04530-1
Xue Y, Zhang D, Li L, Li S, Wang Y (2022) Lightweight multi-scale convolutional neural network for real time stereo matching. Image Vis Comput 124:104510. https://doi.org/10.1016/j.imavis.2022.104510
Xu H, Li F (2022) Multilevel pyramid network for monocular depth estimation based on feature refinement and adaptive fusion. Electronics. https://doi.org/10.3390/electronics11162615
Yan Z, Zha H (2019) Flow-based slam: from geometry computation to learning. Virtual Real Intell Hardware 1(5):435–460
Yang J, Liu T, Jiang B, Song H, Lu W (2018) 3d panoramic virtual reality video quality assessment based on 3d convolutional neural networks. IEEE Access 6:38669–38682. https://doi.org/10.1109/ACCESS.2018.2854922
Yang L, Huang J, Feng T, Hong-An W, Guo-Zhong D (2019) Gesture interaction in virtual reality. Virtual Real Intell Hardware 1(1):84–112
Yang L, Song Q, Wang Z, Hu M, Liu C (2021) Hier r-CNN: instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54. https://doi.org/10.1109/TIP.2020.3029901
Yang J, Liu T, Jiang B, Lu W, Meng Q (2021) Panoramic video quality assessment based on non-local spherical CNN. IEEE Trans Multimed 23:797–809. https://doi.org/10.1109/TMM.2020.2990075
Yang C, Chen Q, Yang Y, Zhang J, Wu M, Mei K (2022) Sdf-slam: A deep learning based highly accurate slam using monocular camera aiming at indoor map reconstruction with semantic and depth fusion. IEEE Access 10:10259–10272
Yao F, Qiu L (2021) Facial expression recognition based on convolutional neural network fusion sift features of mobile virtual reality. Wire Commun Mobile Comput. https://doi.org/10.1155/2021/5763626
Ye X, Yan B, Liu B, Wang H, Qi S, Chen D, Wang P, Wang K, Sang X (2022) Improved real-time three-dimensional stereo matching with local consistency. Image Vis Comput 124:104509. https://doi.org/10.1016/j.imavis.2022.104509
Ye Z, Li G, Yao B, Xian C (2020) Hao-cnn: Filament-aware hair reconstruction based on volumetric vector fields. Comput Animat Virtual Worlds. https://doi.org/10.1002/cav.1945
Yi Z, Chang T, Li S, Liu R, Zhang J, Hao A (2019) Scene-aware deep networks for semantic segmentation of images. IEEE Access 7:69184–69193. https://doi.org/10.1109/ACCESS.2019.2918700
You JK, Hsu CCJ, Wang WY, Huang SK (2021) Object pose estimation incorporating projection loss and discriminative refinement. IEEE Access 9:18597–18606. https://doi.org/10.1109/ACCESS.2021.3054493
Yu L, Qiao B, Zhang H, Yu J, He X (2022) Ltst: long-term segmentation tracker with memory attention network. Image Vis Comput 119:104374. https://doi.org/10.1016/j.imavis.2022.104374
Yuan X, Tang D, Liu Y, Ling Q, Fang L (2017) Magic glasses: from 2d to 3d. IEEE Trans Circuits Syst Video Technol 27(4):843–854. https://doi.org/10.1109/TCSVT.2016.2556439
Yuan H, Zhang D, Wang W, Li Y (2020) A sampling-based 3d point cloud compression algorithm for immersive communication. Mobile Netw Appl 25(5, SI):1863–1872. https://doi.org/10.1007/s11036-020-01570-y
Yuan G, Liu X, Yan Q, Qiao S, Wang Z, Yuan L (2021) Hand gesture recognition using deep feature fusion network based on wearable sensors. IEEE Sens J 21(1):539–547. https://doi.org/10.1109/JSEN.2020.3014276
Yuanyuan S, Yunan L, Xiaolong F, Kaibin M, Qiguang M (2021) Review of dynamic gesture recognition. Virtual Real Intell Hardware 3(3):183–206
Yue M, Fu G, Wu M, Zhang X, Gu H (2022) Self-supervised monocular depth estimation in dynamic scenes with moving instance loss. Eng Appl Artif Intell 112:104862. https://doi.org/10.1016/j.engappai.2022.104862
Yu P, Guo J, Huang F, Chen Z, Wang C, Zhang Y, Guo Y (2023) Shadowmover: automatically projecting real shadows onto virtual object. IEEE Trans Visual Comput Graph29. https://doi.org/10.1109/TVCG.2023.3247066
Zadeh SM, Francois T, Calvet L, Chauvet P, Canis M, Bartoli A, Bourdel N (2020) Surgai: deep learning for computerized laparoscopic image understanding in gynaecology. Surg Endoscopy Other Intervent Tech 34(12):5377–5383. https://doi.org/10.1007/s00464-019-07330-8
Zeng Z, Wu M, Zeng W, Fu C-W (2020) Deep recognition of vanishing-point-constrained building planes in urban street views. IEEE Trans Image Process 29:5912–5923. https://doi.org/10.1109/TIP.2020.2986894
Zeng H, He X, Pan H (2021) Implementation of escape room system based on augmented reality involving deep convolutional neural network. Virtual Real 25(3):585–596. https://doi.org/10.1007/s10055-020-00476-0
Zhang X, Aliaga D (2022) Rfcnet: enhancing urban segmentation using regularization, fusion, and completion. Comput Vis Image Underst 220:103435. https://doi.org/10.1016/j.cviu.2022.103435
Zhang H, Cao Q (2019) Holistic and local patch framework for 6d object pose estimation in RGB-D Images. Comput Vis Image Underst 180:59–73. https://doi.org/10.1016/j.cviu.2019.01.005
Zhang H, Chi L (2020) End-to-end spatial transform face detection and recognition. Virtual Real Intell Hardw 2:119–131. https://doi.org/10.1016/j.vrih.2020.04.002. (Special issue on Visual interaction and its application)
Zhang Y, Fei G (2019) Overview of 3d scene viewpoints evaluation method. Virtual Reality and Intelligent Hardware 1(4):341–385
Zhang S, Xiao N (2021) Detailed 3d human body reconstruction from a single image based on mesh deformation. IEEE Access 9:8595–8603. https://doi.org/10.1109/ACCESS.2021.3049548
Zhang X, Jiang Z, Zhang H (2019) Real-time 6d pose estimation from a single RGB image. Image Vis Comput 89:1–11. https://doi.org/10.1016/j.imavis.2019.06.013
Zhang X, Jiang Z, Zhang H (2020) Out-of-region keypoint localization for 6d pose estimation. Image Vis Comput 93:103854. https://doi.org/10.1016/j.imavis.2019.103854
Zhang Y, Fei G, Yang G (2020) 3d viewpoint estimation based on aesthetics. IEEE Access 8:108602–108621. https://doi.org/10.1109/ACCESS.2020.3001230
Zhang W, Su C, He C (2020) Rehabilitation exercise recognition and evaluation based on smart sensors with deep learning framework. IEEE Access 8:77561–77571. https://doi.org/10.1109/ACCESS.2020.2989128
Zhang Z, Hu L, Deng X, Xia S (2020) Weakly supervised adversarial learning for 3d human pose estimation from point clouds. IEEE Trans Visual Comput Graphics 26(5):1851–1859. https://doi.org/10.1109/TVCG.2020.2973076
Zhang Y, David P, Foroosh H, Gong B (2020) A curriculum domain adaptation approach to the semantic segmentation of urban scenes. IEEE Trans Pattern Anal Mach Intell 42(8):1823–1841. https://doi.org/10.1109/TPAMI.2019.2903401
Zhang Z, Dai Y, Sun J (2020) Deep learning based point cloud registration: an overview. Virtual Real Intell Hardw 2(3):222–246
Zhang J, Liu J, Liu X, Wei J, Cao J, Tang K (2021) Feature interpolation convolution for point cloud analysis. Comput Graph 99:182–191. https://doi.org/10.1016/j.cag.2021.06.015
Zhang T, Jin B, Jia W (2022) An anchor-free object detector based on soften optimized bi-directional FPN. Comput Vis Image Underst 218:103410. https://doi.org/10.1016/j.cviu.2022.103410
Zhang Z, Hu Y, Yu G, Dai J (2023) Deeptag: a general framework for fiducial marker design and detection. IEEE Transactions on Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3174603
Zhang A, Li S, Wu J, Li S, Zhang B (2023) Exploring semantic information extraction from different data forms in 3d point cloud semantic segmentation. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3287940
Zhang T, Li N, Gong G, Yang C, Hou G, Lin X (2023) Ccvo: Cascaded cnns for fast monocular visual odometry towards the dynamic environment. IEEE Robot Automat Lett. https://doi.org/10.1109/LRA.2022.3214790
Zhao X, Tang F, Wu Y (2019) Real-time human segmentation by Bowtienet and a slam-based human AR system. Virtual Real Intell gent Hardware 1:511–524. https://doi.org/10.1016/j.vrih.2019.08.002. (3D Vision)
Zhao G, Hu J, Xiao W, Zou J (2021) A mask r-CNN based method for inspecting cable brackets in aircraft. Chin J Aeronaut 34(12):214–226. https://doi.org/10.1016/j.cja.2020.09.024
Zhao J, Chalmers A, Rhee T (2021) Adaptive light estimation using dynamic filtering for diverse lighting conditions. IEEE Trans Visual Comput Graphics 27(11):4097–4106
Zhao M, Xiong G, Zhou M, Shen Z, Liu S, Han Y, Wang F-Y (2022) Pcunet: a context-aware deep network for coarse-to-fine point cloud completion. IEEE Sens J 22(15):15098–15110
Zheng L, Liu X, An Z, Li S, Zhang R (2020) A smart assistance system for cable assembly by combining wearable augmented reality with portable visual inspection. Virtual Real Intell Hardw 2:12–27. https://doi.org/10.1016/j.vrih.2019.12.002
Zherdev D, Zherdeva L, Agapov S, Sapozhnikov A, Nikonorov A, Chaplygin S (2021) Producing synthetic dataset for human fall detection in AR/VR environments. Appl Sci. https://doi.org/10.3390/app112411938
Zhou D, Feng S (2022) M3spcanet: a simple and effective convnets with unsupervised predefined filters for face recognition. Eng Appl Artif Intell 113:104936. https://doi.org/10.1016/j.engappai.2022.104936
Zhou W, Jiang X, Liu Y-H (2019) Mvpointnet: multi-view network for 3d object based on point cloud. IEEE Sens J 19(24):12145–12152. https://doi.org/10.1109/JSEN.2019.2937089
Zhou W, Jiang W, Bian W, Jie B (2019) Webvr human-centered indoor layout design framework using a convolutional neural network and deep q-learning. IEEE Access 7:185773–185785. https://doi.org/10.1109/ACCESS.2019.2961368
Zhou W, Jia J, Huang C, Cheng Y (2020) Web3d learning framework for 3d shape retrieval based on hybrid convolutional neural networks. Tsinghua Sci Technol 25(1):93–102. https://doi.org/10.26599/TST.2018.9010113
Zhou W, Liu G, Shi J, Zhang H, Dai G (2020) Depth-guided view synthesis for light field reconstruction from a single image. Image Vis Comput 95:103874. https://doi.org/10.1016/j.imavis.2020.103874
Zhou M, Chen W, He T, Zhang Q, Shen J (2021) Scan-free end-to-end new approach for snapshot camera spectral sensitivity estimation. Opt Lett 46(23):5806–5809. https://doi.org/10.1364/OL.440549
Zhu Y, Zhai G, Yang Y, Duan H, Min X, Yang X (2021) Viewing behavior supported visual saliency predictor for 360 degree videos. IEEE Trans Circuits Syst Video Technol 32(7):4188–4201
Zhu F, Xu J, Yao C (2022) Local information fusion network for 3d shape classification and retrieval. Image Vis Comput 121:104405. https://doi.org/10.1016/j.imavis.2022.104405
Zhu L, Chen Z, Wang B, Tian G, Ji L (2023) Sfss-net: shape-awared filter and sematic-ranked sampler for voxel-based 3d object detection. Neural Comput Appl. https://doi.org/10.1007/s00521-023-08382-7
Zou J, Zhang H (2019) New key point detection technology under real-time eye tracking. Mechatron Syst Control 47(2):71–76. https://doi.org/10.2316/J.2019.201-2969
Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. In: Proceedings of the IEEE 111. https://doi.org/10.1109/JPROC.2023.3238524
Zou N, Xiang Z, Chen Y, Chen S, Qiao C (2020) Simultaneous semantic segmentation and depth completion with constraint of boundary. Sensors 20(3). https://doi.org/10.3390/s20030635
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Not applicable.
Author information
Authors and Affiliations
Contributions
All the authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
There are no Conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cortes, D., Bermejo, B. & Juiz, C. The use of CNNs in VR/AR/MR/XR: a systematic literature review. Virtual Reality 28, 154 (2024). https://doi.org/10.1007/s10055-024-01044-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10055-024-01044-6