1 Introduction

Virtual reality (VR)-based training is increasing in popularity and is being explored in recent years across domains like education (Radianti et al. 2020), rehabilitation (Howard 2017), and various industries targeting adult learners (Abich et al. 2021; Radhakrishnan et al. 2021b; Renganayagalu et al. 2021; Xie et al. 2021). VR-based skill training brings in several advantages like allowing learners to practice procedures safely and repeatedly with consistent feedback (Hamilton et al. 2021). For example, in a Cochrane meta-analysis of studies investigating the effectiveness of VR training in endoscopy skills, it was found that VR training was more effective than no training and as effective as physical training (Khan et al. 2019). The advantages of VR training are being further enhanced by the increasingly widespread availability of immersive VR (IVR) technologies which make use of CAVE (Cave Automatic Virtual Environment) technologies or head-mounted displays (HMDs), offering high-fidelity audiovisuals to the user (Makransky et al. 2019). The immersion and presence offered by IVR further enhance its effectiveness, particularly when the affordances of IVR are matched with the teaching/training method (Makransky and Petersen 2021). It must be noted that IVR still has limitations in comparison to physical reality, to name a few in particular: differences in visual acuity, field of view, and the presence of cybersickness, the latter possibly linked to differences in vestibular response (Ashiri et al. 2020). As the evidence for the effectiveness of IVR over other methods is mixed (Abich et al. 2021; Radhakrishnan et al. 2021b), one may ask: how can IVR training be improved?

IVR training primarily makes use of easily observable training/test performance metrics like task completion time and the number of errors (Abich et al. 2021; Radhakrishnan et al. 2021b). In addition to such objective measures, the literature on skill training outside of IVR has also investigated the links between arousal and performance (Storbeck and Clore 2008; Yerkes and Dodson 1908). The term arousal refers to many related phenomena like an increase in alertness, attention, emotion, or the ability to respond to stimuli through motor movements (Calderon et al. 2016). Arousal levels are measured using both subjective (questionnaires) and objective methods (sensors). Existing biosensing technologies can measure pupil dilation, heart rate, electro-dermal activity, brain activity, skin temperature, respiration rate, and other measures of the body’s autonomic arousal. IVR literature provides several examples where arousal levels are incorporated into studies on social anxiety (Owens and Beidel 2015), treatment of phobias (Diemer et al. 2016), presence (Terkildsen and Makransky 2019), and other studies of emotions and behavior (Marín-Morales et al. 2018; Syrjämäki et al. 2020). However, there are only a few instances in immersive and non-immersive VR training literature where arousal levels are measured and then linked to performance (Parong and Mayer 2021; Wu et al. 2010). Such research would open up new avenues for advancing the state of the art, particularly aided by the increasing availability of cost-effective biosensors that can measure physiological arousal and their integration with commercial IVR technologies (e.g., HP Reverb, OpenBCI Galea). If such links can be established, IVR training itself may be further enhanced with adaptation (Zahabi and Abdul Razak 2020) by changing the parameters of the training environment to increase or decrease the trainee’s arousal levels and performance.

This paper adds to the body of the literature on motor skill training in IVR with a between-subjects fine motor skill training experiment. With the aid of N = 87 participants, we compared the effectiveness of IVR against physical training conditions with a focus on performance and arousal. The latter is achieved with the use of wearable biosensors which measure physiological arousal in the form of electro-dermal activity (EDA) and electrocardiogram (ECG) signals. These were recorded from all participants across the two conditions. Furthermore, the study investigated improvements in performance after training along with subjective measures of immersion, presence, enjoyment, self-efficacy, and task load.

2 Related works

2.1 Training in virtual reality

Virtual reality has been described as a collection of technologies that creates synthetic and interactive three-dimensional environments (Mikropoulos and Natsis 2011). These technologies range from highly immersive ones like head-mounted displays (HMDs) and CAVEs to devices providing a comparatively lower level of immersion like desktops and smartphone displays. Technological advances have resulted in HMDs becoming more popular in recent years, which in turn increased interest in their applications in education and training (Checa and Bustillo 2020; Makransky and Petersen 2021). However, research suggests that IVR training should not be just implemented as a one-size-fits-all solution, but instead works best when the design factors of the training environment complement the capabilities provided by the IVR hardware (Jensen and Konradsen 2018).

Learning/training in immersive virtual environments extends across many domains like school/university education, rehabilitation training for patients, professional training for doctors, and office/industrial workers, where it focuses on diverse kinds of cognitive, affective, and motor skills (Jensen and Konradsen 2018). For this study, we limit the discussion of training literature focusing on teaching various cognitive and motor skills to healthy individuals. The literature on cognitive skills taught in IVR primarily relates to school and college education (Hamilton et al. 2021), as well as teaching procedural and safety knowledge primarily for industrial training purposes (Feng et al. 2018; Patle et al. 2019). On the other hand, motor skill training literature in IVR has been dominated by medical use cases, particularly in the surgical and dental domains which require fine motor skills (Radhakrishnan et al. 2021b). IVR-based motor skill training researchers have investigated the relative advantages. IVR-based training has over other training media (physical training, video training, etc.) or variations within IVR, like different levels of visual/haptic fidelity (Huber et al. 2018; Jain et al. 2020), participant characteristics (Shakur et al. 2015), and training methods (Harvey et al. 2019). The results of these studies have been varied; for example, Pulijala et al. (2018) found IVR to be more effective than video/presentation training, Hooper et al. showed IVR to be more effective than physical training for hip arthroplasty surgery, Butt et al. observed the same advantage of IVR over physical training for catheter insertion training, but the advantage disappeared after a week (Butt et al. 2018). Huber et al. found IVR to be as effective as an ‘augmented’ VR condition (Huber et al. 2018). In a comparison of IVR to desktop VR training, Frederiksen et al. found that IVR was inferior in its effectiveness and caused more cognitive load among students of laparoscopic surgery (Frederiksen et al. 2020). Thus, whether IVR training can be as effective or more effective compared to other types of training is inconclusive so far and an open research topic (Checa and Bustillo 2020) and more so in the case of IVR-based motor skill training (Coban et al. 2022). This need inspired the first research question addressed in this work: RQ 1—Is IVR training as effective as physical training in improving task performance?

In order to answer this research question, it is important to include observable measures signifying training effectiveness (Magill and Anderson 2016); for example, performance metrics like time for task completion, and quality metrics like the number of mistakes/errors (Abich et al. 2021; Radhakrishnan et al. 2021b; Wulf et al. 2010). While measuring such performance metrics, trainees may be tested before and after training to measure their performance improvement (Magill and Anderson 2016, p. 269). When the tests are performed in a physical setting, they provide a measure of the transfer of skills from the virtual to the real environment, which has been argued in the literature to be crucial in establishing the effectiveness of IVR training (Jensen and Konradsen 2018; Levac et al. 2019).

Subjective measures have been linked to the effectiveness of learning/training in IVR environments in the Cognitive Affective Model of Immersive Learning (CAMIL) (Makransky and Petersen 2021). The CAMIL framework suggests that there are two affordances to learning in immersive VR, namely presence (arising from immersion) and agency (arising from interactivity) which affect six other factors, i.e., interest, motivation, self-efficacy, embodiment, cognitive load, and self-regulation, which in turn affect the effectiveness of IVR training. Popular subjective measures from IVR training literature include measures of cognitive load, like the NASA Task Load Index (NASA-TLX) (Hart and Staveland 1988), measures of immersion, like the Immersive Tendencies Questionnaire (ITQ), measures of presence, like the presence questionnaire (Witmer and Singer 1998), measures of usability, like the System Usability Scale (SUS) (Brooke 1996), measures of cyber/motion sickness, like the Simulator Sickness Questionnaire (SSQ) (Kennedy et al. 1993), and measures of self-efficacy (Lehikko 2021; Pintrich 1991). It should be noted that while ‘Immersion’ is an objective measure of how vivid the VR technology can be made (for example, IVR is more immersive than desktop VR), ‘Presence’ is understood to be a subjective measure of experience by users which arises from both immersion and interactivity in VR (Makransky and Petersen 2021). Cognitive load is also crucial to understanding the effectiveness of VR in comparison to other media, as it is negatively correlated with learning/training effectiveness (Koumaditis et al. 2020; Van Merriënboer and Sweller 2010). Another subjective measure of importance is ‘self-efficacy,’ defined as the subjective belief people have about their own ability to fulfill a task (Bandura 1986). Self-efficacy measures are gaining more attention in the literature, as it has been positively linked to the IVR modality and learning outcomes (Shu et al. 2019; Tai et al. 2022). Therefore, it is important to measure the subjective perception of trainees in different training modalities in order to investigate their relationship with training effectiveness. This need generates the second research question: RQ 2—Is there a significant difference in the enjoyment, presence, immersion, task load, and changes in self-efficacy reported by participants in IVR compared to physical training?

IVR training is used in various contexts of motor skills. These can be broadly categorized as context-specific or context-independent. Many examples of context-specific IVR training are found in the medical and surgical domains, where the procedure being trained can easily be used for the same procedure in the real world but rarely in other contexts. An example from the non-medical domain is Winther et al. (2020) who explored the effectiveness of IVR-based training vs conventional training for a pump maintenance task. Such context-specific explorations result in findings that can be applied in the real world easily but are limited by their limited external validity, i.e., they are hard to generalize to other contexts. An advantage of studies on employing context-independent scenarios is therefore that the result is often easier to generalize and transfer to related domains. Examples exist in the IVR motor skill training literature that use more context-independent scenarios like puzzle assembly (Carlson et al. 2015; Koumaditis et al. 2020; Murcia-Lopez and Steed 2018). Though such examples are not related to real-world tasks or scenarios, it can be argued that such studies and skill training scenarios may generate results that are more generalizable and transferable to related domains. Inspiration can be found in laparoscopy surgical training literature, where the use of box trainers is widespread, which are highly simplified representations of the tasks involved in laparoscopy (Aggarwal et al. 2004). In this paper, we identify a fine motor skill task (buzz-wire or wire loop game) inspired by the literature where it was previously investigated in ergonomics research (Shafti et al. 2016) and in the domain of motor control (Luvizutto et al. 2022; Read et al. 2013) and rehabilitation (Budini et al. 2014; Christou et al. 2018). In this task, the aim is to move a metallic loop across a wire without entering into contact. Immediate feedback is provided when a mistake is made in the form of a loud ‘buzz’ and, in some cases, a blinking red light in the background. The wire is bent at different locations which makes the task challenging to perform while maintaining a steady hand (Shafti et al. 2016). Read et al. (2013) found that a buzz-wire setup was effective in assessing the relation between manual dexterity and binocular vision. Budini et al. (2014) used buzz-wire training along with hand postural exercises for patients with hand tremors in their experiment and found improvements in goal-directed tasks. Christou et al. (2018) present the only example of research using the buzz-wire setup in an IVR environment, designed as an exercise tool for patients who have suffered stroke and other brain trauma. Similar to Read et al. (2013), they found that the presence of binocular viewing is correlated with increased performance and also that they could distinguish between dominant and non-dominant hand performance. Furthermore, the details provided by Christou et al. (2018) on designing increasing levels of buzz-wire task complexity inspired the current study.

2.2 Arousal and learning

Though the terms ‘arousal’ and ‘emotion’ have been used interchangeably in the literature, arousal is one aspect of emotion, along with valence (ranging from negative to positive) according to dimensional models of emotion (Posner et al. 2005; Rubin and Talarico 2009). Similarly, the terms ‘stress’ and ‘anxiety’ have also been used to denote high arousal states with a negative valence (Janelle 2002; Pakarinen et al. 2019). Multiple methods have been used/utilized to measure arousal levels, using both subjective (Bradley and Lang 1994) and objective methods (Cacioppo et al. 2007). Among subjective techniques, subjects report their degree of arousal using instruments like the Self-Assessment Manikin (SAM) (Bradley and Lang 1994) and the Stress Arousal Checklist (Mackay et al. 1978). Such questionnaires are usually measured post-exposure and depend on the user’s knowledge of their own arousal levels, their memory of the task, and comprehension of the questions. On the other hand, objective measures of arousal are a function of the body’s autonomic nervous system, which produces measurable responses, reflecting the user’s emotional and cognitive state. This includes changes in skin conductivity (electro-dermal/EDA activity due to sweating), heart rate parameters (heart rate variability/HRV), respiration, skin temperature, pupil dilation, and brain activity (Cacioppo et al. 2007). These biosignals can be measured by sensors placed on the body (usually non-invasive) to provide measures of physiological arousal. Objective biosignal data also allow for a more fine-grained look at variations in the subject’s arousal levels during a study using measures like Event-Related potentials (ERPs) in EEG, Skin Conductance Responses (SCRs) in EDA, Inter-beat Intervals (or R–R intervals) in heart rate variability data, among many others, where each signal can be used in isolation or be coupled with others in order to increase accuracy (Cacioppo et al. 2007).

Arousal levels may have links to performance and learning outcomes, but limited empirical support is to be found. It has been hypothesized that an individual’s experience of arousal affects attention, perception of time, and memory (Storbeck and Clore 2008), and that there is a non-linear ‘inverted U-shaped’ relationship between arousal levels and performance (Yerkes and Dodson 1908). However, the results have been inconclusive in validating this hypothesis (Storbeck and Clore 2008). Some examples from the literature point to a link between high arousal and better training performance (Homer et al. 2019; Matthews and Margetts 1991; Ünal et al. 2013). On the other hand, some explorations related to training have found that low arousal leads to better improvements in performance (Kuan et al. 2018; Pavlidis et al. 2019; Prabhu et al. 2010; Quick et al. 2017). The link between arousal and learning/training adds a further layer of complexity since the effectiveness of training is measured not by task performance alone but by changes in performance across different periods, usually as a change in performance before and after training (learning gain). Movahedi et al. (2007) illustrate this complexity in a sports training context where they found that participants performed worse during a retention test when their arousal levels during the test were mismatched with the arousal levels (either high or low) during training.

The use of physiological data to measure arousal levels in IVR literature is rare; however, some representative examples that use heart rate-related metrics for measuring arousal include Muñoz et al. (2019) where HRV metrics (along with EEG data) were used to detect calmness states among participants using an IVR target shooting simulator, Cebeci et al. (2019) where eye tracking and heart rate were used to measure the impact of different virtual environments on factors like cybersickness and emotions among study participants, and Larmuseau et al. (2020) where HRV along with EDA and skin temperature were used to measure cognitive load among students’ learning statistics online. In the use of EDA data, some illustrative examples include understanding how soldiers respond to threatening stimuli during IVR training (Binsch et al. 2021), detecting student stress levels during a physics course (non-VR) (Pijeira-Díaz et al. 2018), and measuring EDA responses to insights made by participants in an IVR learning environment (Collins et al. 2019). There are currently only a few examples in IVR literature on the exploration of physiological arousal levels and their connection to fine motor skill training in virtual reality. One example is from a science education scenario where it was shown that learning in IVR leads to higher arousal and subsequently lower scores on a retention test (Parong and Mayer 2021). Another example is from non-immersive VR where a stroop interference task-induced arousal in participants during a virtual driving task and then found the optimal arousal levels related to increased performance (Wu et al. 2010). Therefore, a research gap exists in the literature for understanding the link between motor skill training in IVR, improvements in performance due to the training, and physiological arousal levels of the trainees. The following research questions were generated in order to address this gap: RQ 3—Is there a significant difference between the physiological arousal levels of participants in IVR training compared to physical training? RQ 4—Is there a link between physiological arousal during training and improvements in performance after training? In the next section, the design of the experiment is detailed which will help address these questions.

3 Methods

The experiment contains three phases as depicted in Figure 1: a pre-training phase common to all conditions where a pre-test of the motor skill is performed, a training phase in which the participants were randomly assigned to either VR or physical training conditions and a post-training phase where a post-test of the motor skill was performed for participants from both training conditions. The following sub-sections detail the motor skill task, the two experimental conditions, the pre-test and post-test tasks, the physiological and performance data measured during the experiment as well as the subjective data reported by the participants. The section ends with a detailed description of the experimental procedure shown in Fig. 1.

Fig. 1
figure 1

Overview of experiment procedure

3.1 Motor skill task

In this study, the trainee is asked to grab the apparatus as shown in Fig. 2 and guide the metallic loop across a wire as fast as possible with the least amount of touching between the loop and the wire. There are two variations of the task, varying on the feedback provided when the loop touches the wire, i.e., when a mistake is made. In the training task, when the participant makes such a mistake, three kinds of feedback were provided simultaneously:

  • Haptic feedback in the form of vibration in the Oculus Quest’s Touch controller. Vibration is set to the maximum frequency and amplitude available in the Oculus SDK and delivered for 1/10th of a second.

  • Auditory feedback was provided by playing a continuous 1000 Hz sine wave tone at 39 dB over the headphones worn by the participant (Sony WH-CH710N). Sound levels were verified and maintained across participants using the NIOSH iPhone app (National Institute for Occupational Safety and Health Sound Level Meter App).

  • Visual feedback is provided by switching on a red LED (Fig. 2) placed at eye level behind the wire.

Fig. 2
figure 2

Physical training condition. Left: participant moving loop across the wire in level 4. When the loop touches the wire, the participant receives audio, haptic, and visual feedback. Right: The four levels of training

The training task in the physical and VR conditions is spread across four levels of increasing difficulty, with difficulty being specified as an increase in complexity of wrist movements needed to complete a level (see Table 1). For example, a wire with fewer bends requires less wrist movement, which in turn may produce fewer mistakes (i.e., the loop touching the wire) and the task may be completed (move from start to finish) quicker than a wire with more bends. This was verified in a previously published pilot study (Radhakrishnan et al. 2021a). These four levels were intended to help the participants train themselves, i.e., to develop the skills required to perform the test task more effectively. It should be noted that there were no instructions provided in either condition to facilitate the training by letting the participant construct their strategies for improving their skill level subject to the constraint of the environment.

Table 1 Training levels

3.1.1 Training in physical condition

The wire in each training level rests on two 20-cm tall pillars to provide better task ergonomics for participants (verified in a pilot test). Two black vertical wooden panels are placed at right angles on the wooden base (Fig. 2), and the entire setup is painted black to reduce visual distractions. The start and end positions are shaped like cylinders with grooves inside for the loop to be placed. An Arduino Uno placed in a microcontroller box is used to detect contact between the loop and the wire (denoting mistakes) using a simple switch circuit. A ‘mistake’ signal is transmitted serially to the PC when the loop touches the wire. Two similar switch circuits are used to detect contact between the loop and the grooves on both the start and finish positions. When the participant lifts the loop off the start position, a ‘start task’ signal is transmitted by the contact circuit to the PC; similarly, an ‘end task’ signal is transmitted when the loop is placed in the end position. The loop is made by bending a 1-mm-thick metal wire with a diameter of 2.5 cm. The loop is then screwed to a 3D printed handle (adapted from Lagos (2019)) that houses an Oculus controller (Fig. 2) to provide haptic feedback.

3.1.2 Training in IVR condition

Participants in the VR condition wore an Oculus Quest (1st generation) head-mounted display (HMD) (Fig. 3) connected to a PC and running on Rift mode. The VR environment was developed in the Unity3D (version 2019.4) game engine to closely resemble the physical environment. The wires (for each training level) and loop were designed using the Blender3D design software. The participants were presented with the same four levels in VR as in physical condition. They hold a physical handle containing the loop and the Oculus controller (like those in the test and physical training tasks). The position and rotation of both the controller and the HMD are provided by the Oculus SDK which is then used to move the virtual loop and the participant’s viewpoint in the three-dimensional space of the virtual environment (see Fig. 3).

Fig. 3
figure 3

a VR training environment. Virtual loop moving across level 2, b Ghost loop appears when contact is made, and the ‘real’ loop goes outside the wire. It disappears when the loop is placed back inside the wire. Visual feedback in form of a red ‘X’ mark in the background also turns on during contact. c Participant in VR condition wearing an Oculus Quest HMD (Rift mode)

The ‘Measurements’ asset from the Unity Asset Store was used to scale and position objects identically to their real-world counterparts (Vrchewal 2020). Both haptic and audio feedback modalities used the same parameters as the physical condition, and the visual feedback was in the form of a red 3D light behind each wire turning on during contact between the virtual wire and the virtual loop (Fig. 3b). Like the physical condition, ‘start task,’ ‘end task,’ and ‘mistake’ signals were sent to the data collection module (Fig. 5). Physics collision meshes were defined on the 3D models of the loop, the start and end positions, and the wires across the four levels. Collision tests were performed by Unity’s inbuilt physics engine at 60 Hz.

Though the VR condition mimics the physical, there are unavoidable differences between the two conditions:

  • Ghost effect during mistakes When the participant makes a mistake, i.e., the loop touches the wire, there is nothing to physically restrict the participant’s hand, unlike the physical condition where there is an actual wire to provide resistance. Though there is haptic vibration when contact is made, by the time the mistake is made, the loop would have passed through the wire creating an unrealistic effect for the participant which could potentially break their feeling of immersion (i.e., ‘being there’). To solve this, a ‘ghost effect’ has been programmed to show a blue translucent loop at the contact position where the actual loop passes through the virtual wire (Fig. 3b). This helps the participant understand how to bring their loop back into the wire, at which point the blue translucent ‘ghost’ disappears.

  • VR familiarization Participants were first exposed to a VR task to help them familiarize themselves with the movement of the virtual loop before starting the actual training. This is to avoid any negative outcomes from the novelty effect of using IVR among novice users (Hamilton et al. 2021). They were encouraged to intentionally make mistakes to learn the functionality of the ghost effect. The task is in the form of a straight wire which has no bends so that there is no unintended extra ‘training effect’ for participants in the VR condition.

  • Differences in media In addition to the above two features which distinguishes VR from the physical, there are other differences arising from the nature of the VR medium itself, for example—the field of view and the visual acuity provided by the Quest HMD are lower compared to that provided by healthy human vision (Adhanom et al. 2021; Cuervo et al. 2018). Additionally, the weight of the HMD has not been replicated in the physical condition.

3.1.3 Test task

The wire in the test task is 52 cm long with eleven 90° bends in all three axes (x, y, and z) between the start and finish positions (see Fig. 4). Contact circuits like those used in the physical training setup are used here to detect contact between the loop and the wire as well as the corresponding start and end positions. The three contact signals ‘start task,’ ‘mistake,’ and ‘end task’ are serially transmitted to the PC similar to the training setup (see 3.1.2). There is no ‘augmented’ feedback provided when the participant makes a mistake in the test condition, i.e., there is no haptic, visual, or auditory feedback other than the natural feedback of two metal pieces touching each other. Like the training setup, all parts of the test setup are painted black to provide a consistent background with fewer visual distractions. An Oculus controller is placed inside the handle containing the loop to mimic the weight of the controller in the physical and VR training setups but provides no haptic feedback. The same test task is used before and after the VR/physical training task as an objective measure of training effectiveness.

Fig. 4
figure 4

Test task setup along with the loop attached to 3D printed handle containing a Quest controller

3.2 Sensors and data collection

Data collected during the experiment come from three kinds of sources: the biosensors, the task-related signals coming from the test and training setups, and subjective data recorded in an online survey (at the end of the experiment). The first two types of data are facilitated by:

  • iMotions iMotions is a commercial software platform that supports data collection from commercial biosensors across many modalities (iMotions A/S, Copenhagen, Denmark). In this study, iMotions was used as the endpoint for storing all data coming through the dataflow pipeline shown in Fig. 5, as it integrates timestamped data from the two biosensors alongside performance-related data coming from the data collection module.

  • Data collection module A data collection module was developed in C# on the Unity3D game engine which collected task-related signals from the hardware setups (test and training) and the VR training software. Data from the hardware were read from two serial connections with a transmission rate of 9600 baud. The data collection module then transmitted in real-time the collected signals to the iMotions biosensor platform via a TCP socket connection (Fig. 5).

Fig. 5
figure 5

Software architecture

Subsequent sub-sections discuss the biosensors used for measuring electro-dermal and heart rate signals, associated arousal metrics (3.1.1), performance metrics for measuring the effectiveness of training (3.3.2), and survey data to measure the subjective experience of training using online questionnaires (3.3.3).

3.2.1 Physiological sensing

For measuring the participant’s physiological arousal levels, the Polar H10 (heart rate) and the Shimmer GSR + (skin conductance) sensors were used. Table 4 in the Appendix details all the physiological metrics used, their source, and their relationship with arousal according to the literature (Table 2).

Table 2 Physiological metrics, their source, and their relation to changes in arousal
3.2.1.1 Electrocardiogram (ECG) signals

The Polar H10 (Polar Electro Oy, Kempele, Finland) is an electrocardiogram (ECG)-based heart rate (HR) monitor designed for athletes. It has been clinically validated to be as effective as medical-grade ECG hardware (Gilgen-Ammann et al. 2019) and has been used in recent VR literature (Muñoz et al. 2019; Ventura et al. 2021). It is worn around the chest with electrodes placed in contact with the skin. The data in the form of heart rate and Inter-beat Intervals (R–R intervals) are transmitted via a Bluetooth Low Energy (BLE) connection at a rate of 1–2 Hz to the iMotions application running on a PC. Measures of heart rate variability including time and frequency domain metrics have been calculated using the hrv-analysis Python library (Champseix 2021).

Increases in arousal are indicated by increases in heart rate (time-domain) and frequency-domain measures like LF/HF (Low Frequency/High Frequency) ratio (Orsila et al. 2008; Slater et al. 2006). On the other hand, decreases in time-domain HRV measures like IBI (Inter-Beat Interval), SDNN (Standard Deviation of NN Intervals), RMSSD (Root Mean Square of Successive Difference), and the frequency domain measure HFN (Normalized High-Frequency Component) indicate an increase in arousal (Shaffer and Ginsberg 2017). All HRV metrics have been baseline corrected by subtracting from them the corresponding mean baseline values (Healey and Picard 2005; Wulfert et al. 2005).

3.2.1.2 Electro-dermal activity (EDA) signals

The Shimmer GSR + (Galvanic Skin Resistance) unit (Shimmer Research Ltd., Dublin, Ireland) measures EDA by passing a small current through electrodes placed in two locations on the body. The locations for the electrodes were verified in a pilot study where Shimmer electrodes were placed on the foot, the forehead, and the fingers of two participants, and the signals generated in response to stimuli were examined for signal quality and consistency. It was found that the index and middle fingers were the most reliable locations for sensing skin conductance which matched recommendations from the literature on skin conductance sensing (van Dooren and Janssen 2012). The index and middle fingers of the left hand were chosen to allow study participants to use their right hand alone for moving the loop across the wires.

Popular EDA measures include SC (Skin Conductance) measured in micro-siemens which increases in response to an increase in arousal (Collet et al. 2005). An increase in arousal also leads to a higher rate of skin conductance response peaks which are peaks in the SC amplitude lasting between 1 and 5 s after onset (Krogmeier et al. 2019; Terkildsen and Makransky 2019). The SCRPeaks measure is calculated as the number of skin conductance response peaks per minute. Similarly, the mean peak amplitude of all SCR peaks (SCRAmp) is also a positive measure of arousal (Khalfa et al. 2002; Krogmeier et al. 2019). SCL levels have been baseline corrected by subtracting from it the mean baseline values (Potter and Bolls 2012). All EDA signals were processed using the Neurokit2 Python library (Makowski et al. 2021).

3.2.2 Improvement in performance

The data collection module collects signals generated from both physical and IVR setups, namely the ‘Start task,’ ‘End task,’ and ‘Mistake’ signals. These are used to calculate the following two measures of performance:

  • Task completion time (TCT) The time taken to move the loop from start to end.

  • Contact time (CT) The total time the loop is in contact with the wire during the task which quantifies the number of mistakes by the participant.

These two measures are then used to calculate the following measures of performance improvement:

  • Improvement in task completion time (TCT-I) This is calculated by subtracting the posttest TCT from the pre-test TCT for each participant. A positive value indicates an improvement in this performance metric.

  • Improvement in contact time (CT-I) This is calculated by subtracting the posttest CT from the pre-test CT for each participant. A positive value indicates an improvement in this performance metric.

  • Improvement Score (IS) Since the participants are asked to complete the test task by satisfying two potentially competing goals—to minimize both task completion time and contact time—participants may choose to prioritize one over the other. For example, a participant can choose to complete the task very slowly to minimize the chances of contact with the wire or vice versa. To balance out these two metrics, it is necessary to create a combined score metric that considers both improvements in task completion time (TCT-I) and contact time (CT-I). To calculate this measure, we first divide the two performance improvement measures, TCT-I and CT-I, into 10 equal-sized quartiles for all participants across both conditions, transforming the values into scores from 1 to 10 where 1 denotes the least improvement in performance and 10 the most. Subsequently, IS for a participant is defined as the sum of these two scores. A hypothetical participant who has improved the most in both TCT-I (score = 10) and CT-I (score = 10) metrics would then get a final improvement score (IS) of 20.

3.2.3 Subjective data

Subjective data were collected from all participants toward the end of the experiment using an online survey tool (Microsoft Forms) running on a laboratory PC. The different subjective metrics are listed below.

  • NASA Task Load Index (NASA-TLX) NASA Task Load Index (Hart and Staveland 1988) is a validated measure of workload across six dimensions (Mental Demand, Physical Demand, Temporal Demand, Performance, Effort, and Frustration). The ‘raw’ version of the NASA-TLX without weighted rankings was given to the participants where the answer to each measure was on a scale of range 1–21 (Hart 2006).

  • Immersion Questionnaire The immersion questionnaire from Högberg et al. (2019) was adapted. Participants are asked to give answers on a Likert scale ranging from 1 to 7 (from strongly disagree to strongly agree). A combined Immersion Score is calculated by taking the average of all the responses to items. See Appendix for a list of all items in the questionnaire.

  • Presence Questionnaire The presence questionnaire was adapted from the physical presence subscale of the Multimodal Presence Scale (Makransky et al. 2017) and the telepresence questionnaire (Kim and Biocca 1997). Participants are asked to give answers on a Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree). A combined Presence Score is calculated by taking the average of all the responses to items (after reversing responses to inverse questions). See Appendix for a list of all items in the questionnaire.

  • Enjoyment The participants are asked to rate their agreement with the question ‘The training session was very enjoyable’ on a Likert scale from 1 (strongly disagree) to 7 (strongly agree).

  • Self-efficacy The participants are asked ‘How confident are you that you can perform a similar task effectively (go from start to finish as fast as you can with minimal mistakes) on a scale from 1 to 7?’ to measure self-efficacy, once before the training and once after training. Details are provided in the next section.

3.3 Study procedure

Ethics approval was obtained from the local research ethics committee for experimenting with human subjects. The study was conducted in two rooms, one dedicated to IVR training and the other to physical training. Participants signed up for the study using the lab’s online participant recruitment system. The system automatically filtered the participants using the following criteria based on self-reported data (i.e., they were not medically certified or independently verified): (a) right-handed, (b) normal vision or corrected to normal vision with contact lenses, and (c) no mental illnesses or sensitivity to nausea. The requirement for right-handedness was added to eliminate variation in the setup. Participants signed up for 45 min timeslots of their choosing and were paid the equivalent of 15 Euros. Each condition/room was run by one researcher at a time. The researchers switched between them regularly to reduce investigator effects. The timeslots for both conditions were open from 9 a.m. to 5 p.m. on weekdays.

At the beginning of a session, the participants were asked to read and sign the consent form. They were then briefly familiarized with the experiment procedure by allowing them to practice on the first level of the physical training setup. Thus, all participants, independent of condition, were provided a chance to experience the physical setup (Fig. 2), the cue for starting each task (when they hear the word ‘Go’), and the proper way to lift the handle from the start position and to rest it on the end position. Thereafter, they were given the privacy to wear the Polar H10 around their chest as the researchers left the room. After this, the Shimmer GSR electrodes were placed on the index and middle fingers of the participant’s left hand. The participant places her/his left hand on a Styrofoam support pad placed toward the left side of the table with the palms facing upwards and the fingers kept relaxed. The participant was asked not to move or flex her/his hand to minimize the noise in the recorded signals. The signal quality for both sensors was checked and verified in the iMotions software before the experiment started.

Baseline biosensor data were then measured by asking the participants to remain seated quietly and still with their eyes closed, without heavy breathing. The baseline HR and GSR data were then used to normalize subsequent signals since the baseline HR and GSR values for each person varied considerably. The participants were then presented with the test task before training begins (detailed in Sect. 3.1.1). They start the test task after hearing the word ‘Go’ from the researcher. Upon completion, they were then asked the question on self-efficacy. Following this, they were trained on four levels of increasing complexity in either VR or physical conditions (depending on the random assignment at the beginning of the experiment). In the physical condition, after each level of training, the researcher would rotate the wooden base by 90° (Fig. 2) so that the next level is facing the participant. This process took 10 to 15 s, which was absent in the VR condition where the switch to the next level was instantaneous. At the beginning of each level, they were asked to relax for 30 s by resting their right hand on their lap and start the task only when they hear the word ‘Go,’ this time from the headphone. After the training, the participant was asked the self-efficacy question again. They were then presented with a distractor task in the form of a maze to reduce the recency effect (Carlson et al. 2015; Winther et al. 2020). They were asked to spend about a minute both visualizing the solution and then picking up the maze with their right hand to solve it, in order to minimize recency effects (Bjork and Whitten 1974). They were finally given the test task and again asked to perform it as quickly as possible with the least number of mistakes possible. Following this, the participant was asked to remove the sensors and to fill out an online questionnaire containing the NASA-TLX questionnaire and questions on enjoyment, presence, and immersion. When the participant started performing either the test or training task, the researcher steps behind a panel to reduce biases in performance due to the Hawthorne effect (Demetriou et al. 2019).

No personal information was recorded, except for those required for compensating the study participants, which were handled according to university data protection policies. The researchers followed COVID-19 safety protocols, including sanitizing the sensors, table, and buzz-wire handles after every participant completed the experiment.

4 Results

The statistical analysis was performed using the statistical methods available in SciPy (Scientific Python) and Pingouin packages (Vallat 2018; Virtanen et al. 2020), and plots were generated using the Seaborn and Matplotlib Python packages (Hunter 2007; Waskom 2021). 87 participants were part of the study, divided between the physical (N = 42) and VR training (N = 45) conditions. 48 participants identified themselves as male, 37 as female, and 2 as other. 46 participants indicated their age group in the 18–24 range, and 36 indicated theirs in the range 25–34. The majority of participants in the VR condition (69%, N = 31) indicated that they had tried a VR head-mounted display 1–5 times, 1 reported trying IVR 5–10 times, and 6 reported trying IVR more than 10 times, whereas 7 had never tried VR before. Data from eight participants had to be excluded from the analysis of performance metrics because of data loss arising from VR headset tracking errors, and the biosignals from 15 participants had to be excluded from analysis due to sensor errors. Shapiro–Wilk tests for normality were applied to all the variables, and if a variable was found to violate assumptions of normality, non-parametric statistical tests were used: Wilcoxon Signed Rank (Wilcoxon 1945) for paired, and Mann–Whitney U tests for independent tests (Mann and Whitney 1947), and the related W and U statistics are reported. When the variables used for comparison, both followed normal distributions, Student’s t test and Welch’s t-test (for unequal variances) were used to test for independence, and related t-statistic and Cohen’s d are reported. A significance level of 0.05 was selected while interpreting the results of the statistical tests. The datasets analyzed during the current study are available from the corresponding author on reasonable request.

4.1 Improvement in performance

Figure 6 depicts the three performance metrics for the VR and physical conditions: task completion time (Fig. 6a), contact time (Fig. 6b), and improvement score (Fig. 6c) (see Sect. 3.3.2 for definitions). The task completion time and contact time metrics were analyzed to see if there were changes from the pre-training task to the post-training task. Analyses were also performed to see if there were statistically significant differences between the improvement scores of the two conditions.

Fig. 6
figure 6

Change in performance metrics within VR (N = 45) and physical conditions (N = 42) for a contact time, b task completion time, and c between the conditions for improvement score

4.1.1 Within-condition changes

In terms of contact time (CT), a statistically significant decrease of 1.21 s from pre- to post-training (p < 0.001, w = 126.0) was observed among participants in the VR condition (N = 40). For the same group, a near statistically significant decrease of 1.33 s was observed in the task completion time (TCT) from pre-training to post-training phases (p = 0.062, w = 352.0). In the physical condition (N = 39), there was a statistically significant decrease of 1.07 s in CT from pre- to post-training phases (p < 0.001, w = 114.0). On the other hand, though a slight deterioration of TCT may be observed in Fig. 6b for the physical condition from pre-training to post-training phases, this was not statistically significant (0.83 s, p = 0.412, w = 387.0).

4.1.2 Between conditions

To compare performance in VR (N = 40) and physical (N = 39) conditions, improvements in task completion time (TCT-I), contact time (CT-I), and improvement scores (IS) were calculated (see Sect. 3.3.2). Since the metrics from both these conditions were non-normally distributed, Mann–Whitney U independent samples tests were performed. The results showed no statistically significant differences between the improvement scores in the two conditions (p = 0.353, t(77) = − 0.38, d = 0.085). Regarding improvement in task completion time (TCT-I), though it can be seen from Fig. 6b that the task completion time for participants in the VR condition shows a visible improvement (i.e., decreases), this was not statistically significant (p = 0.2864, U = 722). CT-I also showed similar trends with participants in the VR condition showing no statistically significant differences with participants from the physical condition (p = 0.4746, U = 773).

4.2 Improvement in self-efficacy

As indicated in Fig. 7, in the VR condition (N = 45), there is a statistically significant increase in the reported self-efficacy from the pre-training phase (3.8) to the post-training phase (4.24; p = 0.016, w = 120.5). Though a slight increase in reported self-efficacy in the physical condition (N = 42) from the pre-training phase (4.38) to the post-training phase (4.48) can be observed in Fig. 7, this difference was found not to be statistically significant (p = 0.545, w = 191.5). It was also observed that the change in self-efficacy in the VR condition (0.44) was greater than the change in self-efficacy in the physical condition (0.095). This difference approaches statistical significance (p = 0.0585, U = 767.5).

Fig. 7
figure 7

Left: Self-efficacy levels from pre-training to post-training phases (on a scale of 1–7). Right: Change in self-efficacy levels across VR (N = 45) and physical conditions (N = 42)

4.3 Task load

Figure 8 shows the item-wise scores for NASA-TLX between the VR (N = 45) and physical (N = 42) conditions. Participants reported their perceived task load on six dimensions, i.e., mental, physical, and temporal demand, along with frustration, effort, and performance (Hart and Staveland 1988). Among these six dimensions, it can be observed that both VR and physical training result in similar task load values except for the temporal load parameter where participants in the physical condition report a mean score of 11.71 ± 2.87 (on a scale from 1 to 21) which is significantly higher than what participants in the VR condition reported (9.16 ± 2.49; p = 0.012, U = 738.5). There was no statistically significant difference in the combined NASA TLX Score between the physical (11.62 ± 2.87) and VR conditions (11.53 ± 2.49; p = 0.436, t(81.5) = 0.161, d = 0.03).

Fig. 8
figure 8

NASA TLX Scores across VR (N = 45) and physical conditions (N = 42). ** denotes significant difference at α = 0.05

4.4 Immersion, presence and enjoyment

Figure 9 shows the immersion, presence, and enjoyment scores between the VR (N = 45) and physical (N = 42) conditions. Cronbach’s alpha coefficients were calculated for both questionnaires and found to be 0.88 for Immersion and 0.69 for Presence, indicating an acceptable internal consistency of the scales. An analysis of the Immersion Score (which is the mean of all items on the Immersion questionnaire) shows that participants in the VR condition report higher immersion on average (4.94 ± 0.99) as compared to participants in the physical condition (4.54 ± 0.98) and that this difference is statistically significant (p = 0.031, t(84.47) = − 1.88, d = 0.404) with statistical significance also being observed for items I2, I4, and I9. Analysis of the combined Presence Score shows participants reporting a higher score on average for VR (4.61 ± 0.93) compared to physical (4.4 ± 0.79). This difference approaches statistical significance (p = 0.0736, U = 774) with statistical significance also being observed for items P5, P6, P10, and P14. See Tables 5, 6 in the Appendix for item-wise statistics for both Immersion and Presence questionnaires. Finally, participants report higher enjoyment for the VR condition (6.02 ± 1.23) as compared to physical condition (5.52 ± 1.15; p = 0.0175, U = 696.5).

Fig. 9
figure 9

Immersion, presence, and enjoyment scores across VR (N = 45) and physical conditions (N = 42). ** denotes significant difference at α = 0.05

4.5 Physiological arousal

4.5.1 Arousal levels between conditions

Table 3 lists all the physiological arousal metrics recorded during the training session. Only data points recorded between the start and finish points for each training level have been considered and then averaged to generate arousal metrics that represent the whole training phase. The metrics listed have been adjusted to each participant’s baseline where appropriate.

Table 3 Physiological arousal metrics across physical (N = 39) and VR training conditions (N = 39 for HRV, N = 33 for EDA)

Among EDA metrics, in the VR condition (N = 33), the mean SCRPeaks of 9.9 was found to be significantly lower than the SCRPeaks of 12.04 in the physical condition (N = 39) denoting higher arousal among participants in the physical condition (p = 0.0032, U = 885). Among HRV measures, mean baseline-corrected HR was lower in VR (− 1.3) than physical (0.05) and the difference approaches statistical significance (p = 0.066, t(73.7) = 1.52, d = 0.34). Showing similar trends, the mean baseline-corrected IBI was found to be higher in VR (16.86) than physical (1.33), but the difference is not statistically significant (p = 0.087, t(75.6) = − 1.37, d = 0.31).

Comparisons between other EDA and HRV metrics showed no statistically significant differences though they mostly align with the findings in the SCRPeaks and IBI metrics with higher arousal in physical than in VR. Among EDA measures, the mean SC across all training levels in the VR condition (N = 33) is 2.31, which is lower than the mean SC from the physical condition (N = 39), 2.63. However, this difference is not statistically significant (p = 0.1221, U = 747). Mean SCRAmp for VR (0.19) is lower than physical (0.21), but the difference is not statistically significant (p = 0.445, U = 676). Among time-domain HRV measures, the mean baseline-corrected RMSSD for VR (− 22.22) is higher than physical (− 8.6) with no statistically significant difference (p = 0.614, U = 732) and the mean baseline-corrected SDNN in VR (− 18.34) is lower than physical (− 9.0) with no statistically significant difference (p = 0.3821, U = 791). Among frequency domain HRV metrics, mean baseline-corrected HFN in VR (5.83) is lower than physical (9.72) where the difference is not statistically significant (p = 0.195, t(71.8) = 0.865, d = 0.196) and mean baseline-corrected LF/HF ratio in VR (− 1.02) is greater than physical with no statistically significant difference (p = 0.3086, U = 710).

4.6 Arousal level and performance

To assess the link between arousal levels and performance, data from both IVR and physical groups were combined, and Spearman rank correlation tests (for non-normal data) were performed between the physiological arousal metrics and performance improvement metrics. The tests showed almost no correlation between arousal and improvement in performance with most ρ values between − 0.1 and 0.1. Notable statistically significant but weak correlations include the correlation between TCT-I and SCRAmp (ρ = − 0.24, p = 0.041), TCT-I and SC (ρ = − 0.24, p = 0.0434), and near statistically significant correlations include those between TCT-I and RMSSD (ρ = − 0.21, p = 0.068) and IS and SCRAmp (ρ = − 0.19, p = 0.098).

As part of a post hoc analysis to explore the relationship between arousal levels and performance, we defined two kinds of participants: high and low improvement groups in terms of their improvement score (IS) as denoted in Fig. 10. Those participants whose IS was greater than the upper bound of the IQR (inter-quartile range), i.e., the top 25%, were defined to be in the high improvement group (N = 14). Similarly, those participants whose IS was lesser than the lower bound of the IQR (the bottom 25%) were defined to be in the low improvement group (N = 19). Table 7 shows the results of Mann–Whitney U tests to compare the physiological arousal metrics between these two groups. Among statistically significant differences, the mean SCRAmp of the low improvement group (0.25) was greater than that of the high improvement group (0.12) (p = 0.0298, U = 63), and the mean SC of the low improvement group (3.49) was greater than that of the high improvement group (1.59) (p = 0.0252, U = 55).

Fig. 10
figure 10

The participants (from both conditions) were divided into high and low-performance groups. The high improvement group is in the upper 75th percentile of performance based on the improvement score. Similarly, the low improvement group is from the bottom 25th percentile. Participants who showed the highest improvement had lower arousal than those who had the lowest improvement

5 Discussion

This section discusses the results and is structured around each of the four research questions formulated in the related works section.

5.1 Is IVR training as effective as physical training in improving task performance?

Both IVR and physical training result in statistically significant improvements in contact time (CT) from pre-training to post-training phases. This shows that participants from both training conditions achieved fewer mistakes while performing the task. Participants in the IVR group showed improvements in task completion time which neared statistical significance. However, for participants who underwent physical training, the task completion time did not show a statistically significant change. Overall, the results suggest that training in fine motor skills results in quantifiable performance improvements for participants in both IVR and physical training. This is expected and as per the literature on IVR-based skill training (Radhakrishnan et al. 2021b).

To compare the effectiveness of the two training modalities, three metrics to quantify improvement were defined: improvements in task completion time (TCT-I), improvements in contact time (CT-I), and an Improvement Score (IS) which combines the first two metrics. Statistical tests comparing these three metrics between IVR and physical conditions showed no statistically significant differences. Thus, the results indicate that IVR training is as effective as physical training for training in the buzz-wire task, thus supporting similar findings in other IVR skill training literature (Murcia-Lopez and Steed 2018; Schwarz et al. 2020). One can argue that the novelty effect of IVR might have played a role in its effectiveness as it was observed that 31 participants in the IVR condition had tried VR only 1–5 times before the study, and 7 had never tried VR before. In a review, Merchant et al. (2014) found a link between the novelty effect of desktop VR-based high school education and learning outcomes and that the latter may even decrease as the number of VR sessions increases. Thus, novelty in VR use can play a role yet as the current study utilized a short familiarization task prior to the actual experimental task, this effect can only be a small attribute of the observed effectiveness.

The current finding that IVR training is as good as physical training should also be considered in terms of the potential for further enhancement of this training modality. The literature suggests different methods to do this: the inclusion of haptic feedback (Frederiksen et al. 2020; Winther et al. 2020) and the inclusion of body representation and movements (other than the head and controllers) (Jensen and Konradsen 2018). Inspirations for improving IVR training might also be taken from motor skill training literature which suggests techniques like decreasing the frequency of feedback as the skill level of the participant increases during training (Hebert and Coker 2021), allowing participants to choose whether they want to receive feedback or not (Chiviacowsky and Wulf 2005), or the IVR simulation adapting aspects of the training to the individual in real-time using physiological arousal levels and/or performance metrics (Zahabi and Abdul Razak 2020).

5.2 Is there a significant difference in the enjoyment, presence, immersion, task load, and changes in self-efficacy reported by participants in IVR compared to physical training?

Participants in the IVR condition reported on average significantly more enjoyment levels than participants in the physical condition. This finding is consistent with IVR literature (Makransky et al. 2019). One parameter that is typically associated with frustration and lack of enjoyment during a VR experience is cybersickness. Herein, there were no incidents of cybersickness reported by the participants, probably due to the seated arrangement. Participants in the IVR condition reported on average more immersion (with statistical significance) than those in the physical condition. Similar trends exist for the presence measure, with participants in IVR training reporting more presence than those in physical training, where the difference was found to approach statistical significance. Though the IVR condition showed higher presence and immersion scores compared to the physical condition, it should be kept in mind that results from such metrics gain more importance when all subjects experience the same environment (Usoh et al. 2000). Nevertheless, the results are encouraging and as expected, as participants did not feel less immersed or present in the IVR environment as compared to the physical.

The NASA-TLX results show that in all parameters except temporal demand, IVR training induces roughly the same workload on participants as physical training. This was expected, as all kinds of visual noise and other confounding variables were tightly controlled across both conditions. However, VR, if not designed properly, may cause more cognitive load due to the possible complexity and novelty of the VR interactions involved. The one task load parameter where IVR training shows a statistically significant advantage over physical training is temporal demand. However, one cannot draw clear conclusions from this finding and further research is needed, for example, to compare the total training time across both conditions (which was not part of the research questions) along with the perceived temporal demand. This opens up interesting possibilities, due to the presence of a ‘time compression’ effect in IVR as observed by Mullen and Davidenko (2021), where subjects experienced time to speed up while using VR compared to those in the control condition.

Participants in both physical and IVR training conditions reported an increase in self-efficacy, though a statistically significant increase was found only for the IVR group. Increases in self-efficacy levels have been found to correlate positively with learning outcomes (Makransky et al. 2019; Shu et al. 2019) and motor skill performance (Bandura 1986). However, both VR and physical training in the current study did not show different levels of improvement in performance. It is possible that the novelty effects of VR caused participants in the VR condition to start initially with a lower self-efficacy in spite of the VR familiarization, but they ended up with self-efficacy levels similar to the physical condition by the end of the training. Further research is required to understand the links between self-efficacy and familiarity with the IVR medium. Additionally, these participants in the VR condition were observed to both have lower physiological arousal along with their increased self-efficacy. According to Bandura (1986)’s model of self-efficacy, there is a possible interaction between self-efficacy and arousal which merits further research in the context of IVR skill training.

5.3 Is there a significant difference between the physiological arousal levels of participants in IVR training compared to physical training?

Analysis of EDA and HRV metrics from the physiological arousal data revealed that IVR training caused less arousal than physical training, with a significant difference found for the SCRPeaks (EDA) metric and a near significant difference found for the HRV metrics Heart Rate and Inter-Beat Intervals. However, the frequency domain HRV measures, i.e., HFN, LF/HF ratio, and the time domain HRV measures SDNN and RMSSD showed no statistically significant difference.

Though there is no literature on the comparison of arousal between IVR and non-IVR conditions for skill training, some indicative literature from other domains exists. Tian et al. (2021) found more physiological arousal (EDA, EEG measures) in participants being emotionally stimulated through videos in the IVR condition as compared to those in the 2D condition. Egan et al. (2016) in a comparative quality of experience study found greater HR in the IVR condition compared to the non-IVR 2D condition, while they found that EDA showed the opposite trend to our finding. We discuss possible causes for these seemingly contradictory trends toward the end of this section.

5.4 Is there a link between physiological arousal during training and improvements in performance after training?

A post hoc analysis was performed to compare the physiological data from participants with the highest improvement to those with the lowest improvement. This revealed greater arousal in two EDA measures (mean amplitude of skin conductance responses and mean skin conductance) for those participants who improved the least as compared to those who improved the most. This result is in alignment with findings from the literature; for example, in surgical simulation training (non-IVR), it has been found that lower performance is correlated with increased stress (higher arousal) levels (Prabhu et al. 2010; Quick et al. 2017). When correlation analysis was performed to compare the different arousal metrics with performance metrics for the whole study sample, we found statistically significant but weak correlations for improvement in task completion time (TCT-I) and among two EDA metrics: mean amplitude of skin conductance responses during training (SCRAmp) and mean skin conductance (SC). Further research should investigate the link between performance and arousal for participants across all levels of performance improvement.

For the last two research questions (links between arousal and training condition, arousal, and performance improvements), we found significant differences only in EDA metrics but not in HRV. This might be because EDA is purely a measure of sympathetic activity, as skin conductance levels are not counteracted by the parasympathetic nervous system. On the other hand, heart rate activity is controlled by both the sympathetic system (which causes heart activity to increase) and the parasympathetic system (which causes heart rate activity to decrease back to the baseline) (Cacioppo et al. 2007). Some literature finds EDA measures to be superior in terms of measuring changes in arousal (Dawson et al. 2016), even above HRV (Healey and Picard 2005).

6 Limitations

Motor skill learning literature indicates the possibility that short-term performance might misrepresent learning (Magill and Anderson 2016). Although a distractor task (see Sect. 3.4) was used in the current study to compensate for the short-term nature of the retention test, it may be necessary to perform the retention tests after longer intervals to give a more precise understanding of the relationship between training conditions and retention. IVR skill training literature points to many comparative studies where retention tests after long intervals show better or the same retention in performance for the IVR condition as compared to non-immersive VR and physical conditions (Butt et al. 2018; Buttussi and Chittaro 2018; Sakowitz et al. 2019). An illustrative example is in the burr-puzzle solving task by Carlson et al. (2015), where participants in a physical training condition initially outperformed those in IVR in terms of knowledge retention, but after two weeks, this effect was reversed. These examples suggest that such results may be expected in contexts similar to the current study; however, further research is still required.

The study was also purposefully limited in terms of the ‘training’ provided. Here, participants were not given instructions during or after the training (knowledge of results), but participants get only automated feedback during the training when mistakes were made (knowledge of performance). Further research may build upon the design of the experiment and incorporate different training strategies or instructions. Also, the study is limited only to people who self-reported to be right-handed, to better control the setup and minimize variations, but future research might consider designing buzz-wire arrangements that are compatible with left-handed participants.

Regarding considerations on arousal metrics, comparisons using HRV metrics in the current study showed a lack of significant results. This could potentially be explained if it is assumed that the main cause for arousal in the current task was contact feedback (audio-visual-haptic). Since the time spent by a participant in contact with the wire (i.e., committing mistakes) will only be a proportion of the total duration of the training, any short-term increases in HRV metrics (which is accompanied by a rapid return to normal) may get averaged out by variations in HRV metrics during the rest of the training where they do not make any mistakes. Another potential confounder, which could cause variation in HRV, is the physical aspect of the activity where the participant has the freedom to choose any possible configurations of hand-arm-shoulder movement to complete the task with their right hand. Controlling this was beyond the scope of the current setup. Future studies may require a more fine-grained analysis of the relation between different stimuli (feedback during mistakes, difficulty in navigating certain parts of the wires) and physiological signals. Inspirations from the literature include Liebold et al. (2017) where a post-stimuli window of 10 s was used for heart rate metrics and Boucsein (2012) which recommends a 1–5 s post-stimuli window to detect event-related skin conductance responses (ER-SCRs).

Regarding the choice of sensors used, the study is limited to only two measures of physiological signals (EDA and HRV). There is a multitude of physiological sensors which can be used to detect physiological arousal like electroencephalogram (EEG), skin temperature, and eye-tracking. Additional sensors were not used as they might have made the experimental procedure more complex and affected the behavior of the participants. However, additional sources of biosignals merit further exploration in IVR training research as there are indications that some signals may make others redundant, for example, pupil dilation (from eye-tracking sensors) has been found to be correlated with both EDA and HRV (Wang et al. 2018). It is known that melatonin (which is correlated with the time of day), and temperature affect HRV and EDA metrics (Boucsein 2012; Schachinger et al. 2008), but these factors were not controlled for in the experiment. On the other hand, these effects may have been reduced by the baseline correction applied to the various arousal metrics. Though arousal in this study is averaged across all the training sessions, the long recovery periods lasting several minutes for HRV signals to return to baseline levels (Moses et al. 2007) might potentially result in arousal from one level of training affecting the next. However, this issue may not affect EDA metrics, as a half recovery period from 2 to 10 s is found in the literature (Dawson et al. 2016), which is within the range of the 30 s rest interval between each level. The current study did not control for color blindness, and the self-reported normal vision of the participants was not medically certified, both of which might have caused differences in performance between the conditions.

A related factor affecting our study is the inherent difference between the haptic feedback available in the IVR and physical conditions. Though the vibration aspect is identical in both conditions, in the physical condition, there is the added feel of the physical wire though the vibration masks this feeling to a certain degree. We propose further experimentation in IVR modality alone, with conditions being varied for various haptic feedback modalities like portable, grounded, and wearable as observed by Radhakrishnan et al. (2021b) in their analysis of the use of haptics in industrial skills training. The investigation of possible links between haptic feedback modality, physiological arousal, and improvements in performance holds promise for improving the state of the art in IVR-based skills training.

7 Implications for researchers

Taking as a point of departure the findings and lessons learned from this study one may consider:

  • IVR and other training modalities must be designed to minimize distractions. This study tries to achieve this by using black panels covering the peripheral view of the participant and using headphones which, in addition to providing audio feedback, also minimizes external noise. In their review of motor skill learning literature, Wulf et al. (2010) found that performance is increased when there is an ‘external focus’ directed at the effect of the movement itself instead of an ‘internal focus’ directed at the trainee’s body movements. Therefore, it is recommended that such complexities be minimized unless there are reliable methods of representing hands, arms, and other relevant parts of the body realistically. The coherence principle from the Cognitive Theory of Multimedia Learning further supports this by stating that removing stimuli irrelevant to the training context can improve learning outcomes (Parong and Mayer 2021).

  • VR hardware The use of the Oculus Quest often requires minor calibrations related to the setting of tracking boundaries. This may be avoided by making sure the study environment is consistent between sessions or by using external trackers.

  • Polar H10 This cost-effective yet highly accurate and reliable ECG heart monitor is a useful tool for measuring arousal levels (Polar Electro Oy, Kempele, Finland). Researchers should, however, take into consideration the time taken for setting up the device and for the study setup to give privacy and instructions to participants for properly wearing the device.

  • Shimmer GSR + This is a cost-effective and reliable device for measuring electro-dermal activity (EDA) (Shimmer Research Ltd., Dublin, Ireland). The opportunity of measuring high-quality EDA signals from the fingers also restricts the training task from involving bimanual skills (use of both hands). Alternative but less accurate/convenient locations on the body can be considered if a training task demands the use of both hands (van Dooren and Janssen 2012).

  • Buzz-wire task This task allows for one-hand use making it convenient for studies using EDA. The training task itself provides immediate feedback and allows for variations, for example, different types of audio, visual, or haptic feedback.

8 Conclusion

The study suggests that for the fine motor skill training presented, IVR training is as effective as physical training in improving task performance. Participants in the IVR condition reported an improvement in self-efficacy and significantly more enjoyment and immersion than physical training. Also, participants in the IVR condition on average displayed lower arousal than physical training. Though clear indications on the relationship between arousal and improvements in performance could not be found, EDA metrics hold potential for further investigation to answer this question by showing differences in arousal between high and low improvement groups. It is our understanding that such findings add to the IVR training field and can potentially pave the way to user-adaptive training systems (Zahabi and Abdul Razak 2020).

Future work could incorporate subjective measures of arousal (like the Self-Assessment Manikin) into the immersive VR training as an additional layer to confirm findings from the physiological arousal signals. Additional measures like EEG could be employed to investigate the effect of the different types of stimuli on different brain regions, resultant cognitive load, and their relationship with arousal and performance (Hofmann et al. 2021; Tian et al. 2021). However, this should be implemented in a manner that does not break immersion/presence. It should also be noted that the current study does not explore the origins of the physiological arousal observed during the study but only its effects on performance improvement. It is reasonable to assume that the arousal observed may have been primarily caused by the direct feedback provided (visual, audio, and haptic), but other factors may also play a role. The study tries to control such extraneous factors by features in the study design like providing an initial baseline phase for the users to relax and also rest periods between training levels. The present study does not go into a fine-grained analysis of the relationship between arousal and stimuli like feedback from mistakes or challenging parts like bends in the wire, but rather looks at arousal across the whole training phase. There could be merit in understanding the short-term changes in arousal for various kinds of stimuli; for example, haptic feedback which is increasingly becoming a major focus point for IVR research as it affects task performance and presence (Kreimeier et al. 2019) and is crucial for many fine motor skill training tasks in VR like surgery (Rangarajan et al. 2020). This study also considers averaged performance metrics across the entire training session to answer the primary research questions, but future work might consider variations during the motor skill training, particularly in understanding different control strategies and stages of learning (Sternad 2018). Future studies may also try to incorporate a cross-over study methodology in order to control for difference between groups, by exposing the same group of participants to counterbalanced exposures to VR and physical training with appropriate time intervals in between to reduce cross-over effects similar to Yin et al. (2019).