Abstract
We investigate whether the region of hand gesture interaction with AR glasses can be expanded using a regular camera. We measure the accuracy of gesture recognition in the images obtained by two conventional cameras: wide-angle and ultra-wide angle. For both of them, gestures, typically used in AR scenarios such as pinch, palm, pointer, and grab, are recognized with >90% accuracy. The accuracy improves on the periphery of the ultra-wide camera field of view. A usability study confirms that performing gestures in the periphery of the ultra-wide camera view is as convenient, as in its center. Two gestures, zoom and swipe, are distinctively more convenient when made in the peripheral zone. Our findings pave the way for expanding the region of hand gesture interaction with AR glasses using purely computer vision-based techniques.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Augmented (AR) applications are ubiquitous in various fields of human activity, from gaming and entertainment to professional training and industrial work support. AR consists in immersing computer-generated content into the objective footage of the real world shown to the user. Head-mounted wearables such as AR glasses (smart glasses) are especially practical for technical and industrial workers, as they can provide user assistance while leaving both hands free for real-world tasks [1]. Since conventional touch/keyboard user interaction is unavailable in AR glasses, vendors have adopted other input methods. Popular wearables, Microsoft HoloLens 2 and Magic Leap One provide hand gesture (HG) interaction using depth cameras and motion sensors [2, 3]. They employ a variety of gestures, including static gestures, also known as hand poses, and dynamic gestures. The region of space where these wearables are sensitive to gestures, the so called gesture frame, is limited. Gesture frame of Microsoft HoloLens 2 spans from nose to waist and between user’s shoulders [4], restricting user interaction capabilities on the sides. User’s hands may also obscure the content shown in the AR display, therefore expanding the gesture frame is useful for extending AR glasses applications area.
We investigate whether the gesture frame can be expanded by using just a regular camera with a wide field of view (FOV). Figure 1 illustrates the difference between FOVs of two conventional cameras used in mobile devices, the ultra-wide angle (UWA) camera (left) and a wide-angle (WA) camera. The FOV of the latter, depicted with a red rectangle in the UWA camera picture, is slightly wider than the FOV of modern AR glasses. Notably, the HGs captured in a broad area on the sides of the UWA camera image would not obscure the content in the AR display.
Previous studies have found that information from AR glasses cameras can be used for hazard warning [5] and navigation [6]. The feasibility of camera-based gesture control on a mobile device was demonstrated already five years ago [7]. Since then, versatility and accuracy of computer vision-based HG recognition algorithms have substantially improved. Modern HG recognition methods achieve 90% and better precision in recognizing hand gestures used for human-computer interaction [8]. In the first part of our study we measure the recognition accuracy for gestures typically used in AR scenarios for navigation and manipulation. We compare the accuracy measured in two zones of the WA and UWA camera FOVs, central and peripheral. Finally, we study whether HGs made in the peripheral zones of the UWA camera are convenient for users.
2 Methodology
Typical AR tasks consist in manipulation of virtual objects: grabbing, zooming, rotating, etc., and interaction with user interface (UI): pointing, selecting, swapping, etc. [1, 9, 10]. We investigate whether these gestures can be comfortably performed and accurately recognized in the center of UWA and WA camera FOVs, as well as on their periphery. Therefore, our study is twofold: to estimate the precision of HG recognition and to find out whether gestures performed on the periphery of the UWA camera FOV are convenient for users. In our experiments, the WA and UWA cameras are represented by the Samsung Wide-angle Camera and Ultra Wide Camera of the Galaxy S10 5G smartphone, correspondingly. Their properties are listed in Table 1. As the FOV of the UWA camera is almost twice wider, user’s hand appears about twice smaller in size in the UWA camera as compared to the WA camera (Fig. 1).
2.1 Gesture Recognition Solution
In the current study we use our in-house HG recognition solution which can effectively detect, track, and recover hand skeleton and segmentation of two hands in real time on off-the-shelf smartphones. The detection is performed by a deep convolutional neural network (CNN) resembling the single-shot detector architecture [11]. The locations of 21 hand joints in the 2D frame, predicted using a modified convolutional pose machines algorithm [12], are used to reconstruct 3D hand skeleton as proposed in [13]. A dedicated CNN with upsampling layers derives a hand segmentation map. Hand pose recognition is made through a set of empirically detected rules applied to the relative positions and orientations of the hand joints. Such implementation, potentially, allows to define any custom hand pose or dynamic gesture.
2.2 Gesture Recognition Accuracy Measurements
As we focus on practical usage of HGs, we ask users to perform gestures in front of the camera and measure how accurately each hand pose is recognized in the pictures. Each dynamic gesture is essentially a continuous sequence of hand poses, therefore in this experiment we consider the following hand poses, commonly used by AR devices:
-
1.
‘Pinch’ formed by the thumb and the index finger as employed by HoloLens.
-
2.
‘Grab’ – essentially a pinch formed by the thumb and any other finger(s).
-
3.
‘Palm Front’ – palm facing the user, resembling the HoloLens Start gesture.
-
4.
‘Palm Back’ – palm directed away from the user, resembling the HoloLens ‘hand ray’.
-
5.
‘Pointer’ – the index finger pointed up and into the scene, corresponding to Magic Leap One’s Point gesture.
We asked 15 volunteers (3 females) aged 24–51 years (32 on average) to perform these five gestures in front of the smartphone cameras. The smartphone was mounted on a tripod so that participants saw its screen and were able to comfortably position their hands in the camera FOV. Users showed gestures in the center of the camera FOV and moved their hands to the side until the edge of the FOV. To exclude possible dependence of HG recognition accuracy on the distance to the camera, each user did the above procedure first with his/her hand closer to the camera, and then farther from the camera. Images were recorded and processed one-by-one using our HG recognition solution. The examples of images captured by two cameras are shown in Fig. 1. For generality, gestures were filmed on different backgrounds, such as street views, painted walls, office equipment, etc.
2.3 Usability Study
It is natural to assume that showing gestures in front of the camera is more intuitive, thus more convenient for users. Hands shown in the center of the camera FOV, however, might obscure visual content in the AR display. Certain operations, like page flipping or panning, do not necessarily require seeing the hands, and can be triggered by HGs made in the periphery of the camera FOV. We conduct a dedicated usability study to find out which gestures are convenient for the users when performed in the peripheral zones of the FOV.
Miscellaneous factors, including user’s habits, skills and physique, define the degree of convenience of gestures shown in different regions with respect to the head-mounted camera. Some users gesticulate with elevated arms, while others are more comfortable with keeping them low. Some prefer showing HGs in the center of the gesture frame, while some make broad hand waives. The latter, horizontal, expansion of the gesture frame is especially important, as continuous raising or lowering hands is exhausting in the work assistance scenarios. Therefore, we address the dependence of gesture convenience on the horizontal distance to the center of the FOV.
Seventeen volunteers (4 females) aged 23–40 years (30 on average) took part in the usability study. The experiment is set up with the smartphone mounted at the height of user’s forehead mimicking the location of the AR glasses camera (Fig. 2). The smartphone is installed in front of the whiteboard which has a special markup to control the hand position in the camera FOV. As most interaction with virtual objects in AR happens at the reach of user’s hand, the distance to the whiteboard is adjusted to the participant’s total arm length. It allows free gesticulation in front of the whiteboard and offers more accurate hand position control.
Participants were asked to show 5 hand poses and 5 dynamic gestures. In addition to the five hand poses described above, we include the following dynamic gestures:
-
1.
‘Finger Push’ – resembles a button push made with the index finger.
-
2.
‘Palm Push’ – a button push made with the whole palm.
-
3.
‘Zoom’ – performed with two hands in the ‘Pinch’ poses.
-
4.
‘Palm Zoom’ – the palm moves back and front to imitate zooming in/out.
-
5.
‘Swipe’ – a conventional page swipe made with the whole hand.
In this study the central zone was bound by the horizontal FOV of the WA camera, while the peripheral zones on both sides were only visible to the UWA camera. Two pairs of lines on the whiteboard marked the FOVs of the WA camera (green) and the UWA camera (blue). The central zone, thus, was bound by the green lines, and the periphery was on either side of the center, bound between the adjacent pairs of green and blue lines. Assistants used the whiteboard markup to guide participants, as, by design, they were unable to see the smartphone screen (Fig. 2). Participants were asked to perform the requested gestures in the most convenient way and evaluate which zone, central or peripheral, was more comfortable to perform each gesture.
3 Results
3.1 Gesture Recognition Accuracy Measurements
The HG recognition accuracy is computed as the fraction of the number of correct predictions of the hand pose to the total number of frames with this pose. During the experiment, 7000 images of each of the five hand poses were collected by each camera. Recognition accuracy averaged over hands detected in three zones: center, WA periphery and UWA periphery is summarized in Fig. 3. Zone extents are defined with respect to the average hand width \( W \), computed over all collected images from 2 cameras. We have found \( W_{\text{WA}} = 200 \) and \( W_{\text{UWA}} = 128 \) pixels for the WA and UWA cameras, respectively. The central zone spans for \( 1.5W_{{\{ WA;UW\} }} \) on both sides of the FOV center; the WA and UWA peripheries are \( 1.5W_{{\{ WA;UW\} }} \) wide bands adjacent to the edges of the corresponding camera FOV.
Pinch is correctly predicted in 99.5% cases in the WA camera images. Recognition accuracy of Pinch in the UWA periphery is as good, but degrades to 95% in two inner zones. Several hand joints appear obscured in the Pinch pose which makes it harder to reconstruct the hand skeleton. However, as the hand moves towards UWA periphery, more joints are revealed, and it gets easier for the HG recognition algorithm to predict the pose. Perfect recognition by the WA camera is attributed to a larger hand image, which provides more information to the HG recognition algorithm and makes hand skeleton reconstruction easier even in the FOV center.
Both Palm Front and Palm Back are detected with >98% accuracy in all zones with one exception. The accuracy of Palm Front recognition in the center of WA images is slightly worse, 95%. Examination of the collected images from WA suggests that Palm Front misdetections happen in the pictures where the hand is too large and does not fit in the picture. Apparently, for many users it has been more comfortable to show this pose rather close to the camera. Notably, UWA camera does not suffer from this problem and the recognition of both Palm poses is superb in all zones.
Although Pointer is the worst recognized hand pose in our experiment, its recognition accuracy is still very good, 94–95% and 90–95% for the WA and UWA cameras, respectively. In the Pointer pose, as well as in Pinch and Grab, multiple hand joints are obscured, making hand skeleton reconstruction more difficult. Pinch recognition accuracy in the UWA images is better in the peripheral zones, where more hand joints are visible.
The Grab hand pose is defined less strictly than Pinch and Pointer. For instance, we allow it to be shown with two or more fingers. Therefore, its recognition accuracy does not suffer as much from obscured joints as much as Pinch and Pointer. Grab is correctly predicted in 95% of images from the UWA camera, and in 97–99% of WA images.
To summarize, the recognition accuracy of the five considered hand poses is better than 90% in all parts of the WA and UWA camera FOV. The accuracy is better on the periphery where more hand joints are visible to the camera. In the UWA camera periphery Grab is recognized in the 95% of images, while four gestures (Pinch, Palm Front, Palm Back, Pointer) exhibit an excellent 99% accuracy.
3.2 Usability Study
Users assessed the level of convenience of each gesture on the scale from −3 (more comfortable in the central zone) to +3 (more comfortable on the periphery). The scores spread broadly (Fig. 4), confirming that gesticulation preferences are very individual and depend on many factors. For some gestures, however, the median score was negative. Users showed more inclination to perform 4 out of 10 gestures in the central zone: Palm Front, Pinch, Palm Back, and Finger Push. The periphery is slightly more preferred for Palm Push, Palm Zoom, Grab and Point. However, for all these 8 gestures the median is rather close to zero, which allows to conclude that expanding the gesture frame to the periphery would not cause much discomfort.
For two gestures, Zoom and Swipe, users showed a clear preference towards the peripheral zone of the UWA camera FOV. Both gestures are quite intuitive and span a wide body angle, hence gesture frame expansion is beneficial for them. Besides, expanded gesture frame allows for a smoother and more precise Zoom and allows Swipe to be shown solely in the periphery where it is non-obscuring.
4 Conclusions
Five hand poses, typically used for gesture interaction with AR glasses, are correctly recognized in the 90% of frames obtained from conventional WA and UWA cameras. Recognition is harder for the poses in which multiple hand joints are obscured, such as Pointer and Grab. However, when these gestures are shown on the periphery of the UWA camera, the accuracy improves because more joints are revealed. In the periphery of the UWA camera FOV, the recognition precision is 99% for four out of five hand poses. Therefore, the gesture frame can be effectively expanded using the UWA camera.
The usability study addressed five hand poses accompanied by five dynamic gestures, commonly used in AR glasses. For eight out of ten gestures, no strong preference was found towards the central or peripheral zone of the UWA camera FOV. The periphery, however, was more convenient for making Zoom and Swipe gestures.
Our findings show that expansion of the gesture frame using a UW camera is beneficial for the precision and convenience of user interaction with AR glasses.
References
Kim, M., Choi, S.H., Park, K.-B., Lee, J.Y.: User interactions for augmented reality smart glasses: a comparative evaluation of visual contexts and interaction gestures. Appl. Sci. 9(15), 3171 (2019)
Microsoft Docs: Multimodal interaction models. https://docs.microsoft.com/en-us/windows/mixed-reality/interaction-fundamentals. Accessed 13 Mar 2020
Magic Leap: Magic Leap 1. https://www.magicleap.com/magic-leap-1. Accessed 13 Mar 2020
Microsoft Docs. Gaze and Commit. https://docs.microsoft.com/en-us/windows/mixed-reality/gaze-and-commit#composite-gestures. Accessed 13 Mar 2020
Younis, O., Al-Nuaimy, W., Alomari, M., Rowe, F.: A hazard detection and tracking system for people with peripheral vision loss using smart glasses and augmented reality. Int. J. Adv. Comput. Sci. Appl. 10, 1–9 (2019)
Chaturvedi, I., Bijarbooneh, F.H., Braud, T., Hui, P.: Peripheral vision: a new killer app for smart glasses. In: Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 625–636, March 2019
Song, J., et al.: In-air gestures around unmodified mobile devices. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, pp. 319–329. Association for Computing Machinery (2014)
Mohammed, A.A.Q., Lv, J., Islam, M.S.: A deep learning-based end-to-end composite system for hand detection and gesture recognition. Sensors 19(23), 5282 (2019)
Goh, E.S., Sunar, M.S., Ismail, A.W.: 3D object manipulation techniques in handheld mobile augmented reality interface: a review. IEEE Access 7, 40581–40601 (2019)
Vuletic, T., Duffy, A., Hay, L., McTeague, C., Campbell, G., Grealy, M.: Systematic literature review of hand gestures used in human computer interaction interfaces. Int. J. Hum.-Comput. Stud. 129, 74–94 (2019)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732. IEEE (2016)
Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Olshevsky, V., Bondarets, I., Trunov, O., Shcherbina, A., Alkhimova, S. (2020). In-Air Gesture Interaction Using Ultra Wide Camera. In: Stephanidis, C., Antona, M. (eds) HCI International 2020 - Posters. HCII 2020. Communications in Computer and Information Science, vol 1224. Springer, Cham. https://doi.org/10.1007/978-3-030-50726-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-50726-8_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50725-1
Online ISBN: 978-3-030-50726-8
eBook Packages: Computer ScienceComputer Science (R0)