Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Articulated distance fields for ultra-fast tracking of hands interacting

Published: 20 November 2017 Publication History

Abstract

The state of the art in articulated hand tracking has been greatly advanced by hybrid methods that fit a generative hand model to depth data, leveraging both temporally and discriminatively predicted starting poses. In this paradigm, the generative model is used to define an energy function and a local iterative optimization is performed from these starting poses in order to find a "good local minimum" (i.e. a local minimum close to the true pose). Performing this optimization quickly is key to exploring more starting poses, performing more iterations and, crucially, exploiting high frame rates that ensure that temporally predicted starting poses are in the basin of convergence of a good local minimum. At the same time, a detailed and accurate generative model tends to deepen the good local minima and widen their basins of convergence. Recent work, however, has largely had to trade-off such a detailed hand model with one that facilitates such rapid optimization. We present a new implicit model of hand geometry that mostly avoids this compromise and leverage it to build an ultra-fast hybrid hand tracking system. Specifically, we construct an articulated signed distance function that, for any pose, yields a closed form calculation of both the distance to the detailed surface geometry and the necessary derivatives to perform gradient based optimization. There is no need to introduce or update any explicit "correspondences" yielding a simple algorithm that maps well to parallel hardware such as GPUs. As a result, our system can run at extremely high frame rates (e.g. up to 1000fps). Furthermore, we demonstrate how to detect, segment and optimize for two strongly interacting hands, recovering complex interactions at extremely high framerates. In the absence of publicly available datasets of sufficiently high frame rate, we leverage a multiview capture system to create a new 180fps dataset of one and two hands interacting together or with objects.

Supplementary Material

MP4 File (a244-taylor.mp4)

References

[1]
Luca Ballan, Aparna Taneja, Jürgen Gall, Luc Van Gool, and Marc Pollefeys. 2012. Motion capture of hands in action using discriminative salient points. In European Conference on Computer Vision. Springer, 640--653.
[2]
Blender Online Community. 2016. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. http://www.blender.org
[3]
Thomas J Cashman and Andrew W Fitzgibbon. 2013. What shape are dolphins? Building 3D morphable models from 2D images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35, 1 (2013), 232--244.
[4]
Dorin Comaniciu and Peter Meer. 2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence 24, 5 (2002), 603--619.
[5]
Martin de La Gorce, David J Fleet, and Nikos Paragios. 2011. Model-based 3d hand pose estimation from monocular video. IEEE transactions on pattern analysis and machine intelligence 33, 9 (2011), 1793--1805.
[6]
Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. 2016. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) 35, 4 (2016), 114.
[7]
Mingsong Dou, Jonathan Taylor, Henry Fuchs, Andrew Fitzgibbon, and Shahram Izadi. 2015. 3d scanning deformable objects with a single rgbd sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 493--501.
[8]
Sean Ryan Fanello, Julien Valentin, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, Carlo Ciliberto, Philip Davidson, and Shahram Izadi. 2017a. Low Compute and Fully Parallel Computer Vision with HashMatch. In ICCV.
[9]
Sean Ryan Fanello, Julien Valentin, Christoph Rhemann, Adarsh Kowdle, Vladimir Tankovich, Philip Davidson, and Shahram Izadi. 2017b. UltraStereo: Efficient Learning-based Matching for Active Stereo Systems. In CVPR.
[10]
Andrew W Fitzgibbon. 2003. Robust registration of 2D and 3D point sets. Image and Vision Computing 21, 13 (2003), 1145--1153.
[11]
Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann. 2016. Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Ross Girshick, Jamie Shotton, Pushmeet Kohli, Antonio Criminisi, and Andrew Fitzgibbon. 2011. Efficient regression of general-activity human poses from depth images. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 415--422.
[13]
Je Hyeong Hong and Andrew Fitzgibbon. 2015. Secrets of Matrix Factorization: Approximations, Numerics, Manifold Optimization and Random Restarts. In Proceedings of the IEEE International Conference on Computer Vision. 4130--4138.
[14]
Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-time Volumetric Non-rigid Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV).
[15]
Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. 2011. KinectFusion: real-time 3Dreconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 559--568.
[16]
David Joseph Tan, Thomas Cashman, Jonathan Taylor, Andrew Fitzgibbon, Daniel Tarlow, Sameh Khamis, Shahram Izadi, and Jamie Shotton. 2016. Fits like a glove: Rapid and reliable hand shape personalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5610--5619.
[17]
Cem Keskin, Furkan Kiraç, Yunus Emre Kara, and Lale Akarun. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In Proceedings of the IEEE International Conference on Computer Vision. 852--863.
[18]
Sameh Khamis, Jonathan Taylor, Jamie Shotton, Cem Keskin, Shahram Izadi, and Andrew Fitzgibbon. 2015. Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2540--2548.
[19]
Francois Lekien and J Marsden. 2005. Tricubic interpolation in three dimensions. Internat. J. Numer. Methods Engrg. 63, 3 (2005), 455--471.
[20]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. CoRR abs/1411.4038 (2014). http://arxiv.org/abs/1411.4038
[21]
Stan Melax, Leonid Keselman, and Sterling Orsten. 2013. Dynamics based 3D skeletal hand tracking. In Proceedings of Graphics Interface 2013. Canadian Information Processing Society, 63--70.
[22]
Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. 2017. Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor. In Proceedings of International Conference on Computer Vision (ICCV). 10. http://handtracker.mpi-inf.mpg.de/projects/OccludedHands/
[23]
Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition. 343--352.
[24]
Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. 2015a. Hands Deep in Deep Learning for Hand Pose Estimation. CoRR abs/1502.06807 (2015). http://arxiv.org/abs/1502.06807
[25]
Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. 2015b. Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3316--3324.
[26]
Iason Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2011. Efficient model-based 3D tracking of hand articulations using Kinect. In BmVC, Vol. 1. 3.
[27]
Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2012. Tracking the articulated motion of two strongly interacting hands. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 1862--1869.
[28]
Chen Qian, Xiao Sun, Yichen Wei, Xiaoou Tang, and Jian Sun. 2014. Realtime and Robust Hand Tracking from Depth. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29]
Edoardo Remelli, Anastasia Tkach, Andrea Tagliasacchi, and Mark Pauly. 2017. Low-Dimensionality Calibration through Local Anisotropic Scaling for Robust Hand Model Personalization. In Proceedings of the International Conference on Computer Vision.
[30]
Tanner Schmidt, Richard A Newcombe, and Dieter Fox. 2014. DART: Dense Articulated Real-Time Tracking. In Robotics: Science and Systems.
[31]
Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, Jamie Shotton, David Kim, Christoph Rhemann, Ido Leichter, Alon Vinnikov, Yichen Wei, et al. 2015. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 3633--3642.
[32]
Jamie Shotton, Andrew Fitzgibbon, Andrew Blake, Alex Kipman, Mark Finocchio, Bob Moore, and Toby Sharp. 2011. Real-Time Human Pose Recognition in Parts from a Single Depth Image. https://www.microsoft.com/en-us/research/publication/real-time-human-pose-recognition-in-parts-from-a-single-depth-image/
[33]
Hang Si. 2015. TetGen, a Delaunay-based quality tetrahedral mesh generator. ACM Transactions on Mathematical Software (TOMS) 41, 2 (2015), 11.
[34]
Ayan Sinha, Chiho Choi, and Karthik Ramani. 2016. Deephand: Robust hand pose estimation by completing a matrix imputed with deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4150--4158.
[35]
Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and Christian Theobalt. 2015. Fast and robust hand tracking using detection-guided optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213--3221.
[36]
Srinath Sridhar, Antti Oulasvirta, and Christian Theobalt. 2013. Interactive markerless articulated hand motion tracking using RGB and depth data. In Proceedings of the IEEE International Conference on Computer Vision. 2456--2463.
[37]
Xiao Sun, Yichen Wei, Shuang Liang, Xiaoou Tang, and Jian Sun. 2015. Cascaded hand pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 824--832.
[38]
Andrea Tagliasacchi, Matthias Schröder, Anastasia Tkach, Sofien Bouaziz, Mario Botsch, and Mark Pauly. 2015. Robust Articulated-ICP for Real-Time Hand Tracking. Symposium on Geometry Processing (Computer Graphics Forum) (2015).
[39]
David Joseph Tan, Thomas Cashman, Jonathan Taylor, Andrew Fitzgibbon, Daniel Tarlow, Sameh Khamis, Shahram Izadi, and Jamie Shotton. 2016. Fits Like a Glove: Rapid and Reliable Hand Shape Personalization. In IEEE Conference on Computer Vision and Pattern Recognition.
[40]
Danhang Tang, Hyung Jin Chang, Alykhan Tejani, and Tae-Kyun Kim. 2016. Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In Transactions on Pattern Analysis and Machine Intelligence.
[41]
Danhang Tang, Jonathan Taylor, Pushmeet Kohli, Cem Keskin, Tae-Kyun Kim, and Jamie Shotton. 2015. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE International Conference on Computer Vision. 3325--3333.
[42]
Jonathan Taylor, Lucas Bordeaux, Thomas Cashman, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto, David Sweeney, Julien Valentin, Benjamin Luff, et al. 2016. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG) 35, 4 (2016), 143.
[43]
Jonathan Taylor, Richard Stebbing, Varun Ramakrishna, Cem Keskin, Jamie Shotton, Shahram Izadi, Aaron Hertzmann, and Andrew Fitzgibbon. 2014. User-specific hand modeling from monocular depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 644--651.
[44]
Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. 2016. Sphere-Meshes for Real-Time Hand Modeling and Tracking. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) (2016).
[45]
Anastasia Tkach, Andrea Tagliasacchi, Edoardo Remelli, Mark Pauly, and Andrew Fitzgibbon. 2017. Online Generative Model Personalization for Hand Tracking. ACM Transaction on Graphics (Proc. SIGGRAPH Asia, conditionally accepted) (2017).
[46]
Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. 2014. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG) 33, 5 (2014), 169.
[47]
Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo Aponte, Marc Pollefeys, and Juergen Gall. 2016. Capturing Hands in Action using Discriminative Salient Points and Physics Simulation. International Journal of Computer Vision (IJCV) (2016).
[48]
Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. 2017. Crossing Nets: Dual Generative Models with a Shared Latent Space for Hand Pose Estimation. arXiv preprint arXiv.1702.03431 (2017).
[49]
Qi Ye, Shanxin Yuan, and Tae-Kyun Kim. 2016. Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In European Conference on Computer Vision. Springer International Publishing, 346--361.
[50]
Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhand Jain, and Tae-Kyun Kim. 2017. BigHand2. 2M Benchmark: Hand Pose Dataset and State of the Art Analysis. arXiv preprint arXiv.1704.02612 (2017).
[51]
Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, and Yichen Wei. 2016. Model-based Deep Hand Pose Estimation. In IJCAI.

Cited By

View all
  • (2024)EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech AnimationProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696336(1-12)Online publication date: 21-Nov-2024
  • (2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
  • (2024)CLTalk: Speech-Driven 3D Facial Animation with Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657625(1175-1179)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Articulated distance fields for ultra-fast tracking of hands interacting

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 36, Issue 6
    December 2017
    973 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3130800
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 November 2017
    Published in TOG Volume 36, Issue 6

    Check for updates

    Author Tags

    1. tracking
    2. volumetric deformation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 27 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)EmoSpaceTime: Decoupling Emotion and Content through Contrastive Learning for Expressive 3D Speech AnimationProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696336(1-12)Online publication date: 21-Nov-2024
    • (2024)ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAEProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696320(1-12)Online publication date: 21-Nov-2024
    • (2024)CLTalk: Speech-Driven 3D Facial Animation with Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657625(1175-1179)Online publication date: 30-May-2024
    • (2024)Ego2HandsPose: A Dataset for Egocentric Two-hand 3D Global Pose Estimation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00432(4363-4371)Online publication date: 3-Jan-2024
    • (2024)Pyramid Deep Fusion Network for Two-Hand Reconstruction From RGB-D ImagesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.336964634:7(5843-5855)Online publication date: 1-Jul-2024
    • (2024)InterHandNet: Enhancing 3D Interacting Hand Mesh Recovery Through Inter-Hand Feature Extraction2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704418(510-514)Online publication date: 19-Jul-2024
    • (2024)Physics-Aware Hand-Object Interaction Denoising2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00227(2341-2350)Online publication date: 16-Jun-2024
    • (2024)InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00057(527-537)Online publication date: 16-Jun-2024
    • (2024)3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera2024 International Conference on 3D Vision (3DV)10.1109/3DV62453.2024.00008(291-301)Online publication date: 18-Mar-2024
    • (2024)A hybrid network for estimating 3D interacting hand pose from a single RGB imageSignal, Image and Video Processing10.1007/s11760-024-03043-118:4(3801-3814)Online publication date: 21-Feb-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media