Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

EgoHDM: A Real-time Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

Published: 19 November 2024 Publication History

Abstract

We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild. Our project page: https://handiyin.github.io/EgoHDM/

References

[1]
Carlos Campos, Richard Elvira, Juan J. Gómez, José M. M. Montiel, and Juan D. Tardós. 2021. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Transactions on Robotics 37, 6 (2021), 1874--1890.
[2]
Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. 2020. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3279--3288.
[3]
Yudi Dai, Yitai Lin, Xiping Lin, Chenglu Wen, Lan Xu, Hongwei Yi, Siqi Shen, Yuexin Ma, and Cheng Wang. 2023. SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 682--692.
[4]
Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. 2022. HSC4D: Human-Centered 4D Scene Capture in Large-Scale Indoor-Outdoor Space Using Wearable IMUs and LiDAR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6792--6802.
[5]
Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, Ali Thabet, and Artsiom Sanakoyeu. 2023. Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model. In CVPR.
[6]
Martin L Felis. 2017. RBDL: an efficient rigid-body dynamics library using recursive algorithms. Autonomous Robots 41, 2 (2017), 495--511.
[7]
Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. 2021. Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
[8]
Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J Black, Otmar Hilliges, and Gerard Pons-Moll. 2018. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1--15.
[9]
Hao Jiang and Kristen Grauman. 2017. Seeing invisible poses: Estimating 3d body pose from egocentric video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3501--3509.
[10]
Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. 2022a. AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing. In Proceedings of European Conference on Computer Vision. Springer.
[11]
Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W Winkler, and C Karen Liu. 2022b. Transformer Inertial Poser: Real-time human motion reconstruction from sparse IMUs with simultaneous terrain generation. In SIGGRAPH Asia 2022 Conference Papers. 1--9.
[12]
Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. 2023. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In International Conference on Computer Vision (ICCV).
[13]
Manuel Kaufmann, Yi Zhao, Chengcheng Tang, Lingling Tao, Christopher Twigg, Jie Song, Robert Wang, and Otmar Hilliges. 2021. Em-pose: 3d human pose estimation from sparse electromagnetic trackers. In Proceedings of the IEEE/CVF international conference on computer vision. 11510--11520.
[14]
Jiye Lee and Hanbyul Joo. 2024. Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera. arXiv preprint arXiv:2401.00847 (2024).
[15]
Lorenzo Liso, Erik Sandström, Vladimir Yugay, Luc Van Gool, and Martin R Oswald. 2024. Loopy-SLAM: Dense Neural SLAM with Loop Closures. arXiv preprint arXiv:2402.09944 (2024).
[16]
Daniil Lisus, Connor Holmes, and Steven Waslander. 2023. Towards open world nerf-based slam. In 2023 20th Conference on Robots and Vision (CRV). IEEE, 37--44.
[17]
Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M Rehg, and Siyu Tang. 2021. 4d human body capture from egocentric video via 3d scene grounding. In 2021 international conference on 3D vision (3DV). IEEE, 930--939.
[18]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851--866.
[19]
Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. 2021. Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems 34 (2021), 25019--25032.
[20]
Takahiro Miki, Lorenz Wellhausen, Ruben Grandia, Fabian Jenelten, Timon Homberger, and Marco Hutter. 2022. Elevation mapping for locomotion and navigation using gpu. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2273--2280.
[21]
Zhixiang Min and Enrique Dunn. 2021. Voldor+ slam: For the times when feature-based or direct methods are not good enough. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 13813--13819.
[22]
Noitom. 2024. Retrieved Jan 16, 2024 from https://www.noitom.com/
[23]
Edwin Olson. 2011. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3400--3407.
[24]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[25]
Monique Paulich, Martin Schepers, Nina Rudigkeit, and G. Bellusci. 2018. Xsens MTw Awinda: Miniature Wireless Inertial-Magnetic Motion Tracker for Highly Accurate 3D Kinematic Applications.
[26]
N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan. 2021. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15190--15200.
[27]
Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. 2016. Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics (TOG) 35, 6 (2016), 1--11.
[28]
Qaiser Riaz, Guanhong Tao, Björn Krüger, and Andreas Weber. 2015. Motion reconstruction using very few accelerometers and ground contacts. Graphical Models 79 (2015), 23--38.
[29]
Antoni Rosinol, John J Leonard, and Luca Carlone. 2023a. Nerf-slam: Real-time dense monocular slam with neural radiance fields. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3437--3444.
[30]
Antoni Rosinol, John J Leonard, and Luca Carlone. 2023b. Probabilistic volumetric fusion for dense monocular slam. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3097--3105.
[31]
Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, and Yebin Liu. 2022. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In European Conference on Computer Vision. Springer, 702--720.
[32]
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. 2023. WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion. arXiv preprint arXiv:2312.07531 (2023).
[33]
Ronit Slyper and Jessica K Hodgins. 2008. Action capture with accelerometers. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics symposium on computer animation. 193--199.
[34]
Jochen Tautges, Arno Zinke, Björn Krüger, Jan Baumann, Andreas Weber, Thomas Helten, Meinard Müller, Hans-Peter Seidel, and Bernd Eberhardt. 2011. Motion reconstruction using sparse accelerometer data. ACM Transactions on Graphics (ToG) 30, 3 (2011), 1--12.
[35]
Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16. Springer, 402--419.
[36]
Zachary Teed and Jia Deng. 2021. DOID-SLAM: Deep Visual Slam for Monocular, Stereo, and RGB-D cameras. Advances in neural information processing systems 34 (2021), 16558--16569.
[37]
Denis Tome, Thiemo Alldieck, Patrick Peluse, Gerard Pons-Moll, Lourdes Agapito, Hernan Badino, and Fernando De la Torre. 2020. Selfpose: 3d egocentric pose estimation from a headset mounted camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[38]
Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. 2017. Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of 28th British Machine Vision Conference. 1--13.
[39]
Timo Von Marcard, Bodo Rosenhahn, Michael J Black, and Gerard Pons-Moll. 2017. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In Computer graphics forum, Vol. 36. Wiley Online Library, 349--360.
[40]
Chen Wang, Dasong Gao, Kuan Xu, Junyi Geng, Yaoyu Hu, Yuheng Qiu, Bowen Li, Fan Yang, Brady Moon, Abhinav Pandey, et al. 2023a. Pypose: A library for robot learning with physics-based optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22024--22034.
[41]
Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt. 2021. Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11500--11509.
[42]
Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt. 2023b. Scene-aware Egocentric 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13031--13040.
[43]
Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, HansPeter Seidel, and Christian Theobalt. 2019. Mo 2 cap 2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE transactions on visualization and computer graphics 25, 5 (2019), 2093--2101.
[44]
Dongseok Yang, Jiho Kang, Lingni Ma, Joseph Greer, Yuting Ye, and Sung-Hee Lee. 2024. DivaTrack: Diverse Bodies and Motions from Acceleration-Enhanced Three-Point Trackers. Computer Graphics Forum n/a, n/a (2024), e15057. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.15057
[45]
Xingrui Yang, Hai Li, Hongjia Zhai, Yuhang Ming, Yuqian Liu, and Guofeng Zhang. 2022. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 499--507.
[46]
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21222--21232.
[47]
Xinyu Yi, Yuxiao Zhou, Marc Habermann, Vladislav Golyanik, Shaohua Pan, Christian Theobalt, and Feng Xu. 2023. EgoLocate: Real-time Motion Capture, Localization, and Mapping with Sparse Body-mounted Sensors. ACM Transactions on Graphics (TOG) 42, 4, Article 76 (2023), 17 pages.
[48]
Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. 2022. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13167--13178.
[49]
Xinyu Yi, Yuxiao Zhou, and Feng Xu. 2021. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1--13.
[50]
Ye Yuan and Kris Kitani. 2019. Ego-pose estimation and forecasting as real-time pd control. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10082--10092.
[51]
Wei Zhang, Sen Wang, Xingliang Dong, Rongwei Guo, and Norbert Haala. 2023b. Bamf-slam: Bundle adjusted multi-fisheye visual-inertial slam using recurrent field transforms. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6232--6238.
[52]
Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. 2023a. Go-slam: Global optimization for consistent 3d instant reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3727--3737.
[53]
Yahui Zhang, Shaodi You, and Theo Gevers. 2021. Automatic calibration of the fisheye camera for egocentric 3d human pose estimation from a single image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1772--1781.
[54]
Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. 2022. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12786--12796.

Index Terms

  1. EgoHDM: A Real-time Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 43, Issue 6
    December 2024
    1828 pages
    EISSN:1557-7368
    DOI:10.1145/3702969
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 November 2024
    Published in TOG Volume 43, Issue 6

    Check for updates

    Author Tags

    1. IMUs
    2. SLAM
    3. real-time
    4. pose estimation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 46
      Total Downloads
    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)46
    Reflects downloads up to 29 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media