research-article

VoCAPTER: Voting-based Pose Tracking for Category-level Articulated Object via Inter-frame Priors

Authors:

Rujing WangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 8942 - 8951

https://doi.org/10.1145/3664647.3681131

Published: 28 October 2024 Publication History

Abstract

Articulated objects are common in our daily life. However, current category-level articulation pose works mostly focus on predicting 9D poses on statistical point cloud observations. In this paper, we deal with the problem of category-level online robust 9D pose tracking of articulated objects, where we propose VoCAPTER, a novel 3D Voting-based Category-level Articulated object Pose TrackER. Our VoCAPTER efficiently updates poses between adjacent frames by utilizing partial observations from the current frame and the estimated per-part 9D poses from the previous frame. Specifically, by incorporating prior knowledge of continuous motion relationships between frames, we begin by canonicalizing the input point cloud, casting the pose tracking task as an inter-frame pose increment estimation challenge. Subsequently, to obtain a robust pose-tracking algorithm, our main idea is to leverage SE(3)-invariant features during motion. This is achieved through a voting-based articulation tracking algorithm, which identifies keyframes as reference states for accurate pose updating throughout the entire video sequence. We evaluate the performance of VoCAPTER in the synthetic dataset and real-world scenarios, which demonstrates VoCAPTER's generalization ability to diverse and complicated scenes. Through these experiments, we provide evidence of VoCAPTER's superiority and robustness in multi-frame pose tracking of articulated objects. We believe that this work can facilitate the progress of various fields, including robotics, embodied intelligence, and augmented reality. All the codes will be made publicly available.

References

[1]

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. 2022. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19129--19139.

[2]

Ronald T Azuma. 1997. A survey of augmented reality. Presence: teleoperators & virtual environments 6, 4 (1997), 355--385.

[3]

Kenan Bektaş, Jannis Strecker, Simon Mayer, and Kimberly Garcia. 2024. Gazeenabled activity recognition for augmented reality feedback. Computers & Graphics (2024), 103909.

[4]

Aude Billard and Danica Kragic. 2019. Trends and challenges in robot manipulation. Science 364, 6446 (2019), eaat8414.

[5]

Zhimin Chen, Longlong Jing, Yingwei Li, and Bing Li. 2024. Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. Advances in Neural Information Processing Systems 36 (2024).

[6]

Meghan Clark, Mark W Newman, and Prabal Dutta. 2022. ARticulate: One- Shot Interactions with Intelligent Assistants in Unfamiliar Smart Spaces Using Augmented Reality. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1--24.

Digital Library

[7]

Leon Davis and Usman Aslam. 2024. Analyzing consumer expectations and experiences of Augmented Reality (AR) apps in the fashion retail sector. Journal of Retailing and Consumer Services 76 (2024), 103577.

[8]

Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, and Federico Tombari. 2022. Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6781--6791.

[9]

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. 2023. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7010--7019.

[10]

Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. 2010. Model globally, match locally: Efficient and robust 3D object recognition. In 2010 IEEE computer society conference on computer vision and pattern recognition. Ieee, 998--1005.

[11]

Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology 34, 7 (2024), 6238--6252. https: //doi.org/10.1109/TCSVT.2024.3358415

Digital Library

[12]

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Abhinav Valada, and Thomas Kollar. 2023. Carto: Category and joint agnostic reconstruction of articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21201--21210.

[13]

Nick Heppert, Toki Migimatsu, Brent Yi, Claire Chen, and Jeannette Bohg. 2022. Category-independent articulated object tracking with factor graphs. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3800--3807.

[14]

Benjamin Keinert, Matthias Innmann, Michael Sänger, and Marc Stamminger. 2015. Spherical fibonacci mapping. ACM Transactions on Graphics (TOG) 34, 6 (2015), 1--7.

Digital Library

[15]

Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. 2024. NAP: Neural 3D Articulated Object Prior. Advances in Neural Information Processing Systems 36 (2024).

[16]

Liulei Li, Jianan Wei, Wenguan Wang, and Yi Yang. 2024. Neural-logic humanobject interaction detection. Advances in Neural Information Processing Systems 36 (2024).

[17]

Quanzhou Li, Jingbo Wang, Chen Change Loy, and Bo Dai. 2024. Task-oriented human-object interactions generation with implicit neural representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3035--3044.

[18]

Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn Abbott, and Shuran Song. 2020. Category-level articulated object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3706--3715.

[19]

Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, and Yuanqing Li. 2021. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3560--3569.

[20]

Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A Vela, and Stan Birchfield. 2022. Keypoint-based category-level object pose tracking from an RGB sequence with uncertainty estimation. In 2022 International Conference on Robotics and Automation (ICRA). IEEE, 1258--1264.

Digital Library

[21]

Liu Liu, Jianming Du, Hao Wu, Xun Yang, Zhenguang Liu, Richang Hong, and Meng Wang. 2023. Category-Level Articulated Object 9D Pose Estimation via Reinforcement Learning. In Proceedings of the 31st ACM International Conference on Multimedia. 728--736.

Digital Library

[22]

Liu Liu, Anran Huang, Qi Wu, Dan Guo, Xun Yang, and Meng Wang. 2024. KPATracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 3684--3692.

[23]

Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiaojun Yu, Yang Han, and Cewu Lu. 2022. AKB-48: a real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14809--14818.

[24]

Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu Lu. 2022. Toward realworld category-level articulation pose estimation. IEEE Transactions on Image Processing 31 (2022), 1072--1083.

Digital Library

[25]

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. 2022. HOI4D: A 4D egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21013--21022.

[26]

Zhenyu Liu, Qide Wang, Daxin Liu, and Jianrong Tan. 2024. PA-Pose: Partial point cloud fusion based on reliable alignment for 6D pose tracking. Pattern Recognition 148 (2024), 110151.

Digital Library

[27]

Fabian Manhardt, Gu Wang, Benjamin Busam, Manuel Nickel, Sven Meier, Luca Minciullo, Xiangyang Ji, and Nassir Navab. 2020. CPS: Improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv preprint arXiv:2003.05848 (2020).

[28]

Zhe Min, Jiaole Wang, and Max Q-H Meng. 2018. Robust generalized point cloud registration using hybrid mixture model. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4812--4818.

Digital Library

[29]

Zhe Min, Jiaole Wang, and Max Q-H Meng. 2019. Robust generalized point cloud registration with orientational data based on expectation maximization. IEEE Transactions on Automation Science and Engineering 17, 1 (2019), 207--221.

[30]

Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. 2024. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects. Advances in Neural Information Processing Systems 36 (2024).

[31]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30 (2017).

[32]

Ahmet E Tekden, Aykut Erdem, Erkut Erdem, Tamim Asfour, and Emre Ugur. 2024. Object and relation centric representations for push effect prediction. Robotics and Autonomous Systems 174 (2024), 104632.

Digital Library

[33]

Wenxuan Tu, Renxiang Guan, Sihang Zhou, Chuan Ma, Xin Peng, Zhiping Cai, Zhe Liu, Jieren Cheng, and Xinwang Liu. 2024. Attribute-missing graph clustering network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 15392--15401.

Digital Library

[34]

Chen Wang, Roberto Martín-Martín, Danfei Xu, Jun Lv, Cewu Lu, Li Fei-Fei, Silvio Savarese, and Yuke Zhu. 2020. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10059--10066.

[35]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642--2651.

[36]

Junbo Wang, Wenhai Liu, Qiaojun Yu, Yang You, Liu Liu, Weiming Wang, and Cewu Lu. 2024. RPMArt: Towards Robust Perception and Manipulation for Articulated Objects. arXiv preprint arXiv:2403.16023 (2024).

[37]

Bowen Wen and Kostas Bekris. 2021. Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 8067--8074.

Digital Library

[38]

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Müller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. 2023. BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 606--617.

[39]

Yijia Weng, He Wang, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, and Leonidas J Guibas. 2021. Captra: Category-level pose tracking for rigid and articulated objects from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13209--13218.

[40]

Ruihai Wu, Kai Cheng, Yan Zhao, Chuanruo Ning, Guanqi Zhan, and Hao Dong. 2024. Learning environment-aware affordance for 3d articulated object manipulation under occlusions. Advances in Neural Information Processing Systems 36 (2024).

[41]

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, HeWang, et al. 2020. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11097--11107.

[42]

Han Xue, Liu Liu,Wenqiang Xu, Haoyuan Fu, and Cewu Lu. 2021. OMAD: Object Model with Articulated Deformations for Pose Estimation and Retrieval. arXiv preprint arXiv:2112.07334 (2021).

[43]

Lixin Yang, Kailin Li, Xinyu Zhan, Jun Lv, Wenqiang Xu, Jiefeng Li, and Cewu Lu. 2022. ArtiBoost: Boosting articulated 3d hand-object pose estimation via online exploration and synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2750--2760.

[44]

Yang You, Wenhao He, Michael Xu Liu, Weiming Wang, and Cewu Lu. 2022. Go Beyond Point Pairs: A General and Accurate Sim2Real Object Pose Voting Method with Efficient Online Synthetic Training. CoRR (2022).

[45]

Shaobo Zhang, Wanqing Zhao, Ziyu Guan, Xianlin Peng, and Jinye Peng. 2021. Keypoint-graph-driven learning framework for object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1065--1073.

[46]

Jihong Zhu, Andrea Cherubini, Claire Dune, David Navarro-Alarcon, Farshid Alambeigi, Dmitry Berenson, Fanny Ficuciello, Kensuke Harada, Jens Kober, Xiang Li, et al. 2022. Challenges and outlook in robotic manipulation of deformable objects. IEEE Robotics & Automation Magazine 29, 3 (2022), 67--77.

[47]

Lu Zou, Zhangjin Huang, Naijie Gu, and Guoping Wang. 2024. Learning geometric consistency and discrepancy for category-level 6D object pose estimation from point clouds. Pattern Recognition 145 (2024), 109896.

Digital Library

Index Terms

VoCAPTER: Voting-based Pose Tracking for Category-level Articulated Object via Inter-frame Priors
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Vision for robotics

Recommendations

Category-Level Articulated Object 9D Pose Estimation via Reinforcement Learning
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Human life is populated with articulated objects. Current category-level articulated object 9D pose estimation (Articulated Object 9D Pose Estimation, ArtOPE) methods usually meet the challenges of shared object representation requirement, kinematics-...
Silhouette lookup for monocular 3D pose tracking

Computers should be able to detect and track the articulated 3D pose of a human being moving through a video sequence. Incremental tracking methods often prove slow and unreliable, and many must be initialized by a human operator before they can track a ...
A hybrid pose tracking approach for handheld augmented reality
ICDSC '15: Proceedings of the 9th International Conference on Distributed Smart Cameras

With the rapid advances in mobile computing, handheld Augmented Reality draws increasing attention. Pose tracking of handheld devices is of fundamental importance to register virtual information with the real world and is still a crucial challenge. In ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

2019YFE0125700

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
146
Total Downloads

Downloads (Last 12 months)146
Downloads (Last 6 weeks)90

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten