research-article

EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation

Authors:

Yunhan SunAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 26

Pages 5120 - 5130

https://doi.org/10.1109/TMM.2023.3330116

Published: 01 January 2024 Publication History

Abstract

Estimating the 6-D poses of objects from RGB-D images holds great potential for several applications. However, given that the 6-D pose estimation accuracy is significantly affected by occlusion and noise between the objects in an image, this paper proposes a novel 6-D pose estimation method based on Efficient feature extraction and Point-pair feature matching. Specifically, we develop the Efficient channel attention Convolutional Neural Network (ECNN) and SO(3)-Encoder modules to extract 2-D features from the RGB image and SO(3)-equivariant features from the depth image, respectively. These features are fused in the DenseFusion module to obtain 3-D features in the camera space. Meanwhile, we exploit CAD model priors to obtain 3-D features in the model space through the model feature encoder, and then we globally regress the 3-D features in the camera and model space. According to these features, we generate oriented point clouds in each space, and then conduct point-pair feature matching to obtain pose information. Finally, we perform direct pose regression on the 3-D features in the camera and model space, and then resulting point-pair feature matching pose information is combined with the direct point-wise pose regression information to enhance pose prediction accuracy. Experimental results on three widely used benchmarking datasets demonstrate that our method achieves state-of-the-art performance, particularly for severe occluded scenes.

References

[1]

J. Ge, J. Shi, Z. Zhou, Z. Wang, and Q. Qian, “A grasping posture estimation method based on 3D detection network,” Comput. Elect. Eng., vol. 100, 2022, Art. no.

[2]

X. Liu et al., “A robust pixel-wise prediction network with applications to industrial robotic grasping,” IEEE Trans. Ind. Electron., vol. 70, no. 8, pp. 8203–8214, Aug. 2023.

[3]

W.-L. Huang, C.-Y. Hung, and I.-C. Lin, “Confidence-based 6D object pose estimation,” IEEE Trans. Multimedia, vol. 24, pp. 3025–3035, 2022.

Digital Library

[4]

T. Hodaň, X. Zabulis, M. Lourakis, Š. Obdržálek, and J. Matas, “Detection and fine 3D pose estimation of texture-less objects in RGB-D images,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2015, pp. 4421–4428.

[5]

S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 345–360.

[6]

G. Zhou, Y. Yan, D. Wang, and Q. Chen, “A novel depth and color feature fusion framework for 6D object pose estimation,” IEEE Trans. Multimedia, vol. 23, pp. 1630–1639, 2021.

Digital Library

[7]

C. Wang et al., “DenseFusion: 6D object pose estimation by iterative dense fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3343–3352.

[8]

Y. He et al., “PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11632–11641.

[9]

Z. Xu et al., “BiCo-Net: Regress globally, match locally for robust 6D pose estimation,” 2022, arXiv:2205.03536.

[10]

J. T. Barron and J. Malik, “Intrinsic scene properties from a single RGB-D image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 17–24.

[11]

Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021.

[12]

C. Deng et al., “Vector neurons: A general framework for SO(3)-equivariant networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 12200–12209.

[13]

X. Xing et al., “Efficient MSPSO sampling for object detection and 6-D pose estimation in 3-D scenes,” IEEE Trans. Ind. Electron., vol. 69, no. 10, pp. 10281–10291, Oct. 2022.

[14]

B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3D object recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 998–1005.

[15]

Y. Xiang et al., “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” 2017, arXiv:1711.00199.

[16]

S. Hinterstoisser et al., “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 858–865.

[17]

E. Brachmann et al., “Learning 6D object pose estimation using 3D object coordinates,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 536–551.

[18]

M. Zhu et al., “Single image 3D object detection and pose estimation for grasping,” in Proc. IEEE Int. Conf. Robot. Automat., 2014, pp. 3936–3943.

[19]

W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1521–1529.

[20]

S. Li, C. Xu, and M. Xie, “A robust O(n) solution to the perspective-n-point problem,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1444–1450, Jul. 2012.

Digital Library

[21]

S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “PVNet: Pixel-wise voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4561–4570.

[22]

K. Park, T. Patten, and M. Vincze, “Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7667–7676.

[23]

C. Song, J. Song, and Q. Huang, “HybridPose: 6D object pose estimation under hybrid representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 431–440.

[24]

A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-DoF camera relocalization,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2938–2946.

[25]

T.-T. Do et al., “Deep-6DPose: Recovering 6D object pose from a single RGB image,” 2018, arXiv:1802.10367.

[26]

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.

[27]

S. Song and J. Xiao, “Sliding shapes for 3D object detection in depth images,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 634–651.

[28]

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 918–927.

[29]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660.

[30]

Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4490–4499.

[31]

K. Park, T. Patten, J. Prankl, and M. Vincze, “Multi-task template matching for object detection, segmentation and pose estimation using depth images,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 7207–7213.

[32]

S. Hinterstoisser et al., “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in Proc. Asian Conf. Comput. Vis., 2013, pp. 548–562.

[33]

A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim, “Latent-class hough forests for 3D object detection and pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 462–477.

[34]

Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A full flow bidirectional fusion network for 6D pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3003–3013.

[35]

J. Lu et al., “Depth guidance and intradomain adaptation for semantic segmentation,” IEEE Trans. Instrum. Meas., vol. 72, 2023, Art. no.

[36]

K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

[37]

Y. Li, L. Yang, B. Xu, J. Wang, and H. Lin, “Improving user attribute classification with text and social network attention,” Cogn. Computation, vol. 11, pp. 459–468, 2019.

[38]

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3104–3112.

[39]

J. Li et al., “Spatio-temporal attention networks for action recognition and detection,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2990–3001, Nov. 2020.

[40]

J. Song et al., “From deterministic to generative: Multimodal stochastic RNNs for video captioning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 10, pp. 3047–3058, Oct. 2019.

[41]

M. Sperber et al., “Self-attentional acoustic models,” 2018, arXiv:1803.09519.

[42]

J. Zhang et al., “From global to local: Multi-scale out-of-distribution detection,” IEEE Trans. Image Process., 2023.

[43]

L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.

Digital Library

[44]

V. Mnih et al., “Recurrent models of visual attention,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2204–2212.

[45]

A. Vaswani et al., “Scaling local self-attention for parameter efficient visual backbones,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12894–12904.

[46]

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.

[47]

Q. Wang et al., “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11534–11542.

[48]

Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for RGBT tracking,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 465–472.

[49]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[50]

N. Pereira and L. A. Alexandre, “MaskedFusion: Mask-based 6D object pose estimation,” in Proc. IEEE Int. Conf. Mach. Learn. Appl., 2020, pp. 71–78.

[51]

G. Zhou, H. Wang, J. Chen, and D. Huang, “PR-GCN: A deep graph convolutional network with point refinement for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2793–2802.

[52]

C. Moenning and N. A. Dodgson, “Fast marching farthest point sampling for implicit surfaces and point clouds,” Comput. Lab., Tech. Rep., vol. 565, pp. 1–12, 2003.

[53]

W. Chen, X. Jia, H. J. Chang, J. Duan, and A. Leonardis, “G2L-Net: Global to local network for real-time 6D pose estimation with embedding vector features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4233–4242.

[54]

H. Li, J. Lin, and K. Jia, “DCL-Net: Deep correspondence learning network for 6D pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 369–385.

[55]

M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3D orientation learning for 6D object detection from RGB images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 699–715.

[56]

D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for 3D bounding box estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 244–253.

[57]

I. H. Zhan et al., “KRF: Keypoint refinement with fusion network for 6D pose estimation,” 2022, arXiv:2210.03437.

[58]

H. Pan et al., “SO(3)-pose: SO(3)-equivariance learning for 6D object pose estimation,” Comput. Graph. Forum., vol. 41, no. 7, pp. 371–381, 2022.

Cited By

Yi GWang FDing Q(2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025
https://dl.acm.org/doi/10.1016/j.eswa.2024.126022

Recommendations

Semantic-assisted Unified Network for Feature Point Extraction and Matching
VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Feature point matching between two images is an essential part of 3D reconstruction, augmented reality, panorama stitching, etc. The quality of the initial feature point matching stage greatly affects the overall performance of a system. We present a ...
Systematic Feature Extraction

A systematic feature extraction procedure is proposed. It is based on successive extractions of features. At each stage a dimensionality reduction is made and a new feature is extracted. A specific example is given using the Gaussian minus-log-...
Iterative Pose Estimation Using Coplanar Feature Points

This paper presents a new method for the computation of the position and orientation of a camera with respect to a known object, using four or morecoplanarfeature points. Starting with the scaled orthographic projection approximation, this method ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 26, Issue

2024

11427 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yi GWang FDing Q(2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025
https://dl.acm.org/doi/10.1016/j.eswa.2024.126022

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents