Nothing Special   »   [go: up one dir, main page]

skip to main content

EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation

Published: 01 January 2024 Publication History


Estimating the 6-D poses of objects from RGB-D images holds great potential for several applications. However, given that the 6-D pose estimation accuracy is significantly affected by occlusion and noise between the objects in an image, this paper proposes a novel 6-D pose estimation method based on Efficient feature extraction and Point-pair feature matching. Specifically, we develop the Efficient channel attention Convolutional Neural Network (ECNN) and SO(3)-Encoder modules to extract 2-D features from the RGB image and SO(3)-equivariant features from the depth image, respectively. These features are fused in the DenseFusion module to obtain 3-D features in the camera space. Meanwhile, we exploit CAD model priors to obtain 3-D features in the model space through the model feature encoder, and then we globally regress the 3-D features in the camera and model space. According to these features, we generate oriented point clouds in each space, and then conduct point-pair feature matching to obtain pose information. Finally, we perform direct pose regression on the 3-D features in the camera and model space, and then resulting point-pair feature matching pose information is combined with the direct point-wise pose regression information to enhance pose prediction accuracy. Experimental results on three widely used benchmarking datasets demonstrate that our method achieves state-of-the-art performance, particularly for severe occluded scenes.


J. Ge, J. Shi, Z. Zhou, Z. Wang, and Q. Qian, “A grasping posture estimation method based on 3D detection network,” Comput. Elect. Eng., vol. 100, 2022, Art. no.
X. Liu et al., “A robust pixel-wise prediction network with applications to industrial robotic grasping,” IEEE Trans. Ind. Electron., vol. 70, no. 8, pp. 8203–8214, Aug. 2023.
W.-L. Huang, C.-Y. Hung, and I.-C. Lin, “Confidence-based 6D object pose estimation,” IEEE Trans. Multimedia, vol. 24, pp. 3025–3035, 2022.
T. Hodaň, X. Zabulis, M. Lourakis, Š. Obdržálek, and J. Matas, “Detection and fine 3D pose estimation of texture-less objects in RGB-D images,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2015, pp. 4421–4428.
S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 345–360.
G. Zhou, Y. Yan, D. Wang, and Q. Chen, “A novel depth and color feature fusion framework for 6D object pose estimation,” IEEE Trans. Multimedia, vol. 23, pp. 1630–1639, 2021.
C. Wang et al., “DenseFusion: 6D object pose estimation by iterative dense fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3343–3352.
Y. He et al., “PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11632–11641.
Z. Xu et al., “BiCo-Net: Regress globally, match locally for robust 6D pose estimation,” 2022, arXiv:2205.03536.
J. T. Barron and J. Malik, “Intrinsic scene properties from a single RGB-D image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 17–24.
Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021.
C. Deng et al., “Vector neurons: A general framework for SO(3)-equivariant networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 12200–12209.
X. Xing et al., “Efficient MSPSO sampling for object detection and 6-D pose estimation in 3-D scenes,” IEEE Trans. Ind. Electron., vol. 69, no. 10, pp. 10281–10291, Oct. 2022.
B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3D object recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 998–1005.
Y. Xiang et al., “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” 2017, arXiv:1711.00199.
S. Hinterstoisser et al., “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 858–865.
E. Brachmann et al., “Learning 6D object pose estimation using 3D object coordinates,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 536–551.
M. Zhu et al., “Single image 3D object detection and pose estimation for grasping,” in Proc. IEEE Int. Conf. Robot. Automat., 2014, pp. 3936–3943.
W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1521–1529.
S. Li, C. Xu, and M. Xie, “A robust O(n) solution to the perspective-n-point problem,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1444–1450, Jul. 2012.
S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “PVNet: Pixel-wise voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4561–4570.
K. Park, T. Patten, and M. Vincze, “Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7667–7676.
C. Song, J. Song, and Q. Huang, “HybridPose: 6D object pose estimation under hybrid representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 431–440.
A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-DoF camera relocalization,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2938–2946.
T.-T. Do et al., “Deep-6DPose: Recovering 6D object pose from a single RGB image,” 2018, arXiv:1802.10367.
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
S. Song and J. Xiao, “Sliding shapes for 3D object detection in depth images,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 634–651.
C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 918–927.
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660.
Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4490–4499.
K. Park, T. Patten, J. Prankl, and M. Vincze, “Multi-task template matching for object detection, segmentation and pose estimation using depth images,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 7207–7213.
S. Hinterstoisser et al., “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in Proc. Asian Conf. Comput. Vis., 2013, pp. 548–562.
A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim, “Latent-class hough forests for 3D object detection and pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 462–477.
Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A full flow bidirectional fusion network for 6D pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3003–3013.
J. Lu et al., “Depth guidance and intradomain adaptation for semantic segmentation,” IEEE Trans. Instrum. Meas., vol. 72, 2023, Art. no.
K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
Y. Li, L. Yang, B. Xu, J. Wang, and H. Lin, “Improving user attribute classification with text and social network attention,” Cogn. Computation, vol. 11, pp. 459–468, 2019.
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3104–3112.
J. Li et al., “Spatio-temporal attention networks for action recognition and detection,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2990–3001, Nov. 2020.
J. Song et al., “From deterministic to generative: Multimodal stochastic RNNs for video captioning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 10, pp. 3047–3058, Oct. 2019.
M. Sperber et al., “Self-attentional acoustic models,” 2018, arXiv:1803.09519.
J. Zhang et al., “From global to local: Multi-scale out-of-distribution detection,” IEEE Trans. Image Process., 2023.
L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.
V. Mnih et al., “Recurrent models of visual attention,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2204–2212.
A. Vaswani et al., “Scaling local self-attention for parameter efficient visual backbones,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12894–12904.
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
Q. Wang et al., “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11534–11542.
Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for RGBT tracking,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 465–472.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
N. Pereira and L. A. Alexandre, “MaskedFusion: Mask-based 6D object pose estimation,” in Proc. IEEE Int. Conf. Mach. Learn. Appl., 2020, pp. 71–78.
G. Zhou, H. Wang, J. Chen, and D. Huang, “PR-GCN: A deep graph convolutional network with point refinement for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2793–2802.
C. Moenning and N. A. Dodgson, “Fast marching farthest point sampling for implicit surfaces and point clouds,” Comput. Lab., Tech. Rep., vol. 565, pp. 1–12, 2003.
W. Chen, X. Jia, H. J. Chang, J. Duan, and A. Leonardis, “G2L-Net: Global to local network for real-time 6D pose estimation with embedding vector features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4233–4242.
H. Li, J. Lin, and K. Jia, “DCL-Net: Deep correspondence learning network for 6D pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 369–385.
M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3D orientation learning for 6D object detection from RGB images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 699–715.
D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for 3D bounding box estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 244–253.
I. H. Zhan et al., “KRF: Keypoint refinement with fusion network for 6D pose estimation,” 2022, arXiv:2210.03437.
H. Pan et al., “SO(3)-pose: SO(3)-equivariance learning for 6D object pose estimation,” Comput. Graph. Forum., vol. 41, no. 7, pp. 371–381, 2022.

Cited By

View all
  • (2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025



Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors


Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
11427 pages


IEEE Press

Publication History

Published: 01 January 2024


  • Research-article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics


Cited By

View all
  • (2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025

View Options

View options






Share this Publication link

Share on social media