Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

EPM-Net: Efficient Feature Extraction, Point-Pair Feature Matching for Robust 6-D Pose Estimation

Published: 01 January 2024 Publication History

Abstract

Estimating the 6-D poses of objects from RGB-D images holds great potential for several applications. However, given that the 6-D pose estimation accuracy is significantly affected by occlusion and noise between the objects in an image, this paper proposes a novel 6-D pose estimation method based on Efficient feature extraction and Point-pair feature matching. Specifically, we develop the Efficient channel attention Convolutional Neural Network (ECNN) and SO(3)-Encoder modules to extract 2-D features from the RGB image and SO(3)-equivariant features from the depth image, respectively. These features are fused in the DenseFusion module to obtain 3-D features in the camera space. Meanwhile, we exploit CAD model priors to obtain 3-D features in the model space through the model feature encoder, and then we globally regress the 3-D features in the camera and model space. According to these features, we generate oriented point clouds in each space, and then conduct point-pair feature matching to obtain pose information. Finally, we perform direct pose regression on the 3-D features in the camera and model space, and then resulting point-pair feature matching pose information is combined with the direct point-wise pose regression information to enhance pose prediction accuracy. Experimental results on three widely used benchmarking datasets demonstrate that our method achieves state-of-the-art performance, particularly for severe occluded scenes.

References

[1]
J. Ge, J. Shi, Z. Zhou, Z. Wang, and Q. Qian, “A grasping posture estimation method based on 3D detection network,” Comput. Elect. Eng., vol. 100, 2022, Art. no.
[2]
X. Liu et al., “A robust pixel-wise prediction network with applications to industrial robotic grasping,” IEEE Trans. Ind. Electron., vol. 70, no. 8, pp. 8203–8214, Aug. 2023.
[3]
W.-L. Huang, C.-Y. Hung, and I.-C. Lin, “Confidence-based 6D object pose estimation,” IEEE Trans. Multimedia, vol. 24, pp. 3025–3035, 2022.
[4]
T. Hodaň, X. Zabulis, M. Lourakis, Š. Obdržálek, and J. Matas, “Detection and fine 3D pose estimation of texture-less objects in RGB-D images,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2015, pp. 4421–4428.
[5]
S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 345–360.
[6]
G. Zhou, Y. Yan, D. Wang, and Q. Chen, “A novel depth and color feature fusion framework for 6D object pose estimation,” IEEE Trans. Multimedia, vol. 23, pp. 1630–1639, 2021.
[7]
C. Wang et al., “DenseFusion: 6D object pose estimation by iterative dense fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3343–3352.
[8]
Y. He et al., “PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11632–11641.
[9]
Z. Xu et al., “BiCo-Net: Regress globally, match locally for robust 6D pose estimation,” 2022, arXiv:2205.03536.
[10]
J. T. Barron and J. Malik, “Intrinsic scene properties from a single RGB-D image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 17–24.
[11]
Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021.
[12]
C. Deng et al., “Vector neurons: A general framework for SO(3)-equivariant networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 12200–12209.
[13]
X. Xing et al., “Efficient MSPSO sampling for object detection and 6-D pose estimation in 3-D scenes,” IEEE Trans. Ind. Electron., vol. 69, no. 10, pp. 10281–10291, Oct. 2022.
[14]
B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3D object recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 998–1005.
[15]
Y. Xiang et al., “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” 2017, arXiv:1711.00199.
[16]
S. Hinterstoisser et al., “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 858–865.
[17]
E. Brachmann et al., “Learning 6D object pose estimation using 3D object coordinates,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 536–551.
[18]
M. Zhu et al., “Single image 3D object detection and pose estimation for grasping,” in Proc. IEEE Int. Conf. Robot. Automat., 2014, pp. 3936–3943.
[19]
W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1521–1529.
[20]
S. Li, C. Xu, and M. Xie, “A robust O(n) solution to the perspective-n-point problem,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1444–1450, Jul. 2012.
[21]
S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “PVNet: Pixel-wise voting network for 6DoF pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4561–4570.
[22]
K. Park, T. Patten, and M. Vincze, “Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7667–7676.
[23]
C. Song, J. Song, and Q. Huang, “HybridPose: 6D object pose estimation under hybrid representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 431–440.
[24]
A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-DoF camera relocalization,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2938–2946.
[25]
T.-T. Do et al., “Deep-6DPose: Recovering 6D object pose from a single RGB image,” 2018, arXiv:1802.10367.
[26]
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
[27]
S. Song and J. Xiao, “Sliding shapes for 3D object detection in depth images,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 634–651.
[28]
C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 918–927.
[29]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660.
[30]
Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4490–4499.
[31]
K. Park, T. Patten, J. Prankl, and M. Vincze, “Multi-task template matching for object detection, segmentation and pose estimation using depth images,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 7207–7213.
[32]
S. Hinterstoisser et al., “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in Proc. Asian Conf. Comput. Vis., 2013, pp. 548–562.
[33]
A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim, “Latent-class hough forests for 3D object detection and pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 462–477.
[34]
Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A full flow bidirectional fusion network for 6D pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3003–3013.
[35]
J. Lu et al., “Depth guidance and intradomain adaptation for semantic segmentation,” IEEE Trans. Instrum. Meas., vol. 72, 2023, Art. no.
[36]
K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[37]
Y. Li, L. Yang, B. Xu, J. Wang, and H. Lin, “Improving user attribute classification with text and social network attention,” Cogn. Computation, vol. 11, pp. 459–468, 2019.
[38]
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3104–3112.
[39]
J. Li et al., “Spatio-temporal attention networks for action recognition and detection,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2990–3001, Nov. 2020.
[40]
J. Song et al., “From deterministic to generative: Multimodal stochastic RNNs for video captioning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 10, pp. 3047–3058, Oct. 2019.
[41]
M. Sperber et al., “Self-attentional acoustic models,” 2018, arXiv:1803.09519.
[42]
J. Zhang et al., “From global to local: Multi-scale out-of-distribution detection,” IEEE Trans. Image Process., 2023.
[43]
L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.
[44]
V. Mnih et al., “Recurrent models of visual attention,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2204–2212.
[45]
A. Vaswani et al., “Scaling local self-attention for parameter efficient visual backbones,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12894–12904.
[46]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
[47]
Q. Wang et al., “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11534–11542.
[48]
Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for RGBT tracking,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 465–472.
[49]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[50]
N. Pereira and L. A. Alexandre, “MaskedFusion: Mask-based 6D object pose estimation,” in Proc. IEEE Int. Conf. Mach. Learn. Appl., 2020, pp. 71–78.
[51]
G. Zhou, H. Wang, J. Chen, and D. Huang, “PR-GCN: A deep graph convolutional network with point refinement for 6D pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2793–2802.
[52]
C. Moenning and N. A. Dodgson, “Fast marching farthest point sampling for implicit surfaces and point clouds,” Comput. Lab., Tech. Rep., vol. 565, pp. 1–12, 2003.
[53]
W. Chen, X. Jia, H. J. Chang, J. Duan, and A. Leonardis, “G2L-Net: Global to local network for real-time 6D pose estimation with embedding vector features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4233–4242.
[54]
H. Li, J. Lin, and K. Jia, “DCL-Net: Deep correspondence learning network for 6D pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 369–385.
[55]
M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3D orientation learning for 6D object detection from RGB images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 699–715.
[56]
D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for 3D bounding box estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 244–253.
[57]
I. H. Zhan et al., “KRF: Keypoint refinement with fusion network for 6D pose estimation,” 2022, arXiv:2210.03437.
[58]
H. Pan et al., “SO(3)-pose: SO(3)-equivariance learning for 6D object pose estimation,” Comput. Graph. Forum., vol. 41, no. 7, pp. 371–381, 2022.

Cited By

View all
  • (2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
11427 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Homologous multimodal fusion network with geometric constraint keypoints selection for 6D pose estimationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.126022266:COnline publication date: 25-Mar-2025

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media