research-article

Point-voxel dual stream transformer for 3d point cloud learning

Authors:

Chen LiAuthors Info & Claims

The Visual Computer, Volume 40, Issue 8

Pages 5323 - 5339

https://doi.org/10.1007/s00371-023-03107-2

Published: 09 October 2023 Publication History

Abstract

Recently, the success of Transformer in natural language processing and image processing inspires researchers to apply Transformer in point cloud processing. However, existing point cloud Transformer methods have problems with massive parameters, heavy computation, and lack of local features due to the use of global self-attention. To solve these problems, this paper presents a novel point-voxel dual stream Transformer (PVDST) network, which combines the voxel-based convolution and point-based local attention, extracting the local and contextual features of point clouds simultaneously. To reduce the parameters and computation of self-attention and make the contextual features contain more position information, we design the local-aware attention module with explicit position encoding and neighbor embedding, which conducts the attention calculation locally. Based on our local-aware attention module and the cross-attention mechanism, we design a unique way to adaptively fuse the local and contextual features. Extensive experiments on shape classification, object part segmentation, and semantic segmentation tasks demonstrate that PVDST achieves competitive performance compared with other methods.

References

[1]

Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660 (2017).

[2]

Liu, T., Cai Y., Zheng J., Thalmann N.M., BEACon: a boundary embedded attentional convolution network for point cloud instance segmentation, 38, pp. 2303–2313 (2022)

[3]

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017).

[4]

Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: PointCNN: convolution On X-transformed points. In: Advances in Neural Information Processing Systems, pp. 820–830 (2018).

[5]

Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, and Solomon JM Dynamic graph CNN for learning on point clouds ACM Trans. Graph. 2019 38 5 1-12

Digital Library

[6]

Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: Flexible and deformable convolution for point clouds. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6411–6420 (2019).

[7]

Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel CNN for efficient 3D deep learning. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019).

[8]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 1–11 (2017).

[9]

Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 945–953 (2015).

[10]

Kanezaki, A., Matsushita, Y., Nishida, Y.: RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5010–5019 (2018).

[11]

Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3D shape recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272 (2018).

[12]

Jiang, J., Bao, D., Chen, Z., Zhao, X., Gao, Y.: MLVCNN: multi-loop-view convolutional neural network for 3D shape retrieval. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 8513–8520 (2019).

[13]

Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3D shape recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 1–11 (2021).

[14]

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D ShapeNets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015).

[15]

Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015).

[16]

Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 424–432 (2016).

[17]

Tchapmi, L., Choy, C., Armeni, I., Gwak, J., Savarese, S.: Segcloud: Semantic segmentation of 3d point clouds. In: International Conference on 3D Vision (3DV), pp. 537–547 (2017).

[18]

Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3577–3586 (2017).

[19]

Choy, C., Gwak, J., Savarese, S.: 4D Spatio-temporal ConvNets: minkowski convolutional neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084 (2019).

[20]

Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., Han, S.: Searching efficient 3D architectures with sparse point-voxel convolution. In: European Conference on Computer Vision (ECCV), pp. 685–702 (2020).

[21]

Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8895–8904 (2019).

[22]

Klokov, R., Lempitsky, V.: Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: IEEE International Conference on Computer Vision (ICCV), pp. 863–872 (2017).

[23]

Atzmon M, Maron H, and Lipman Y Point convolutional neural networks by extension operators ACM Trans. Graph. 2018 37 4 71:1-71:12

Digital Library

[24]

Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: SpiderCNN: deep learning on point sets with parameterized convolutional filters. In: European Conference on Computer Vision (ECCV), pp. 87–102 (2018).

[25]

Landrieu, L., Simonovsky, M.: Large-scale point cloud semantic segmentation with superpoint graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4558–4567 (2018).

[26]

Jiang, L., Zhao, H., Liu, S., Shen, X., Fu, C.W., Jia, J.: Hierarchical point-edge interaction network for point cloud semantic segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10433–10441 (2019).

[27]

Li, G., Muller, M., Thabet, A., Ghanem, B.: DeepGCNs: Can GCNs Go As Deep As CNNs?. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9267–9276 (2019).

[28]

Chen L and Zhang Q DDGCN: graph convolution network based on direction and distance for point cloud learning Vis. Comput. 2023 39 863-873

Digital Library

[29]

Sun Y, Miao Y, Chen J, et al. PGCNet: patch graph convolutional network for point cloud segmentation of indoor scenes Vis. Comput. 2020 36 2407-2418

Digital Library

[30]

You, H., Feng, Y., Ji, R., Gao, Y.: PVNet: a joint convolutional network of point cloud and multi-view for 3D shape recognition. In: ACM International Conference on Multimedia, pp. 1310–1318 (2018).

[31]

You, H., Feng, Y., Zhao, X., Zou, C., Ji, R., Gao, Y.: PVRNet: point-view relation neural network for 3D shape recognition. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 9119–9126 (2019).

[32]

Le, T., Duan, Y.: PointGrid: a deep network for 3D shape understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9204–9214 (2018).

[33]

Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10529–10538 (2020).

[34]

Noh, J., Lee, S., Ham, B.: HVPR: Hybrid voxel-point representation for single-stage 3D object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14605–14614 (2021).

[35]

Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: RPVNet: a deep and efficient range-point-voxel fusion network for LiDAR point cloud segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16024–16033 (2021).

[36]

Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3464–3473 (2019).

[37]

Liu, X., Han, Z., Liu, Y.S., Zwicker, M.: Point2Sequence: learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 8778–8785 (2019).

[38]

Yan, X., Zheng, C., Li, Z., Wang, S., Cui, S.: PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5589–5598 (2020).

[39]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), pp. 1–12 (2021).

[40]

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning (ICML), pp. 3744–3753 (2019).

[41]

Guo MH, Cai JX, Liu ZN, Mu TJ, Martin RR, and Hu SM PCT: point cloud transformer Comput. Vis. Media 2021 7 2 187-199

[42]

Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16259–16268 (2021).

[43]

He Y, Xia G, Feng H, et al. PCTP: point cloud transformer pooling block for points set abstraction structure Vis. Comput. 2022

Digital Library

[44]

Yi L, Kim VG, Ceylan D, Shen IC, Yan M, Su H, Lu C, Huang Q, Sheffer A, and Guibas L A scalable active framework for region annotation in 3D shape collections ACM Trans. Graph. 2016 35 6 1-12

Digital Library

[45]

Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S.: 3D semantic parsing of large-scale indoor spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1534–1543 (2016).

Index Terms

Point-voxel dual stream transformer for 3d point cloud learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
      2. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Global Patch Cross-Attention for Point Cloud Analysis
Pattern Recognition and Computer Vision
Abstract
Despite the great achievement on 3D point cloud analysis with deep learning methods, the insufficiency of contextual semantic description, and misidentification of confusing objects remain tricky problems. To address these challenges, we propose a ...
MVTr: multi-feature voxel transformer for 3D object detection
Abstract
Convolutional neural networks have become a powerful tool for partial 3D object detection. However, their power has not been fully realized for focusing on global information, which is crucial for object detection. In this paper, we resolve the ...
Induced Local Attention for Transformer Models in Speech Recognition
Speech and Computer
Abstract
The transformer models and their variations currently are considered the prime model architectures in speech recognition since they yield state-of-the-art results on several datasets. Their main strength lies in the self-attention mechanism, where ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Visual Computer: International Journal of Computer Graphics

The Visual Computer: International Journal of Computer Graphics Volume 40, Issue 8

Aug 2024

782 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 October 2023

Accepted: 11 September 2023

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Scientific and Technological Innovation Foundation of Foshan
Research Project of the Beijing Young Topnotch Talents Cultivation Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents