PointUR-RL: Unified Self-Supervised Learning Method Based on Variable Masked Autoencoder for Point Cloud Reconstruction and Representation Learning
<p>The structure of PointUR-RL.</p> "> Figure 2
<p>Truncated Gaussian distribution map.</p> "> Figure 3
<p>Embedding module.</p> "> Figure 4
<p>The structure of encoder and decoder.</p> "> Figure 5
<p>The flowchart of the contrastive learning.</p> "> Figure 6
<p>Visualization of feature distributions. The t-SNE visualization is displayed for feature vectors learned by our model in various stages: (<b>a</b>) after pre-training on ModelNet10, (<b>b</b>) after fine-tuning on ModelNet10, (<b>c</b>) after pre-training on ModelNet40, (<b>d</b>) after fine-tuning on ModelNet40, (<b>e</b>) after fine-tuning on ModelNet40 by Point-MAE, and (<b>f</b>) after fine-tuning on ModelNet40 by Point-MAE by Point-BERT.</p> "> Figure 7
<p>Visualization of point cloud reconstruction.</p> ">
Abstract
:1. Introduction
- We introduce PointUR-RL, distinct from other self-supervised methods, which unifies point cloud reconstruction and representation learning through the use of a variable masked autoencoder. Furthermore, the incorporation of a contrastive learning module enhances the model’s ability to learn representations, improving the separability of the learned features and ensuring the quality of these two tasks.
- Optimized for point cloud processing, PointUR-RL is capable of smoothly adapting to point cloud data with varying masked ratios during the pre-training period and naturally achieves class-unconditional point cloud reconstruction.
- The experimental results demonstrate that the pre-trained model of PointUR-RL is effective and possesses strong generalization capabilities for downstream tasks. It has achieved high accuracy in classification and high-quality point cloud reconstruction on public datasets such as ModelNet and ShapeNet. Additionally, it has shown good generalization performance on the ScanObjectNN real-world dataset.
2. Related Work
2.1. Self-Supervised Learning
2.2. Autoencoder
2.3. Transformer
3. Methods
3.1. Point Patch Generation
3.2. Masking Strategy
3.3. Embedding Module
3.4. Encoder-Decoder Design
3.5. Reconstruction
3.6. Contrastive Co-Training
4. Experiments and Results
4.1. Pretrain Setting
4.2. Downstream Tasks
4.2.1. Object Classification on ModelNet40
4.2.2. Object Classification on Real-World Dataset
4.2.3. Object Reconstruction on ShapeNet55
4.2.4. Part Segmentation
4.2.5. Semantic Segmentation
4.3. Ablation Studies
4.3.1. Ablation Studies on Variable Masked Autoencoders
4.3.2. Ablation Experiments with Contrastive Learning
5. Conclusions and Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Tan, J.; Cao, Z.; Xu, L.; Liu, Y.; Si, L.; Sun, F. Part-Aware Correlation Networks for Few-shot Learning. IEEE Trans. Multimed. 2024; Early Access. [Google Scholar]
- Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021, 35, 857–876. [Google Scholar] [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
- Zhang, R.; Li, L.; Zhang, Q.; Zhang, J.; Xu, L.; Zhang, B.; Wang, B. Differential feature awareness network within antagonistic learning for infrared-visible object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 6735–6748. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Ye, M.; Zhang, X.; Yuen, P.C.; Chang, S.-F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6210–6219. [Google Scholar]
- Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; Guibas, L. Representation learning and adversarial generation of 3d point clouds. arXiv 2017, arXiv:1707.02392. [Google Scholar]
- Poursaeed, O.; Jiang, T.; Qiao, H.; Xu, N.; Kim, V.G. Self-supervised learning of point clouds via orientation estimation. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; pp. 1018–1028. [Google Scholar]
- Li, R.; Li, X.; Fu, C.-W.; Cohen-Or, D.; Heng, P.-A. Pu-gan: A point cloud upsampling adversarial network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7203–7212. [Google Scholar]
- Sarmad, M.; Lee, H.J.; Kim, Y.M. Rl-gan-net: A reinforcement learning agent controlled gan network for real-time point cloud shape completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5898–5907. [Google Scholar]
- Yang, G.; Huang, X.; Hao, Z.; Liu, M.-Y.; Belongie, S.; Hariharan, B. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4541–4550. [Google Scholar]
- Li, T.; Chang, H.; Mishra, S.; Zhang, H.; Katabi, D.; Krishnan, D. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2142–2152. [Google Scholar]
- Xiao, A.; Huang, J.; Guan, D.; Zhang, X.; Lu, S.; Shao, L. Unsupervised point cloud representation learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11321–11339. [Google Scholar] [CrossRef] [PubMed]
- Eckart, B.; Yuan, W.; Liu, C.; Kautz, J. Self-supervised learning on 3d point clouds by learning discrete generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 8248–8257. [Google Scholar]
- Chhipa, P.C.; Upadhyay, R.; Saini, R.; Lindqvist, L.; Nordenskjold, R.; Uchida, S.; Liwicki, M. Depth contrast: Self-supervised pretraining on 3dpm images for mining material classification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 212–227. [Google Scholar]
- Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; Rodrigo, R. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9902–9912. [Google Scholar]
- Huang, S.; Xie, Y.; Zhu, S.-C.; Zhu, Y. Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6535–6545. [Google Scholar]
- Liu, K.; Xiao, A.; Zhang, X.; Lu, S.; Shao, L. Fac: 3d representation learning via foreground aware feature contrast. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9476–9485. [Google Scholar]
- Wang, H.; Liu, Q.; Yue, X.; Lasenby, J.; Kusner, M.J. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9782–9792. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19313–19322. [Google Scholar]
- Pang, Y.; Wang, W.; Tay, F.E.; Liu, W.; Tian, Y.; Yuan, L. Masked autoencoders for point cloud self-supervised learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 604–621. [Google Scholar]
- Wu, X.; Jiang, L.; Wang, P.-S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler Faster Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 4840–4851. [Google Scholar]
- Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (tog) 2019, 38, 1–12. [Google Scholar] [CrossRef]
- Chen, X.; Liu, Z.; Xie, S.; He, K. Deconstructing denoising diffusion models for self-supervised learning. arXiv 2024, arXiv:2401.14404. [Google Scholar]
- Chen, X.; Ding, M.; Wang, X.; Xin, Y.; Mo, S.; Wang, Y.; Han, S.; Luo, P.; Zeng, G.; Wang, J. Context autoencoder for self-supervised representation learning. Int. J. Comput. Vis. 2024, 132, 208–223. [Google Scholar] [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9653–9663. [Google Scholar]
- Li, Z.; Gao, Z.; Tan, C.; Ren, B.; Yang, L.T.; Li, S.Z. General Point Model Pretraining with Autoencoding and Autoregressive. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20954–20964. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
- Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation. IEEE Trans. Multimed. 2021, 24, 1735–1749. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7478–7498. [Google Scholar] [CrossRef] [PubMed]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R.R.; Hu, S.-M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
- Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 7463–7472. [Google Scholar]
- Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16259–16268. [Google Scholar]
- Zhang, Y.; Lin, J.; Li, R.; Jia, K.; Zhang, L. Point-MA2E: Masked and Affine Transformed AutoEncoder for Self-supervised Point Cloud Learning. arXiv 2022, arXiv:2211.06841. [Google Scholar]
- Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Oneformer3d: One transformer for unified point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20943–20953. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. arXiv 2018, arXiv:1801.07791. [Google Scholar]
- Qiu, S.; Anwar, S.; Barnes, N. Geometric back-projection network for point cloud classification. IEEE Trans. Multimed. 2021, 24, 1943–1955. [Google Scholar] [CrossRef]
- Cheng, S.; Chen, X.; He, X.; Liu, Z.; Bai, X. Pra-net: Point relation-aware network for 3d point cloud analysis. IEEE Trans. Image Process. 2021, 30, 4436–4448. [Google Scholar] [CrossRef] [PubMed]
Training Category | Methods | Acc. |
---|---|---|
Supervised methods | PointNet | 89.2% |
PointNet++ | 90.7% | |
PointCNN [43] | 92.5% | |
DGCNN | 92.9% | |
[ST] Transformer | 91.4% | |
[T] PCT [37] | 93.2% | |
[T] Point Transformer [39] | 93.7% | |
Self-supervised methods | OcCo | 93.0% |
STRL [19] | 93.1% | |
[ST] Transformer + OcCo | 92.1% | |
[ST] Point-BERT | 93.2% | |
[ST]Point-BERT (rec.) | 93.1% | |
[ST]Point-MAE | 93.8% | |
[ST]Point-MAE (rec.) | 93.11% | |
Ours | 93.31% |
Methods | OBJ-BG | OBJ-ONLY | RB-T50-RS |
---|---|---|---|
PointNet | 73.3 | 79.2 | 68 |
SpiderCNN | 77.1 | 79.5 | 73.7 |
PointNet++ | 82.3 | 84.3 | 77.9 |
DGCNN | 82.8 | 86.2 | 78.1 |
BGA-DGCNN | - | - | 79.7 |
GBNet [44] | - | - | 80.5 |
PRANet [45] | - | - | 81.0 |
Transformer | 79.86 | 80.55 | 77.24 |
Transformer + OcCo | 84.85 | 85.54 | 78.79 |
Point-BERT | 87.43 | 88.12 | 83.07 |
Point-BERT (rec.) | 87.43 | 86.91 | 83.10 |
Point-MAE | 90.02 | 88.29 | 85.18 |
Point-MAE (rec.) | 88.98 | 88.29 | 84.31 |
Ours | 89.67 | 88.81 | 84.35 |
Methods | Table | Chair | Plane | Car | Sofa |
---|---|---|---|---|---|
FoldingNet | 2.53 | 2.81 | 1.43 | 1.98 | 2.48 |
PCN | 2.13 | 2.29 | 1.02 | 1.85 | 2.06 |
TopNet | 2.21 | 2.53 | 1.14 | 2.18 | 2.36 |
PFNet | 3.95 | 4.24 | 1.81 | 2.53 | 3.34 |
GRNet | 2.53 | 2.81 | 1.43 | 1.98 | 2.48 |
Ours | 2.35 | 2.48 | 1.09 | 3.66 | 3.56 |
Methods | Airplane | Bag Laptop | Cap | Car Mug | Chair Pistol | Earphone Rocket | Guitar Skateboard | Knife Table | |
---|---|---|---|---|---|---|---|---|---|
Lamp | Motor | ||||||||
PointNet | 83.7 | 83.4 80.8 | 78.7 95.3 | 82.5 65.2 | 74.9 93 | 89.6 81.2 | 73.0 57.9 | 91.5 72.8 | 85.9 80.6 |
PointNet++ | 85.1 | 82.4 83.7 | 79.0 95.3 | 86.7 71.6 | 77.3 94.1 | 90.8 81.3 | 71.8 58.7 | 91.0 76.4 | 85.9 82.6 |
DGCNN | 85.2 | 84.0 82.8 | 83.4 95.7 | 86.7 66.3 | 77.8 94.9 | 90.6 81.1 | 74.7 63.5 | 91.2 74.5 | 87.5 82.6 |
Transformer | 85.1 | 82.9 85.3 | 85.4 95.6 | 87.7 73.9 | 78.8 94.9 | 90.5 83.5 | 80.8 61.2 | 91.1 74.9 | 87.7 80.6 |
Point-BERT | 85.6 | 84.3 85.2 | 84.8 95.6 | 88.0 75.6 | 79.8 94.7 | 91.0 84.3 | 81.7 63.4 | 91.6 76.3 | 87.9 81.5 |
Point-MAE | 86.1 | 84.3 86.1 | 85.0 96.1 | 88.3 75.2 | 80.5 94.6 | 91.3 84.7 | 78.5 63.5 | 92.1 77.1 | 87.7 82.4 |
Ours | 85.66 | 84.7 85.4 | 85.2 96.0 | 87.9 76.2 | 80.9 94.9 | 91.2 84.9 | 79.7 62.2 | 92.1 75.9 | 87.9 80.7 |
Training Category | Methods | mIoU | mAcc |
---|---|---|---|
Supervised methods | PointNet | 41.4 | 49.0 |
PointNet++ | 53.5 | - | |
PointCNN | 57.4 | 63.9 | |
KPConv | 67.1 | 72.8 | |
SegGCN | 63.6 | 70.4 | |
MKConv | 67.7 | 75.1 | |
Self-supervised methods | Point-BERT | 68.9 | 76.1 |
MaskPoint | 68.6 | 74.2 | |
Point-MAE | 68.4 | 76.2 | |
Ours | 69.0 | 76.2 |
Object classification on ModelNet40 | 92.99% | 93.23% | 93.11% | 93.19% | 92.94% | 93.31% | 92.94% |
Average Reconstruction on ShapeNet55 | 2.87 | 2.75 | 2.64 | 2.80 | 2.66 | 2.57 | 2.8 |
Model Configuration | Variable Ratio | Inclusion of Contrastive Learning | Use of Class Fack Token [C] | Object Classification on ModelNet40 |
---|---|---|---|---|
Model A | √ | 92.58% | ||
Model B | √ | √ | 93.19% | |
Model C | √ | √ | 92.74% | |
Model D | √ | √ | √ | 93.31% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, K.; Zhu, Q.; Wang, H.; Wang, S.; Tian, H.; Zhou, P.; Cao, X. PointUR-RL: Unified Self-Supervised Learning Method Based on Variable Masked Autoencoder for Point Cloud Reconstruction and Representation Learning. Remote Sens. 2024, 16, 3045. https://doi.org/10.3390/rs16163045
Li K, Zhu Q, Wang H, Wang S, Tian H, Zhou P, Cao X. PointUR-RL: Unified Self-Supervised Learning Method Based on Variable Masked Autoencoder for Point Cloud Reconstruction and Representation Learning. Remote Sensing. 2024; 16(16):3045. https://doi.org/10.3390/rs16163045
Chicago/Turabian StyleLi, Kang, Qiuquan Zhu, Haoyu Wang, Shibo Wang, He Tian, Ping Zhou, and Xin Cao. 2024. "PointUR-RL: Unified Self-Supervised Learning Method Based on Variable Masked Autoencoder for Point Cloud Reconstruction and Representation Learning" Remote Sensing 16, no. 16: 3045. https://doi.org/10.3390/rs16163045