Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-73347-5_27guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Equivariant Spatio-temporal Self-supervision for LiDAR Object Detection

Published: 29 October 2024 Publication History

Abstract

Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show that our pre-training method for 3D object detection outperforms existing equivariant and invariant approaches in many settings.

References

[1]
Bhardwaj, S., et al.: Steerable equivariant representation learning. arXiv preprint arXiv:2302.11349 (2023)
[2]
Boulch, A., Sautier, C., Michele, B., Puy, G., Marlet, R.: ALSO: automotive lidar self-supervision by occupancy estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13455–13465 (2023)
[3]
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
[4]
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
[5]
Dangovski, R., et al.: Equivariant self-supervised learning: encouraging equivariance in representations. In: International Conference on Learning Representations (2022)
[6]
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
[7]
Devillers, A., Lefort, M.: EquiMod: an equivariance module to improve self-supervised learning. arXiv preprint arXiv:2211.01244 (2022)
[8]
Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8458–8468 (2022)
[9]
Garrido, Q., Najman, L., Lecun, Y.: Self-supervised learning of split invariant equivariant representations. In: The Fortieth International Conference on Machine Learning (2023)
[10]
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
[11]
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284 (2020)
[12]
Gupta, S., Robinson, J., Lim, D., Villar, S., Jegelka, S.: Learning structured representations with equivariant contrastive learning. In: ICML Workshop on Topology, Algebra, and Geometry in Machine Learning (2023)
[13]
He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. 2022 IEEE. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15979–15988 (2021)
[14]
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
[15]
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3D point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6535–6545 (2021)
[16]
Jin, Z., Lei, Y., Akhtar, N., Li, H., Hayat, M.: Deformation and correspondence aware unsupervised synthetic-to-real scene flow estimation for point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7233–7243 (2022)
[17]
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
[18]
Liao, Y., Xie, J., Geiger, A.: KITTI-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D. Pattern Anal. Mach. Intell. (PAMI) (2022)
[19]
Mao, J., et al.: One million scenes for autonomous driving: once dataset. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021)
[20]
Nunes L, Marcuzzi R, Chen X, Behley J, and Stachniss C SegContrast: 3D point cloud feature representation learning through self-supervised segment discrimination IEEE Robot. Autom. Lett. 2022 7 2 2116-2123
[21]
Nunes, L., Wiesmann, L., Marcuzzi, R., Chen, X., Behley, J., Stachniss, C.: Temporal consistent 3D LiDAR representation learning for semantic perception in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5217–5228 (2023)
[22]
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)
[23]
Shi S et al. PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection Int. J. Comput. Vision 2023 131 2 531-551
[24]
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
[25]
Shi S, Wang Z, Shi J, Wang X, and Li H From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network IEEE Trans. Pattern Anal. Mach. Intell. 2021 43 08 2647-2664
[26]
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
[27]
Tang H et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Searching efficient 3D architectures with sparse point-voxel convolution Computer Vision – ECCV 2020 2020 Cham Springer 685-702
[28]
Team, O.D.: OpenPCDet: an open-source toolbox for 3D object detection from point clouds (2020). https://github.com/open-mmlab/OpenPCDet
[29]
Wei, Y., Wang, Z., Rao, Y., Lu, J., Zhou, J.: PV-RAFT: point-voxel correlation fields for scene flow estimation of point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6954–6963 (2021)
[30]
Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. In: International Conference on Learning Representations (2021)
[31]
Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)
[32]
Xie S, Gu J, Guo D, Qi CR, Guibas L, and Litany O Vedaldi A, Bischof H, Brox T, and Frahm J-M PointContrast: unsupervised pre-training for 3D point cloud understanding Computer Vision – ECCV 2020 2020 Cham Springer 574-591
[33]
Xiong, Y., Ren, M., Zeng, W., Urtasun, R.: Self-supervised representation learning from flow equivariance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10191–10200 (2021)
[34]
Xu, R., et al.: MV-JAR: masked voxel jigsaw and reconstruction for LiDAR-based self-supervised pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13445–13454 (2023)
[35]
Yan, X., et al.: SPOT: scalable 3D pre-training via occupancy prediction for autonomous driving. arXiv preprint arXiv:2309.10527 (2023)
[36]
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10) (2018)., https://www.mdpi.com/1424-8220/18/10/3337
[37]
Yang, H., et al.: GD-MAE: generative decoder for MAE pre-training on LiDAR point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9403–9414 (2023)
[38]
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11040–11048 (2020)
[39]
Yin J, Zhou D, Zhang L, Fang J, Xu CZ, Shen J, and Wang W Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T ProposalContrast: unsupervised pre-training for LiDAR-based 3D object detection Computer Vision - ECCV 2022 2022 Cham Springer 17-33
[40]
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
[41]
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9939–9948 (2021)

Index Terms

  1. Equivariant Spatio-temporal Self-supervision for LiDAR Object Detection
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image Guide Proceedings
            Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVI
            Sep 2024
            578 pages
            ISBN:978-3-031-73346-8
            DOI:10.1007/978-3-031-73347-5
            • Editors:
            • Aleš Leonardis,
            • Elisa Ricci,
            • Stefan Roth,
            • Olga Russakovsky,
            • Torsten Sattler,
            • Gül Varol

            Publisher

            Springer-Verlag

            Berlin, Heidelberg

            Publication History

            Published: 29 October 2024

            Author Tags

            1. LiDAR
            2. 3D object detection
            3. Self-supervised learning

            Qualifiers

            • Article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 19 Feb 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media