URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer

Guoxin Zhang¹⁵,
Jun Xie¹⁶,
Lin Liu¹⁷,
Zhepeng Wang¹⁶,
Kuihe Yang¹⁵ &
…
Ziying Song¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14427))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

812 Accesses

Abstract

Current LiDAR-camera 3D detectors adopt a 3D-2D design pattern. However, this paradigm ignores the dimensional gap between heterogeneous modalities (e.g., coordinate system, data distribution), leading to difficulties in marrying the geometric and semantic information of two modalities. Moreover, conventional 3D convolution neural networks (3D CNNs) backbone leads to limited receptive fields, which discourages the interaction between multi-modal features, especially in capturing long-range object context information. To this end, we propose a Unified Representation Transformer-based multi-modal 3D detector (URFormer) with better representation scheme and cross-modality interaction, which consists of three crucial components. First, we propose Depth-Aware Lift Module (DALM), which exploits depth information in 2D modality and lifts 2D representation into 3D at the pixel level, and naturally unifies inconsistent multi-modal representation. Second, we design a Sparse Transformer (SPTR) to enlarge effective receptive fields and capture long-range object semantic features for better interaction in multi-modal features. Finally, we design Unified Representation Fusion (URFusion) to integrate cross-modality features in a fine-grain manner. Extensive experiments are conducted to demonstrate the effectiveness of our method on KITTI benchmark and show remarkable performance compared to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Article 20 September 2023

Rethinking Pseudo-LiDAR Representation

References

Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Google Scholar
Chen, Y., Li, Y., Zhang, X., Sun, J., Jia, J.: Focal sparse convolutional networks for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5428–5437 (2022)
Google Scholar
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8417–8427 (2022)
Google Scholar
Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: Penet: towards precise and efficient image guided depth completion. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13656–13662. IEEE (2021)
Google Scholar
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Chapter Google Scholar
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7353 (2019)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Google Scholar
Pang, S., Morris, D., Radha, H.: Fast-clocs: fast camera-lidar object candidates fusion for 3D object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 187–196 (2022)
Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Google Scholar
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Google Scholar
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)
Google Scholar
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2647–2664 (2020)
Google Scholar
Song, Z., Jia, C., Yang, L., Wei, H., Liu, L.: Graphalign++: an accurate feature alignment by graph matching for multi-modal 3D object detection. IEEE Trans. Circuits Syst. Video Technol. (2023)
Google Scholar
Song, Z., Wei, H., Jia, C., Xia, Y., Li, X., Zhang, C.: Vp-net: voxels as points for 3-D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023). https://doi.org/10.1109/TGRS.2023.3271020
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Google Scholar
Wang, L., et al.: SAT-GCN: self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 259, 110080 (2023)
Article Google Scholar
Wang, L., et al.: Multi-modal 3D object detection in autonomous driving: a survey and taxonomy. IEEE Trans. Intell. Veh. 8(7), 3781–3798 . https://doi.org/10.1109/TIV.2023.3264658
Wu, X., et al.: Sparse fuse dense: towards high quality 3D detection with depth completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5418–5427 (2022)
Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Yang, L., et al.: Bevheight: a robust framework for vision-based roadside 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21611–21620, June 2023
Google Scholar
Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16494–16507 (2021)
Google Scholar
Zhang, X., et al.: Ri-fusion: 3D object detection using enhanced point features with range-image fusion for autonomous driving. IEEE Trans. Instrum. Meas. 72, 1–13 (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities (2023YJS019) and the STI 2030-Major Projects under Grant 2021ZD0201404 (Funded by Jun Xie and Zhepeng Wang from Lenovo Research).

Author information

Authors and Affiliations

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang, 050018, China
Guoxin Zhang & Kuihe Yang
Lenovo Research, Beijing, 100094, China
Jun Xie & Zhepeng Wang
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Lin Liu & Ziying Song

Authors

Guoxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Xie
View author publications
You can also search for this author in PubMed Google Scholar
Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhepeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kuihe Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ziying Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhepeng Wang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, G., Xie, J., Liu, L., Wang, Z., Yang, K., Song, Z. (2024). URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_32

Download citation

DOI: https://doi.org/10.1007/978-981-99-8435-0_32
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8434-3
Online ISBN: 978-981-99-8435-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Rethinking Pseudo-LiDAR Representation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

DMFF: dual-way multimodal feature fusion for 3D object detection

Rethinking Pseudo-LiDAR Representation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation