DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Peidong Li¹³,
Wancheng Shen¹³,
Qihao Huang¹³ &
…
Dixiao Cui¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15142))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Camera-based Bird’s-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at https://github.com/PeidongLi/DualBEV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

X-Align++: cross-modal cross-view alignment for Bird’s-eye-view segmentation

Article 16 May 2023

OneBEV: Using One Panoramic Image for Bird’s-Eye-View Semantic Mapping

Feature distribution normalization network for multi-view stereo

Article 17 April 2024

References

Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Google Scholar
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: WACV (2021)
Google Scholar
Harley, A.W., Fang, Z., Li, J., Ambrus, R., Fragkiadaki, K.: Simple-BEV: what really matters for multi-sensor BEV perception? In: ICRA (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Huang, J., Huang, G.: BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment (2022). arXiv:2211.17111
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection (2023). arXiv:2307.11477
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view (2021). arXiv:2112.11790
Huang, P., et al.: TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning (2022). arXiv:2212.13979
Jiang, Y., et al.: PolarFormer: multi-camera 3D object detection with polar transformers. In: AAAI (2023)
Google Scholar
Jinqing, Z., Yanan, Z., Qingjie, L., Yunhong, W.: SA-BEV: generating semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In: ICCV (2023)
Google Scholar
Lee, Y., Park, J.: CenterMask: real-time anchor-free instance segmentation. In: CVPR (2020)
Google Scholar
Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3D object detection. In: NeurIPS (2022)
Google Scholar
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: enhancing depth estimation in multi-view 3D object detection with dynamic temporal stereo. In: AAAI (2023)
Google Scholar
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI (2023)
Google Scholar
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Chapter Google Scholar
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: ICCV (2023)
Google Scholar
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion (2022). arXiv:2211.10581
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Chapter Google Scholar
Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Google Scholar
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR (2021)
Google Scholar
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: BMVC (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL (2022)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Wu, Y., Li, R., Qin, Z., Zhao, X., Li, X.: HeightFormer: explicit height modeling without extra data for camera-only 3D object detection in bird’s eye view (2023). arXiv:2307.13510
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)
Google Scholar
Xiaowei, C., et al.: BEV-SAN: accurate BEV 3D object detection via slice attention networks. In: CVPR (2023)
Google Scholar
Xie, E., et al.: M$^2$BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation (2022). arXiv:2204.05088
Xiong, K., et al.: CAPE: camera view position embedding for multi-view 3D object detection. In: CVPR (2023)
Google Scholar
Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR (2023)
Google Scholar
Zhou, H., Ge, Z., Li, Z., Zhang, X.: MatrixVT: efficient multi-camera to BEV transformation for 3D perception. In: ICCV (2023)
Google Scholar
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection (2019). arXiv:1908.09492

Download references

Acknowledgements

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This research is supported by National Key R&D Program of China (2022YFB4300300).

Author information

Authors and Affiliations

Zhijia Technology, Zhuhai, China
Peidong Li, Wancheng Shen, Qihao Huang & Dixiao Cui

Authors

Peidong Li
View author publications
You can also search for this author in PubMed Google Scholar
Wancheng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Qihao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Dixiao Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dixiao Cui .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Shen, W., Huang, Q., Cui, D. (2025). DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15142. Springer, Cham. https://doi.org/10.1007/978-3-031-72907-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-72907-2_17
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72906-5
Online ISBN: 978-3-031-72907-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

X-Align++: cross-modal cross-view alignment for Bird’s-eye-view segmentation

OneBEV: Using One Panoramic Image for Bird’s-Eye-View Semantic Mapping

Feature distribution normalization network for multi-view stereo

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

X-Align++: cross-modal cross-view alignment for Bird’s-eye-view segmentation

OneBEV: Using One Panoramic Image for Bird’s-Eye-View Semantic Mapping

Feature distribution normalization network for multi-view stereo

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation