Abstract
Camera-based Bird’s-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at https://github.com/PeidongLi/DualBEV.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: WACV (2021)
Harley, A.W., Fang, Z., Li, J., Ambrus, R., Fragkiadaki, K.: Simple-BEV: what really matters for multi-sensor BEV perception? In: ICRA (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Huang, J., Huang, G.: BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment (2022). arXiv:2211.17111
Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection (2023). arXiv:2307.11477
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view (2021). arXiv:2112.11790
Huang, P., et al.: TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning (2022). arXiv:2212.13979
Jiang, Y., et al.: PolarFormer: multi-camera 3D object detection with polar transformers. In: AAAI (2023)
Jinqing, Z., Yanan, Z., Qingjie, L., Yunhong, W.: SA-BEV: generating semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In: ICCV (2023)
Lee, Y., Park, J.: CenterMask: real-time anchor-free instance segmentation. In: CVPR (2020)
Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3D object detection. In: NeurIPS (2022)
Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: enhancing depth estimation in multi-view 3D object detection with dynamic temporal stereo. In: AAAI (2023)
Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI (2023)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: ICCV (2023)
Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion (2022). arXiv:2211.10581
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR (2021)
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: BMVC (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL (2022)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Wu, Y., Li, R., Qin, Z., Zhao, X., Li, X.: HeightFormer: explicit height modeling without extra data for camera-only 3D object detection in bird’s eye view (2023). arXiv:2307.13510
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)
Xiaowei, C., et al.: BEV-SAN: accurate BEV 3D object detection via slice attention networks. In: CVPR (2023)
Xie, E., et al.: M\(^2\)BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation (2022). arXiv:2204.05088
Xiong, K., et al.: CAPE: camera view position embedding for multi-view 3D object detection. In: CVPR (2023)
Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR (2023)
Zhou, H., Ge, Z., Li, Z., Zhang, X.: MatrixVT: efficient multi-camera to BEV transformation for 3D perception. In: ICCV (2023)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection (2019). arXiv:1908.09492
Acknowledgements
The authors would like to thank reviewers for their detailed comments and instructive suggestions. This research is supported by National Key R&D Program of China (2022YFB4300300).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, P., Shen, W., Huang, Q., Cui, D. (2025). DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15142. Springer, Cham. https://doi.org/10.1007/978-3-031-72907-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-72907-2_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72906-5
Online ISBN: 978-3-031-72907-2
eBook Packages: Computer ScienceComputer Science (R0)