Nothing Special   »   [go: up one dir, main page]

Skip to main content

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15142))

Included in the following conference series:

Abstract

Camera-based Bird’s-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at https://github.com/PeidongLi/DualBEV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)

    Google Scholar 

  2. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: WACV (2021)

    Google Scholar 

  3. Harley, A.W., Fang, Z., Li, J., Ambrus, R., Fragkiadaki, K.: Simple-BEV: what really matters for multi-sensor BEV perception? In: ICRA (2023)

    Google Scholar 

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  5. Huang, J., Huang, G.: BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment (2022). arXiv:2211.17111

  6. Huang, J., Huang, G.: BEVDet4D: exploit temporal cues in multi-camera 3D object detection (2023). arXiv:2307.11477

  7. Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: BEVDet: high-performance multi-camera 3D object detection in bird-eye-view (2021). arXiv:2112.11790

  8. Huang, P., et al.: TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning (2022). arXiv:2212.13979

  9. Jiang, Y., et al.: PolarFormer: multi-camera 3D object detection with polar transformers. In: AAAI (2023)

    Google Scholar 

  10. Jinqing, Z., Yanan, Z., Qingjie, L., Yunhong, W.: SA-BEV: generating semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In: ICCV (2023)

    Google Scholar 

  11. Lee, Y., Park, J.: CenterMask: real-time anchor-free instance segmentation. In: CVPR (2020)

    Google Scholar 

  12. Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3D object detection. In: NeurIPS (2022)

    Google Scholar 

  13. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: BEVStereo: enhancing depth estimation in multi-view 3D object detection with dynamic temporal stereo. In: AAAI (2023)

    Google Scholar 

  14. Li, Y., et al.: BEVDepth: acquisition of reliable depth for multi-view 3D object detection. In: AAAI (2023)

    Google Scholar 

  15. Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_1

    Chapter  Google Scholar 

  16. Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: ICCV (2023)

    Google Scholar 

  17. Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion (2022). arXiv:2211.10581

  18. Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: high-performance sparse 3D object detection from multi-camera videos. In: ICCV (2023)

    Google Scholar 

  19. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31

    Chapter  Google Scholar 

  20. Liu, Y., et al.: PETRv2: a unified framework for 3D perception from multi-camera images. In: ICCV (2023)

    Google Scholar 

  21. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  22. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)

    Google Scholar 

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  24. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)

    Google Scholar 

  25. Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: ICLR (2023)

    Google Scholar 

  26. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  27. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR (2021)

    Google Scholar 

  28. Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: BMVC (2019)

    Google Scholar 

  29. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  30. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: CoRL (2022)

    Google Scholar 

  31. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1

    Chapter  Google Scholar 

  32. Wu, Y., Li, R., Qin, Z., Zhao, X., Li, X.: HeightFormer: explicit height modeling without extra data for camera-only 3D object detection in bird’s eye view (2023). arXiv:2307.13510

  33. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)

    Google Scholar 

  34. Xiaowei, C., et al.: BEV-SAN: accurate BEV 3D object detection via slice attention networks. In: CVPR (2023)

    Google Scholar 

  35. Xie, E., et al.: M\(^2\)BEV: multi-camera joint 3D detection and segmentation with unified birds-eye view representation (2022). arXiv:2204.05088

  36. Xiong, K., et al.: CAPE: camera view position embedding for multi-view 3D object detection. In: CVPR (2023)

    Google Scholar 

  37. Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR (2023)

    Google Scholar 

  38. Zhou, H., Ge, Z., Li, Z., Zhang, X.: MatrixVT: efficient multi-camera to BEV transformation for 3D perception. In: ICCV (2023)

    Google Scholar 

  39. Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection (2019). arXiv:1908.09492

Download references

Acknowledgements

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This research is supported by National Key R&D Program of China (2022YFB4300300).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dixiao Cui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, P., Shen, W., Huang, Q., Cui, D. (2025). DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15142. Springer, Cham. https://doi.org/10.1007/978-3-031-72907-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72907-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72906-5

  • Online ISBN: 978-3-031-72907-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics