Nothing Special   »   [go: up one dir, main page]

Skip to main content

LabelDistill: Label-Guided Cross-Modal Knowledge Distillation for Camera-Based 3D Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent advancements in camera-based 3D object detection have introduced cross-modal knowledge distillation to bridge the performance gap with LiDAR 3D detectors, leveraging the precise geometric information in LiDAR point clouds. However, existing cross-modal knowledge distillation methods tend to overlook the inherent imperfections of LiDAR, such as the ambiguity of measurements on distant or occluded objects, which should not be transferred to the image detector. To mitigate these imperfections in LiDAR teacher, we propose a novel method that leverages aleatoric uncertainty-free features from ground truth labels.In contrast to conventional label guidance approaches, we approximate the inverse function of the teacher’s head to effectively embed label inputs into feature space. This approach provides additional accurate guidance alongside LiDAR teacher, thereby boosting the performance of the image detector.Additionally, we introduce feature partitioning, which effectively transfers knowledge from the teacher modality while preserving the distinctive features of the student, thereby maximizing the potential of both modalities. Experimental results demonstrate that our approach improves mAP and NDS by 5.1 points and 4.9 points compared to the baseline model, proving the effectiveness of our approach. The code is available at https://github.com/sanmin0312/LabelDistill.

Y. Kim—Work done at Korea Advanced Institute of Science and Technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9287–9296 (2019)

    Google Scholar 

  2. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)

    Google Scholar 

  3. Cao, W., Zhang, Y., Gao, J., Cheng, A., Cheng, K., Cheng, J.: PKD: general distillation framework for object detectors via Pearson correlation coefficient. In: Advances in Neural Information Processing Systems, vol. 35, pp. 15394–15406 (2022)

    Google Scholar 

  4. Chen, D., Li, J., Guizilini, V., Ambrus, R.A., Gaidon, A.: Viewpoint equivariance for multi-view 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9213–9222 (2023)

    Google Scholar 

  5. Chen, S., Wang, X., Cheng, T., Zhang, Q., Huang, C., Liu, W.: Polar parametrization for vision-based surround-view 3D detection. arXiv preprint arXiv:2206.10965 (2022)

  6. Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Bevdistill: cross-modal BEV distillation for multi-view 3D object detection. In: International Conference on Learning Representations (2023)

    Google Scholar 

  7. Cho, H., Choi, J., Baek, G., Hwang, W.: ITKD: interchange transfer-based knowledge distillation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13540–13549 (2023)

    Google Scholar 

  8. Chong, Z., et al.: Monodistill: learning spatial features for monocular 3D object detection. In: International Conference on Learning Representations (2022)

    Google Scholar 

  9. Dai, X., et al.: General instance distillation for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7842–7851 (2021)

    Google Scholar 

  10. Feng, C., Jie, Z., Zhong, Y., Chu, X., Ma, L.: Aedet: azimuth-invariant multi-view 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21580–21588 (2023)

    Google Scholar 

  11. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2485–2494 (2020)

    Google Scholar 

  12. Hahner, M., et al.: LiDAR snowfall simulation for robust 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16364–16374 (2022)

    Google Scholar 

  13. Hahner, M., Sakaridis, C., Dai, D., Van Gool, L.: FOG simulation on real lidar point clouds for 3D object detection in adverse weather. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15283–15292 (2021)

    Google Scholar 

  14. Hao, M., Liu, Y., Zhang, X., Sun, J.: LabelEnc: a new intermediate supervision method for object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 529–545. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_32

    Chapter  Google Scholar 

  15. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  16. Hong, Y., Dai, H., Ding, Y.: Cross-modality knowledge distillation network for monocular 3D object detection. In: European Conference on Computer Vision, pp. 87–104 (2022)

    Google Scholar 

  17. Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054 (2022)

  18. Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)

  19. Huang, L., et al.: Leveraging vision-centric multi-modal expertise for 3D object detection. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)

    Google Scholar 

  20. Huang, P., et al.: TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning. arXiv preprint arXiv:2212.13979 (2022)

  21. Huang, Y., et al.: Label-guided auxiliary training improves 3D object detector. In: European Conference on Computer Vision, pp. 684–700 (2022)

    Google Scholar 

  22. Jang, S., Jo, D.U., Hwang, S.J., Lee, D., Ji, D.: STXD: structural and temporal cross-modal distillation for multi-view 3D object detection. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)

    Google Scholar 

  23. Jiang, Y., et al.: Polarformer: multi-camera 3D object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1042–1050 (2023)

    Google Scholar 

  24. Kim, S., Kim, Y., Lee, I.J., Kum, D.: Predict to detect: prediction-guided 3D object detection using sequential images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18057–18066 (2023)

    Google Scholar 

  25. Klingner, M., et al.: X3KD: knowledge distillation across modalities, tasks and stages for multi-camera 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13343–13353 (2023)

    Google Scholar 

  26. Koh, J., Lee, J., Lee, Y., Kim, J., Choi, J.W.: Mgtanet: encoding sequential lidar points using long short-term motion-guided temporal attention for 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1179–1187 (2023)

    Google Scholar 

  27. Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., Jia, J.: Unifying voxel-based representation with transformer for 3D object detection. In: Advances in Neural Information Processing Systems, pp. 18442–18455 (2022)

    Google Scholar 

  28. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1486–1494 (2023)

    Google Scholar 

  29. Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1477–1485 (2023)

    Google Scholar 

  30. Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European Conference on Computer Vision, pp. 1–18 (2022)

    Google Scholar 

  31. Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: FB-BEV: BEV representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6919–6928 (2023)

    Google Scholar 

  32. Li, Z., Qu, Z., Zhou, Y., Liu, J., Wang, H., Jiang, L.: Diversity matters: fully exploiting depth clues for reliable monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2791–2800 (2022)

    Google Scholar 

  33. Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4d: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581 (2022)

  34. Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2019)

    Google Scholar 

  35. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: European Conference on Computer Vision, pp. 531–548 (2022)

    Google Scholar 

  36. Liu, Y., et al.: PETRV2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3262–3272 (2023)

    Google Scholar 

  37. Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3D object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 996–997 (2020)

    Google Scholar 

  38. Liu, Z., Zhu, L.: Label-guided attention distillation for lane segmentation. Neurocomputing 438, 312–322 (2021)

    Article  Google Scholar 

  39. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

    Google Scholar 

  40. Lu, Y., et al.: Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3111–3121 (2021)

    Google Scholar 

  41. Mostajabi, M., Maire, M., Shakhnarovich, G.: Regularizing deep networks by modeling and predicting label structure. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5629–5638 (2018)

    Google Scholar 

  42. Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082 (2017)

    Google Scholar 

  43. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499 (2016)

    Google Scholar 

  44. Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3142–3152 (2021)

    Google Scholar 

  45. Park, J., et al.: Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: International Conference on Learning Representations (2022)

    Google Scholar 

  46. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)

    Google Scholar 

  47. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: European Conference on Computer Vision, pp. 194–210 (2020)

    Google Scholar 

  48. Qin, Z., Li, X.: Monoground: detecting monocular 3D objects from the ground. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2022)

    Google Scholar 

  49. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8555–8564 (2021)

    Google Scholar 

  50. Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. In: The British Machine Vision Conference (2018)

    Google Scholar 

  51. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)

    Google Scholar 

  52. Tian, Z., et al.: Adaptive perspective distillation for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1372–1387 (2022)

    Article  Google Scholar 

  53. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  54. Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning, pp. 1475–1485 (2022)

    Google Scholar 

  55. Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)

    Google Scholar 

  56. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Conference on Robot Learning, pp. 180–191 (2022)

    Google Scholar 

  57. Wang, Y., Solomon, J.M.: Object DGCNN: 3D object detection using dynamic graphs. In: Advances in Neural Information Processing Systems, pp. 20745–20758 (2021)

    Google Scholar 

  58. Wang, Y., Zhou, W., Jiang, T., Bai, X., Xu, Y.: Intra-class feature variation distillation for semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 346–362. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_21

    Chapter  Google Scholar 

  59. Wang, Z., Li, D., Luo, C., Xie, C., Yang, X.: Distillbev: boosting multi-camera 3D object detection with cross-modal knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8637–8646 (2023)

    Google Scholar 

  60. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)

    Google Scholar 

  61. Yang, C., et al.: Bevformer V2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)

    Google Scholar 

  62. Yang, Z., et al.: Focal and global knowledge distillation for detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4643–4652 (2022)

    Google Scholar 

  63. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)

    Google Scholar 

  64. Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911 (2020)

    Google Scholar 

  65. Zeng, J., et al.: Distilling focal knowledge from imperfect expert for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 992–1001 (2023)

    Google Scholar 

  66. Zhang, H., Yang, D., Yurtsever, E., Redmill, K.A., Özgüner, Ü.: Faraway-frustum: dealing with lidar sparsity for 3D object detection using fusion. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 2646–2652 (2021)

    Google Scholar 

  67. Zhang, L., Dong, R., Tai, H.S., Ma, K.: Pointdistiller: structured knowledge distillation towards efficient and compact 3D detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21791–21801 (2023)

    Google Scholar 

  68. Zhang, P., Kang, Z., Yang, T., Zhang, X., Zheng, N., Sun, J.: LGD: label-guided self-distillation for object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3309–3317 (2022)

    Google Scholar 

  69. Zhang, Y., et al.: QD-BEV: quantization-aware view-guided distillation for multi-view 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3825–3835 (2023)

    Google Scholar 

  70. Zhao, H., Zhang, Q., Zhao, S., Zhang, J., Tao, D.: Bevsimdet: simulated multi-modal distillation in bird’s-eye view for multi-view 3D object detection. arXiv preprint arXiv:2303.16818 (2023)

  71. Zhou, S., Liu, W., Hu, C., Zhou, S., Ma, C.: Unidistill: a universal cross-modality knowledge distillation framework for 3D object detection in bird’s-eye view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5116–5125 (2023)

    Google Scholar 

  72. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  73. Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv preprint arXiv:1908.09492 (2019)

Download references

Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) and the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) under Grants 2021-0-01176 and 2022R1A2C200494413.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanmin Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, S., Kim, Y., Hwang, S., Jeong, H., Kum, D. (2025). LabelDistill: Label-Guided Cross-Modal Knowledge Distillation for Camera-Based 3D Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72992-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72991-1

  • Online ISBN: 978-3-031-72992-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics