Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

MFFNet: multimodal feature fusion network for point cloud semantic segmentation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

We introduce a multimodal feature fusion network (MFFNet) for 3D point cloud semantic segmentation. Unlike previous methods that directly learn from colored point clouds (XYZRGB), MFFNet transforms point clouds to 2D RGB image and frequency image representations for efficient multimodal feature fusion. For each point, MFFNet performs a local projection by automatically learning a weighted orthogonal projection to softly project surrounding points onto 2D images. Regular 2D convolution can thus be applied to these regular grids for efficient semantic feature learning. Then, we fuse 2D semantic features into 3D point cloud features by using a multimodal feature fusion module (MFF). MFF module could employ high-level features from 2D RGB images and frequency images to boost the intrinsic correlation and discriminability of different structure features from the point cloud. In particular, the discriminative descriptions are quantified and leveraged as the local soft attention mask further to enforce the structure feature of the semantic categories. We have evaluated the proposed method on the S3DIS and ScanNet datasets. Experimental results and comparisons with four backbone methods demonstrate that our framework can perform better.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  2. Liu, T., Cai, Y., Zheng, J., Thalmann, N.M.: Beacon: a boundary embedded attentional convolution network for point cloud instance segmentation. Vis. Comput. 38(7), 2303–2313 (2022)

    Article  Google Scholar 

  3. Sun, Y., Miao, Y., Chen, J., Pajarola, R.: Pgcnet: patch graph convolutional network for point cloud segmentation of indoor scenes. Vis. Comput. 36(10), 2407–2418 (2020)

    Article  Google Scholar 

  4. Janai, J., Güney, F., Behl, A., Geiger, A., et al.: Computer vision for autonomous vehicles: problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 12(1–3), 1–308 (2020)

    Article  Google Scholar 

  5. Yang, F., Li, X., Shen, J.: Nested architecture search for point cloud semantic segmentation. IEEE Trans. Image Process. 32, 2889–2418 (2022)

  6. Yin, J., Zhou, D., Zhang, L., Fang, J., Xu, C.-Z., Shen, J., Wang, W.: Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pp. 17–33. Springer (2022)

  7. Yin, J., Fang, J., Zhou, D., Zhang, L., Xu, C.-Z., Shen, J., Wang, W.: Semi-supervised 3d object detection with proficient teachers. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp. 727–743. Springer (2022)

  8. Jaritz, M., Gu, J., Su, H.: Multi-view pointnet for 3d scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0 (2019)

  9. Rizzoli, G., Barbato, F., Zanuttigh, P.: Multimodal semantic segmentation in autonomous driving: a review of current approaches and future perspectives. Technologies 10(4), 90 (2022)

    Article  Google Scholar 

  10. Meng, Q., Wang, W., Zhou, T., Shen, J., Jia, Y., Van Gool, L.: Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4454–4468 (2022)

  11. Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H.A.A.K., Elhoseiny, M., Ghanem, B.: Pointnext: Revisiting pointnet++ with improved training and scaling strategies (2022). arXiv preprint arXiv:2206.04670

  12. Wu, W., Qi, Z., Li, F.: Pointconv: Deep convolutional networks on 3d point clouds. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 9621–9630 (2019)

  13. You, H., Feng, Y., Ji, R., Gao, Y.: Pvnet: A joint convolutional network of point cloud and multi-view for 3d shape recognition. In: MM, pp. 1310–1318 (2018)

  14. Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  15. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)

  16. Wang, J., Wei, Z., Zhang, T., Zeng, W.: Deeply-fused nets (2016). CoRR arXiv:1605.07716

  17. Larsson, G., Maire, M., Shakhnarovich, G.: Fractalnet: Ultra-deep neural networks without residuals. In: ICLR (2017)

  18. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 77–85 (2017)

  19. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920

  20. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)

  21. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5105–5114 (2017)

  22. Ren, D., Wu, Z., Li, J., Yu, P., Guo, J., Wei, M., Guo, Y.: Point attention network for point cloud semantic segmentation. Sci. China Inf. Sci. 65(9), 192104 (2022)

    Article  Google Scholar 

  23. Yin, J., Shen, J., Gao, X., Crandall, D., Yang, R.: Graph neural network and spatiotemporal transformer attention for 3d video object detection from point clouds. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3125981

  24. Cao, J., Qin, X., Zhao, S., Shen, J.: Bilateral cross-modality graph matching attention for feature fusion in visual question answering. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2021.3135655

  25. Giering, M., Venugopalan, V., Reddy, K.K.: Multi-modal sensor registration for vehicle perception via deep neural networks. In: HPEC, pp. 1–6 (2015)

  26. Ye, M., Lan, X., Leng, Q., Shen, J.: Cross-modality person re-identification via modality-aware collaborative ensemble learning. IEEE Trans. Image Process. 29, 9387–9399 (2020)

    Article  Google Scholar 

  27. Zhou, T., Fu, H., Chen, G., Shen, J., Shao, L.: Hi-net: hybrid-fusion network for multi-modal mr image synthesis. IEEE Trans. Med. Imaging 39(9), 2772–2781 (2020)

    Article  Google Scholar 

  28. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

  30. Zhang, Z., Hua, B., Yeung, S.: Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 1607–1616

  31. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed points. In: Advances in Neural Information Processing Systems, pp. 828–838

  32. Tosteberg, P.: Semantic segmentation of point clouds using deep learning. Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University (2017)

  33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)

  34. Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d-3d-semantic data for indoor scene understanding (2017). arXiv preprint arXiv:1702.01105

  35. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)

  36. Zhu, X., Zhou, H., Wang, T., Hong, F., Ma, Y., Li, W., Li, H., Lin, D.: Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9939–9948 (2021)

  37. Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I.K., Fischer, M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,, pp. 1534–1543 (2016)

  38. Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307 (2019)

  39. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  40. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint arXiv:1704.04861

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jie Guo or Yanwen Guo.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ren, D., Li, J., Wu, Z. et al. MFFNet: multimodal feature fusion network for point cloud semantic segmentation. Vis Comput 40, 5155–5167 (2024). https://doi.org/10.1007/s00371-023-02907-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02907-w

Keywords

Navigation