Nothing Special   »   [go: up one dir, main page]

Skip to main content

Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15330))

Included in the following conference series:

  • 209 Accesses

Abstract

In this paper, we present a novel method for reliable frontier selection in Zero-Shot Object Goal Navigation (ZS-OGN), enhancing robotic navigation systems with foundation models to improve commonsense reasoning in indoor environments. Our approach introduces a multi-expert decision framework to address the nonsensical or irrelevant reasoning often seen in foundation model-based systems. The method comprises two key components: Diversified Expert Frontier Analysis (DEFA) and Consensus Decision Making (CDM). DEFA utilizes three expert models—furniture arrangement, room type analysis, and visual scene reasoning—while CDM aggregates their outputs, prioritizing unanimous or majority consensus for more reliable decisions. Demonstrating state-of-the-art performance on the RoboTHOR and HM3D datasets, our method excels at navigating towards untrained objects or goals and outperforms various baselines, showcasing its adaptability to dynamic real-world conditions and superior generalization capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ali, M., Jardali, H., Roy, N., Liu, L.: Autonomous navigation, mapping and exploration with gaussian processes. In: Robotics: Science and Systems XIX (2023). https://api.semanticscholar.org/CorpusID:259343521

  2. Ali, M., Liu, L.: Gp-frontier for local mapless navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10047–10053. IEEE (2023)

    Google Scholar 

  3. Cai, W., et al.: Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309 (2023)

  4. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155 (2020)

  5. Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. Adv. Neural. Inf. Process. Syst. 33, 4247–4258 (2020)

    Google Scholar 

  6. Chen, J., Li, G., Kumar, S., Ghanem, B., Yu, F.: How to not train your dragon: training-free embodied object goal navigation with semantic frontiers. arXiv preprint arXiv:2305.16925 (2023)

  7. Chen, P., et al.: Weakly-supervised multi-granularity map learning for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 35, 38149–38161 (2022)

    Google Scholar 

  8. Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  9. Constantinides, G.M., Malliaris, A.G.: Portfolio theory. Handbooks Oper. Res. Manag. Sci. 9, 1–30 (1995)

    Google Scholar 

  10. Deitke, M., et al.: Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174 (2020)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Dorbala, V.S., Mullen Jr, J.F., Manocha, D.: Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation. arXiv preprint arXiv:2303.03480 (2023)

  13. Dorbala, V.S., Sigurdsson, G.A., Thomason, J., Piramuthu, R., Sukhatme, G.S.: Clip-nav: using clip for zero-shot vision-and-language navigation. In: Workshop on Language and Robotics at CoRL 2022 (2022)

    Google Scholar 

  14. Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  15. Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Clip on wheels: zero-shot object navigation as object localization and exploration, 3(4), 7 (2022). arXiv preprint arXiv:2203.10421

  16. Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Cows on pasture: baselines and benchmarks for language-driven zero-shot object navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181 (2023)

    Google Scholar 

  17. Huang, H., Yuan, S., Wen, C., Hao, Y., Fang, Y.: Noisy few-shot 3d point cloud scene segmentation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 11070–11077. IEEE (2024)

    Google Scholar 

  18. Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)

  19. Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: International Conference on Machine Learning, pp. 9118–9147. PMLR (2022)

    Google Scholar 

  20. Huang, W., et al.: Grounded decoding: guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855 (2023)

  21. Jadidi, M.G., Miró, J.V., Valencia, R., Andrade-Cetto, J.: Exploration on continuous gaussian process frontier maps. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6077–6082. IEEE (2014)

    Google Scholar 

  22. Jiang, Y., et al.: Vima: general robot manipulation with multimodal prompts. In: NeurIPS 2022 Foundation Models for Decision Making Workshop (2022)

    Google Scholar 

  23. Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5129–5136. IEEE (2018)

    Google Scholar 

  24. Karnan, H., Warnell, G., Xiao, X., Stone, P.: Voila: visual-observation-only imitation learning for autonomous navigation. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2497–2503. IEEE (2022)

    Google Scholar 

  25. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)

    Google Scholar 

  26. Komorowski, J.: Minkloc3d: point cloud based large-scale place recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1790–1799 (2021)

    Google Scholar 

  27. Koubaa, A.: Rosgpt: Next-generation human-robot interaction with chatgpt and ros. Preprints (2023). https://doi.org/10.20944/preprints202304.0827.v3

  28. Krause, S., Stolzenburg, F.: Commonsense reasoning and explainable artificial intelligence using large language models. In: European Conference on Artificial Intelligence, pp. 302–319. Springer, Heidelberg (2023)

    Google Scholar 

  29. Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

    Google Scholar 

  30. Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., Batra, D.: Zson: zero-shot object-goal navigation using multimodal goal embeddings. Adv. Neural. Inf. Process. Syst. 35, 32340–32352 (2022)

    Google Scholar 

  31. Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16

    Chapter  Google Scholar 

  32. Mangram, M.E.: A simplified perspective of the markowitz portfolio theory. Glob. J. Bus. Res. 7(1), 59–70 (2013)

    Google Scholar 

  33. Markowitz, H.M.: Foundations of portfolio theory. J. Financ. 46(2), 469–477 (1991)

    Article  Google Scholar 

  34. Markowitz, H.M.: Portfolio theory: as i still see it. Annu. Rev. Financ. Econ. 2(1), 1–23 (2010)

    Article  MathSciNet  Google Scholar 

  35. OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  36. OpenAI: Introducing chatgpt (2023). https://openai.com/blog/chatgpt. Accessed 2 Aug 2023

  37. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  38. Ramakrishnan, S.K., Chaplot, D.S., Al-Halah, Z., Malik, J., Grauman, K.: Poni: potential functions for objectgoal navigation with interaction-free learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18890–18900 (2022)

    Google Scholar 

  39. Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021). https://arxiv.org/abs/2109.08238

  40. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  41. Sethian, J.A.: A fast marching level set method for monotonically advancing fronts. Proc. Natl. Acad. Sci. 93(4), 1591–1595 (1996)

    Google Scholar 

  42. Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: semantic guesswork as a heuristic for planning. In: Conference on Robot Learning, pp. 2683–2699. PMLR (2023)

    Google Scholar 

  43. Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: robotic navigation with large pre-trained models of language, vision, and action. In: Conference on Robot Learning, pp. 492–504. PMLR (2023)

    Google Scholar 

  44. Silver, D., Bagnell, J., Stentz, A.: High performance outdoor navigation from overhead data using imitation learning. In: Robotics: Science and Systems IV, Zurich, Switzerland, vol. 1 (2008)

    Google Scholar 

  45. Song, C.H., Wu, J., Washington, C., Sadler, B.M., Chao, W.L., Su, Y.: Llm-planner: few-shot grounded planning for embodied agents with large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3009 (2023)

    Google Scholar 

  46. Suzuki, S., Takeno, S., Tamura, T., Shitara, K., Karasuyama, M.: Multi-objective bayesian optimization using pareto-frontier entropy. In: International Conference on Machine Learning, pp. 9279–9288. PMLR (2020)

    Google Scholar 

  47. Wang, W., Haddow, B., Birch, A., Peng, W.: Assessing the reliability of large language model knowledge. arXiv preprint arXiv:2310.09820 (2023)

  48. Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  49. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  50. Wijmans, E., et al.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (2019)

    Google Scholar 

  51. Wöhlke, J., Schmitt, F., van Hoof, H.: Hierarchies of planning and reinforcement learning for robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10682–10688. IEEE (2021)

    Google Scholar 

  52. Wu, P., et al.: Voronav: voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695 (2024)

  53. Xia, Y., et al.: Casspr: cross attention single scan place recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8461–8472 (2023)

    Google Scholar 

  54. Xia, Y., et al.: Soe-net: a self-attention and orientation encoding network for point cloud based place recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11348–11357 (2021)

    Google Scholar 

  55. Yamauchi, B.: A frontier-based approach for autonomous exploration. In: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation, pp. 146–151. IEEE (1997)

    Google Scholar 

  56. Yang, M.S., Schuurmans, D., Abbeel, P., Nachum, O.: Chain of thought imitation with procedure cloning. Adv. Neural. Inf. Process. Syst. 35, 36366–36381 (2022)

    Google Scholar 

  57. Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)

  58. Yao, S., et al.: React: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)

  59. Yuan, S., Fang, Y.: Ross: Robust learning of one-shot 3d shape segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1961–1969 (2020)

    Google Scholar 

  60. Yuan, S., Shafique, M., Baghdadi, M.R., Khorrami, F., Tzes, A., Fang, Y.: Zero-shot object navigation with vision-language foundation models reasoning. In: 2024 10th International Conference on Automation, Robotics and Applications (ICARA), pp. 501–505. IEEE (2024)

    Google Scholar 

  61. Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)

  62. Zhao, Q., Zhang, L., He, B., Liu, Z.: Semantic policy network for zero-shot object goal visual navigation. IEEE Rob. Autom. Lett. (2023)

    Google Scholar 

  63. Zhao, Q., Zhang, L., He, B., Qiao, H., Liu, Z.: Zero-shot object goal visual navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2025–2031. IEEE (2023)

    Google Scholar 

  64. Zheng, K., et al.: Jarvis: a neuro-symbolic commonsense reasoning framework for conversational embodied agents. arXiv preprint arXiv:2208.13266 (2022)

  65. Zhou, B., Zhang, Y., Chen, X., Shen, S.: Fuel: fast uav exploration using incremental frontier structure and hierarchical planning. IEEE Rob. Autom. Lett. 6(2), 779–786 (2021)

    Article  Google Scholar 

  66. Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023)

  67. Zhou, K., et al.: Esc: exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:2301.13166 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Fang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yuan, S., Unlu, H.U., Huang, H., Wen, C., Tzes, A., Fang, Y. (2025). Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15330. Springer, Cham. https://doi.org/10.1007/978-3-031-78113-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78113-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78112-4

  • Online ISBN: 978-3-031-78113-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics