Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation

Shuaihang Yuan^13,14,16,
Halil Utku Unlu¹⁵,
Hao Huang^14,16,
Congcong Wen^14,16,
Anthony Tzes^13,14 &
…
Yi Fang^13,14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15330))

Included in the following conference series:

International Conference on Pattern Recognition

209 Accesses

Abstract

In this paper, we present a novel method for reliable frontier selection in Zero-Shot Object Goal Navigation (ZS-OGN), enhancing robotic navigation systems with foundation models to improve commonsense reasoning in indoor environments. Our approach introduces a multi-expert decision framework to address the nonsensical or irrelevant reasoning often seen in foundation model-based systems. The method comprises two key components: Diversified Expert Frontier Analysis (DEFA) and Consensus Decision Making (CDM). DEFA utilizes three expert models—furniture arrangement, room type analysis, and visual scene reasoning—while CDM aggregates their outputs, prioritizing unanimous or majority consensus for more reliable decisions. Demonstrating state-of-the-art performance on the RoboTHOR and HM3D datasets, our method excels at navigating towards untrained objects or goals and outperforms various baselines, showcasing its adaptability to dynamic real-world conditions and superior generalization capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Zero-Shot Object Navigation with Vision-Language Models Reasoning

Active object search in an unknown large-scale environment using commonsense knowledge and spatial relations

Article 29 August 2019

IntelliMove: Enhancing Robotic Planning with Semantic Mapping

References

Ali, M., Jardali, H., Roy, N., Liu, L.: Autonomous navigation, mapping and exploration with gaussian processes. In: Robotics: Science and Systems XIX (2023). https://api.semanticscholar.org/CorpusID:259343521
Ali, M., Liu, L.: Gp-frontier for local mapless navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10047–10053. IEEE (2023)
Google Scholar
Cai, W., et al.: Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309 (2023)
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155 (2020)
Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. Adv. Neural. Inf. Process. Syst. 33, 4247–4258 (2020)
Google Scholar
Chen, J., Li, G., Kumar, S., Ghanem, B., Yu, F.: How to not train your dragon: training-free embodied object goal navigation with semantic frontiers. arXiv preprint arXiv:2305.16925 (2023)
Chen, P., et al.: Weakly-supervised multi-granularity map learning for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 35, 38149–38161 (2022)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Constantinides, G.M., Malliaris, A.G.: Portfolio theory. Handbooks Oper. Res. Manag. Sci. 9, 1–30 (1995)
Google Scholar
Deitke, M., et al.: Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dorbala, V.S., Mullen Jr, J.F., Manocha, D.: Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation. arXiv preprint arXiv:2303.03480 (2023)
Dorbala, V.S., Sigurdsson, G.A., Thomason, J., Piramuthu, R., Sukhatme, G.S.: Clip-nav: using clip for zero-shot vision-and-language navigation. In: Workshop on Language and Robotics at CoRL 2022 (2022)
Google Scholar
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Clip on wheels: zero-shot object navigation as object localization and exploration, 3(4), 7 (2022). arXiv preprint arXiv:2203.10421
Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Cows on pasture: baselines and benchmarks for language-driven zero-shot object navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181 (2023)
Google Scholar
Huang, H., Yuan, S., Wen, C., Hao, Y., Fang, Y.: Noisy few-shot 3d point cloud scene segmentation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 11070–11077. IEEE (2024)
Google Scholar
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: International Conference on Machine Learning, pp. 9118–9147. PMLR (2022)
Google Scholar
Huang, W., et al.: Grounded decoding: guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855 (2023)
Jadidi, M.G., Miró, J.V., Valencia, R., Andrade-Cetto, J.: Exploration on continuous gaussian process frontier maps. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6077–6082. IEEE (2014)
Google Scholar
Jiang, Y., et al.: Vima: general robot manipulation with multimodal prompts. In: NeurIPS 2022 Foundation Models for Decision Making Workshop (2022)
Google Scholar
Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5129–5136. IEEE (2018)
Google Scholar
Karnan, H., Warnell, G., Xiao, X., Stone, P.: Voila: visual-observation-only imitation learning for autonomous navigation. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2497–2503. IEEE (2022)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Google Scholar
Komorowski, J.: Minkloc3d: point cloud based large-scale place recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1790–1799 (2021)
Google Scholar
Koubaa, A.: Rosgpt: Next-generation human-robot interaction with chatgpt and ros. Preprints (2023). https://doi.org/10.20944/preprints202304.0827.v3
Krause, S., Stolzenburg, F.: Commonsense reasoning and explainable artificial intelligence using large language models. In: European Conference on Artificial Intelligence, pp. 302–319. Springer, Heidelberg (2023)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., Batra, D.: Zson: zero-shot object-goal navigation using multimodal goal embeddings. Adv. Neural. Inf. Process. Syst. 35, 32340–32352 (2022)
Google Scholar
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Chapter Google Scholar
Mangram, M.E.: A simplified perspective of the markowitz portfolio theory. Glob. J. Bus. Res. 7(1), 59–70 (2013)
Google Scholar
Markowitz, H.M.: Foundations of portfolio theory. J. Financ. 46(2), 469–477 (1991)
Article Google Scholar
Markowitz, H.M.: Portfolio theory: as i still see it. Annu. Rev. Financ. Econ. 2(1), 1–23 (2010)
Article MathSciNet Google Scholar
OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
OpenAI: Introducing chatgpt (2023). https://openai.com/blog/chatgpt. Accessed 2 Aug 2023
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramakrishnan, S.K., Chaplot, D.S., Al-Halah, Z., Malik, J., Grauman, K.: Poni: potential functions for objectgoal navigation with interaction-free learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18890–18900 (2022)
Google Scholar
Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021). https://arxiv.org/abs/2109.08238
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sethian, J.A.: A fast marching level set method for monotonically advancing fronts. Proc. Natl. Acad. Sci. 93(4), 1591–1595 (1996)
Google Scholar
Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: semantic guesswork as a heuristic for planning. In: Conference on Robot Learning, pp. 2683–2699. PMLR (2023)
Google Scholar
Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: robotic navigation with large pre-trained models of language, vision, and action. In: Conference on Robot Learning, pp. 492–504. PMLR (2023)
Google Scholar
Silver, D., Bagnell, J., Stentz, A.: High performance outdoor navigation from overhead data using imitation learning. In: Robotics: Science and Systems IV, Zurich, Switzerland, vol. 1 (2008)
Google Scholar
Song, C.H., Wu, J., Washington, C., Sadler, B.M., Chao, W.L., Su, Y.: Llm-planner: few-shot grounded planning for embodied agents with large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3009 (2023)
Google Scholar
Suzuki, S., Takeno, S., Tamura, T., Shitara, K., Karasuyama, M.: Multi-objective bayesian optimization using pareto-frontier entropy. In: International Conference on Machine Learning, pp. 9279–9288. PMLR (2020)
Google Scholar
Wang, W., Haddow, B., Birch, A., Peng, W.: Assessing the reliability of large language model knowledge. arXiv preprint arXiv:2310.09820 (2023)
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Wijmans, E., et al.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (2019)
Google Scholar
Wöhlke, J., Schmitt, F., van Hoof, H.: Hierarchies of planning and reinforcement learning for robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10682–10688. IEEE (2021)
Google Scholar
Wu, P., et al.: Voronav: voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695 (2024)
Xia, Y., et al.: Casspr: cross attention single scan place recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8461–8472 (2023)
Google Scholar
Xia, Y., et al.: Soe-net: a self-attention and orientation encoding network for point cloud based place recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11348–11357 (2021)
Google Scholar
Yamauchi, B.: A frontier-based approach for autonomous exploration. In: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation, pp. 146–151. IEEE (1997)
Google Scholar
Yang, M.S., Schuurmans, D., Abbeel, P., Nachum, O.: Chain of thought imitation with procedure cloning. Adv. Neural. Inf. Process. Syst. 35, 36366–36381 (2022)
Google Scholar
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Yao, S., et al.: React: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)
Yuan, S., Fang, Y.: Ross: Robust learning of one-shot 3d shape segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1961–1969 (2020)
Google Scholar
Yuan, S., Shafique, M., Baghdadi, M.R., Khorrami, F., Tzes, A., Fang, Y.: Zero-shot object navigation with vision-language foundation models reasoning. In: 2024 10th International Conference on Automation, Robotics and Applications (ICARA), pp. 501–505. IEEE (2024)
Google Scholar
Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)
Zhao, Q., Zhang, L., He, B., Liu, Z.: Semantic policy network for zero-shot object goal visual navigation. IEEE Rob. Autom. Lett. (2023)
Google Scholar
Zhao, Q., Zhang, L., He, B., Qiao, H., Liu, Z.: Zero-shot object goal visual navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2025–2031. IEEE (2023)
Google Scholar
Zheng, K., et al.: Jarvis: a neuro-symbolic commonsense reasoning framework for conversational embodied agents. arXiv preprint arXiv:2208.13266 (2022)
Zhou, B., Zhang, Y., Chen, X., Shen, S.: Fuel: fast uav exploration using incremental frontier structure and hierarchical planning. IEEE Rob. Autom. Lett. 6(2), 779–786 (2021)
Article Google Scholar
Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023)
Zhou, K., et al.: Esc: exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:2301.13166 (2023)

Download references

Author information

Authors and Affiliations

NYUAD Center for Artificial Intelligence and Robotics (CAIR), Abu Dhabi, UAE
Shuaihang Yuan, Anthony Tzes & Yi Fang
Electrical Engineering, New York University Abu Dhabi, Abu Dhabi, 129188, UAE
Shuaihang Yuan, Hao Huang, Congcong Wen, Anthony Tzes & Yi Fang
Electrical and Computer Engineering Departement, New York University, Brooklyn, NY, 11201, USA
Halil Utku Unlu
Embodied AI and Robotics (AIR) Lab, NYU, Abu Dhabi, UAE
Shuaihang Yuan, Hao Huang, Congcong Wen & Yi Fang

Authors

Shuaihang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Halil Utku Unlu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Congcong Wen
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Tzes
View author publications
You can also search for this author in PubMed Google Scholar
Yi Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Fang .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, S., Unlu, H.U., Huang, H., Wen, C., Tzes, A., Fang, Y. (2025). Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15330. Springer, Cham. https://doi.org/10.1007/978-3-031-78113-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-78113-1_9
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78112-4
Online ISBN: 978-3-031-78113-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)