Abstract
In this paper, we present a novel method for reliable frontier selection in Zero-Shot Object Goal Navigation (ZS-OGN), enhancing robotic navigation systems with foundation models to improve commonsense reasoning in indoor environments. Our approach introduces a multi-expert decision framework to address the nonsensical or irrelevant reasoning often seen in foundation model-based systems. The method comprises two key components: Diversified Expert Frontier Analysis (DEFA) and Consensus Decision Making (CDM). DEFA utilizes three expert models—furniture arrangement, room type analysis, and visual scene reasoning—while CDM aggregates their outputs, prioritizing unanimous or majority consensus for more reliable decisions. Demonstrating state-of-the-art performance on the RoboTHOR and HM3D datasets, our method excels at navigating towards untrained objects or goals and outperforms various baselines, showcasing its adaptability to dynamic real-world conditions and superior generalization capabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ali, M., Jardali, H., Roy, N., Liu, L.: Autonomous navigation, mapping and exploration with gaussian processes. In: Robotics: Science and Systems XIX (2023). https://api.semanticscholar.org/CorpusID:259343521
Ali, M., Liu, L.: Gp-frontier for local mapless navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10047–10053. IEEE (2023)
Cai, W., et al.: Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309 (2023)
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155 (2020)
Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. Adv. Neural. Inf. Process. Syst. 33, 4247–4258 (2020)
Chen, J., Li, G., Kumar, S., Ghanem, B., Yu, F.: How to not train your dragon: training-free embodied object goal navigation with semantic frontiers. arXiv preprint arXiv:2305.16925 (2023)
Chen, P., et al.: Weakly-supervised multi-granularity map learning for vision-and-language navigation. Adv. Neural. Inf. Process. Syst. 35, 38149–38161 (2022)
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Constantinides, G.M., Malliaris, A.G.: Portfolio theory. Handbooks Oper. Res. Manag. Sci. 9, 1–30 (1995)
Deitke, M., et al.: Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dorbala, V.S., Mullen Jr, J.F., Manocha, D.: Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation. arXiv preprint arXiv:2303.03480 (2023)
Dorbala, V.S., Sigurdsson, G.A., Thomason, J., Piramuthu, R., Sukhatme, G.S.: Clip-nav: using clip for zero-shot vision-and-language navigation. In: Workshop on Language and Robotics at CoRL 2022 (2022)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Clip on wheels: zero-shot object navigation as object localization and exploration, 3(4), 7 (2022). arXiv preprint arXiv:2203.10421
Gadre, S.Y., Wortsman, M., Ilharco, G., Schmidt, L., Song, S.: Cows on pasture: baselines and benchmarks for language-driven zero-shot object navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181 (2023)
Huang, H., Yuan, S., Wen, C., Hao, Y., Fang, Y.: Noisy few-shot 3d point cloud scene segmentation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 11070–11077. IEEE (2024)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: International Conference on Machine Learning, pp. 9118–9147. PMLR (2022)
Huang, W., et al.: Grounded decoding: guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855 (2023)
Jadidi, M.G., Miró, J.V., Valencia, R., Andrade-Cetto, J.: Exploration on continuous gaussian process frontier maps. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6077–6082. IEEE (2014)
Jiang, Y., et al.: Vima: general robot manipulation with multimodal prompts. In: NeurIPS 2022 Foundation Models for Decision Making Workshop (2022)
Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5129–5136. IEEE (2018)
Karnan, H., Warnell, G., Xiao, X., Stone, P.: Voila: visual-observation-only imitation learning for autonomous navigation. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2497–2503. IEEE (2022)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Komorowski, J.: Minkloc3d: point cloud based large-scale place recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1790–1799 (2021)
Koubaa, A.: Rosgpt: Next-generation human-robot interaction with chatgpt and ros. Preprints (2023). https://doi.org/10.20944/preprints202304.0827.v3
Krause, S., Stolzenburg, F.: Commonsense reasoning and explainable artificial intelligence using large language models. In: European Conference on Artificial Intelligence, pp. 302–319. Springer, Heidelberg (2023)
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., Batra, D.: Zson: zero-shot object-goal navigation using multimodal goal embeddings. Adv. Neural. Inf. Process. Syst. 35, 32340–32352 (2022)
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Mangram, M.E.: A simplified perspective of the markowitz portfolio theory. Glob. J. Bus. Res. 7(1), 59–70 (2013)
Markowitz, H.M.: Foundations of portfolio theory. J. Financ. 46(2), 469–477 (1991)
Markowitz, H.M.: Portfolio theory: as i still see it. Annu. Rev. Financ. Econ. 2(1), 1–23 (2010)
OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
OpenAI: Introducing chatgpt (2023). https://openai.com/blog/chatgpt. Accessed 2 Aug 2023
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramakrishnan, S.K., Chaplot, D.S., Al-Halah, Z., Malik, J., Grauman, K.: Poni: potential functions for objectgoal navigation with interaction-free learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18890–18900 (2022)
Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021). https://arxiv.org/abs/2109.08238
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sethian, J.A.: A fast marching level set method for monotonically advancing fronts. Proc. Natl. Acad. Sci. 93(4), 1591–1595 (1996)
Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: semantic guesswork as a heuristic for planning. In: Conference on Robot Learning, pp. 2683–2699. PMLR (2023)
Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: robotic navigation with large pre-trained models of language, vision, and action. In: Conference on Robot Learning, pp. 492–504. PMLR (2023)
Silver, D., Bagnell, J., Stentz, A.: High performance outdoor navigation from overhead data using imitation learning. In: Robotics: Science and Systems IV, Zurich, Switzerland, vol. 1 (2008)
Song, C.H., Wu, J., Washington, C., Sadler, B.M., Chao, W.L., Su, Y.: Llm-planner: few-shot grounded planning for embodied agents with large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3009 (2023)
Suzuki, S., Takeno, S., Tamura, T., Shitara, K., Karasuyama, M.: Multi-objective bayesian optimization using pareto-frontier entropy. In: International Conference on Machine Learning, pp. 9279–9288. PMLR (2020)
Wang, W., Haddow, B., Birch, A., Peng, W.: Assessing the reliability of large language model knowledge. arXiv preprint arXiv:2310.09820 (2023)
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Wijmans, E., et al.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (2019)
Wöhlke, J., Schmitt, F., van Hoof, H.: Hierarchies of planning and reinforcement learning for robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10682–10688. IEEE (2021)
Wu, P., et al.: Voronav: voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695 (2024)
Xia, Y., et al.: Casspr: cross attention single scan place recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8461–8472 (2023)
Xia, Y., et al.: Soe-net: a self-attention and orientation encoding network for point cloud based place recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11348–11357 (2021)
Yamauchi, B.: A frontier-based approach for autonomous exploration. In: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation, pp. 146–151. IEEE (1997)
Yang, M.S., Schuurmans, D., Abbeel, P., Nachum, O.: Chain of thought imitation with procedure cloning. Adv. Neural. Inf. Process. Syst. 35, 36366–36381 (2022)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Yao, S., et al.: React: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)
Yuan, S., Fang, Y.: Ross: Robust learning of one-shot 3d shape segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1961–1969 (2020)
Yuan, S., Shafique, M., Baghdadi, M.R., Khorrami, F., Tzes, A., Fang, Y.: Zero-shot object navigation with vision-language foundation models reasoning. In: 2024 10th International Conference on Automation, Robotics and Applications (ICARA), pp. 501–505. IEEE (2024)
Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)
Zhao, Q., Zhang, L., He, B., Liu, Z.: Semantic policy network for zero-shot object goal visual navigation. IEEE Rob. Autom. Lett. (2023)
Zhao, Q., Zhang, L., He, B., Qiao, H., Liu, Z.: Zero-shot object goal visual navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2025–2031. IEEE (2023)
Zheng, K., et al.: Jarvis: a neuro-symbolic commonsense reasoning framework for conversational embodied agents. arXiv preprint arXiv:2208.13266 (2022)
Zhou, B., Zhang, Y., Chen, X., Shen, S.: Fuel: fast uav exploration using incremental frontier structure and hierarchical planning. IEEE Rob. Autom. Lett. 6(2), 779–786 (2021)
Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986 (2023)
Zhou, K., et al.: Esc: exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:2301.13166 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yuan, S., Unlu, H.U., Huang, H., Wen, C., Tzes, A., Fang, Y. (2025). Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15330. Springer, Cham. https://doi.org/10.1007/978-3-031-78113-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-78113-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78112-4
Online ISBN: 978-3-031-78113-1
eBook Packages: Computer ScienceComputer Science (R0)