research-article

Open access

Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

Authors:

Heiko SchuldtAuthors Info & Claims

LGM3A '24: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications

Pages 26 - 35

https://doi.org/10.1145/3688866.3689126

Published: 28 October 2024 Publication History

Abstract

Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality.

This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.

References

[1]

Sofea Aishah. 2020. interior_design. https://www.kaggle.com/datasets/ aishahsofea/interior-design

[2]

Siddhanth Jayaraj Ajri, Dat Nguyen, Swati Agarwal, Arun Kumar Reddy Padala, and Caglar Yildirim. 2023. Virtual AIVantage: Leveraging Large Language Models for Enhanced VR Interview Preparation among Underrepresented Professionals in Computing. In Proceedings of the 22nd International Conference on Mobile and Ubiquitous Multimedia (MUM '23). Association for Computing Machinery, New York, NY, USA, 535--537. https://doi.org/10.1145/3626705.3631799

Digital Library

[3]

Rahel Arnold and Heiko Schuldt. 2024. Multimedia Retrieval in Mixed Reality: Leveraging Live Queries for Immersive Experiences. In 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR). 289--293. https://doi.org/10.1109/AIxVR59861.2024.00048 ISSN: 2771--7453.

[4]

Dana Harry Ballard and Christopher M. Brown. 1982. Computer Vision (1st ed.). Prentice Hall Professional Technical Reference.

[5]

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. [n. d.]. Improving Image Generation with Better Captions. ([n. d.]).

[6]

Ali Borji. 2023. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2. https://doi.org/10.48550/arXiv.2210.00586 arXiv:2210.00586 [cs].

[7]

Efe Bozkir, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Hong Gao, and Enkelejda Kasneci. 2024. Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy. In ACM Conversational User Interfaces 2024. 1--7. https://doi.org/10.1145/3640794. 3665563 arXiv:2402.03907 [cs].

Digital Library

[8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165 arXiv:2005.14165 [cs].

[9]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. https://doi.org/10.48550/arXiv.2104.14294 arXiv:2104.14294 [cs].

[10]

Wei Chen, Weiping Wang, Li Liu, and Michael S. Lew. 2021. New Ideas and Trends in Deep Multimodal Content Understanding: A Review. Neurocomputing 426 (Feb. 2021), 195--215. https://doi.org/10.1016/j.neucom.2020.10.042

[11]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. https://doi.org/10.48550/arXiv.1504.00325 arXiv:1504.00325 [cs].

[12]

K. R. Chowdhary. 2020. Natural Language Processing. In Fundamentals of Artificial Intelligence, K.R. Chowdhary (Ed.). Springer India, New Delhi, 603--649. https://doi.org/10.1007/978--81--322--3972--7_19

[13]

Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski- Fahey, Judith Amores Fernandez, and Jaron Lanier. 2024. LLMR: Real-time Prompting of Interactive Worlds using Large Language Models. http://arxiv. org/abs/2309.12276 arXiv:2309.12276 [cs].

[14]

Lucio Tommaso De Paolis, Pasquale Arpaia, and Marco Sacco (Eds.). 2023. Extended Reality: International Conference, XR Salento 2023, Lecce, Italy, September 6--9, 2023, Proceedings, Part II. Lecture Notes in Computer Science, Vol. 14219. Springer Nature Switzerland, Cham. https://doi.org/10.1007/978--3-031--43404--4

[15]

Nassim Dehouche and Kullathida Dehouche. 2023. What's in a text-to-image prompt? The potential of stable diffusion in visual arts education. Heliyon 9, 6 (June 2023), e16757. https://doi.org/10.1016/j.heliyon.2023.e16757

[16]

Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher. 2021. Deep learningenabled medical computer vision. npj Digital Medicine 4, 1 (Jan. 2021), 1--9. https://doi.org/10.1038/s41746-020-00376--2 Publisher: Nature Publishing Group.

[17]

Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 9 (Sept. 2010), 1627--1645. https://doi.org/10.1109/TPAMI.2009.167 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.

Digital Library

[18]

Ralph Gasser, Rahel Arnold, Fynn Faber, Heiko Schuldt, Raphael Waltenspül, and Luca Rossetto. 2024. A New Retrieval Engine for Vitrivr. In MultiMedia Modeling, Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, and Yoko Yamakata (Eds.). Springer Nature Switzerland, Cham, 324--331. https://doi.org/10.1007/978--3-031--53302-0_28

[19]

Ralph Gasser, Luca Rossetto, and Heiko Schuldt. 2019. Multimodal Multimedia Retrieval with vitrivr. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, Ottawa ON Canada, 391--394. https://doi.org/10. 1145/3323873.3326921

Digital Library

[20]

Ernest Hall. 1979. Computer Image Processing and Recognition. Elsevier.

[21]

Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and Furu Wei. 2022. Language Models are General-Purpose Interfaces. https://doi.org/10.48550/arXiv.2206.06336 arXiv:2206.06336 [cs].

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90 ISSN: 1063--6919.

[23]

Joel Janai, Fatma Güney, Aseem Behl, and Andreas Geiger. 2020. Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art. Foundations and Trends® in Computer Graphics and Vision 12, 1--3 (July 2020), 1--308. https: //doi.org/10.1561/0600000079 Publisher: Now Publishers, Inc.

Digital Library

[24]

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. 2024. Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study. https://doi.org/10.48550/arXiv.2401.17981 arXiv:2401.17981 [cs].

[25]

Jeong-ah Kim, Ju-Yeong Sung, and Se-ho Park. 2020. Comparison of Faster- RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In 2020 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia). 1--4. https: //doi.org/10.1109/ICCE-Asia49877.2020.9277040

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/hash/ c399862d3b9d6b76c8436e924a68c45b-Abstract.html

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. https://doi.org/10.48550/ arXiv.1405.0312 arXiv:1405.0312 [cs].

[28]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2310.03744 arXiv:2310.03744 [cs].

[29]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2304.08485 arXiv:2304.08485 [cs].

[30]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. Vol. 9905. 21--37. https://doi.org/10.1007/978--3--319--46448-0_2 arXiv:1512.02325 [cs].

[31]

Mark Nixon and Alberto Aguado. 2019. Feature Extraction and Image Processing for Computer Vision. Academic Press.

[32]

OpenAI, Josh Achiam, Steven Adler, et al. 2024. GPT-4 Technical Report. https: //doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs].

[33]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without Supervision. https://doi.org/10.48550/arXiv.2304.07193 arXiv:2304.07193 [cs].

[34]

J. R. Parker. 2010. Algorithms for Image Processing and Computer Vision. John Wiley & Sons. Google-Books-ID: BK3oXzpxC44C.

[35]

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal Large Language Models to the World. https://doi.org/10.48550/arXiv.2306.14824 arXiv:2306.14824 [cs].

[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. [n. d.]. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020 arXiv:2103.00020 [cs]

[37]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. http: //arxiv.org/abs/2204.06125 arXiv:2204.06125 [cs].

[38]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. https://doi.org/10.48550/arXiv.2102.12092 arXiv:2102.12092 [cs].

[39]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 779--788. https://doi.org/10. 1109/CVPR.2016.91 ISSN: 1063--6919.

[40]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. https: //doi.org/10.48550/arXiv.1506.01497 arXiv:1506.01497 [cs].

[41]

Stefan Rüger and Gary Marchionini. 2009. Multimedia Information Retrieval. Morgan & Claypool.

[42]

Ryo Suzuki, Mar Gonzalez-Franco, Misha Sra, and David Lindlbauer. 2023. XR and AI: AI-Enabled Virtual, Augmented, and Mixed Reality. In Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

Digital Library

[43]

Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. 2024. Gemini: A Family of Highly Capable Multimodal Models. http://arxiv.org/abs/2312.11805 arXiv:2312.11805 [cs].

[44]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971 [cs].

[45]

Lucia Vadicamo, Rahel Arnold, Werner Bailer, Fabio Carrara, Cathal Gurrin, Nico Hezel, Xinghan Li, Jakub Lokoc, Sebastian Lubos, Zhixin Ma, Nicola Messina, Thao-Nhu Nguyen, Ladislav Peska, Luca Rossetto, Loris Sauter, Klaus Schöffmann, Florian Spiess, Minh-Triet Tran, and Stefanos Vrochidis. 2024. Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition. IEEE Access 12 (2024), 79342--79366. https: //doi.org/10.1109/ACCESS.2024.3405638 Conference Name: IEEE Access.

[46]

Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Nasif Zaman, Prithul Sarker, Andrew G. Lee, and Alireza Tavakkoli. 2024. Meta smart glasses?large language models and the future for assistive glasses for individuals with vision impairments. Eye 38, 6 (April 2024), 1036--1038. https://doi.org/10.1038/s41433-023-02842-z Publisher: Nature Publishing Group.

[47]

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. 2024. YOLOv10: Real-Time End-to-End Object Detection. https://doi.org/ 10.48550/arXiv.2405.14458 arXiv:2405.14458 [cs].

[48]

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. NExTGPT: Any-to-Any Multimodal LLM. https://doi.org/10.48550/arXiv.2309.05519 arXiv:2309.05519 [cs].

[49]

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. https://doi.org/10.48550/arXiv.2308.02490 arXiv:2308.02490 [cs].

[50]

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. 2023. Contextual Object Detection with Multimodal Large Language Models. https: //doi.org/10.48550/arXiv.2305.18279 arXiv:2305.18279 [cs].

[51]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. MM-LLMs: Recent Advances in MultiModal Large Language Models. https://doi.org/10.48550/arXiv.2401.13601 arXiv:2401.13601 [cs].

[52]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. https://doi.org/10.48550/arXiv.1511. 03416 arXiv:1511.03416 [cs].

[53]

Arzu Çöltekin, Ian Lochhead, Marguerite Madden, Sidonie Christophe, Alexandre Devaux, Christopher Pettit, Oliver Lock, Shashwat Shukla, Dajana Snopková, Sergio Bernardes, and Nicholas Hedley. 2020. Extended Reality in Spatial Sciences: A Review of Research Challenges and Future Directions. ISPRS International Journal of Geo-Information 9, 7 (July 2020), 439. https://doi.org/10.3390/ijgi9070439 Number: 7 Publisher: Multidisciplinary Digital Publishing Institute.

Index Terms

Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query suggestion

Recommendations

Multimodal augmented reality: the norm rather than the exception
MVAR '16: Proceedings of the 2016 workshop on Multimodal Virtual and Augmented Reality

Augmented reality (AR) is commonly seen as a technology that overlays virtual imagery onto a participant's view of the world. In line with this, most AR research is focused on what we see. In this paper, we challenge this focus on vision and make a case ...
An Authoring Tool for XR Learning Environments
Extended Reality
Abstract
The creation of extended reality applications, which combine elements of virtual reality, augmented reality and mixed reality, is complex and demanding, especially in education where its use has grown exponentially in recent years. This paper ...
Unified Representation for XR Content and its Rendering Method
Web3D '20: Proceedings of the 25th International Conference on 3D Web Technology

Virtual Reality (VR) and Augmented Reality (AR) have become familiar technologies with related markets growing rapidly every year. Moreover, the idea of considering VR and AR as one eXtended reality (XR) has broken the border between virtual space and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

LGM3A '24: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications

October 2024

41 pages

ISBN:9798400711930

DOI:10.1145/3688866

General Chair:
Shihao Xu
Huawei Singapore Research Center
,
Program Chairs:
Yiyang Luo
Luo Huawei Singapore Research
,
Justin Dauwels
Delft University of Technology
,
Andy Khong
Nanyang Technological University
,
Zheng Wang
Huawei Singapore Research Center
,
Qianqian Chen
Huawei Singapore Research Center
,
Chen Cai
Huawei Singapore Research Center
,
Wei Shi
Huawei Singapore Research Center
,
Tat-Seng Chua
National University of Singapore

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss State Secretariat for Education, Research and Innovation (SERI)

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
80
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)58

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents