Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3688866.3689126acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

Published: 28 October 2024 Publication History

Abstract

Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality.
This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.

References

[1]
Sofea Aishah. 2020. interior_design. https://www.kaggle.com/datasets/ aishahsofea/interior-design
[2]
Siddhanth Jayaraj Ajri, Dat Nguyen, Swati Agarwal, Arun Kumar Reddy Padala, and Caglar Yildirim. 2023. Virtual AIVantage: Leveraging Large Language Models for Enhanced VR Interview Preparation among Underrepresented Professionals in Computing. In Proceedings of the 22nd International Conference on Mobile and Ubiquitous Multimedia (MUM '23). Association for Computing Machinery, New York, NY, USA, 535--537. https://doi.org/10.1145/3626705.3631799
[3]
Rahel Arnold and Heiko Schuldt. 2024. Multimedia Retrieval in Mixed Reality: Leveraging Live Queries for Immersive Experiences. In 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR). 289--293. https://doi.org/10.1109/AIxVR59861.2024.00048 ISSN: 2771--7453.
[4]
Dana Harry Ballard and Christopher M. Brown. 1982. Computer Vision (1st ed.). Prentice Hall Professional Technical Reference.
[5]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. [n. d.]. Improving Image Generation with Better Captions. ([n. d.]).
[6]
Ali Borji. 2023. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2. https://doi.org/10.48550/arXiv.2210.00586 arXiv:2210.00586 [cs].
[7]
Efe Bozkir, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Hong Gao, and Enkelejda Kasneci. 2024. Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy. In ACM Conversational User Interfaces 2024. 1--7. https://doi.org/10.1145/3640794. 3665563 arXiv:2402.03907 [cs].
[8]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165 arXiv:2005.14165 [cs].
[9]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. https://doi.org/10.48550/arXiv.2104.14294 arXiv:2104.14294 [cs].
[10]
Wei Chen, Weiping Wang, Li Liu, and Michael S. Lew. 2021. New Ideas and Trends in Deep Multimodal Content Understanding: A Review. Neurocomputing 426 (Feb. 2021), 195--215. https://doi.org/10.1016/j.neucom.2020.10.042
[11]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. https://doi.org/10.48550/arXiv.1504.00325 arXiv:1504.00325 [cs].
[12]
K. R. Chowdhary. 2020. Natural Language Processing. In Fundamentals of Artificial Intelligence, K.R. Chowdhary (Ed.). Springer India, New Delhi, 603--649. https://doi.org/10.1007/978--81--322--3972--7_19
[13]
Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski- Fahey, Judith Amores Fernandez, and Jaron Lanier. 2024. LLMR: Real-time Prompting of Interactive Worlds using Large Language Models. http://arxiv. org/abs/2309.12276 arXiv:2309.12276 [cs].
[14]
Lucio Tommaso De Paolis, Pasquale Arpaia, and Marco Sacco (Eds.). 2023. Extended Reality: International Conference, XR Salento 2023, Lecce, Italy, September 6--9, 2023, Proceedings, Part II. Lecture Notes in Computer Science, Vol. 14219. Springer Nature Switzerland, Cham. https://doi.org/10.1007/978--3-031--43404--4
[15]
Nassim Dehouche and Kullathida Dehouche. 2023. What's in a text-to-image prompt? The potential of stable diffusion in visual arts education. Heliyon 9, 6 (June 2023), e16757. https://doi.org/10.1016/j.heliyon.2023.e16757
[16]
Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher. 2021. Deep learningenabled medical computer vision. npj Digital Medicine 4, 1 (Jan. 2021), 1--9. https://doi.org/10.1038/s41746-020-00376--2 Publisher: Nature Publishing Group.
[17]
Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 9 (Sept. 2010), 1627--1645. https://doi.org/10.1109/TPAMI.2009.167 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
[18]
Ralph Gasser, Rahel Arnold, Fynn Faber, Heiko Schuldt, Raphael Waltenspül, and Luca Rossetto. 2024. A New Retrieval Engine for Vitrivr. In MultiMedia Modeling, Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, and Yoko Yamakata (Eds.). Springer Nature Switzerland, Cham, 324--331. https://doi.org/10.1007/978--3-031--53302-0_28
[19]
Ralph Gasser, Luca Rossetto, and Heiko Schuldt. 2019. Multimodal Multimedia Retrieval with vitrivr. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, Ottawa ON Canada, 391--394. https://doi.org/10. 1145/3323873.3326921
[20]
Ernest Hall. 1979. Computer Image Processing and Recognition. Elsevier.
[21]
Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and Furu Wei. 2022. Language Models are General-Purpose Interfaces. https://doi.org/10.48550/arXiv.2206.06336 arXiv:2206.06336 [cs].
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90 ISSN: 1063--6919.
[23]
Joel Janai, Fatma Güney, Aseem Behl, and Andreas Geiger. 2020. Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art. Foundations and Trends® in Computer Graphics and Vision 12, 1--3 (July 2020), 1--308. https: //doi.org/10.1561/0600000079 Publisher: Now Publishers, Inc.
[24]
Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. 2024. Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study. https://doi.org/10.48550/arXiv.2401.17981 arXiv:2401.17981 [cs].
[25]
Jeong-ah Kim, Ju-Yeong Sung, and Se-ho Park. 2020. Comparison of Faster- RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In 2020 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia). 1--4. https: //doi.org/10.1109/ICCE-Asia49877.2020.9277040
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/hash/ c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. https://doi.org/10.48550/ arXiv.1405.0312 arXiv:1405.0312 [cs].
[28]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2310.03744 arXiv:2310.03744 [cs].
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2304.08485 arXiv:2304.08485 [cs].
[30]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. Vol. 9905. 21--37. https://doi.org/10.1007/978--3--319--46448-0_2 arXiv:1512.02325 [cs].
[31]
Mark Nixon and Alberto Aguado. 2019. Feature Extraction and Image Processing for Computer Vision. Academic Press.
[32]
OpenAI, Josh Achiam, Steven Adler, et al. 2024. GPT-4 Technical Report. https: //doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs].
[33]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without Supervision. https://doi.org/10.48550/arXiv.2304.07193 arXiv:2304.07193 [cs].
[34]
J. R. Parker. 2010. Algorithms for Image Processing and Computer Vision. John Wiley & Sons. Google-Books-ID: BK3oXzpxC44C.
[35]
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal Large Language Models to the World. https://doi.org/10.48550/arXiv.2306.14824 arXiv:2306.14824 [cs].
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. [n. d.]. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020 arXiv:2103.00020 [cs]
[37]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. http: //arxiv.org/abs/2204.06125 arXiv:2204.06125 [cs].
[38]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. https://doi.org/10.48550/arXiv.2102.12092 arXiv:2102.12092 [cs].
[39]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 779--788. https://doi.org/10. 1109/CVPR.2016.91 ISSN: 1063--6919.
[40]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. https: //doi.org/10.48550/arXiv.1506.01497 arXiv:1506.01497 [cs].
[41]
Stefan Rüger and Gary Marchionini. 2009. Multimedia Information Retrieval. Morgan & Claypool.
[42]
Ryo Suzuki, Mar Gonzalez-Franco, Misha Sra, and David Lindlbauer. 2023. XR and AI: AI-Enabled Virtual, Augmented, and Mixed Reality. In Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology
[43]
Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. 2024. Gemini: A Family of Highly Capable Multimodal Models. http://arxiv.org/abs/2312.11805 arXiv:2312.11805 [cs].
[44]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971 [cs].
[45]
Lucia Vadicamo, Rahel Arnold, Werner Bailer, Fabio Carrara, Cathal Gurrin, Nico Hezel, Xinghan Li, Jakub Lokoc, Sebastian Lubos, Zhixin Ma, Nicola Messina, Thao-Nhu Nguyen, Ladislav Peska, Luca Rossetto, Loris Sauter, Klaus Schöffmann, Florian Spiess, Minh-Triet Tran, and Stefanos Vrochidis. 2024. Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition. IEEE Access 12 (2024), 79342--79366. https: //doi.org/10.1109/ACCESS.2024.3405638 Conference Name: IEEE Access.
[46]
Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Nasif Zaman, Prithul Sarker, Andrew G. Lee, and Alireza Tavakkoli. 2024. Meta smart glasses?large language models and the future for assistive glasses for individuals with vision impairments. Eye 38, 6 (April 2024), 1036--1038. https://doi.org/10.1038/s41433-023-02842-z Publisher: Nature Publishing Group.
[47]
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. 2024. YOLOv10: Real-Time End-to-End Object Detection. https://doi.org/ 10.48550/arXiv.2405.14458 arXiv:2405.14458 [cs].
[48]
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2024. NExTGPT: Any-to-Any Multimodal LLM. https://doi.org/10.48550/arXiv.2309.05519 arXiv:2309.05519 [cs].
[49]
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. https://doi.org/10.48550/arXiv.2308.02490 arXiv:2308.02490 [cs].
[50]
Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. 2023. Contextual Object Detection with Multimodal Large Language Models. https: //doi.org/10.48550/arXiv.2305.18279 arXiv:2305.18279 [cs].
[51]
Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. MM-LLMs: Recent Advances in MultiModal Large Language Models. https://doi.org/10.48550/arXiv.2401.13601 arXiv:2401.13601 [cs].
[52]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. https://doi.org/10.48550/arXiv.1511. 03416 arXiv:1511.03416 [cs].
[53]
Arzu Çöltekin, Ian Lochhead, Marguerite Madden, Sidonie Christophe, Alexandre Devaux, Christopher Pettit, Oliver Lock, Shashwat Shukla, Dajana Snopková, Sergio Bernardes, and Nicholas Hedley. 2020. Extended Reality in Spatial Sciences: A Review of Research Challenges and Future Directions. ISPRS International Journal of Geo-Information 9, 7 (July 2020), 439. https://doi.org/10.3390/ijgi9070439 Number: 7 Publisher: Multidisciplinary Digital Publishing Institute.

Index Terms

  1. Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      LGM3A '24: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications
      October 2024
      41 pages
      ISBN:9798400711930
      DOI:10.1145/3688866
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Check for updates

      Author Tags

      1. extended reality
      2. llm
      3. lmm
      4. multimedia retrieval
      5. object detection

      Qualifiers

      • Research-article

      Funding Sources

      • Swiss State Secretariat for Education, Research and Innovation (SERI)

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 80
        Total Downloads
      • Downloads (Last 12 months)80
      • Downloads (Last 6 weeks)58
      Reflects downloads up to 17 Dec 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media