short-paper

Open access

VQAsk: a multimodal Android GPT-based application to help blind users visualize pictures

Authors:

Maria De Marsico,

Chiara Giacanelli,

Clizia Giorgia Manganaro,

Davide SantoroAuthors Info & Claims

AVI '24: Proceedings of the 2024 International Conference on Advanced Visual Interfaces

Article No.: 39, Pages 1 - 5

https://doi.org/10.1145/3656650.3656677

Published: 03 June 2024 Publication History

All formats PDF

Abstract

VQAsk is an Android application that helps visually impaired users to get information about images framed by their smartphones. It enables to interact with one’s photographs or the surrounding visual environment through a question-and-answer interface integrating three modalities: speech interaction, haptic feedback that facilitates navigation and interaction, and sight. VQAsk is primarily designed to help visually impaired users mentally visualize what they cannot see, but it can also accommodate users with varying levels of visual ability. To this aim, it embeds advanced NLP and Computer Vision techniques to answer all user questions about the image on the cell screen. Image processing is enhanced by background removal through advanced segmentation models that identify important image elements. The outcomes of a testing phase confirmed the importance of this project as a first attempt at using AI-supported multimodality to enhance visually impaired users’ experience.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.

[2]

Silvio Barra, Carmen Bisogni, Maria De Marsico, and Stefano Ricciardi. 2021. Visual question answering: Which investigated applications?Pattern Recognition Letters 151 (2021), 325–331.

[3]

Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 333–342.

Digital Library

[4]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 09 January 2024) (2023).

[5]

Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P Bigham. 2019. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 939–948.

[6]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3608–3617.

[7]

Md Farhan Ishmam, Md Sakib Hossain Shovon, MF Mridha, and Nilanjan Dey. 2024. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities. Information Fusion (2024), 102270.

[8]

Walter S Lasecki, Phyo Thiha, Yu Zhong, Erin Brady, and Jeffrey P Bigham. 2013. Answering visual questions with conversational crowd assistants. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility. 1–8.

Digital Library

[9]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).

[10]

S. Manmadhan and B. C. Kovoor. 2020. Visual question answering: a state-of-the-art review. In Artifcial Intelligence Review (2020).

Digital Library

[11]

OpenAI, :, Josh Achiam, et al. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]

[12]

Himanshu Sharma and Anand Singh Jalal. 2021. A survey of methods, datasets and evaluation metrics for visual question answering. Image and Vision Computing 116 (2021), 104327.

Digital Library

[13]

Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey, and Snehasis Mukherjee. 2021. Visual question answering using deep learning: A survey and performance analysis. In Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5. Springer, 75–86.

[14]

Martin Weiss, Simon Chamorro, Roger Girgis, Margaux Luck, Samira E Kahou, Joseph P Cohen, Derek Nowrouzezahrai, Doina Precup, Florian Golemo, and Chris Pal. 2020. Navigation agents for the visually impaired: A sidewalk simulator and experiments. In Conference on Robot Learning. PMLR, 1314–1327.

[15]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).

Index Terms

VQAsk: a multimodal Android GPT-based application to help blind users visualize pictures
1. Computing methodologies
  1. Artificial intelligence
2. Human-centered computing

Recommendations

Accessible smartphones for blind users: A case study for a wayfinding system

While progress on assistive technologies have been made, some blind users still face several problems opening and using basic functionalities when interacting with touch interfaces. Sometimes, people with visual impairments may also have problems ...
Blind users' assistive technology based on the Android platform

Because of the wide use of smartphones in our daily activities, a wide range of mobile applications have been developed to serve as assistive technologies. However, there is a lack of applications in the Arabic language to assist disabled people. The ...
Compatible 2D Table Navigation System for Visually Impaired Users
ISS '19: Proceedings of the 2019 ACM International Conference on Interactive Surfaces and Spaces

Complex data comprehension is a hard task for visually impaired people, for the lack of viable supporting tools. We designed a web-based interactive navigation system to enable visually impaired people to effectively explore a simple data table on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AVI '24: Proceedings of the 2024 International Conference on Advanced Visual Interfaces

June 2024

578 pages

ISBN:9798400717642

DOI:10.1145/3656650

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

AVI 2024

AVI 2024: International Conference on Advanced Visual Interfaces 2024

June 3 - 7, 2024

Arenzano, Genoa, Italy

Acceptance Rates

AVI '24 Paper Acceptance Rate 21 of 82 submissions, 26%;

Overall Acceptance Rate 128 of 490 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
251
Total Downloads

Downloads (Last 12 months)251
Downloads (Last 6 weeks)63

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents