poster

Automated Video Description for Blind and Low Vision Users

Authors:

Andrew T Scott,

Ilmi YoonAuthors Info & Claims

CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

Article No.: 230, Pages 1 - 7

https://doi.org/10.1145/3411763.3451810

Published: 08 May 2021 Publication History

Abstract

Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional description services and tools for amateur description, most human-generated descriptions are expensive and time consuming, and the rate of human-generated descriptions simply cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a system to automatically generate descriptions for videos and answer blind and low vision users’ queries on the videos. Results from a pilot study with eight blind video aficionados indicate the promise of this system for meeting needs for immediate access to videos and validate our efforts in developing tools in partnership with the individuals we aim to benefit. Though the results must be interpreted with caution due to the small sample size, participants overall reported high levels of satisfaction with the system, and all preferred use of the system over no support at all.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6077–6086.

[2]

[2] Google Cloud Vision API.Accessed Date 2021-01-10. https://cloud.google.com/vision/docs/ocr

[3]

David Bar-El, Thomas Large, Lydia Davison, and Marcelo Worsley. 2018. Tangicraft: A Multimodal Interface for Minecraft. In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 456–458.

Digital Library

[4]

Carmen J Branje and Deborah I Fels. 2012. Livedescribe: Can Amateur Describers create High-Quality Audio Description?Journal of Visual Impairment & Blindness 106, 3 (2012), pp. 154–165.

[5]

[5] Listen by Code Speech-to Text API.Accessed Date 2021-01-10. https://www.listenbycode.com/

[6]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1724–1734.

[7]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 326–335.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).

[10]

Michael Friedewald and Oliver Raabe. 2011. Ubiquitous Computing: An Overview of Technology Impacts. Telematics and Informatics 28, 2 (2011), pp. 55–65.

Digital Library

[11]

Langis Gagnon, Claude Chapdelaine, David Byrns, Samuel Foucher, Maguelonne Heritier, and Vishwa Gupta. 2010. A Computer-Vision-Assisted System for Videodescription Scripting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR Workshops). pp. 41–48.

[12]

Langis Gagnon, Samuel Foucher, Maguelonne Heritier, Marc Lalonde, David Byrns, Claude Chapdelaine, James Turner, Suzanne Mathieu, Denis Laurendeau, Nath Tan Nguyen, 2009. Towards Computer-Vision Software Tools to Increase Production and Accessibility of Video Description for People with Vision Loss. Universal Access in the Information Society 8, 3 (2009), pp. 199–218.

Digital Library

[13]

Eitan Glinert and Lonce Wyse. 2007. AudiOdyssey: An Accessible Video Game for both Sighted and Non-sighted Gamers. In International Academic Conference on the Future of Game Design and Technology (Future Play). pp. 251–252.

Digital Library

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778.

[15]

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), pp. 1735–1780.

Digital Library

[16]

[16] Microsoft Azure Video Indexer.Accessed Date 2021-01-10. https://azure.microsoft.com/en-us/services/media-services/video-indexer/

[17]

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: The Winning Entry to the VQA Challenge 2018. arXiv:1807.09956 (2018).

[18]

Richard E Ladner, Melody Y Ivory, Rajesh Rao, Sheryl Burgstahler, Dan Comden, Sangyun Hahn, Matthew Renzelmann, Satria Krisnandi, Mahalakshmi Ramasamy, Beverly Slabosky, 2005. Automating Tactile Graphics Translation. In Proceedings of the International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 150–157.

Digital Library

[19]

Vladimir Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966), pp. 707–710.

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). pp. 740–755.

[21]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language tasks. In Neural Information Processing Systems (NeurIPS). pp. 13–23.

[22]

Joshua A Miele, Steven Landau, and Deborah Gilden. 2006. Talking TMAP: Automated Generation of Audio-Tactile Maps using Smith-Kettlewell’s TMAP Software. British Journal of Visual Impairment 24, 2 (2006), pp. 93–100.

[23]

Devi Archana Paladugu, Zheshen Wang, and Baoxin Li. 2010. On Presenting Audio-tactile Maps to Visually Impaired Users for Getting Directions. In ACM SIGCHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI). pp. 3955–3960.

Digital Library

[24]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Annual Meetings of the Association for Computational Linguistics (ACL). pp. 311–318.

[25]

Amy Pavel, Gabriel Reyes, and Jeffrey P. Bigham. 2020. Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST). pp. 747–759.

Digital Library

[26]

Venkatesh Potluri, Priyan Vaithilingam, Suresh Iyengar, Y Vidya, Manohar Swaminathan, and Gopal Srinivasa. 2018. CodeTalk: Improving Programming Environment Accessibility for Visually Impaired Developers. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). pp. 1–11.

Digital Library

[27]

Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arxiv:1804.02767 [cs.CV]

[28]

Jaime Sánchez and Matías Espinoza. 2011. Audio Haptic Videogaming for Navigation Skills in Learners Who are Blind. In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 227–228.

Digital Library

[29]

Yue-Ting Siu and Ike Presley. 2020. Access Technology for Blind and Low Vision Accessibility. APH Press, Louisville, KY.

[30]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-BERT: Pre-Training of Generic Visual-Linguistic Representations. arXiv:1908.08530 (2019).

[31]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9.

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). pp. 5998–6008.

[33]

Steven Wall and Stephen Brewster. 2006. Feeling what you hear: tactile feedback for navigation of audio graphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). pp. 1123–1132.

Digital Library

[34]

Mark Weiser. 1991. The Computer for the 21st Century. Scientific American 265, 3 (1991), pp. 94–105.

Digital Library

[35]

[35] YouDescribe.Accessed Date 2021-01-10. https://www.youdescribe.org/

[36]

Beste Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon. 2020. Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users. In ACM Designing Interactive Systems (DIS). pp. 47–60.

[37]

Beste Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon. 2020. Increasing Video Accessibility for Visually Impaired Users with Human-in-the-Loop Machine Learning. In ACM SIGCHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI). pp. 1–9.

Cited By

Yousif AAl-Jammas M(2024)A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded PlatformDiyala Journal of Engineering Sciences10.24237/djes.2024.17310(146-162)Online publication date: 1-Sep-2024
https://doi.org/10.24237/djes.2024.17310
Jiang LJung CPhutane MStangl AAzenkot S(2024)“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing ScenariosProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642238(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642238
Kodandaram SSunkara MJayarathna SAshok V(2023)Detecting Deceptive Dark-Pattern Web Advertisements for Blind Screen-Reader UsersJournal of Imaging10.3390/jimaging91102399:11(239)Online publication date: 6-Nov-2023
https://doi.org/10.3390/jimaging9110239
Show More Cited By

Recommendations

Increasing Video Accessibility for Visually Impaired Users with Human-in-the-Loop Machine Learning
CHI EA '20: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems

Video accessibility is crucial for blind and visually impaired individuals for education, employment, and entertainment purposes. However, professional video descriptions are costly and time-consuming. Volunteer-created video descriptions could be a ...
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...
Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users
DIS '20: Proceedings of the 2020 ACM Designing Interactive Systems Conference

Video accessibility is crucial for blind and visually impaired individuals for education, employment, and entertainment purposes. However, professional video descriptions are costly and time-consuming. Volunteer-created video descriptions could be a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

May 2021

2965 pages

ISBN:9781450380959

DOI:10.1145/3411763

Editors:
Yoshifumi Kitamura
Tohoku University, Japan
,
Aaron Quigley
University of New South Wales, Australia
,
Katherine Isbister
University of California Santa Cruz, USA
,
Takeo Igarashi
The University of Tokyo, Japan

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster
Research
Refereed limited

Conference

CHI '21

Sponsor:

SIGCHI

CHI '21: CHI Conference on Human Factors in Computing Systems

May 8 - 13, 2021

Yokohama, Japan

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI '25

Sponsor:
sigchi

CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
495
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)9

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yousif AAl-Jammas M(2024)A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded PlatformDiyala Journal of Engineering Sciences10.24237/djes.2024.17310(146-162)Online publication date: 1-Sep-2024
https://doi.org/10.24237/djes.2024.17310
Jiang LJung CPhutane MStangl AAzenkot S(2024)“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing ScenariosProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642238(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642238
Kodandaram SSunkara MJayarathna SAshok V(2023)Detecting Deceptive Dark-Pattern Web Advertisements for Blind Screen-Reader UsersJournal of Imaging10.3390/jimaging91102399:11(239)Online publication date: 6-Nov-2023
https://doi.org/10.3390/jimaging9110239
De Silva MGoodwin SHolloway LButler M(2023)Understanding Challenges and Opportunities in Body Movement Education of People who are Blind or have Low VisionProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608409(1-19)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3608409
Stangl AIhorn SSiu YBodi ACastanon MNarins LYoon I(2023)The Potential of a Visual Dialogue Agent In a Tandem Automated Audio Description System for VideosProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608402(1-17)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3608402
Jiang LPhutane MAzenkot S(2023)Beyond Audio Description: Exploring 360° Video Accessibility with Blind and Low Vision Users Through Collaborative CreationProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608381(1-17)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3608381
Stangl ASadjo EEmami-Naeini PWang YGurari DFindlater L(2023)“Dump it, Destroy it, Send it to Data Heaven”: Blind People’s Expectations for Visual Privacy in Visual Assistance TechnologiesProceedings of the 20th International Web for All Conference10.1145/3587281.3587296(134-147)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3587281.3587296
Nevsky A(2023)Object-Based Access: Enhancing Accessibility with Data-Driven MediaProceedings of the 2023 ACM International Conference on Interactive Media Experiences10.1145/3573381.3596500(402-406)Online publication date: 12-Jun-2023
https://dl.acm.org/doi/10.1145/3573381.3596500
Nevsky ANeate TSimperl EVatavu R(2023)Accessibility Research in Digital Audiovisual Media: What Has Been Achieved and What Should Be Done Next?Proceedings of the 2023 ACM International Conference on Interactive Media Experiences10.1145/3573381.3596159(94-114)Online publication date: 12-Jun-2023
https://dl.acm.org/doi/10.1145/3573381.3596159

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents