research-article

Toward Automatic Audio Description Generation for Accessible Videos

Authors:

Lap-Fai YuAuthors Info & Claims

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Article No.: 277, Pages 1 - 12

https://doi.org/10.1145/3411764.3445347

Published: 07 May 2021 Publication History

Abstract

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study’s analysis, we provided recommendations for the development of future audio description generation technologies.

Supplementary Material

VTT File (3411764.3445347_videopreviewcaptions.vtt)

Download
.61 KB

Supplementary Materials (3411764.3445347_supplementalmaterials.zip)

Download
1.82 MB

MP4 File (3411764.3445347_videopreview.mp4)

Preview video

Download
19.64 MB

References

[1]

2019. Katna: Tool for automating common vide keyframe extraction and Image Autocrop tasks. https://katna.readthedocs.io/.

[2]

2020. Guidelines for Audio Describers. http://www.acb.org/adp/guidelines.html.

[3]

Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2019. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–37.

Digital Library

[4]

Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. 609–617.

[5]

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. (2016).

[6]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. 892–900.

[7]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research 3, Feb (2003), 1137–1155.

[8]

Ann Bigelow. 1991. Spatial mapping of familiar locations in blind children. Journal of Visual Impairment & Blindness 85, 3 (1991), 113–117.

[9]

Sabine Braun. 2011. Creating coherence in audio description. Meta: Journal des traducteurs/Meta: Translators’ Journal 56, 3(2011), 645–662.

[10]

Richard Brown. 2013. Consciousness inside and out: Phenomenology, neuroscience, and the nature of experience. Springer Science & Business Media.

[11]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[12]

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059–3069.

[13]

Katie Ellis 2015. Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?Communication, Politics & Culture 47, 3 (2015), 3.

[14]

Deborah I Fels, John Patrick Udo, Jonas E Diamond, and Jeremy I Diamond. 2006. A comparison of alternative narrative approaches to video description for animated comedy. Journal of Visual Impairment & Blindness 100, 5 (2006), 295–305.

[15]

Pat Fletcher. 2002. Seeing with sound: A journey into sight. Retrieved September 21(2002), 2015.

[16]

Louise Fryer. 2016. An introduction to audio description: A practical guide. Routledge.

[17]

Louise Fryer and Jonathan Freeman. 2013. Cinematic language and the description of film: Keeping AD users in the frame. Perspectives 21, 3 (2013), 412–426.

[18]

William W Gaver. 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5, 1 (1993), 1–29.

[19]

Nicholas A Giudice and Gordon E Legge. 2008. Blind navigation and the role of technology. The engineering handbook of smart technology for aging, disability, and independence 8(2008), 479–500.

[20]

Cole Gleason, Amy Pavel, Himalini Gururaj, Kris M Kitani, and Jefrey P Bigham. 2020. Making GIFs Accessible. (2020).

[21]

Cole Gleason, Amy Pavel, Xingyu Liu, Patrick Carrington, Lydia B Chilton, and Jeffrey P Bigham. 2019. Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367–376.

[22]

Cole Gleason, Amy Pavel, Emma McCamey, Christina Low, Patrick Carrington, Kris M Kitani, and Jeffrey P Bigham. 2020. Twitter A11y: A Browser Extension to Make Twitter Images Accessible. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.

Digital Library

[23]

Darren Guinness, Edward Cutrell, and Meredith Ringel Morris. 2018. Caption crawler: Enabling reusable alternative text descriptions using reverse image search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–11.

Digital Library

[24]

Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9248–9257.

[25]

Haikun Huang, Michael Solah, Dingzeyu Li, and Lap-Fai Yu. 2019. Audible panorama: Automatic spatial audio generation for panorama imagery. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–11.

Digital Library

[26]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.

[27]

R Kingett. 2014. The Accessible Netflix Project Advocates Taking Steps to Ensure Netflix Accessibility for Everyone. The Accessible Netflix Project 26 (2014).

[28]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706–715.

[29]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7492–7500.

[30]

Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations. 1–16.

[31]

John Miers. 1995. Audio description–seeing theater with your ears. Information Technology and Disabilities 2, 2 (1995).

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.

[33]

Chris Mikul. 2010. Audio description background paper. Ultimo NSW: Media Access Australia(2010).

[34]

Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2face: Learning the face behind a voice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7539–7548.

[35]

Jaclyn Packer, Katie Vizenor, and Joshua A Miele. 2015. An overview of video description: history, benefits, and guidelines. Journal of Visual Impairment & Blindness 109, 2 (2015), 83–93.

[36]

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics.

[37]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4594–4602.

[38]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6504–6512.

[39]

Becky Parton. 2016. Video captions for online courses: Do YouTube’s auto-generated captions meet deaf students’ needs?Journal of Open, Flexible, and Distance Learning 20, 1 (2016), 8–18.

[40]

Elisa Perego. 2016. Gains and losses of watching audio described films for sighted viewers. Target. International Journal of Translation Studies 28, 3 (2016), 424–444.

[41]

Zhe Quan, Zhi-Jie Wang, Yuquan Le, Bin Yao, Kenli Li, and Jian Yin. 2019. An efficient framework for sentence similarity modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 4(2019), 853–865.

Digital Library

[42]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.

[43]

Aline Remael and Gert Vercauteren. 2011. Basisprincipes voor audiobeschrijving voor televisie en film [Basics of audio description for television and film]. Antwerp: Departement Vertalers and Tolken, Artesis Hogeschool (2011).

[44]

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. International Journal of Computer Vision 123, 1 (2017), 94–120.

Digital Library

[45]

Pablo Romero-Fresco and Louise Fryer. 2013. Could audio-described films benefit from audio introductions? An audience response study. Journal of Visual Impairment & Blindness 107, 4 (2013), 287–295.

[46]

Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In HCOMP. 147–156.

[47]

Jaime Sánchez, Mauricio Saenz, and Jose Miguel Garrido. 2010. Usability of a multimodal video game to improve navigation skills for blind children. ACM Transactions on Accessible Computing (TACCESS) 3, 2 (2010), 1–29.

Digital Library

[48]

Emilie Schmeidler and Corinne Kirchner. 2001. Adding audio description: Does it make a difference?Journal of Visual Impairment & Blindness 95, 4 (2001), 197–212.

[49]

Merrie Snell. 2015. Lipsynching: popular song recordings and the disembodied voice. Ph.D. Dissertation. Newcastle University.

[50]

Joel Snyder. 2014. The visual made verbal: A comprehensive training manual and guide to the history and applications of audio description. American Council of the Blind, Incorporated.

[51]

Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ” Person, Shoes, Tree. Is the Person Naked?” What People with Vision Impairments Want in Image Descriptions. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.

Digital Library

[52]

Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. (2018). arXiv:1804.00079

[53]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.

Digital Library

[54]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534–4542.

Digital Library

[55]

Agnieszka Walczak and Louise Fryer. 2017. Creative description: The impact of audio description style on presence in visually impaired audiences. British Journal of Visual Impairment 35, 1 (2017), 6–17.

[56]

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7190–7198.

[57]

Yujia Wang, Wenguan Wang, Wei Liang, and Lap-Fai Yu. 2019. Comic-guided speech synthesis. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.

Digital Library

[58]

Yujia Wang, Liang Wei, Li Wanwan, Li Dingzeyu, and Lap-Fai Yu. 2020. Scene-Aware Background Music Synthesis. In ACM Multimedia, Vol. 38.

[59]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 2019. Transformers: State-of-the-art natural language processing. (2019). arXiv:1910.03771

[60]

Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

Cited By

Wang XCheng ANamkung KPan Y(2024)Caring for Special Participants in the Digital Media Era: A Study on Enhancing the Blind User Experience on Short Video Platforms Through Auditory CuesJournal of Information Systems Engineering and Management10.55267/iadt.07.147749:3(28013)Online publication date: 2024
https://doi.org/10.55267/iadt.07.14774
Shen MHuang GWu YSong SZhou SLi LYu ZWang WBu J(2024)Making Accessible Movies Easily: An Intelligent Tool for Authoring and Integrating Audio Descriptions to MoviesProceedings of the 21st International Web for All Conference10.1145/3677846.3677855(160-164)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3677846.3677855
Dang KLee S(2024)Musical Performances in Virtual Reality with Spatial and View-Dependent Audio Descriptions for Blind and Low-Vision UsersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688492(1-5)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3688492
Show More Cited By

Index Terms

Toward Automatic Audio Description Generation for Accessible Videos

Index terms have been assigned to the content through auto-classification.

Recommendations

What Makes Videos Accessible to Blind and Visually Impaired People?
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

User-generated videos are an increasingly important source of information online, yet most online videos are inaccessible to blind and visually impaired (BVI) people. To find videos that are accessible, or understandable without additional description ...
Machine Generation of Audio Description for Blind and Visually Impaired People
Automating the generation of audio descriptions (AD) for blind and visually impaired (BVI) people is a difficult task, since it has several challenges involved, such as: identifying gaps in dialogues; describing the essential elements; summarizing and ...
Deploying Prerecorded Audio Description for Musical Theater Using Live Performance Tracking
Perception, Representations, Image, Sound, Music
Abstract
Audio description, an accessibility service used by blind or visually impaired individuals, provides spoken descriptions of visual content. This alternative format allows those with low or no vision the ability to access information that sighted ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

May 2021

10862 pages

ISBN:9781450380966

DOI:10.1145/3411764

General Chairs:
Yoshifumi Kitamura
Tohoku University, Japan
,
Aaron Quigley
University of New South Wales, Australia
,
Program Chairs:
Katherine Isbister
University of California Santa Cruz, USA
,
Takeo Igarashi
The University of Tokyo, Japan
,
Publications Chairs:
Pernille Bjørn
University of Copenhagen, Denmark
,
Steven Drucker
Microsoft Research, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF CAREER Award

Conference

CHI '21

Sponsor:

SIGCHI

CHI '21: CHI Conference on Human Factors in Computing Systems

May 8 - 13, 2021

Yokohama, Japan

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI '25

Sponsor:
sigchi

CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,008
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)35

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XCheng ANamkung KPan Y(2024)Caring for Special Participants in the Digital Media Era: A Study on Enhancing the Blind User Experience on Short Video Platforms Through Auditory CuesJournal of Information Systems Engineering and Management10.55267/iadt.07.147749:3(28013)Online publication date: 2024
https://doi.org/10.55267/iadt.07.14774
Shen MHuang GWu YSong SZhou SLi LYu ZWang WBu J(2024)Making Accessible Movies Easily: An Intelligent Tool for Authoring and Integrating Audio Descriptions to MoviesProceedings of the 21st International Web for All Conference10.1145/3677846.3677855(160-164)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3677846.3677855
Dang KLee S(2024)Musical Performances in Virtual Reality with Spatial and View-Dependent Audio Descriptions for Blind and Low-Vision UsersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688492(1-5)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3688492
Hammad NElavsky FMoharana SChen JLee SCarrington PMoritz DHammer JHarpstead E(2024)Exploring The Affordances of Game-Aware Streaming to Support Blind and Low Vision Viewers: A Design Probe StudyProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675665(1-13)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675665
Li FWang ACarrington PKane S(2024)A Recipe for Success? Exploring Strategies for Improving Non-Visual Access to Cooking InstructionsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675662(1-15)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675662
South LYildirim CPavel ABorkin M(2024)Design considerations for photosensitivity warnings in visual mediaProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675643(1-12)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675643
Dang KBurke GKorreshi HLee S(2024)Towards Accessible Musical Performances in Virtual Reality: Designing a Conceptual Framework for Omnidirectional Audio DescriptionsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675618(1-17)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675618
Natalie RChang RSheshadri SGuo AHara K(2024)Audio Description CustomizationProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675617(1-19)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675617
Nevsky ABircanin FCruice MWilson SSimperl ENeate T(2024)"I Wish You Could Make the Camera Stand Still": Envisioning Media Accessibility Interventions with People with AphasiaProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675598(1-17)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3675598
Ning ZZhang ZBan JJiang KGan RTian YLi T(2024)MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on VideosProceedings of the 16th Conference on Creativity & Cognition10.1145/3635636.3656189(156-169)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3635636.3656189
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents