Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3411764.3445347acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Toward Automatic Audio Description Generation for Accessible Videos

Published: 07 May 2021 Publication History

Abstract

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of manual description generation [50]. To address this accessibility obstacle, we built a system that analyzes the audiovisual contents of a video and generates the audio descriptions. The system consisted of three modules: AD insertion time prediction, AD generation, and AD optimization. We evaluated the quality of our system on five types of videos by conducting qualitative studies with 20 sighted users and 12 users who were blind or visually impaired. Our findings revealed how audio description preferences varied with user types and video types. Based on our study’s analysis, we provided recommendations for the development of future audio description generation technologies.

Supplementary Material

VTT File (3411764.3445347_videopreviewcaptions.vtt)
Supplementary Materials (3411764.3445347_supplementalmaterials.zip)
MP4 File (3411764.3445347_videopreview.mp4)
Preview video

References

[1]
2019. Katna: Tool for automating common vide keyframe extraction and Image Autocrop tasks. https://katna.readthedocs.io/.
[2]
2020. Guidelines for Audio Describers. http://www.acb.org/adp/guidelines.html.
[3]
Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2019. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–37.
[4]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. 609–617.
[5]
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. (2016).
[6]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. 892–900.
[7]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research 3, Feb (2003), 1137–1155.
[8]
Ann Bigelow. 1991. Spatial mapping of familiar locations in blind children. Journal of Visual Impairment & Blindness 85, 3 (1991), 113–117.
[9]
Sabine Braun. 2011. Creating coherence in audio description. Meta: Journal des traducteurs/Meta: Translators’ Journal 56, 3(2011), 645–662.
[10]
Richard Brown. 2013. Consciousness inside and out: Phenomenology, neuroscience, and the nature of experience. Springer Science & Business Media.
[11]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[12]
Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059–3069.
[13]
Katie Ellis 2015. Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?Communication, Politics & Culture 47, 3 (2015), 3.
[14]
Deborah I Fels, John Patrick Udo, Jonas E Diamond, and Jeremy I Diamond. 2006. A comparison of alternative narrative approaches to video description for animated comedy. Journal of Visual Impairment & Blindness 100, 5 (2006), 295–305.
[15]
Pat Fletcher. 2002. Seeing with sound: A journey into sight. Retrieved September 21(2002), 2015.
[16]
Louise Fryer. 2016. An introduction to audio description: A practical guide. Routledge.
[17]
Louise Fryer and Jonathan Freeman. 2013. Cinematic language and the description of film: Keeping AD users in the frame. Perspectives 21, 3 (2013), 412–426.
[18]
William W Gaver. 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5, 1 (1993), 1–29.
[19]
Nicholas A Giudice and Gordon E Legge. 2008. Blind navigation and the role of technology. The engineering handbook of smart technology for aging, disability, and independence 8(2008), 479–500.
[20]
Cole Gleason, Amy Pavel, Himalini Gururaj, Kris M Kitani, and Jefrey P Bigham. 2020. Making GIFs Accessible. (2020).
[21]
Cole Gleason, Amy Pavel, Xingyu Liu, Patrick Carrington, Lydia B Chilton, and Jeffrey P Bigham. 2019. Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367–376.
[22]
Cole Gleason, Amy Pavel, Emma McCamey, Christina Low, Patrick Carrington, Kris M Kitani, and Jeffrey P Bigham. 2020. Twitter A11y: A Browser Extension to Make Twitter Images Accessible. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[23]
Darren Guinness, Edward Cutrell, and Meredith Ringel Morris. 2018. Caption crawler: Enabling reusable alternative text descriptions using reverse image search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–11.
[24]
Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9248–9257.
[25]
Haikun Huang, Michael Solah, Dingzeyu Li, and Lap-Fai Yu. 2019. Audible panorama: Automatic spatial audio generation for panorama imagery. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–11.
[26]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.
[27]
R Kingett. 2014. The Accessible Netflix Project Advocates Taking Steps to Ensure Netflix Accessibility for Everyone. The Accessible Netflix Project 26 (2014).
[28]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706–715.
[29]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7492–7500.
[30]
Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations. 1–16.
[31]
John Miers. 1995. Audio description–seeing theater with your ears. Information Technology and Disabilities 2, 2 (1995).
[32]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
[33]
Chris Mikul. 2010. Audio description background paper. Ultimo NSW: Media Access Australia(2010).
[34]
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2face: Learning the face behind a voice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7539–7548.
[35]
Jaclyn Packer, Katie Vizenor, and Joshua A Miele. 2015. An overview of video description: history, benefits, and guidelines. Journal of Visual Impairment & Blindness 109, 2 (2015), 83–93.
[36]
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics.
[37]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4594–4602.
[38]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6504–6512.
[39]
Becky Parton. 2016. Video captions for online courses: Do YouTube’s auto-generated captions meet deaf students’ needs?Journal of Open, Flexible, and Distance Learning 20, 1 (2016), 8–18.
[40]
Elisa Perego. 2016. Gains and losses of watching audio described films for sighted viewers. Target. International Journal of Translation Studies 28, 3 (2016), 424–444.
[41]
Zhe Quan, Zhi-Jie Wang, Yuquan Le, Bin Yao, Kenli Li, and Jian Yin. 2019. An efficient framework for sentence similarity modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 4(2019), 853–865.
[42]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[43]
Aline Remael and Gert Vercauteren. 2011. Basisprincipes voor audiobeschrijving voor televisie en film [Basics of audio description for television and film]. Antwerp: Departement Vertalers and Tolken, Artesis Hogeschool (2011).
[44]
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. International Journal of Computer Vision 123, 1 (2017), 94–120.
[45]
Pablo Romero-Fresco and Louise Fryer. 2013. Could audio-described films benefit from audio introductions? An audience response study. Journal of Visual Impairment & Blindness 107, 4 (2013), 287–295.
[46]
Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind. In HCOMP. 147–156.
[47]
Jaime Sánchez, Mauricio Saenz, and Jose Miguel Garrido. 2010. Usability of a multimodal video game to improve navigation skills for blind children. ACM Transactions on Accessible Computing (TACCESS) 3, 2 (2010), 1–29.
[48]
Emilie Schmeidler and Corinne Kirchner. 2001. Adding audio description: Does it make a difference?Journal of Visual Impairment & Blindness 95, 4 (2001), 197–212.
[49]
Merrie Snell. 2015. Lipsynching: popular song recordings and the disembodied voice. Ph.D. Dissertation. Newcastle University.
[50]
Joel Snyder. 2014. The visual made verbal: A comprehensive training manual and guide to the history and applications of audio description. American Council of the Blind, Incorporated.
[51]
Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ” Person, Shoes, Tree. Is the Person Naked?” What People with Vision Impairments Want in Image Descriptions. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
[52]
Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. (2018). arXiv:1804.00079
[53]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
[54]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534–4542.
[55]
Agnieszka Walczak and Louise Fryer. 2017. Creative description: The impact of audio description style on presence in visually impaired audiences. British Journal of Visual Impairment 35, 1 (2017), 6–17.
[56]
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7190–7198.
[57]
Yujia Wang, Wenguan Wang, Wei Liang, and Lap-Fai Yu. 2019. Comic-guided speech synthesis. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
[58]
Yujia Wang, Liang Wei, Li Wanwan, Li Dingzeyu, and Lap-Fai Yu. 2020. Scene-Aware Background Music Synthesis. In ACM Multimedia, Vol. 38.
[59]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 2019. Transformers: State-of-the-art natural language processing. (2019). arXiv:1910.03771
[60]
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

Cited By

View all
  • (2024)Caring for Special Participants in the Digital Media Era: A Study on Enhancing the Blind User Experience on Short Video Platforms Through Auditory CuesJournal of Information Systems Engineering and Management10.55267/iadt.07.147749:3(28013)Online publication date: 2024
  • (2024)Making Accessible Movies Easily: An Intelligent Tool for Authoring and Integrating Audio Descriptions to MoviesProceedings of the 21st International Web for All Conference10.1145/3677846.3677855(160-164)Online publication date: 13-May-2024
  • (2024)Musical Performances in Virtual Reality with Spatial and View-Dependent Audio Descriptions for Blind and Low-Vision UsersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688492(1-5)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. Toward Automatic Audio Description Generation for Accessible Videos
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
    May 2021
    10862 pages
    ISBN:9781450380966
    DOI:10.1145/3411764
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. accessibility
    2. audio description
    3. audio-visual consistency
    4. sentence-level embedding
    5. video captioning
    6. video description

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • NSF CAREER Award

    Conference

    CHI '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI '25
    CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)268
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Caring for Special Participants in the Digital Media Era: A Study on Enhancing the Blind User Experience on Short Video Platforms Through Auditory CuesJournal of Information Systems Engineering and Management10.55267/iadt.07.147749:3(28013)Online publication date: 2024
    • (2024)Making Accessible Movies Easily: An Intelligent Tool for Authoring and Integrating Audio Descriptions to MoviesProceedings of the 21st International Web for All Conference10.1145/3677846.3677855(160-164)Online publication date: 13-May-2024
    • (2024)Musical Performances in Virtual Reality with Spatial and View-Dependent Audio Descriptions for Blind and Low-Vision UsersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688492(1-5)Online publication date: 27-Oct-2024
    • (2024)Exploring The Affordances of Game-Aware Streaming to Support Blind and Low Vision Viewers: A Design Probe StudyProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675665(1-13)Online publication date: 27-Oct-2024
    • (2024)A Recipe for Success? Exploring Strategies for Improving Non-Visual Access to Cooking InstructionsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675662(1-15)Online publication date: 27-Oct-2024
    • (2024)Design considerations for photosensitivity warnings in visual mediaProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675643(1-12)Online publication date: 27-Oct-2024
    • (2024)Towards Accessible Musical Performances in Virtual Reality: Designing a Conceptual Framework for Omnidirectional Audio DescriptionsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675618(1-17)Online publication date: 27-Oct-2024
    • (2024)Audio Description CustomizationProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675617(1-19)Online publication date: 27-Oct-2024
    • (2024)"I Wish You Could Make the Camera Stand Still": Envisioning Media Accessibility Interventions with People with AphasiaProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675598(1-17)Online publication date: 27-Oct-2024
    • (2024)MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on VideosProceedings of the 16th Conference on Creativity & Cognition10.1145/3635636.3656189(156-169)Online publication date: 23-Jun-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media