Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3242969.3264976acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Large Vocabulary Continuous Audio-Visual Speech Recognition

Published: 02 October 2018 Publication History

Abstract

We like to conversate with other people using both sounds and visuals, as our perception of speech is bimodal. Essentially echoing the same speech structure, we manage to integrate the two modalities and often understand the message better than with the eyes closed. In this work we would like to learn more about the visual nature of speech, coined lip-reading, and to make use of it towards better automatic speech recognition systems. Recent developments in the Machine Learning area, together with the release of suitable audio-visual datasets aimed at large vocabulary continuous speech recognition, have led to a renewal of the lip-reading topic, and allow us to address the recurring question of how to better integrate visual and acoustic speech.

References

[1]
Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: Sentence-level Lipreading. Vol. abs/1611.01599 (2016). http://arxiv.org/abs/1611.01599
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2018. Neural Machine Translation by Jointly Learning to Align and Translate International Conference on Learning Representations. http://arxiv.org/abs/1409.0473
[3]
T. Baltruusaitis, P. Robinson, and L. P. Morency. 2016. OpenFace: An open source facial behavior analysis toolkit 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--10.
[4]
BBC and Oxford University. 2017. The BBC-Oxford Multi-View Lip Reading Sentences 2 (LRS2) Dataset. http://www.robots.ox.ac.uk/~vgg/data/lip_reading_sentences/. (2017). Online, Accessed: 11 August 2018.
[5]
Chung-Cheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art Speech Recognition With Sequence-to-Sequence Models ICASSP. https://arxiv.org/pdf/1712.01769.pdf
[6]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. 1993. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. (1993).
[7]
F. A. Gers, J. Schmidhuber, and F. Cummins. 1999. Learning to forget: continual prediction with LSTM. IET Conference Proceedings (January. 1999), 850--855(5). http://digital-library.theiet.org/content/conferences/10.1049/cp_19991218
[8]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). ACM, New York, NY, USA, 369--376.
[9]
Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE Transactions on Multimedia Vol. 17, 5 (May. 2015), 603--615.
[10]
S. Kim, T. Hori, and S. Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839.
[11]
Edward Nitchie. 1919. Lip-reading Principles and Practice. Frederick A. Stokes Company.
[12]
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-end Audiovisual Speech Recognition. In ICASSP. http://arxiv.org/abs/1802.06424
[13]
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE Vol. 91, 9 (Sept. 2003), 1306--1326.
[14]
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15]
George Sterpu and Naomi Harte. 2017. Towards lipreading sentences using Active Appearance Models AVSP. Stockholm, Sweden.
[16]
George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition 2018 International Conference on Multimodal Interaction (ICMI '18), October 16-20, 2018, Boulder, CO, USA. ACM, New York, NY, USA.
[17]
George Sterpu, Christian Saam, and Naomi Harte. 2018 b. Can DNNs Learn to Lipread Full Sentences? ArXiv e-prints (May. 2018). {arxiv}1805.11685
[18]
F. Tao and C. Busso. 2018. Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 26, 7 (July. 2018), 1286--1298.
[19]
Kwanchiva Thangthai, Helen L. Bear, and Richard Harvey. 2017. Comparing phonemes and visemes with DNN-based lipreading Workshop on Lip-Reading using deep learning methods (BMVC 2017).
[20]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention Recurrent Network for Human Communication Comprehension AAAI Conference on Artificial Intelligence. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17390
[21]
Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. A review of recent advances in visual speech decoding. Image and Vision Computing Vol. 32, 9 (2014), 590--605.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
October 2018
687 pages
ISBN:9781450356923
DOI:10.1145/3242969
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual fusion
  2. audio-visual speech recognition
  3. lip-reading

Qualifiers

  • Research-article

Funding Sources

  • ADAPT Centre for Digital Content Technology

Conference

ICMI '18
Sponsor:
  • SIGCHI

Acceptance Rates

ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;
Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 184
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media