Nothing Special   »   [go: up one dir, main page]

skip to main content
chapter

Audio and visual modality combination in speech processing applications

Published: 24 April 2017 Publication History

Abstract

Chances are that most of us have experienced difficulty in listening to our interlocutor during face-to-face conversation while in highly noisy environments, such as next to heavy traffic or over the background of high-intensity speech babble or loud music. In such occasions, we may have found ourselves looking at the speaker's lower face, while our interlocutor articulates speech, in order to help us enhance speech intelligibility. In fact, what we resort to in such circumstances is known as lipreading or speechreading, namely the recognition of the so-called "visual speech modality" and its combination (fusion) with the available noisy audio data.
Similar to humans, automatic speech recognition (ASR) systems also face difficulties in noisy environments. In recent years, ASR technology has made remarkable strides following the adoption of deep-learning techniques [Hinton et al. 2012, Yu and Deng 2015]. This has led to advanced ASR systems bridging the gap with human performance [Xiong et al. 2017], compared to their significant lag 20 years earlier, as established by Lippmann [1997]. Nevertheless, the quest for ASR noise robustness, particularly when noise is non-stationary and mismatched to training data, remains an active research topic [Li et al. 2015].
To help us mitigate the aforementioned problem, the question naturally arises as to whether or not machines can be designed to mimic human speech perception in noise. Namely, can they successfully incorporate visual speech into the ASR pipeline, especially since this represents an additional information source unaffected by the acoustic environment. At Bell-Labs, Petajan [1984] was the first to develop and implement an early audio-visual automatic speech recognition (AVASR) system. Since then, the area has witnessed significant research activity, paralleling the advances in traditional audio-only ASR, while also utilizing progress in the computer vision and machine learning fields. Not surprisingly, adoption of deep learning techniques has created renewed interest in the field, resulting in remarkable progress on challenging domains, even surpassing human lipreading performance [Chung et al. 2017].
Since the very early works in the field [Stork and Hennecke 1996], design of AVASR systems has generally followed the basic architecture of Figure 12.1. There, a visual front-end module is depicted to provide speech-informative features that are extracted from the video of the speaker's face. These are subsequently fused with acoustic features into the speech recognition process. Clearly, compared to audio-only ASR, visual speech information extraction and audio-visual fusion (or integration) constitute two additional distinct components on which to focus. Indeed, their robustness under a wide range of audio-visual conditions and their efficient implementation represent significant challenges that, to date, remain the focus of active research. It should be noted that rapid recent advances, leading to so-called "end-to-end" AVASR systems [Assael et al. 2016, Chung et al. 2017], have somewhat blurred the distinction between these two components. Nevertheless, this division remains valuable to both the systematic exposure of the relevant material, as well as to the research and development of new systems.
In this chapter, we concentrate on AVASR while also addressing other related problems, namely audio-visual speech activity detection, diarization, and synchrony detection. In order to address such subjects, we first provide additional motivation in Section 12.2, discussing bimodality of human speech perception and production. In Section 12.3, we overview AVASR research in view of its potential application scenarios to multimodal interfaces, visual sensors employed, and audio-visual databases typically used. In Section 12.4, we cover visual feature extraction and, in Section 12.5, we discuss audio-visual fusion for ASR, also providing examples of experimental results achieved by AVASR systems. In Section 12.6, we offer a glimpse into additional audio-visual speech applications. We conclude the chapter by enumerating Focus Questions for further study. In addition, we provide a brief Glossary of the chapter's core terminology, serving as a quick reference.

References

[1]
A. H. Abdelaziz, S. Zeiler, and D. Kolossa. 2015. Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(5): 863--876. .2409785. 511
[2]
A. Adjoudani and C. Benoit. 1996. On the integration of auditory and visual parameters in an HMM-based ASR. In D. G. Stork and M. E. Hennecke, editors, Speechreading by Humans and Machines, pp. 461--471. Springer, Berlin. - 5_35. 505
[3]
R. Ahdid, K. Taifi, S. Safi, and B. Manaut. 2016. A survey on facial feature points detection techniques and approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering, 10(8): 1508--1515. 501
[4]
A. Aides and H. Aronowitz. 2016. Text-dependent audiovisual synchrony detection for spoofing detection in mobile person recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2125--2129. 520
[5]
P. S. Aleksic and A. K. Katsaggelos. 2006. Audio-visual biometrics. Proceedings of the IEEE, 94(11):2025--2044. 525
[6]
P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos. 2002. Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing, 2002(11): 1213--1227. 501
[7]
P. S. Aleksic, G. Potamianos, and A. K. Katsaggelos. 2005. Exploiting visual information in automatic speech processing. In A. Bovic, editor, Handbook of Image and Video Processing, pp. 1263--1289. Elsevier Academic Press, Burlington, MA. 494, 508, 516
[8]
I. Almajai, S. Cox, R. Harvey, and Y. Lan. 2016. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2722--2726. 512
[9]
E. Alpaydin. 2017. Classifying multimodal data. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors, The Handbook of Multimodal- Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan Claypool Publishers, San Rafael, CA. 508
[10]
P. Angkititrakul, J. H. L. Hansen, S. Choi, T. Creek, J. Hayes, J. Kim, D. Kwak, L. T. Noecker, and A. Phan. 2009. UTDrive: The smart vehicle project. In K. Takeda, J. H. L. Hansen, H. Erdogan, and H. Abut, editors, In-Vehicle Corpus and Signal Processing for Driver Behavior, pp. 55--67. Springer Science+Business Media, LLC, New York. 494, 499
[11]
X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2): 356--370. 517
[12]
I. Anina, Z. Zhou, G. Zhao, and M. Pietikäinen. 2015. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In Proceedings of the IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1--5. 496, 498, 499
[13]
D. F. Armstrong, M. A. Karchmer, and J. V. Van Cleve, editors. 2002. The Study of Signed Languages: Essays in Honor of William C. Stokoe. Gallaudet University Press, Washington, DC. 493
[14]
Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. 2016. LipNet: End-to-end sentence-level lipreading. Computing Research Repository, http://arXiv.org/abs/CoRRarxiv.491, 504, 506, 514
[15]
S. Asteriadis, N. Nikolaidis, and I. Pitas. 2011. A review of facial feature detection algorithms. In Y.-J. Zhang, editor, Advances in Face Image Analysis: Techniques and Technologies, pp. 42--61. IGI Global, Hershey, PA. 501
[16]
A. J. Aubrey, Y. A. Hicks, and J. A. Chambers. 2010. Visual voice activity detection with optical flow. IET Image Processing, 4(6): 463--472. 504, 517
[17]
H. L. Bear and R. Harvey. 2016. Decoding visemes: Improving machine lip-reading. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2009--2013. 509
[18]
L. E. Bernstein. 2012. Visual speech perception. In G. Bailly, P. Perrier, and E. Vatikiotis- Bateson, editors, Audio-Visual Speech Processing, pp. 21--39. Cambridge University Press, Cambridge, UK. 491
[19]
L. E. Bernstein, M. E. Demorest, and P. E. Tucker. 1998. What makes a good speechreader? First you have to find one. In R. Campbell, B. Dodd, and D. Burnham, editors, Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-visual Speech, pp. 211--227. Psychology Press Ltd. Publishers, Hove, UK. 493
[20]
X. Bost, G. Linar`es, and S. Gueye. 2015. Audiovisual speaker diarization of TV series. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4799--4803. 517
[21]
H. A. Bourlard and N. Morgan. 1994. Connectionist Speech Recognition: A Hybrid Approach, vol. SECS 247. Springer Science & Business Media, New York. - 4615-3210-1. 511, 512
[22]
G. Bradski and A. Kaehler. 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly Media, Inc., Sebastopol, CA. 501
[23]
H. Bredin and G. Chollet. 2007. Audiovisual speech synchrony measure: Application to biometrics. EURASIP Journal on Advances in Signal Processing, 2007(070186): 1--11. 520, 521, 522
[24]
C. Bregler and Y. Konig. 1994. "Eigenlips" for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 669--672. 493, 505
[25]
D. Burnham, D. Estival, S. Fazio, J. Viethen, F. Cox, R. Dale, S. Cassidy, J. Epps, R. Togneri, M. Wagner, Y. Kinoshita, R. Göcke, J. Arciuli, M. Onslow, T. Lewis, A. Butcher, and J. Hajek. 2011. Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable Black Box. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 841--844. 496, 498, 499
[26]
T. Butz and J.-P. Thiran. 2002. Feature space mutual information in speech-video sequences. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), vol. 2, pp. 361--364. 520
[27]
P. Campr, M. Kunesova, J. Vanek, J. Cech, and J. Psutka. 2014. Audio-video speaker diarization for unsupervised speaker and face model creation. In P. Sojka, A. Horak, I. Kopecek, and K. Pala, editors, Text, Speech and Dialogue, vol. LNCS 8655, pp. 465--472. Springer International Publishing, Switzerland. - 10816-2_56. 517
[28]
L. Cappelletta and N. Harte. 2012. Phoneme-to-viseme mapping for visual speech recognition. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp. 322--329. 494
[29]
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan,W. Post, D. Reidsma, and P.Wellner. 2006. The AMI meeting corpus: a pre-announcement. In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction (MLMI)---Part II, vol. LNCS 3869, pp. 28--39. Springer-Verlag, Berlin. 494
[30]
M. Castrillon, O. Deniz, D. Hernandez, and J. Lorenzo. 2011. A comparison of face and facial feature detectors based on the Viola-Jones general object detection framework. Machine Vision and Applications, 22(3): 481--494. 501
[31]
T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma. 2015. PCANet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing, 24(12): 5017--5032. 506
[32]
D. Chandramohan and P. L. Silsbee. 1996. A multiple deformable template approach for visual speech recognition. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), vol. 1, pp. 50--53. 505
[33]
G. Chetty and M.Wagner. 2004. Liveness verification in audio-video speaker authentication. In Proceedings of the Australian International Conference on Speech Science and Technology (SST), pp. 358--363. 520
[34]
C. C. Chibelushi, F. Deravi, and J. S. D. Mason. 2002. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia, 4(1): 23--37. 525
[35]
G. I. Chiou and J.-N. Hwang. 1997. Lipreading from color video. IEEE Transactions on Image Processing, 6(8): 1192--1195. 504, 505
[36]
J. S. Chung and A. Zisserman. 2017a. Out of time: automated lip sync in the wild. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision---ACCV 2016 Workshops, Part II, vol. LNCS 10117, pp. 251--263. Springer International Publishing, Switzerland. 520
[37]
J. S. Chung and A. Zisserman. 2017b. Lip reading in the wild. In S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, editors, Computer Vision---ACCV 2016, Part II, vol. LNCS 10112, pp. 87--103. Springer International Publishing, Switzerland. - 54184-6_6. 498, 499
[38]
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. 2017. Lip reading sentences in the wild. Computing Research Repository, http://arXiv.org/abs/arXiv:1611.05358v2. 490, 491, 496, 499, 504, 506, 513
[39]
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-like environment for machine learning. In Proceedings of the Neural Information Processing Systems (NIPS) BigLearn Worskhop. 501
[40]
D. Comaniciu, V.Ramesh, and P. Meer. 2003.Kernel-based object tracking. IEEE Transactions onPattern Analysis and Machine Intelligence, 25(5): 564--577. .1195991. 502
[41]
M. Cooke, J. Barker, S. Cunningham, and X. Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5): 2421--2424. 497, 498
[42]
T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. 1995. Active shape models - Their training and application. Computer Vision and Image Understanding, 61(1): 38--59. 504
[43]
T. F. Cootes, G. J. Edwards, and C. J. Taylor. 1998. Active appearance models. In H. Burkhardt and B. Neumann, editors, Computer Vision---ECCV'98, Part II, vol. LNCS 1407, pp. 484--498 Springer, Berlin. 504
[44]
R. Cutler and L. Davis. 2000. Look who's talking: Speaker detection using video and audio correlation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1589--1592. 520
[45]
N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886--893. 506
[46]
P. De Cuetos, C. Neti, and A. W. Senior. 2000. Audio-visual intent to speak detection for human computer interaction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2373--2376. .2000.859318. 494
[47]
D. Dov, R. Talmon, and I. Cohen. 2015. Audio-visual voice activity detection using diffusion maps. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4): 732--745. 517
[48]
P. Duchnowski, U. Meier, and A. Waibel. 1994. See me, hear me: Integrating automatic speech recognition and lip-reading. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 547--550. 505, 508, 511
[49]
S. Dupont and J. Luettin. 2000. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3): 141--151. .865479. 505, 510
[50]
E. El Khouri, C. Senac, and P. Joly. 2014. Audiovisual diarization of people in video content. Multimedia Tools and Applications, 68(3): 747--775. 517
[51]
S. Escalera, J. Gonz`alez, X. Baro, M. Reyes, I. Guyon, V. Athitsos, H. J. Escalante, L. Sigal, A. Argyros, C. Sminchisescu, R. Bowden, and S. Sclaroff. 2013. ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), pp. 365--368. 497
[52]
V. Estellers, M. Gurban, and J.-P. Thiran. 2012.Ondynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(4): 1145--1157. 511
[53]
N. Eveno and L. Besacier. 2005. A speaker independent "liveness" test for audio-visual biometrics. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3081--3084. 520, 521
[54]
FFmpeg. 2000. http://ffmpeg.org. (accessed March 1, 2017). 523, 524
[55]
J. G. Fiscus, J. Ajot, and J. S. Garofolo. 2008. The Rich Transcription 2007 meeting recognition evaluation. In R. Stiefelhagen, R. Bowers, and J. Fiscus, editors, Multimodal Technologies for Perception of Humans, vol. LNCS 4625, pp. 373--389. Springer-Verlag, Berlin. 519
[56]
C. G. Fisher. 1968. Confusions among visually perceived consonants. Journal of Speech, Language, and Hearing Research, 11(4): 796--804. 494
[57]
J.W. Fisher, III and T. Darrell. 2004. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3): 406--413. 520
[58]
FLIR Bumblebee2. 2008. http://www.ptgrey.com/bumblebee2-firewire-stereo-visioncamera- systems. (accessed March 1, 2017).
[59]
A. Fossati, J. Gall, H. Grabner, X. Ren, and K. Konolige, editors. 2013. Consumer Depth Cameras for Computer Vision, Research Topics and Applications. Springer-Verlag, London. 497
[60]
J. Freitas, A. Ferreira, M. Figueiredo, A. Teixeira, and M. S. Dias. 2014. Enhancing multimodal silent speech interfaces with feature selection. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1169--1173. 496
[61]
G. Friedland, H. Hung, and C. Yeo. 2009. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069--4072. 517
[62]
G. Galatas, G. Potamianos, D. Kosmopoulos, C. McMurrough, and F. Makedon. 2011. Bilingual corpus for AVASR using multiple sensors and depth information. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), pp. 103--106. 497
[63]
A. Garg, G. Potamianos, C. Neti, and T. S. Huang. 2003. Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 24--27. 511
[64]
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue. 1993. TIMIT acoustic-phonetic continuous speech corpus. Technical report, Linguistic Data Consortium (LDC93S1), Philadelphia. 497
[65]
I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud. 2015. Tracking the active speaker based on a joint audio-visual observation model. In Proceedings of the IEEE International Conference on Computer VisionWorkshops (ICCVW), pp. 702--708. .2015.96. 517
[66]
S. Gergen, S. Zeiler, A. H. Abdelaziz, R. Nickel, and D. Kolossa. 2016. Dymamic stream weighting for turbo-decoding-based audiovisual ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2135--2139. 511
[67]
M. Gilbert, A. Acero, J. Cohen, H. Bourlard, S.-F. Chang, and M. Etoh. 2011. Media search in mobile devices {From the Guest Editors}. IEEE Signal Processing Magazine, 28(4): 12--13. 494
[68]
L. Girin, J.-L. Schwartz, and G. Feng. 2001. Audio-visual enhancement of speech in noise. The Journal of the Acoustical Society of America, 109(6): 3007--3020. .1358887. 509
[69]
R. Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440--1448. 501
[70]
H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin. 2001. Weighting schemes for audio-visual fusion in speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 173--176. 510
[71]
B. Gold, N. Morgan, and D. Ellis. 2011. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons, Inc., Hoboken, NJ. 509
[72]
A. J. Goldschen, O. N. Garcia, and E. D. Petajan. 1996. Rationale for phoneme-viseme mapping and feature selection in visual speech recognition. In D. G. Stork and M. E. Hennecke, editors, Speechreading by Humans and Machines, pp. 505--515. Springer, Berlin. 494
[73]
M. Gordan, C. Kotropoulos, and I. Pitas. 2002. A support vector machine-based dynamic network for visual speech recognition applications. EURASIP Journal on Applied Signal Processing, 2002(11): 1248--1259. 508
[74]
S. Graf, T. Herbig, M. Buck, and G. Schmidt. 2015. Features for voice activity detection: a comparative analysis. EURASIP Journal on Advances in Signal Processing, 2015(91): 1--15. 517
[75]
K. W. Grant and S. Greenberg. 2001. Speech intelligibility derived from asynchronous processing of auditory-visual information. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), pp. 132--137. 493
[76]
A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1764--1772. 513
[77]
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 369--376. 514
[78]
G. Gravier, S. Axelrod, G. Potamianos, and C. Neti. 2002. Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 853--856. 511
[79]
M. Gurban and J.-P. Thiran. 2006. Multimodal speaker localization in a probabilistic framework. In Proceedings of the European Signal Processing Conference (Eusipco), pp. 1--5. 520
[80]
M. Gurban and J.-P. Thiran. 2009. Information theoretic feature extraction for audio-visual speech recognition. IEEE Transactions on Signal Processing, 57(12): 4765--4776. 507
[81]
S. Gurbuz, Z. Tufekci, E. Patterson, and J. N. Gowdy. 2001. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 177--180. 505
[82]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with applications to learning methods. Neural Computation, 16(12): 2639--2664. 521
[83]
N. Harte and E. Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5): 603--615. 498, 499
[84]
T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass. 2004. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In Proceedings of the International Conference on Multimodal Interfaces (ICMI), pp. 235--242. 497
[85]
M. Heckmann, F. Berthommier, and K. Kroschel. 2002. Noise adaptive stream weighting in audio-visual speech recognition. EURASIP Journal on Applied Signal Processing, 2002(11): 1260--1273. 505, 508, 511
[86]
M. E. Hennecke, D. G. Stork, and K. V. Prasad. 1996. Visionary speech: Looking ahead to practical speechreading systems. In D. G. Stork and M. E. Hennecke, editors, Speechreading by Humans and Machines, pp. 331--349. Springer, Berlin. 505
[87]
P. Heracleous, D. Beautemps, and N. Aboutabit. 2010. Cued speech automatic recognition in normal-hearing and deaf subjects. Speech Communication, 52(6): 504--512. 493
[88]
H. Hermansky, D. P.W. Ellis, and S. Sharma. 2000. Tandem connectionist feature extraction for conventional HMM systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635--1638. 512
[89]
J. Hershey and J. Movellan. 1999. Audio vision: Using audio-visual synchrony to locate sounds. In S. A. Solla, T. K. Leen, and K.Müller, editors, Advances in Neural Information Processing Systems, pp. 813--819. MIT Press, Cambridge, MA. 520
[90]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6): 82--97. 489, 511
[91]
G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786): 504--507. 506
[92]
O. Hosseini Jafari, D. Mitzel, and B. Leibe. 2014. Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 5636--5643. 497
[93]
HTK. 2000. http://htk.eng.cam.ac.uk/. (accessed March 1, 2017). 509, 523, 524
[94]
J. Huang and B. Kingsbury. 2013. Audio-visual deep learning for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7596--7599. 512
[95]
J. Huang, G. Potamianos, J. Connell, and C. Neti. 2004. Audio-visual speech recognition using an infrared headset. Speech Communication, 44(4): 83--96. .specom.2004.10.007. 495
[96]
J. B. Jepsen, G. De Clerck, S. Lutalo-Kiingi, andW. B. McGregor, editors. 2015. Sign Languages of the World: A Comparative Handbook. Mouton De Gruyter, Berlin. 493
[97]
J. Jiang, A. Alwan, P. A. Keating, E. T. Auer, Jr., and L. E. Bernstein. 2002. On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 2002(11): 1174--1188. 493
[98]
B. Joosten, E. Postma, and E. Krahmer. 2015. Voice activity detection based on facial movement. Journal on Multimodal User Interfaces, 9(3): 183--193. - 015-0187-2. 517
[99]
Kaldi. 2011. http://kaldi-asr.org/. (accessed March 1, 2017). 509
[100]
A. Karpov, L. Akarun, H. Yalcin, A. Ronzhin, B. E. Demiroz, A. Coban, and M. Zelezny. 2014. Audio-visual signal processing in a multimodal assisted living environment. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1023--1027. 495
[101]
M. Kass, A. Witkin, and D. Terzopoulos. 1988. Snakes: Active contour models. International Journal of Computer Vision, 1(4): 321--331. 504
[102]
A. K. Katsaggelos, S. Bahaadini, and R. Molina. 2015. Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE, 103(9): 1635--1653. .2015.2459017. 508, 516
[103]
A. Katsamanis, G. Papandreou, and P. Maragos. 2009. Face active appearance modeling and speech acoustic information to recover articulation. IEEE Transactions on Audio, Speech, and Language Processing, 17(3): 411--422. 493
[104]
A. Katsamanis, V. Pitsikalis, S. Theodorakis, and P. Maragos. 2017. Multimodal gesture recognition. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations. Morgan Claypool Publishers, San Rafael, CA. 497, 510, 517, 524
[105]
N. Kawaguchi, S. Matsubara, K. Takeda, and F. Itakura. 2001. Multimedia data collection of in-car speech communication. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pp. 2027--2030. 499
[106]
G. Keren, A. El-Desoky Mousa, O. Pietquin, S. Zafeiriou, and B. Schuller. 2017. Deep learning for multisensorial and multimodal interaction. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors, The Handbook of Multimodal- Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan Claypool Publishers, San Rafael, CA. 501, 508
[107]
Kinect for Windows SDK. 2013. http://dev.windows.com/en-us/kinect. (accessed March 1, 2017). 496, 519
[108]
Kinect for Xbox 360. 2010. http://support.xbox.com/en-US/browse/xbox-360/accessories/Kinect. (accessed March 20, 2017). 517
[109]
Kinect for Xbox One. 2012. http://www.xbox.com/en-US/xbox-one/accessories/kinect. (accessed March 1, 2017).
[110]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), pp. 1097--1105. Curran Associates, Inc. 501
[111]
K. Kumar, T. Chen, and R. M. Stern. 2007. Profile view lip reading. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 429--432. 499
[112]
K. Kumar, G. Potamianos, J. Navratil, E. Marcheret, and V. Libal. 2011. Audio-visual speech synchrony detection by a family of bimodal linear prediction models. In B. Bhanu and V. Govindaraju, editors, Multibiometrics for Human Identification, pp. 31--50. Cambridge University Press, New York. 520
[113]
T. Le Cornu and B. Milner. 2015. Voicing classification of visual speech using convolutional neural networks. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 103--108. 517
[114]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278--2324. .726791. 501
[115]
Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature, 521: 436--444. 501, 512
[116]
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang. 2004. AVICAR: audio-visual speech corpus in a car environment. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2489--2492. 498, 499
[117]
D. Lee, J. Lee, and K.-E. Kim. 2017. Multi-view automatic lip-reading using neural network. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision---ACCV 2016 Workshops, Part II, vol. LNCS 10117, pp. 290--302. Springer International Publishing, Switzerland. 500
[118]
J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong. 2015. Robust Automatic Speech Recognition---A Bridge to Practical Applications. Academic Press, Amsterdam. 489
[119]
N. Li, S. Dettmer, and M. Shah. 1995. Lipreading using eigensequences. In Proceedings of the InternationalWorkshop on Automatic Face and Gesture Recognition (FG), pp. 30--34. 504
[120]
Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki. 2016. Lip reading using a dynamic feature of lip images and convolutional neural networks. In Proceedings of the IEEE/ACIS International Conference on Computer and Information Science (ICIS), pp. 1--6. 504, 506
[121]
X. Lin, H. Yao, X. Hong, and Q. Wang. 2008. HIT-AVDB-II: A new multi-view and extreme feature cases contained audio-visual database for biometrics. In Proceedings of the Joint Conference on Information Sciences---Computer Vision, Pattern Recognition and Image Processing (CVPRIP). 496, 499
[122]
R. P. Lippmann. 1997. Speech recognition by machines and humans. Speech Communication, 22(1): 1--15. 489
[123]
J. Liu and M. Kavakli. 2010. A survey of speech-hand gesture recognition for the development of multimodal interfaces in computer games. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1564--1569. .5583252. 494
[124]
P. Lucey, G. Potamianos, and S. Sridharan. 2007. A unified approach to multi-pose audiovisual ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 650--653. 496, 499, 500
[125]
J. Luig and A. Sontacchi. 2014. A speech database for stress monitoring in the cockpit. Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, 228(2): 284--296. 495
[126]
W. Luo, X. Zhao, and T.-K. Kim. 2014. Multiple object tracking: A review. Computing Research Repository, http://arXiv.org/abs/1409.7618v1. 502
[127]
G. Lv, Y. Fan, D. Jiang, and R. Zhao. 2008. Multi-stream asynchrony modeling for audio visual speech recognition. In F. Miheli^c and J. ^Zibert, editors, Speech Recognition, Technologies and Applications, pp. 297--310. InTech, Vienna. 508
[128]
Y. Mahajan, J. Kim, and C. Davis. 2014. Does elderly speech recognition in noise benefit from spectral and visual cues? In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2021--2025. 491
[129]
S. Mallat. 2010. Recursive interferometric representations. In Proceedings of the European Signal Processing Conference (Eusipco), pp. 716--720. 505
[130]
E. Marcheret, V. Libal, and G. Potamianos. 2007. Dynamic stream weight modeling for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 945--948. 511
[131]
E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel. 2015a. Detecting audio-visual synchrony using deep neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 548--552. 499, 505, 520, 521
[132]
E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel. 2015b. Scattering vs. discrete cosine transform features in visual speech processing. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 175--180. 506, 513, 517, 522
[133]
M. Marschark, D. LePoutre, and L. Bement. 1998. Mouth movement and signed communication. In R. Campbell, B. Dodd, and D. Burnham, editors, Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-visual Speech, pp. 245--266. Psychology Press Ltd. Publishers, Hove, UK. 493
[134]
D. W. Massaro and D. G. Stork. 1998. Speech recognition and sensory integration. American Scientist, 86(3): 236--244. 491
[135]
I. Matthews and S. Baker. 2004. Active appearance models revisited. International Journal of Computer Vision, 60(2): 135--164. 504
[136]
I. Matthews, G. Potamianos, C. Neti, and J. Luettin. 2001. A comparison of model and transform-based visual features for audio-visual LVCSR. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 825--828. 505
[137]
H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature, 264: 746--748. 493
[138]
K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. 1999. XM2VTSDB: The extended M2VTS database. In Proceedings of the International Conference on Audio and Videobased Biometric Person Authentication (AVBPA), pp. 72--76. 498
[139]
Y. Miao and F. Metze. 2016. Open-domain audio-visual speech recognition: a deep learning approach. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3414--3418. .2016-412 . 496, 499
[140]
I. Mporas, O. Kocsis, T. Ganchev, and N. Fakotakis. 2010. Robust speech interaction in motorcycle environment. Expert Systems with Applications, 37(3): 1827--1835. 495
[141]
Y. Mroueh, E. Marcheret, and V. Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2130--2134. .7178347. 507, 513, 516
[142]
A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy. 2002. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing, 2002(11): 1274--1288. 503, 508, 510
[143]
D. Neil, M. Pfeiffer, and S.-C. Liu. 2016. Phased LSTM: Accelerating recurrent network training for long or event-based sequences. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, pp. 3882--3890. Curran Associates, Inc. 513
[144]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 689--696. 506, 512
[145]
H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, and K. Takeda. 2015. Integration of deep bottleneck features for audio-visual speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 563--567. 506, 512
[146]
H. J. Nock, G. Iyengar, and C. Neti. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR), pp. 488--499. 520
[147]
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. 2014. Lipreading using convolutional neural network. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1149--1153. 512
[148]
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. 2015. Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4): 722--737. 506, 512
[149]
T. Ojala, M. Pietikäinen, and T. Mäenpää. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7): 971--987. .1017623. 506
[150]
Y. Onuma, N. Karnado, H. Saruwatari, and K. Shikano. 2012. Real-time semi-blind speech extraction with speaker direction tracking on Kinect. In Proceedings of the Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA-ASC). 497
[151]
OpenCV. 2000. http://www.opencv.org. (accessed March 1, 2017). 501, 524
[152]
A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel, L. Buera, and E. Zacur. 2004. AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In Proceedings of the Language Resources and Evaluation Conference (LREC), vol. 3, pp. 763--766. 499
[153]
G. Papandreou, A. Katsamanis, V. Pitsikalis, and P. Maragos. 2009. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(3): 423--435. 505, 511
[154]
A. Pass, J. Zhang, and D. Stewart. 2010. An investigation into features for multi-view lipreading. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 2417--2420. 499, 500
[155]
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2017--2020. 497, 498
[156]
E. D. Petajan. 1984. Automatic lipreading to enhance speech recognition. In Proceedings of the Global Telecommunications Conference (GlobeCom), pp. 265--272. 490
[157]
S. Petridis and M. Pantic. 2016. Deep complementary bottleneck features for visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304--2308. 506
[158]
S. Petridis, Z. Li, and M. Pantic. 2017. End-to-end visual speech recognition with LSTMs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2592--2596. 504, 514
[159]
A. Potamianos, C. Tzafestas, E. Iosif, F. Kirstein, P. Maragos, K. Dauthenhahn, J. Gustafson, J.-E. Ostergaard, S. Kopp, P. Wik, O. Pietquin, and S. Al Moubayed. 2016. BabyRobot---next generation social robots: Enhancing communication and collaboration development of TD and ASD children by developing and commercially exploiting the next generation of human-robot interaction technologies. In Proceedings of the Workshop on Evaluating Child-Robot Interaction (CRI) at the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 495
[160]
G. Potamianos and H. P. Graf. 1998a. Discriminative training of HMM stream exponents for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3733--3736. 511
[161]
G. Potamianos and H. P. Graf. 1998b. Linear discriminant analysis for speechreading. In Proceedings of the IEEE Workshop on Multimedia Signal Processing (MMSP), pp. 221--226. 507
[162]
G. Potamianos and C. Neti. 2003. Audio-visual speech recognition in challenging environments. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pp. 1293--1296. 499, 505, 514
[163]
G. Potamianos and P. Scanlon. 2005. Exploiting lower face symmetry in appearance-based automatic speechreading. In Proceedings of the International Conference on Auditory- Visual Speech Processing (AVSP), pp. 79--84. 503, 505
[164]
G. Potamianos, H. P. Graf, and E. Cosatto. 1998. An image transform approach for HMM based automatic lipreading. In Proceedings of the IEEE International Conference on Image Processing (ICIP), vol. 3, pp. 173--177. 505
[165]
G. Potamianos, C. Neti, G. Iyengar, A. W. Senior, and A. Verma. 2001. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 4(3--4): 193--208. 507
[166]
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9): 1306--1326. 490, 497, 503, 507, 508, 510, 514
[167]
S. J. D. Prince. 2012. Computer Vision: Models, Learning, and Inference. Cambridge University Press, Cambridge, UK. 504
[168]
L. Rabiner and B.-H. Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ. 494, 506, 508
[169]
J. Rajeshwari, K. Karibasappa, and M. T. GopalKrishna. 2014. Survey on skin based face detection on different illumination, posses and occlusion. In Proceedings of the International Conference on Contemporary Computing and Informatics (IC3I), pp. 728--733. 501
[170]
A. Rekik, A. Ben-Hamadou, and W. Mahdi. 2014. A new visual speech recognition approach for RGB-D cameras. In A. Campilho and M. Kamel, editors, Image Analysis and Recognition (ICIAR), Part II, vol. LNCS 8815, pp. 21--28. Springer International Publishing, Switzerland. 497
[171]
E. A. Rua, H. Bredin, C. G. Mateo, G. Chollet, and D. G. Jimenez. 2009. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Pattern Analysis and Applications, 12(3): 271--284. 520, 521
[172]
V. Rudzionis, R. Maskeliunas, and K. Driaunys. 2012. Voice controlled environment for the assistive tools and living space control. In Proceedings of the Federated Conference on Computer Science and Information Systems, pp. 1075--1080. 494, 495
[173]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211--252. 501
[174]
K. Saenko, K. Livescu, J. Glass, and T. Darrell. 2009. Multistream articulatory feature-based models for visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9): 1700--1707. 508
[175]
T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. 2015. Convolutional, long shortterm memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580--4584. 513
[176]
T. Saitoh, Z. Zhou, G. Zhao, and M. Pietikäinen. 2017. Concatenated frame image based CNN for visual speech recognition. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision---ACCV 2016 Workshops, Part II, vol. LNCS 10117, pp. 277--289. Springer International Publishing, Switzerland. 504, 506
[177]
C. Sanderson and B. C. Lovell. 2009. Multi-region probabilistic histograms for robust and scalable identity inference. In M. Tistarelli and M. S. Nixon, editors, Advances in Biometrics (ICB), vol. LNCS 5558, pp. 199--208. Springer Verlag, Berlin. 498
[178]
N. Sarafianos, T. Giannakopoulos, and S. Petridis. 2016. Audio-visual speaker diarization using Fisher linear semi-discriminant analysis. Multimedia Tools and Applications, 75(1): 115--130. 517
[179]
M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7): 1396--1403. 520, 521
[180]
P. Scanlon, G. Potamianos, V. Libal, and S. M. Chu. 2004. Mutual information based visual feature selection for lipreading. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 2037--2040. 507
[181]
ScatNet Software. 2013. http://www.di.ens.fr/data/software. (accessed March 1, 2017). 505
[182]
D. Schnelle-Walka and S. Radomski. 2017. Automotive multimodal human-machine interaction. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krüger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software and Commercialization, Emerging Directions. Morgan Claypool Publishers, San Rafael, CA. 494
[183]
A. Shah. 2012. Use voice, gestures to control TV. PCWorld Magazine. http://www.pcworld .com/article/253223/use_voice_gestures_to_control_tv.html. 494
[184]
X. Shao and J. Barker. 2008. Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment. Speech Communication, 50(4): 337--353. 511
[185]
S. T. Shivappa, M. M. Trivedi, and B. D. Rao. 2010. Audiovisual information fusion in humancomputer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98(10): 1692--1715. 508
[186]
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR). 501
[187]
M. Slaney and M. Covell. 2000. FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, pp. 814--820. MIT Press, Cambridge, MA. 520
[188]
D. Stewart, R. Seymour, A. Pass, and J. Ming. 2014. Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2): 175--184. 511
[189]
D. G. Stork and M. E. Hennecke, editors. 1996. Speechreading by Humans and Machines. Springer, Berlin. 490
[190]
C. Sui, R. Togneri, and M. Bennamoun. 2015. Extracting deep bottleneck features for visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1518--1522. .7178224. 506
[191]
W. H. Sumby and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2): 212--215. 491
[192]
A. Q. Summerfield. 1987. Some preliminaries to a comprehensive account of audiovisual speech perception. In R. Campbell and B. Dodd, editors, Hearing by Eye: The Psychology of Lip-Reading, pp. 3--51. Lawrence Erlbaum Associates, London. 491
[193]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), pp. 3104--3112. Curran Associates, Inc. 514
[194]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1--9. 501
[195]
R. Szeliski. 2011. Computer Vision---Algorithms and Applications. Springer-Verlag, London. 500, 504
[196]
S. Tamura, K. Iwano, and S. Furui. 2005. A stream-weight optimization method for multistream HMMs based on likelihood value normalization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 469--472. 511
[197]
F. Tao, J. H. L. Hansen, and C. Busso. 2016. Improving boundary estimation in audiovisual speech activity detection using Bayesian information criterion. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2130--2134. 517
[198]
I. Tashev. 2013. Kinect development kit: A toolkit for gesture- and speech-based humanmachine interaction {Best of the Web}. IEEE Signal Processing Magazine, 30(5): 129--131. 497
[199]
S. Taylor, B.-J. Theobald, and I. Matthews. 2014. The effect of speaking rate on audio and visual speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3037--3041. 494
[200]
P. Teissier, J. Robert-Ribes, J.-L. Schwartz, and A. Guerin-Dugue. 1999. Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing, 7(6): 629--642. 509
[201]
L. D. Terissi, G. D. Sad, J. C. Gomez, and M. Parodi. 2015. Audio-visual speech recognition scheme based on wavelets and random forests classification. In A. Pardo and J. Kittler, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (CIARP), vol. LNCS 9423, pp. 567--574. Springer International Publishing, Switzerland. 508
[202]
L. Terry. 2011. Audio-Visual Asynchrony Modeling and Analysis for Speech Alignment and Recognition. Ph.D. thesis, Northwestern University, Evanston, IL. 493
[203]
L. H. Terry, D. J. Shiell, and A. K. Katsaggelos. 2008. Feature space video stream consistency estimation for dynamic stream weighting in audio-visual speech recognition. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 1316-- 1319. 511
[204]
A. Thanda and S. M. Venkatesan. 2016. Audio visual speech recognition using deep recurrent neural networks. Computing Research Repository, http://arXiv.org/abs/arXiv:1611 .02879. 513
[205]
K. Thangthai, R. Harvey, S. Cox, and B.-J. Theobald. 2015. Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 127--131. 512
[206]
S. Thermos and G. Potamianos. 2016. Audio-visual speech activity detection in a twospeaker scenario incorporating depth information from a profile or frontal view. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 579--584. 511, 517, 518, 519
[207]
P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 511--518. 501
[208]
A. Vorwerk, X.Wang, D. Kolossa, S. Zeiler, and R. Orglmeister. 2010.WAPUSK20---A database for robust audiovisual speech recognition. In Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 3016--3019. 496, 499
[209]
M. Wand and T. Schultz. 2014. Towards real-life application of EMG-based speech recognition by using unsupervised adaptation. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1189--1193. 496
[210]
M. Wand, J. Koutnik, and J. Schmidhuber. 2016. Lipreading with long short-term memory. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115--6119. 506, 514
[211]
J. Wang, Y. Gao, J. Zhang, J. Wei, and J. Dang. 2015. Lipreading using profile lips rebuilt by 3D data from the Kinect. Journal of Computational Information Systems, 11(7): 2429--2438. 497
[212]
N. Wang, X. Gao, D. Tao, and X. Li. 2014. Facial feature point detection: A comprehensive survey. Computing Research Repository, http://arXiv.org/abs/arXiv:1410.1037. 501
[213]
T. Watanabe, K. Katsurada, and Y. Kanazawa. 2017. Lip reading from multi view facial images using 3D-AAM. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision---ACCV 2016 Workshops, Part II, vol. LNCS 10117, pp. 303--316. Springer International Publishing, Switzerland. - 4_23. 505
[214]
D. Websdale and B. Milner. 2015. Analysing the importance of different visual feature coefficients. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 137--142. 507
[215]
C.-H. Wu, J.-C. Lin, and W.-L. Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, 3(e12): 1--18. 525
[216]
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. 2017. Achieving human parity in conversational speech recognition. Computing Research Repository, http://arXiv.org/abs/arXiv:1610.05256v2. 489, 513
[217]
M.-H. Yang, D. J. Kriegman, and N. Ahuja. 2002. Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1): 34--58. 500
[218]
H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. 1998. Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1--2): 23--43. - 6393(98)00048-X. 493
[219]
A. Yilmaz, O. Javed, and M. Shah. 2006. Object tracking: A survey. ACM Computing Surveys, 38(4). 502
[220]
T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui. 2003. Audio-visual speech recognition using lip movement extracted from side-face images. In Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP), pp. 117--120.
[221]
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. 2002. The HTK Book. Cambridge University Engineering Department, Cambridge, UK. 494, 507
[222]
D. Yu and L. Deng. 2015. Automatic Speech Recognition---A Deep Learning Approach. Springer- Verlag, London. 489, 511
[223]
B. P. Yuhas, M. H. Goldstein, Jr., and T. J. Sejnowski. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11): 65--71. 511
[224]
A. L. Yuille, P. W. Hallinan, and D. S. Cohen. 1992. Feature extraction from faces using deformable templates. International Journal of Computer Vision, 8(2): 99--111. 504
[225]
S. Zafeiriou, C. Zhang, and Z. Zhang. 2015. A survey of face detection in the wild: past, present and future. Computer Vision and Image Understanding, 138: 1--24. 500
[226]
S. Zeiler, J. Cwiklak, and D. Kolossa. 2014. Robust multimodal human machine interaction using the Kinect sensor. In Proceedings of the ITG Symposium on Speech Communication. 497
[227]
G. Zhao, M. Barnard, and M. Pietikäinen. 2009. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7): 1254--1265. .2009.2030637. 504, 506, 508
[228]
Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen. 2014. A review of recent advances in visual speech decoding. Image and Vision Computing, 32(9): 590--605. .2014.06.004. 500
[229]
M. Zimmermann, M. M. Ghazi, H. K. Ekenel, and J.-P. Thiran. 2017. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision -- ACCV 2016Workshops, Part II, vol. LNCS 10117, pp. 264--276. Springer International Publishing, Switzerland. 504, 506

Cited By

View all
  • (2021)Resource-efficient TDNN Architectures for Audio-visual Speech Recognition2021 29th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO54536.2021.9616215(506-510)Online publication date: 23-Aug-2021
  • (2021)End-to-End Audiovisual Speech Recognition System With Multitask LearningIEEE Transactions on Multimedia10.1109/TMM.2020.297592223(1-11)Online publication date: 2021
  • (2021)Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos2021 IEEE Spoken Language Technology Workshop (SLT)10.1109/SLT48900.2021.9383619(1109-1116)Online publication date: 19-Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Books
The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations - Volume 1
April 2017
662 pages
ISBN:9781970001679
DOI:10.1145/3015783

Publisher

Association for Computing Machinery and Morgan & Claypool

Publication History

Published: 24 April 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Chapter

Appears in

ACM Books

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Resource-efficient TDNN Architectures for Audio-visual Speech Recognition2021 29th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO54536.2021.9616215(506-510)Online publication date: 23-Aug-2021
  • (2021)End-to-End Audiovisual Speech Recognition System With Multitask LearningIEEE Transactions on Multimedia10.1109/TMM.2020.297592223(1-11)Online publication date: 2021
  • (2021)Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos2021 IEEE Spoken Language Technology Workshop (SLT)10.1109/SLT48900.2021.9383619(1109-1116)Online publication date: 19-Jan-2021
  • (2021)Multimodal Fusion and Sequence Learning for Cued Speech Recognition from VideosUniversal Access in Human-Computer Interaction. Access to Media, Learning and Assistive Environments10.1007/978-3-030-78095-1_21(277-290)Online publication date: 24-Jul-2021
  • (2021)Learning More Expressive Joint Distributions in Multimodal Variational MethodsMachine Learning, Optimization, and Data Science10.1007/978-3-030-64583-0_14(137-149)Online publication date: 8-Jan-2021
  • (2020)How to Teach DNNs to Pay Attention to the Visual Modality in Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2020.298043628(1052-1064)Online publication date: 16-Apr-2020
  • (2019)Standardized representations and markup languages for multimodal interactionThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233806(347-392)Online publication date: 1-Jul-2019
  • (2019)Multimodal conversational interaction with robotsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233799(77-104)Online publication date: 1-Jul-2019
  • (2019)Multimodal integration for interactive conversational systemsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233798(21-76)Online publication date: 1-Jul-2019
  • (2018)Semantic Fusion for Natural Multimodal Interfaces using Concurrent Augmented Transition NetworksMultimodal Technologies and Interaction10.3390/mti20400812:4(81)Online publication date: 6-Dec-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media