Abstract
Chances are that most of us have experienced difficulty in listening to our interlocutor during face-to-face conversation while in highly noisy environments, such as next to heavy traffic or over the background of high-intensity speech babble or loud music. In such occasions, we may have found ourselves looking at the speaker's lower face, while our interlocutor articulates speech, in order to help us enhance speech intelligibility. In fact, what we resort to in such circumstances is known as lipreading or speechreading, namely the recognition of the so-called "visual speech modality" and its combination (fusion) with the available noisy audio data.
Similar to humans, automatic speech recognition (ASR) systems also face difficulties in noisy environments. In recent years, ASR technology has made remarkable strides following the adoption of deep-learning techniques [Hinton et al. 2012, Yu and Deng 2015]. This has led to advanced ASR systems bridging the gap with human performance [Xiong et al. 2017], compared to their significant lag 20 years earlier, as established by Lippmann [1997]. Nevertheless, the quest for ASR noise robustness, particularly when noise is non-stationary and mismatched to training data, remains an active research topic [Li et al. 2015].
To help us mitigate the aforementioned problem, the question naturally arises as to whether or not machines can be designed to mimic human speech perception in noise. Namely, can they successfully incorporate visual speech into the ASR pipeline, especially since this represents an additional information source unaffected by the acoustic environment. At Bell-Labs, Petajan [1984] was the first to develop and implement an early audio-visual automatic speech recognition (AVASR) system. Since then, the area has witnessed significant research activity, paralleling the advances in traditional audio-only ASR, while also utilizing progress in the computer vision and machine learning fields. Not surprisingly, adoption of deep learning techniques has created renewed interest in the field, resulting in remarkable progress on challenging domains, even surpassing human lipreading performance [Chung et al. 2017].
Since the very early works in the field [Stork and Hennecke 1996], design of AVASR systems has generally followed the basic architecture of Figure 12.1. There, a visual front-end module is depicted to provide speech-informative features that are extracted from the video of the speaker's face. These are subsequently fused with acoustic features into the speech recognition process. Clearly, compared to audio-only ASR, visual speech information extraction and audio-visual fusion (or integration) constitute two additional distinct components on which to focus. Indeed, their robustness under a wide range of audio-visual conditions and their efficient implementation represent significant challenges that, to date, remain the focus of active research. It should be noted that rapid recent advances, leading to so-called "end-to-end" AVASR systems [Assael et al. 2016, Chung et al. 2017], have somewhat blurred the distinction between these two components. Nevertheless, this division remains valuable to both the systematic exposure of the relevant material, as well as to the research and development of new systems.
In this chapter, we concentrate on AVASR while also addressing other related problems, namely audio-visual speech activity detection, diarization, and synchrony detection. In order to address such subjects, we first provide additional motivation in Section 12.2, discussing bimodality of human speech perception and production. In Section 12.3, we overview AVASR research in view of its potential application scenarios to multimodal interfaces, visual sensors employed, and audio-visual databases typically used. In Section 12.4, we cover visual feature extraction and, in Section 12.5, we discuss audio-visual fusion for ASR, also providing examples of experimental results achieved by AVASR systems. In Section 12.6, we offer a glimpse into additional audio-visual speech applications. We conclude the chapter by enumerating Focus Questions for further study. In addition, we provide a brief Glossary of the chapter's core terminology, serving as a quick reference.