Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Samer Al Moubayed²¹,
Jonas Beskow²¹,
Björn Granström²¹ &
…
David House²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6456))

1257 Accesses
3 Citations

Abstract

In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech.

We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

eBook: USD 15.99; Price excludes VAT (USA)

Softcover Book: USD 15.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Modeling Multimodal Behaviors from Speech Prosody

Visual speech influences speech perception immediately but not automatically

Article 30 November 2016

Real-Time Visual Prosody for Interactive Virtual Agents

References

McGurk, H., MacDonald, J.: Hearing lips and seeing voices, vol. 264, pp. 746–748 (1976)
Google Scholar
Summerfield, Q.: Lipreading and audio-visual speech perception. Philosophical Transactions: Biological Sciences 335(1273), 71–78 (1992)
Article Google Scholar
Cave, C., Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., Espesser, R.: About the relationship between eyebrow movements and Fo variations. In: Proc. of the Fourth International Conference on Spoken Language, vol. 4 (1996)
Google Scholar
Munhall, K., Jones, J., Callan, D., Kuratate, T., Vatikiotis-Bateson, E.: Head Movement Improves Auditory Speech Perception Psychological Science, vol. 15(2), pp. 133–137 (2004)
Google Scholar
Davis, C., Kim, J.: Audio-visual speech perception off the top of the head. Cognition 100(3), 21–31 (2006)
Article Google Scholar
Cvejic, E., Kim, J., Davis, C.: Prosody off the top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication (2010)
Google Scholar
Terken, J., Hermes, D.: The perception of prosodic prominence, in Prosody: Theory and Experiment. Studies Presented to Gösta Bruce. pp. 89–127 (2000)
Google Scholar
Streefkerk, B., Pols, L., Bosch, L.: Acoustical features as predictors for prominence in read aloud Dutch sentences used in ANN’s. In: Sixth European Conference on Speech Communication and Technology, Citeseer (1999)
Google Scholar
Fant, G., Kruckenberg, A., Nord, L.: Durational correlates of stress in Swedish, French, and English. Journal of phonetics 19(3-4), 351–365 (1991)
Google Scholar
Bruce, G.: Swedish word accents in sentence perspective. LiberLäromedel/Gleerup (1977)
Google Scholar
Gussenhoven, C., Bruce, G.: Word prosody and intonation.Empirical Approaches to Language Typology, 233–272 (1999)
Google Scholar
Heldner, M., Strangert, E.: Temporal effects of focus in Swedish. Journal of Phonetics 29(3), 329–361 (2001)
Article Google Scholar
Fant, G., Kruckenberg, A., Liljencrants, J., Hertegård, S.: Acoustic phonetic studies of prominence in Swedish. KTH TMH-QPSR 2(3), 2000 (2000)
Google Scholar
Fant, G., Kruckenberg, A.: Notes on stress and word accent in Swedish. In: Proceedings of the International Symposium on Prosody, Yokohama, September 18, pp. 2–3 (1994)
Google Scholar
Granström, B., House, D.: Audiovisual representation of prosody in expressive speech communication. Speech Communication 46(3-4), 473–484 (2005)
Article Google Scholar
Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proc of the Ninth International Conference on Spoken Language Processing (2006)
Google Scholar
House, D., Beskow, J., Granström, B.: Timing and interaction of visual cues for prominence in audiovisual speech perception. In: Proc. of the Seventh European Conference on Speech Communication and Technology (2001)
Google Scholar
Swerts, M., Krahmer, E.: The importance of different facial areas for signalling visual prominence. In: Proc. of the Ninth International Conference on Spoken Language Processing (2006)
Google Scholar
Krahmer, E., Swerts, M.: The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. Journal of Memory and Language 57(3), 396–414 (2007)
Article Google Scholar
Dohen, M., Lœvenbruck, H.: Interaction of audition and vision for the perception of prosodic contrastive focus. Language and Speech 52(2-3), 177 (2009)
Article Google Scholar
Dohen, M., Lcevenbruck, H., Hill, H.: Recognizing Prosody from the Lips: Is It Possible to Extract Prosodic Focus. Visual Speech Recognition: Lip Segmentation and Mapping, 416 (2009)
Google Scholar
Wang, D., Narayanan, S.: An acoustic measure for word prominence in spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing 15(2), 690–701 (2007)
Article Google Scholar
Grice, M., Savino, M.: Can pitch accent type convey information status in yes-no questions. In: Proc. of the Workshop Sponsored by the Association for Computational Linguistics, pp. 29–38 (1997)
Google Scholar
Al Moubayed, S., Beskow, J.: Effects of visual prominence cues on speech intelligibility. In: Proceedings of the International Conference on Auditory Visual Speech Processing AVSP 2009, vol. 15, p. 16 (2009)
Google Scholar
Tamburini, F.: Prosodic prominence detection in speech. In: Proceedings of the Seventh International Symposium on Signal Processing and Its Applications, vol. 1 (2003)
Google Scholar
Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., Öhman, T.: Synthetic faces as a lipreading support. In: Proceedings of ICSLP 1998 (1998)
Google Scholar
Salvi, G., Beskow, J., Al Moubayed, S., Granström, B.: Synface - speech-driven facial animation for virtual speech-reading support. Journal on Audio, Speech and Music Processing (2009)
Google Scholar
Beskow, J.: Rule-based visual speech synthesis. In: Proc. of the Fourth European Conference on Speech Communication and Technology (1995)
Google Scholar
Sjölander, K.: An HMM-based system for automatic segmentation and alignment of speech. In: Proceedings of Fonetik, pp. 93–96 (2003)
Google Scholar
Beskow, J.: Trainable articulatory control models for visual speech synthesis. International Journal of Speech Technology 7(4), 335–349 (2004)
Article Google Scholar
Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech recognition with primarily temporal cues. Science 270(5234), 303 (1995)
Article Google Scholar
Al Moubayed, S., Beskow, J., Oster, A.-M., Salvi, G., Granström, B., van Son, N., Ormel, E.: Virtual speech reading support for hard of hearing in a domestic multi-media setting. In: Proceedings of Interspeech 2009 (2009)
Google Scholar
Poggi, I., Pelachaud, C., De Rosisc, F.: Eye communication in a conversational 3D synthetic agent. AI communications 13(3), 169–181 (2000)
Google Scholar
Ekman, P.: About brows: Emotional and conversational signals. Human ethology: Claims and limits of a new discipline: contributions to the Colloquium, 169–248 (1979)
Google Scholar
Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 413–420 (1994)
Google Scholar
Raidt, S., Bailly, G., Elisei, F.: Analyzing and modeling gaze during face-to-face interaction. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, AVSP 2007 (2007)
Google Scholar
Vatikiotis-Bateson, E., Eigsti, I., Yano, S., Munhall, K.: Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics 60(6), 926–940 (1998)
Article Google Scholar
Paré, M., Richler, R., Ten, H., Munhall, K.: Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect. Perception & psychophysics 65(4), 553 (2003)
Article Google Scholar
Cutler, A., Otake, T.: Pitch accent in spoken-word recognition in Japanese. The Journal of the Acoustical Society of America 105, 1877 (1999)
Article Google Scholar
van Wassenhove, V., Grant, K., Poeppel, D.: Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences 102(4), 1181 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Speech Technology, Royal Institute of Technology KTH, Stockholm, Sweden
Samer Al Moubayed, Jonas Beskow, Björn Granström & David House

Authors

Samer Al Moubayed
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Beskow
View author publications
You can also search for this author in PubMed Google Scholar
Björn Granström
View author publications
You can also search for this author in PubMed Google Scholar
David House
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Advanced Scientific Studies, Second University of Naples, and IIASS, Via Pellegrino 19, 84019, Vietri sul Mare (SA), Italy
Anna Esposito
Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Vesuviano, Via Diocleziano 328, 80124, Napoli, Italy
Antonietta M. Esposito
Dipartemento di Ingegneria dell’ Informazione, Seconda Università di Napoli, Via Roma 29, 81031, Aversa (CE), Italy
Raffaele Martone
Department of Humanities and Social Sciences, Anatolia College/ACT, Kennedy Street, 55510, Pylaia, Greece
Vincent C. Müller
Departmnet of Physics "E.R. Caoamoeööp", University of Salerno and IIASS, International Institute for Advanced Scientific Studies, 84081, Baronissi (SA), Italy
Gaetano Scarpetta

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Al Moubayed, S., Beskow, J., Granström, B., House, D. (2011). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds) Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues. Lecture Notes in Computer Science, vol 6456. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18184-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-18184-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18183-2
Online ISBN: 978-3-642-18184-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Modeling Multimodal Behaviors from Speech Prosody

Visual speech influences speech perception immediately but not automatically

Real-Time Visual Prosody for Interactive Virtual Agents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Modeling Multimodal Behaviors from Speech Prosody

Visual speech influences speech perception immediately but not automatically

Real-Time Visual Prosody for Interactive Virtual Agents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation