Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

219 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An audio-visual corpus for multimodal automatic speech recognition

Article Open access 07 January 2017

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Multimodal speech recognition: increasing accuracy using high speed video data

Article 01 August 2018

Notes

The sentences of medium length have not been used in the different analyses presented in this paper.

References

Bailly, G., Gibert, G., & Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. IEEE, pp. 27–30.
Bandini, A., Ouni, S., Cosi, P., Orlandi, S., & Manfredi, C. (2015). Accuracy of a markerless acquisition technique for studying speech articulators. In Interspeech 2015.
Barbulescu, A. (2015). Generation of audio-visual prosody for expressive virtual actors. Theses: Université Grenoble Alpes.
Barra Chicote, R., Montero Martínez, J.M., et al. (2008). Spanish expressive voices: corpus for emotion research in spanish. In Second international workshop on emotion: corpora for research on emotion and affect, international conference on language resources and evaluation (LREC 2008).
Berry, J. J. (2011). Accuracy of the ndi wave speech research system. Journal of Speech, Language, and Hearing Research, 54(5), 1295–1301.
Article Google Scholar
Boersma, P., et al. (2002). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345.
Google Scholar
Bolinger, D. (1978). Intonation across languages. Universals of human language.
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335.
Article Google Scholar
Cave, C., Guaitella, I., Bertrand, R., Santi, S., Harlay, F., & Espesser, R. (1996). About the relationship between eyebrow movements and fo variations. In Proceedings, fourth international conference on spoken language, 1996. ICSLP 96.
Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49(2), 167–192.
Article Google Scholar
Dutoit, T. (2008). Corpus-based speech synthesis. Springer handbook of speech processing (pp. 437–456). Berlin: Springer.
Chapter Google Scholar
Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.
Article Google Scholar
Ekman, P., & Friesen, W. V. (1976). Measuring facial movement. Environmental Psychology and Nonverbal Behavior, 1(1), 56–75.
Article Google Scholar
Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion. Motivation and Emotion, 10(2), 159–168.
Article Google Scholar
Ekman, P., Friesen, W., & Hager, J. (2002). Facial action coding system: Research nexus (p. 1). Salt Lake City: Network Research Information.
Google Scholar
Feng, Y., & Max, L. (2014). Accuracy and precision of a custom camera-based system for 2-d and 3-d motion tracking during speech and nonspeech motor tasks. Journal of Speech, Language, and Hearing Research, 57(2), 426–438.
Article Google Scholar
Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53–72.
Article Google Scholar
François, H., & Boëffard, O. (2001). Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Seventh European conference on speech communication and technology.
Hess, U., & Thibault, P. (2009). Why the same expression may not mean the same when shown on different faces or seen by different people. In U. Hess (Ed.), Affective information processing (pp. 145–158). Berlin: Springer.
Chapter Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Google Scholar
Huron, D., & Shanahan, D. (2013). Eyebrow movements and vocal pitch height: Evidence consistent with an ethological signal. The Journal of the Acoustical Society of America, 133(5), 2947–2952.
Article Google Scholar
Jiang, J., Alwan, A., Keating, P., Auer, E., & Bernstein, L. (2002). On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 11, 1174–1188.
Google Scholar
Jonathan, B.C., Nelly, O.B., & Delhay, A. (2008). Expressive prosody for unit-selection speech synthesis. In LREC.
Katz, W., Campbell, T.F., Wang, J., Farrar, E., Eubanks, J.C., Balasubramanian, A., Prabhakaran, B., & Rennaker, R. (2014). Opti-speech: A real-time, 3d visual feedback system for speech training. In: Fifteenth Annual Conference of the International Speech Communication Association.
Kawaler, M., & Czyzewski, A. (2019). Database of speech and facial expressions recorded with optimized face motion capture settings. Journal of Intelligent Information Systems, 53, 1–24.
Article Google Scholar
Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–10.
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The cmu sphinx-4 speech recognition system. In IEEE international confernece on acoustics, speech and signal processing (ICASSP 2003), Hong Kong.
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, IEEE, pp. 94–101.
Ma, J., Cole, R., Pellom, B., Ward, W., & Wise, B. (2006). Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics, 12(2), 266–276.
Article Google Scholar
Mattheyses, W., Latacz, L., & Verhelst, W. (2009). On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing.
Mefferd, A. (2015). Articulatory-to-acoustic relations in talkers with dysarthria: A first analysis. Journal of Speech, Language, and Hearing Research, 58(3), 576–589.
Article Google Scholar
Mehrabian, A. (2008). Communication without words. Communication Theory, 6, 193–200.
Google Scholar
Moore, S. (1984). The Stanislavski system: The professional training of an actor. Penguin.
Morton, E. S. (1977). On the occurrence and significance of motivation-structural rules in some bird and mammal sounds. The American Naturalist, 111(981), 855–869.
Article Google Scholar
Morton, E. S. (1994). Sound symbolism and its role in non-human vertebrate. Sound symbolism (pp. 348–365). New York: Cambridge University Press.
Google Scholar
Nabi, R. L. (2002). The theoretical versus the lay meaning of disgust: Implications for emotion research. Cognition & Emotion, 16(5), 695–703.
Article Google Scholar
Nunes, A. M. B. (2013). Cross-linguistic and cultural effects on the perception of emotions. International Journal of Science Commerce and Humanities, 1(8), 107–120.
Google Scholar
Ouni, S., Colotte, V., Dahmani, S., & Azzi, S. (2016). Acoustic and visual analysis of expressive speech: a case study of french acted speech. In Interspeech 2016.
Ouni, S., Colotte, V., Musti, U., Toutios, A., Wrobel-Dautcourt, B., Berger, M. O., et al. (2013). Acoustic-visual synthesis technique using bimodal unit-selection. EURASIP Journal on Audio, Speech, and Music Processing,. https://doi.org/10.1186/1687-4722-2013-16.
Article Google Scholar
Ouni, S., & Dahmani, S. (2016). Is markerless acquisition technique adequate for speech production? The Journal of the Acoustical Society of America, 139(6), EL234–EL239.
Article Google Scholar
Ouni, S., Dahmani, S., & Colotte, V. (2017). On the quality of an expressive audiovisual corpus: a case study of acted speech. In International conference on auditory-visual speech processing
Ouni, S., & Gris, G. (2018). Dynamic Lip Animation from a Limited number of Control Points: Towards an Effective Audiovisual Spoken Communication. Speech Communication 96.
Paeschke, A., Kienast, M., Sendlmeier, W.F., et al. (1999). F0-contours in emotional speech. In Proceedings of the 14th international congress of phonetic sciences, vol 2, pp. 929–932.
Pell, M. D., Paulmann, S., Dara, C., & Alasseri, A. (2009). Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics., 37, 417–435.
Article Google Scholar
Queneau, R. (2018). Exercises in style. Richmond: Alma Books.
Google Scholar
Raymond, Q. (1947). Exercices de style
Schabus, D., & Pucher, M. (2014). Joint audiovisual hidden semi-markov model-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8, 336–347.
Article Google Scholar
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2), 143.
Article Google Scholar
Stella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., & Fivela, B.G. (2013). Electromagnetic articulography with ag500 and ag501. In Interspeech, pp. 1316–1320.
Tian, Y. I., Kanade, T., & Cohn, J. F. (2001). Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 97–115.
Article Google Scholar
Vatikiotis-Bateson, E., Munhall, K., & Ostry, D. (1993). Optoelectronic measurement of orofacial motions during speech production. The Journal of the Acoustical Society of America, 93(4), 2414–2414.
Article Google Scholar
Volker Strom, R.C., & King, S. (2006). Expressive prosody for unit-selection speech synthesis. INTERSPEECH.
Walsh, B., & Smith, A. (2012). Basic parameters of articulatory movements and acoustics in individuals with parkinson’s disease. Movement Disorders, 27(7), 843–850.
Article Google Scholar
Wiggers, M. (1982). Judgments of facial expressions of emotion predicted from facial behavior. Journal of Nonverbal Behavior, 7(2), 101–116.
Article Google Scholar
Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3), 555–568.
Article Google Scholar
Yunusova, Y., Green, J. R., & Mefferd, A. (2009). Accuracy assessment for ag500, electromagnetic articulograph. Journal of Speech, Language, and Hearing Research,. https://doi.org/10.1044/1092-4388(2008/07-0218).
Article Google Scholar

Download references

Acknowledgements

This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the EQUIPEX Ortolang. We also thank our actor F.S. for his participation in this study.

Funding

This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the Agence Nationale de la Recherche (EQUIPEX Ortolang).

Author information

Authors and Affiliations

CNRS, Inria, LORIA, Université de Lorraine, 54000, Nancy, France
Sara Dahmani, Vincent Colotte & Slim Ouni

Authors

Sara Dahmani
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Colotte
View author publications
You can also search for this author in PubMed Google Scholar
Slim Ouni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Slim Ouni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dahmani, S., Colotte, V. & Ouni, S. Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform. Lang Resources & Evaluation 54, 943–974 (2020). https://doi.org/10.1007/s10579-020-09500-w

Download citation

Published: 26 July 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10579-020-09500-w

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An audio-visual corpus for multimodal automatic speech recognition

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Multimodal speech recognition: increasing accuracy using high speed video data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An audio-visual corpus for multimodal automatic speech recognition

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera

Multimodal speech recognition: increasing accuracy using high speed video data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation