Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. The sentences of medium length have not been used in the different analyses presented in this paper.

References

  • Bailly, G., Gibert, G., & Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. IEEE, pp. 27–30.

  • Bandini, A., Ouni, S., Cosi, P., Orlandi, S., & Manfredi, C. (2015). Accuracy of a markerless acquisition technique for studying speech articulators. In Interspeech 2015.

  • Barbulescu, A. (2015). Generation of audio-visual prosody for expressive virtual actors. Theses: Université Grenoble Alpes.

  • Barra Chicote, R., Montero Martínez, J.M., et al. (2008). Spanish expressive voices: corpus for emotion research in spanish. In Second international workshop on emotion: corpora for research on emotion and affect, international conference on language resources and evaluation (LREC 2008).

  • Berry, J. J. (2011). Accuracy of the ndi wave speech research system. Journal of Speech, Language, and Hearing Research, 54(5), 1295–1301.

    Article  Google Scholar 

  • Boersma, P., et al. (2002). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345.

    Google Scholar 

  • Bolinger, D. (1978). Intonation across languages. Universals of human language.

  • Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335.

    Article  Google Scholar 

  • Cave, C., Guaitella, I., Bertrand, R., Santi, S., Harlay, F., & Espesser, R. (1996). About the relationship between eyebrow movements and fo variations. In Proceedings, fourth international conference on spoken language, 1996. ICSLP 96.

  • Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49(2), 167–192.

    Article  Google Scholar 

  • Dutoit, T. (2008). Corpus-based speech synthesis. Springer handbook of speech processing (pp. 437–456). Berlin: Springer.

    Chapter  Google Scholar 

  • Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.

    Article  Google Scholar 

  • Ekman, P., & Friesen, W. V. (1976). Measuring facial movement. Environmental Psychology and Nonverbal Behavior, 1(1), 56–75.

    Article  Google Scholar 

  • Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion. Motivation and Emotion, 10(2), 159–168.

    Article  Google Scholar 

  • Ekman, P., Friesen, W., & Hager, J. (2002). Facial action coding system: Research nexus (p. 1). Salt Lake City: Network Research Information.

    Google Scholar 

  • Feng, Y., & Max, L. (2014). Accuracy and precision of a custom camera-based system for 2-d and 3-d motion tracking during speech and nonspeech motor tasks. Journal of Speech, Language, and Hearing Research, 57(2), 426–438.

    Article  Google Scholar 

  • Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53–72.

    Article  Google Scholar 

  • François, H., & Boëffard, O. (2001). Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Seventh European conference on speech communication and technology.

  • Hess, U., & Thibault, P. (2009). Why the same expression may not mean the same when shown on different faces or seen by different people. In U. Hess (Ed.), Affective information processing (pp. 145–158). Berlin: Springer.

    Chapter  Google Scholar 

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.

    Google Scholar 

  • Huron, D., & Shanahan, D. (2013). Eyebrow movements and vocal pitch height: Evidence consistent with an ethological signal. The Journal of the Acoustical Society of America, 133(5), 2947–2952.

    Article  Google Scholar 

  • Jiang, J., Alwan, A., Keating, P., Auer, E., & Bernstein, L. (2002). On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 11, 1174–1188.

    Google Scholar 

  • Jonathan, B.C., Nelly, O.B., & Delhay, A. (2008). Expressive prosody for unit-selection speech synthesis. In LREC.

  • Katz, W., Campbell, T.F., Wang, J., Farrar, E., Eubanks, J.C., Balasubramanian, A., Prabhakaran, B., & Rennaker, R. (2014). Opti-speech: A real-time, 3d visual feedback system for speech training. In: Fifteenth Annual Conference of the International Speech Communication Association.

  • Kawaler, M., & Czyzewski, A. (2019). Database of speech and facial expressions recorded with optimized face motion capture settings. Journal of Intelligent Information Systems, 53, 1–24.

    Article  Google Scholar 

  • Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–10.

  • Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The cmu sphinx-4 speech recognition system. In IEEE international confernece on acoustics, speech and signal processing (ICASSP 2003), Hong Kong.

  • Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, IEEE, pp. 94–101.

  • Ma, J., Cole, R., Pellom, B., Ward, W., & Wise, B. (2006). Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics, 12(2), 266–276.

    Article  Google Scholar 

  • Mattheyses, W., Latacz, L., & Verhelst, W. (2009). On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing.

  • Mefferd, A. (2015). Articulatory-to-acoustic relations in talkers with dysarthria: A first analysis. Journal of Speech, Language, and Hearing Research, 58(3), 576–589.

    Article  Google Scholar 

  • Mehrabian, A. (2008). Communication without words. Communication Theory, 6, 193–200.

    Google Scholar 

  • Moore, S. (1984). The Stanislavski system: The professional training of an actor. Penguin.

  • Morton, E. S. (1977). On the occurrence and significance of motivation-structural rules in some bird and mammal sounds. The American Naturalist, 111(981), 855–869.

    Article  Google Scholar 

  • Morton, E. S. (1994). Sound symbolism and its role in non-human vertebrate. Sound symbolism (pp. 348–365). New York: Cambridge University Press.

    Google Scholar 

  • Nabi, R. L. (2002). The theoretical versus the lay meaning of disgust: Implications for emotion research. Cognition & Emotion, 16(5), 695–703.

    Article  Google Scholar 

  • Nunes, A. M. B. (2013). Cross-linguistic and cultural effects on the perception of emotions. International Journal of Science Commerce and Humanities, 1(8), 107–120.

    Google Scholar 

  • Ouni, S., Colotte, V., Dahmani, S., & Azzi, S. (2016). Acoustic and visual analysis of expressive speech: a case study of french acted speech. In Interspeech 2016.

  • Ouni, S., Colotte, V., Musti, U., Toutios, A., Wrobel-Dautcourt, B., Berger, M. O., et al. (2013). Acoustic-visual synthesis technique using bimodal unit-selection. EURASIP Journal on Audio, Speech, and Music Processing,. https://doi.org/10.1186/1687-4722-2013-16.

    Article  Google Scholar 

  • Ouni, S., & Dahmani, S. (2016). Is markerless acquisition technique adequate for speech production? The Journal of the Acoustical Society of America, 139(6), EL234–EL239.

    Article  Google Scholar 

  • Ouni, S., Dahmani, S., & Colotte, V. (2017). On the quality of an expressive audiovisual corpus: a case study of acted speech. In International conference on auditory-visual speech processing

  • Ouni, S., & Gris, G. (2018). Dynamic Lip Animation from a Limited number of Control Points: Towards an Effective Audiovisual Spoken Communication. Speech Communication 96.

  • Paeschke, A., Kienast, M., Sendlmeier, W.F., et al. (1999). F0-contours in emotional speech. In Proceedings of the 14th international congress of phonetic sciences, vol 2, pp. 929–932.

  • Pell, M. D., Paulmann, S., Dara, C., & Alasseri, A. (2009). Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics., 37, 417–435.

    Article  Google Scholar 

  • Queneau, R. (2018). Exercises in style. Richmond: Alma Books.

    Google Scholar 

  • Raymond, Q. (1947). Exercices de style

  • Schabus, D., & Pucher, M. (2014). Joint audiovisual hidden semi-markov model-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8, 336–347.

    Article  Google Scholar 

  • Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2), 143.

    Article  Google Scholar 

  • Stella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., & Fivela, B.G. (2013). Electromagnetic articulography with ag500 and ag501. In Interspeech, pp. 1316–1320.

  • Tian, Y. I., Kanade, T., & Cohn, J. F. (2001). Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 97–115.

    Article  Google Scholar 

  • Vatikiotis-Bateson, E., Munhall, K., & Ostry, D. (1993). Optoelectronic measurement of orofacial motions during speech production. The Journal of the Acoustical Society of America, 93(4), 2414–2414.

    Article  Google Scholar 

  • Volker Strom, R.C., & King, S. (2006). Expressive prosody for unit-selection speech synthesis. INTERSPEECH.

  • Walsh, B., & Smith, A. (2012). Basic parameters of articulatory movements and acoustics in individuals with parkinson’s disease. Movement Disorders, 27(7), 843–850.

    Article  Google Scholar 

  • Wiggers, M. (1982). Judgments of facial expressions of emotion predicted from facial behavior. Journal of Nonverbal Behavior, 7(2), 101–116.

    Article  Google Scholar 

  • Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3), 555–568.

    Article  Google Scholar 

  • Yunusova, Y., Green, J. R., & Mefferd, A. (2009). Accuracy assessment for ag500, electromagnetic articulograph. Journal of Speech, Language, and Hearing Research,. https://doi.org/10.1044/1092-4388(2008/07-0218).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the EQUIPEX Ortolang. We also thank our actor F.S. for his participation in this study.

Funding

This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the Agence Nationale de la Recherche (EQUIPEX Ortolang).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Slim Ouni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dahmani, S., Colotte, V. & Ouni, S. Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform. Lang Resources & Evaluation 54, 943–974 (2020). https://doi.org/10.1007/s10579-020-09500-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09500-w

Keywords

Navigation