Abstract
In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.
Similar content being viewed by others
Notes
The sentences of medium length have not been used in the different analyses presented in this paper.
References
Bailly, G., Gibert, G., & Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. IEEE, pp. 27–30.
Bandini, A., Ouni, S., Cosi, P., Orlandi, S., & Manfredi, C. (2015). Accuracy of a markerless acquisition technique for studying speech articulators. In Interspeech 2015.
Barbulescu, A. (2015). Generation of audio-visual prosody for expressive virtual actors. Theses: Université Grenoble Alpes.
Barra Chicote, R., Montero Martínez, J.M., et al. (2008). Spanish expressive voices: corpus for emotion research in spanish. In Second international workshop on emotion: corpora for research on emotion and affect, international conference on language resources and evaluation (LREC 2008).
Berry, J. J. (2011). Accuracy of the ndi wave speech research system. Journal of Speech, Language, and Hearing Research, 54(5), 1295–1301.
Boersma, P., et al. (2002). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345.
Bolinger, D. (1978). Intonation across languages. Universals of human language.
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335.
Cave, C., Guaitella, I., Bertrand, R., Santi, S., Harlay, F., & Espesser, R. (1996). About the relationship between eyebrow movements and fo variations. In Proceedings, fourth international conference on spoken language, 1996. ICSLP 96.
Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49(2), 167–192.
Dutoit, T. (2008). Corpus-based speech synthesis. Springer handbook of speech processing (pp. 437–456). Berlin: Springer.
Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.
Ekman, P., & Friesen, W. V. (1976). Measuring facial movement. Environmental Psychology and Nonverbal Behavior, 1(1), 56–75.
Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion. Motivation and Emotion, 10(2), 159–168.
Ekman, P., Friesen, W., & Hager, J. (2002). Facial action coding system: Research nexus (p. 1). Salt Lake City: Network Research Information.
Feng, Y., & Max, L. (2014). Accuracy and precision of a custom camera-based system for 2-d and 3-d motion tracking during speech and nonspeech motor tasks. Journal of Speech, Language, and Hearing Research, 57(2), 426–438.
Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53–72.
François, H., & Boëffard, O. (2001). Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Seventh European conference on speech communication and technology.
Hess, U., & Thibault, P. (2009). Why the same expression may not mean the same when shown on different faces or seen by different people. In U. Hess (Ed.), Affective information processing (pp. 145–158). Berlin: Springer.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Huron, D., & Shanahan, D. (2013). Eyebrow movements and vocal pitch height: Evidence consistent with an ethological signal. The Journal of the Acoustical Society of America, 133(5), 2947–2952.
Jiang, J., Alwan, A., Keating, P., Auer, E., & Bernstein, L. (2002). On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 11, 1174–1188.
Jonathan, B.C., Nelly, O.B., & Delhay, A. (2008). Expressive prosody for unit-selection speech synthesis. In LREC.
Katz, W., Campbell, T.F., Wang, J., Farrar, E., Eubanks, J.C., Balasubramanian, A., Prabhakaran, B., & Rennaker, R. (2014). Opti-speech: A real-time, 3d visual feedback system for speech training. In: Fifteenth Annual Conference of the International Speech Communication Association.
Kawaler, M., & Czyzewski, A. (2019). Database of speech and facial expressions recorded with optimized face motion capture settings. Journal of Intelligent Information Systems, 53, 1–24.
Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–10.
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The cmu sphinx-4 speech recognition system. In IEEE international confernece on acoustics, speech and signal processing (ICASSP 2003), Hong Kong.
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, IEEE, pp. 94–101.
Ma, J., Cole, R., Pellom, B., Ward, W., & Wise, B. (2006). Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics, 12(2), 266–276.
Mattheyses, W., Latacz, L., & Verhelst, W. (2009). On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing.
Mefferd, A. (2015). Articulatory-to-acoustic relations in talkers with dysarthria: A first analysis. Journal of Speech, Language, and Hearing Research, 58(3), 576–589.
Mehrabian, A. (2008). Communication without words. Communication Theory, 6, 193–200.
Moore, S. (1984). The Stanislavski system: The professional training of an actor. Penguin.
Morton, E. S. (1977). On the occurrence and significance of motivation-structural rules in some bird and mammal sounds. The American Naturalist, 111(981), 855–869.
Morton, E. S. (1994). Sound symbolism and its role in non-human vertebrate. Sound symbolism (pp. 348–365). New York: Cambridge University Press.
Nabi, R. L. (2002). The theoretical versus the lay meaning of disgust: Implications for emotion research. Cognition & Emotion, 16(5), 695–703.
Nunes, A. M. B. (2013). Cross-linguistic and cultural effects on the perception of emotions. International Journal of Science Commerce and Humanities, 1(8), 107–120.
Ouni, S., Colotte, V., Dahmani, S., & Azzi, S. (2016). Acoustic and visual analysis of expressive speech: a case study of french acted speech. In Interspeech 2016.
Ouni, S., Colotte, V., Musti, U., Toutios, A., Wrobel-Dautcourt, B., Berger, M. O., et al. (2013). Acoustic-visual synthesis technique using bimodal unit-selection. EURASIP Journal on Audio, Speech, and Music Processing,. https://doi.org/10.1186/1687-4722-2013-16.
Ouni, S., & Dahmani, S. (2016). Is markerless acquisition technique adequate for speech production? The Journal of the Acoustical Society of America, 139(6), EL234–EL239.
Ouni, S., Dahmani, S., & Colotte, V. (2017). On the quality of an expressive audiovisual corpus: a case study of acted speech. In International conference on auditory-visual speech processing
Ouni, S., & Gris, G. (2018). Dynamic Lip Animation from a Limited number of Control Points: Towards an Effective Audiovisual Spoken Communication. Speech Communication 96.
Paeschke, A., Kienast, M., Sendlmeier, W.F., et al. (1999). F0-contours in emotional speech. In Proceedings of the 14th international congress of phonetic sciences, vol 2, pp. 929–932.
Pell, M. D., Paulmann, S., Dara, C., & Alasseri, A. (2009). Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics., 37, 417–435.
Queneau, R. (2018). Exercises in style. Richmond: Alma Books.
Raymond, Q. (1947). Exercices de style
Schabus, D., & Pucher, M. (2014). Joint audiovisual hidden semi-markov model-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8, 336–347.
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin, 99(2), 143.
Stella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., & Fivela, B.G. (2013). Electromagnetic articulography with ag500 and ag501. In Interspeech, pp. 1316–1320.
Tian, Y. I., Kanade, T., & Cohn, J. F. (2001). Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 97–115.
Vatikiotis-Bateson, E., Munhall, K., & Ostry, D. (1993). Optoelectronic measurement of orofacial motions during speech production. The Journal of the Acoustical Society of America, 93(4), 2414–2414.
Volker Strom, R.C., & King, S. (2006). Expressive prosody for unit-selection speech synthesis. INTERSPEECH.
Walsh, B., & Smith, A. (2012). Basic parameters of articulatory movements and acoustics in individuals with parkinson’s disease. Movement Disorders, 27(7), 843–850.
Wiggers, M. (1982). Judgments of facial expressions of emotion predicted from facial behavior. Journal of Nonverbal Behavior, 7(2), 101–116.
Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3), 555–568.
Yunusova, Y., Green, J. R., & Mefferd, A. (2009). Accuracy assessment for ag500, electromagnetic articulograph. Journal of Speech, Language, and Hearing Research,. https://doi.org/10.1044/1092-4388(2008/07-0218).
Acknowledgements
This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the EQUIPEX Ortolang. We also thank our actor F.S. for his participation in this study.
Funding
This work was supported by Region Lorraine (COREXP Project), Inria (ADT Plavis) and the Agence Nationale de la Recherche (EQUIPEX Ortolang).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dahmani, S., Colotte, V. & Ouni, S. Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform. Lang Resources & Evaluation 54, 943–974 (2020). https://doi.org/10.1007/s10579-020-09500-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09500-w