research-article

Performative Vocal Synthesis for Foreign Language Intonation Practice

Authors:

Barbara Kuhnert,

Nicolas Audibert,

Grégoire Locqueville,

Claire Pillot-Loiseau,

Christophe D'AlessandroAuthors Info & Claims

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Article No.: 697, Pages 1 - 9

https://doi.org/10.1145/3544548.3581210

Published: 19 April 2023 Publication History

Abstract

Typical foreign language (L2) pronunciation training focuses mainly on individual sounds. Intonation, the patterns of pitch change across words or phrases is often neglected, despite its key role in word-level intelligibility and in the expression of attitudes and affect. This paper examines hand-controlled real-time vocal synthesis, known as Performative Vocal Synthesis (PVS), as an interaction technique for practicing L2 intonation in computer aided pronunciation training (CAPT).

We evaluate a tablet-based interface where users gesturally control the pitch of a pre-recorded utterance by drawing curves on the touchscreen. 24 subjects (12 French learners, 12 British controls) imitated English phrases with their voice and the interface. Results of an acoustic analysis and expert perceptive evaluation showed that learners’ gestural imitations yielded more accurate results than vocal imitations of the fall-rise intonation pattern typically difficult for francophones, suggesting that PVS can help learners produce intonation patterns beyond the capabilities of their natural voice.

Supplementary Material

MP4 File (3544548.3581210-talk-video.mp4)

Pre-recorded Video Presentation

Download
242.68 MB

MP4 File (3544548.3581210-video-preview.mp4)

Video Preview

Download
19.02 MB

MP4 File (3544548.3581210-video-figure.mp4)

Video Figure

Download
54.65 MB

References

[1]

2022. Guthman Musical Instrument Competition, 2022 Competition. https://guthman.gatech.edu/2022-competition. Accessed: 2022-09-15.

[2]

Pierre Badin, Atef Ben Youssef, Gérard Bailly, Frédéric Elisei, and Thomas Hueber. 2010. Visual articulatory feedback for phonetic correction in second language learning. In L2SW, Workshop on" Second Language Studies: Acquisition, Learning, Education and Technology. P1–10.

[3]

Paul Boersma and David Weenink. 1992-2022. Praat: doing phonetics by computer [Computer program]. Version 6.1.08, retrieved 5 December 2019 from http://www.praat.org.

[4]

Elena Boitsova, Evgeny Pyshkin, Yasuta Takako, Natalia Bogach, Iurii Lezhenin, Anton Lamtev, and Vadim Diachkov. 2018. StudyIntonation courseware kit for EFL prosody teaching. In Proceedings of the 9th International Conference on Speech Prosody. 413–417.

[5]

Yaohua Bu, Tianyi Ma, Weijun Li, Hang Zhou, Jia Jia, Shengqi Chen, Kaiyuan Xu, Dachuan Shi, Haozhe Wu, Zhihan Yang, 2021. PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.

Digital Library

[6]

Dorothy Chun. 1998. Signal analysis software for teaching discourse intonation. Language Learning & Technology 2, 1 (1998), 74–93.

[7]

Cristina Crison, Daniel Romero, Joaquín Romero, and Rovira i Virgili. 2018. The practical application of hand gestures as a means of improving English intonation. In Proc. ISAPh 2018 International Symposium on Applied Phonetics. 45–50.

[8]

Alan Cruttenden 1997. Intonation. Cambridge University Press.

[9]

Madalena Cruz-Ferreira. 1984. Perception and interpretation of non-native intonation patterns. In Proceedings of the tenth International Congress of Phonetic Sciences. De Gruyter Mouton, 565–569.

[10]

Christophe d’Alessandro, Lionel Feugère, Sylvain Le Beux, Olivier Perrotin, and Albert Rilliard. 2014. Drawing melodies: Evaluation of chironomic singing synthesis. The Journal of the Acoustical Society of America 135, 6 (2014), 3601–3612.

[11]

Christophe d’Alessandro, Xiao Xiao, Grégoire Locqueville, and Boris Doval. 2019. Borrowed voices. In International Conference on New Interfaces for Musical Expression NIME’19. 2–2.

[12]

Kees De Bot. 1983. Visual feedback of intonation I: Effectiveness and induced practice behavior. Language and speech 26, 4 (1983), 331–350.

[13]

Tracey M Derwing and Marian J Rossiter. 2002. ESL learners’ perceptions of their pronunciation needs and strategies. System 30, 2 (2002), 155–166.

[14]

Christophe d’Alessandro, Albert Rilliard, and Sylvain Le Beux. 2011. Chironomic stylization of intonation. The Journal of the Acoustical Society of America 129, 3 (2011), 1594–1604.

[15]

Marc Evrard, Samuel Delalez, Christophe d’Alessandro, and Albert Rilliard. 2015. Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis. In Sixteenth Annual Conference of the International Speech Communication Association.

[16]

S. Sidney Fels and Geoffrey E. Hinton. 1993. Glove-talk: A neural network interface between a data-glove and a speech synthesizer. Neural Networks, IEEE Trans. on 4, 1 (1993), 2–8.

Digital Library

[17]

S. Sydney Fels and Geoffrey E. Hinton. 1998. Glove-Talk II-a neural-network interface which maps gestures to parallel formant speech synthesizer controls. IEEE Trans.on Neural Networks 9, 1 (Jan 1998), 205–212. https://doi.org/10.1109/72.655042

Digital Library

[18]

Lionel Feugère, Christophe d’Alessandro, and Boris Doval. 2013. Performative voice synthesis for edutainment in acoustic phonetics and singing: A case study using the “Cantor Digitalis”. In International Conference on Intelligent Technologies for Interactive Entertainment. Springer, 169–178.

[19]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters.Psychological bulletin 76, 5 (1971), 378.

[20]

Atsushi Fujimori, Noriko Yoshimura, and Noriko Yamane. 2015. The development of visual CALL materials for learning L2 English prosody. In Conference proceedings. ICT for language learning. libreriauniversitaria. it Edizioni, 249.

[21]

Abbas Pourhossein Gilakjani and Mohammad Reza Ahmadi. 2011. Why Is Pronunciation So Difficult to Learn?.English language teaching 4, 3 (2011), 74–83.

[22]

Pierre A Hallé, Yueh-Chin Chang, and Catherine T Best. 2004. Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of phonetics 32, 3 (2004), 395–421.

[23]

Sophie Herment and Anne Tortel. 2021. The intonation contour of non-finality revisited: implications for EFL teaching. In English pronunciation instruction: Research-based Insights, Applied Linguisics series, Anastazija Kirkova-Naskova, Alice Henderson, and Jonás Fouz-González (Eds.). John Benjamins, Amsterdam, Netherlands, 175–195.

[24]

Rebecca Hincks. 2003. Speech technologies for pronunciation feedback and evaluation. ReCALL 15, 1 (2003), 3–20.

Digital Library

[25]

Daniel Hirst and Albert Di Cristo. 1998. Intonation systems. A survey of Twenty Languages(1998).

[26]

Thomas Kisler, Uwe Reichel, and Florian Schiel. 2017. Multilingual processing of speech via web services. Computer Speech & Language 45 (2017), 326–347.

Digital Library

[27]

Grégoire Locqueville, Christophe d’Alessandro, Samuel Delalez, Boris Doval, and Xiao Xiao. 2020. Voks: Digital instruments for chironomic control of voice samples. Speech Communication 125(2020), 97–113.

[28]

Steven G McCafferty. 2006. Gesture and the materialization of second language prosody. (2006).

[29]

Murray J Munro and Tracey M Derwing. 1999. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language learning 49(1999), 285–310.

[30]

Yishuang Ning, Zhiyong Wu, Jia Jia, Fanbo Meng, Helen Meng, and Lianhong Cai. 2015. HMM-based emphatic speech synthesis for corrective feedback in computer-aided pronunciation training. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4934–4938.

[31]

Martha C Pennington. 1999. Computer-aided pronunciation pedagogy: Promise, limitations, directions. Computer assisted language learning 12, 5 (1999), 427–440.

[32]

Brechtje Post, Mariapaola d’Imperio, and Carlos Gussenhoven. 2007. Fine phonetic detail and intonational meaning. In International Congress of Phonetic Science (ICPhS). 191–196.

[33]

Evgeny Pyshkin, John Blake, Anton Lamtev, Iurii Lezhenin, Artyom Zhuikov, and Natalia Bogach. 2019. Prosody training mobile application: Early design assessment and lessons learned. In 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Vol. 2. IEEE, 735–740.

Digital Library

[34]

R Core Team. 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

[35]

Tetyana Smotrova. 2017. Making pronunciation visible: Gesture in teaching pronunciation. Tesol Quarterly 51, 1 (2017), 59–89.

[36]

Marton Soskuthy. 2021. Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. Journal of Phonetics 84(2021), 101017.

[37]

Ron I Thomson and Tracey M Derwing. 2015. The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics 36, 3 (2015), 326–344.

[38]

Juhani Toivanen. 2007. Fall-rise intonation usage in Finnish English second language discourse. Proceedings of Fonetik 2007, TMH-QPSR, 50 (1) (2007), 85–88.

[39]

Jacolien van Rij, Martijn Wieling, R. Harald Baayen, and Hedderik van Rijn. 2020. itsadug: Interpreting Time Series and Autocorrelated Data Using GAMMs. R package version 2.4.

[40]

Gregory Ward and Julia Hirschberg. 1985. Implicating uncertainty: The pragmatics of fall-rise intonation. Language (1985), 747–776.

[41]

Martijn Wieling. 2018. Analyzing dynamic phonetic data using generalized additive mixed modeling: A tutorial focusing on articulatory differences between L1 and L2 speakers of English. Journal of Phonetics 70(2018), 86–116.

[42]

S. N. Wood. 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B) 73, 1 (2011), 3–36.

[43]

S. N. Wood. 2017. Generalized Additive Models: An Introduction with R (2 ed.). Chapman and Hall/CRC.

[44]

Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d’Alessandro, Barbara Kuhnert, and Claire Pillot-Loiseau. 2021. Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers. In Interspeech 2021. ISCA, 516–520.

[45]

Xiao Xiao, Grégoire Locqueville, Christophe d’Alessandro, and Boris Doval. 2019. T-Voks: the singing and speaking theremin. In NIME 2019 International Conference on New Interfaces for Musical Expression. 110–115.

[46]

Johan ’t Hart. 1981. Differential sensitivity to pitch distance, particularly in speech. The Journal of the Acoustical Society of America 69, 3 (1981), 811–821.

Cited By

Xiao XAlaoui S(2024)Tuning In to Intangibility : Reflections from My First 3 Years of Theremin LearningProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661584(2649-2659)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661584

Index Terms

Performative Vocal Synthesis for Foreign Language Intonation Practice
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Sound-based input / output
      2. Touch screens
    2. Interaction techniques
      1. Gestural input
  2. Interaction design
    1. Empirical studies in interaction design

Recommendations

On the perception of "segmental intonation": F0 context effects on sibilant identification in German

In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...
Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis

Previous intonational research on Mandarin has mainly focused on the prosody modeling of statements or the prosody analysis of interrogative sentences. To support related speech technologies, e.g., Text-to-Speech, the quantitative modeling of intonation ...
Extraction and representation of prosodic features for language and speaker recognition

In this paper, we propose a new approach for extracting and representing prosodic features directly from the speech signal. We hypothesize that prosody is linked to linguistic units such as syllables, and it is manifested in terms of changes in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

April 2023

14911 pages

ISBN:9781450394215

DOI:10.1145/3544548

Editors:
Albrecht Schmidt
LMU Munich, Germany60028717
,
Kaisa Väänänen
Tampere University, Finland60011170
,
Tesh Goyal
Google Research, USA60006191
,
Per Ola Kristensson
University of Cambridge, UK60031101
,
Anicia Peters
University of Namibia, Namibia60072704
,
Stefanie Mueller
Massachusetts Institute of Technology, USA60022195
,
Julie R. Williamson
University of Glasgow, UK60001490
,
Max L. Wilson
University of Nottingham, UK60015138

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CHI '23

Sponsor:

SIGCHI

CHI '23: CHI Conference on Human Factors in Computing Systems

April 23 - 28, 2023

Hamburg, Germany

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI 2025

Sponsor:
sigchi

ACM CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)136
Downloads (Last 6 weeks)18

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao XAlaoui S(2024)Tuning In to Intangibility : Reflections from My First 3 Years of Theremin LearningProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661584(2649-2659)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661584

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Table of Conten