The Impression of Phones and Prosody Choice in the Gibberish Speech of the Virtual Embodied Conversational Agent Kotaro
<p>Russell’s two-dimensional model of valence and arousal and the mapping of some emotions in it.</p> "> Figure 2
<p>Talk to Kotaro experiment screen, which shows ECA Kotaro, the video feed from the camera of volunteers (one of the authors, in this case) and the turn-taking button (green).</p> "> Figure 3
<p><math display="inline"><semantics> <mover accent="true"> <msub> <mi>δ</mi> <mi>E</mi> </msub> <mo>→</mo> </mover> </semantics></math> represented in the valence–arousal emotion space, where arrows indicate the valence–arousal change and grey triangles denote an utterance that caused no visible emotional impression.</p> "> Figure 4
<p>Architecture of the LSTM-based neural network used for analyzing the sentiment of the voice recordings of participants.</p> "> Figure 5
<p>Neural networks used for impression prediction in this work. (<b>a</b>) Architecture of the bi-directional GRU neural network <math display="inline"><semantics> <mrow> <mi>G</mi> <mi>R</mi> <msub> <mi>U</mi> <mrow> <mi>p</mi> <mi>h</mi> <mi>o</mi> <mi>n</mi> <mi>e</mi> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math> for generating a phone-embedding matrix; (<b>b</b>) architecture of neural network <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>L</mi> <msub> <mi>P</mi> <mrow> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>f</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mo>+</mo> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>s</mi> <mi>o</mi> <mi>d</mi> <mi>y</mi> </mrow> </msub> </mrow> </semantics></math> and its variations, where X represents the number oc columns of the input vector.</p> "> Figure 6
<p>Emotional state changes caused by every utterance in the “Talk to Kotaro” experiment in the valence–arousal space, where a blue arrow denotes a positive change in valence, a red one denotes negative valence change, and a grey triangle denotes no visible emotional change.</p> "> Figure 7
<p>Histograms of <math display="inline"><semantics> <msub> <mi>δ</mi> <mi>v</mi> </msub> </semantics></math> (<b>top left</b>) and <math display="inline"><semantics> <msub> <mi>δ</mi> <mi>a</mi> </msub> </semantics></math> (<b>top right</b>) and <math display="inline"><semantics> <mrow> <mrow> <mo>|</mo> <mo>|</mo> </mrow> <mover accent="true"> <msub> <mi>δ</mi> <mi>E</mi> </msub> <mo>→</mo> </mover> <mrow> <mo>|</mo> <mo>|</mo> </mrow> </mrow> </semantics></math> (<b>bottom</b>) for every utterance generated in the “Talk to Kotaro” experiment.</p> "> Figure 8
<p>Every emotion change <math display="inline"><semantics> <mover accent="true"> <msub> <mi>δ</mi> <mi>E</mi> </msub> <mo>→</mo> </mover> </semantics></math> in the data set represented in the impression space.</p> "> Figure 9
<p>Scatter plots of the average valence of each time volunteers M6, F3, F7, and F12 listened to a GS utterance and the results of the linear regression for each session and across multiple sessions. Points with the same color were obtained in the same session, and the line for that session shares the color with the points.</p> "> Figure 10
<p>Scatter plots of the average arousal of each time volunteers M2, M8, F3, and F7 listened to a GS utterance and the results of the linear regression for each session and across multiple sessions. Points with the same color were obtained in the same session, and the line for that session shares the color with the points.</p> "> Figure 11
<p>Pairwise Stuart-Kendall’s correlation coefficient matrices, where the top number of a cell indicates the coefficient and the number in parentheses indicates the related <span class="html-italic">p</span>-value.</p> "> Figure 12
<p>Comparison between the actual impression and the impression predicted by <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>L</mi> <msub> <mi>P</mi> <mrow> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>f</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mo>+</mo> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>s</mi> <mi>o</mi> <mi>d</mi> <mi>y</mi> </mrow> </msub> </mrow> </semantics></math> for (<b>top left</b>) training data, (<b>top right</b>) validation data, and (<b>bottom</b>) test data.</p> "> Figure 13
<p>Embedding values for vowels of the IPA for valence and arousal estimation. IPA symbols in red were absent in the generated utterances. Other symbols were colored according to the index of the cluster they belong to, as shown in the rightmost color bar; (<b>a</b>) Embedding values of vowel for valence change estimation; (<b>b</b>) embedding values of vowel for arousal change estimation.</p> "> Figure 14
<p>IPA Consonant table with embedding values for valence change estimation. IPA symbols in red were absent in the generated utterances. Other symbols were colored according to the index of the cluster they belong to, as shown in the rightmost color bar.</p> "> Figure 15
<p>IPA Consonant table with embedding values for arousal change estimation. IPA symbols in red were absent in the generated utterances. Other symbols were colored according to the index of the cluster they belong to, as shown in the rightmost color bar.</p> "> Figure 16
<p>Comparison between the actual impression and the impression predicted by <math display="inline"><semantics> <mrow> <mi>G</mi> <mi>R</mi> <msub> <mi>U</mi> <mrow> <mi>p</mi> <mi>h</mi> <mi>o</mi> <mi>n</mi> <mi>e</mi> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math> for (<b>top left</b>) training data, (<b>top right</b>) validation data, and (<b>bottom</b>) test data.</p> "> Figure 17
<p>Comparison between the actual impression and the impression predicted by <math display="inline"><semantics> <mrow> <mi>G</mi> <mi>S</mi> <mi>I</mi> <mi>P</mi> </mrow> </semantics></math> for (<b>top left</b>) training data, (<b>top right</b>) validation data, and (<b>bottom</b>) test data.</p> "> Figure 18
<p>Male (blue), female (yellow), and everyone’s (green) responses to the optional Likert scale questionnaire’s prompts 1 to 5. The median value of the responses is highlighted in orange, outliers are represented by small circles.</p> "> Figure 19
<p>Male (blue), female (yellow), and everyone’s (green) responses to the optional Likert scale questionnaire’s prompts 6 to 10. The median value of the responses is highlighted in orange, outliers are represented by small circles.</p> "> Figure 20
<p>Bar plots of the male and female responses to prompts 1 to 5 of the optional Likert cale questionnaire.</p> "> Figure 21
<p>Bar plots of the male and female responses to prompts 6 to 10 of the optional Likert scale questionnaire.</p> ">
Abstract
:1. Introduction
2. Background
2.1. Gibberish Speech
2.2. Prosody
2.3. Valence and Arousal
2.4. Speech Act
2.5. Statistical Bootstrapping
2.6. Literature Review
3. Materials and Methods
3.1. Talk to Kotaro: A Web Crowdsourcing Experiment
3.1.1. IPA-Based Gibberish Speech Generation
Algorithm 1: IPA Giberish Speech generation algorithm |
|
3.1.2. Likert Scale Questionnaire
- (P1)
- Talking with the robot avatar was interesting;
- (P2)
- Variation of the speech characteristics made conversation more natural;
- (P3)
- Some randomly generated words are less pleasant than others;
- (P4)
- Some speech characteristics, such as speed, loudness or pitch influence more than others;
- (P5)
- Different random words didn’t have an impact on your enjoyment;
- (P6)
- You felt that the robot was answering your speech accordingly;
- (P7)
- Longer phrases were more interesting;
- (P8)
- The turn-based conversation felt unnatural;
- (P9)
- Foreign sounding phones were more interesting;
- (P10)
- The robot seemed to be intelligent.
3.2. Neural Network Architectures for Emotion Analysis
3.2.1. Emotion Estimation from Video
3.2.2. Sentiment Analysis of Recorded Speech
3.2.3. Gibberish Speech Impression Prediction System
4. Results
4.1. Profile of Participants Breakdown
4.2. Impression Estimation from Video and Prosody Correlation
Emotion State Change Estimate Error
4.3. Analysis of the Recorded Speech Supports the Findings of the Video Analysis
4.4. Phone Embedding Analysis
4.5. GSIP Evaluation
4.6. Likert Scale Questionnaire Analysis
- P1
- —Talking with the robot avatar was interesting
- P2
- —Variation of the speech characteristics made conversation more natural
- P3
- —Some randomly generated words are less pleasant than others
- P4
- —Some speech characteristics, such as speed, loudness or pitch influence more than others
- P5
- —Different random words didn’t have an impact on your enjoyment
- P6
- —You felt that the robot was answering your speech accordingly
- P7
- —Longer phrases were more interesting
- P8
- —The turn-based conversation felt unnatural
- P9
- —Foreign sounding phones were more interesting
- P10
- —The robot seemed to be intelligent.
5. Discussion
5.1. Effects of Kotaro’s Gibberish Speech on Listeners
5.2. Effects of Prosody, Duration of Interaction, and Phone Choice
5.3. Performance of the GSIP System
6. Conclusions and Future Works
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CUI | Conversational User Interface |
ECA | Embodied Conversational Agent |
GRU | Gated Recurrent Unit |
GS | Gibberish Speech |
GSIP | Gibberish Impression Prediction System |
GUI | Graphical User Interface |
IPA | International Phonetic Alphabet |
KDE | Kernel Density Estimation |
MANOVA | Multivariate Analysis of Variance |
MFCC | Mel-Frequency Cepstral Coefficients |
MLP | Multilayer Perceptron |
NLP | Natural Language Processing |
NN | Neural Network |
SFU | Semantic-Free Utterance |
References
- Deguchi, A.; Hirai, C.; Matsuoka, H.; Nakano, T.; Oshima, K.; Tai, M.; Tani, S. What is society 5.0. Society 2020, 5, 1–23. [Google Scholar]
- Lasi, H.; Fettke, P.; Kemper, H.G.; Feld, T.; Hoffmann, M. Industry 4.0. Bus. Inf. Syst. Eng. 2014, 6, 239–242. [Google Scholar] [CrossRef]
- Mah, P.M.; Skalna, I.; Muzam, J. Natural Language Processing and Artificial Intelligence for Enterprise Management in the Era of Industry 4.0. Appl. Sci. 2022, 12, 9207. [Google Scholar] [CrossRef]
- Karunarathne, G.; Kulawansa, K.; Firdhous, M. Wireless communication technologies in internet of things: A critical evaluation. In Proceedings of the 2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC), Mon Tresor, Mauritius, 6–7 December 2018; pp. 1–5. [Google Scholar]
- Janarthanam, S. Hands-on Chatbots and Conversational UI Development: Build Chatbots and Voice User Interfaces with Chatfuel, Dialogflow, Microsoft Bot Framework, Twilio, and Alexa Skills; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
- Lee, J. Generating Robotic Speech Prosody for Human Robot Interaction: A Preliminary Study. Appl. Sci. 2021, 11, 3468. [Google Scholar] [CrossRef]
- Yilmazyildiz, S.; Read, R.; Belpeame, T.; Verhelst, W. Review of semantic-free utterances in social human—Robot interaction. Int. J. Hum. -Comput. Interact. 2016, 32, 63–85. [Google Scholar] [CrossRef]
- Schwenk, M.; Arras, K.O. R2-D2 reloaded: A flexible sound synthesis system for sonic human-robot interaction design. In Proceedings of the The 23rd IEEE International Symposium on Robot and Human Interactive Communication, Edinburgh, UK, 25–29 August 2014; pp. 161–167. [Google Scholar]
- Caroro, R.; Garcia, A.; Namoco, C. A Text-To-Speech using Rule-based and Data-driven Prosody Techniques with Concatenative Synthesis of the Philippines’ Bisaya Dialect. Int. J. Appl. Eng. Res. 2015, 10, 40209–40215. [Google Scholar]
- Sun, G.; Zhang, Y.; Weiss, R.J.; Cao, Y.; Zen, H.; Rosenberg, A.; Ramabhadran, B.; Wu, Y. Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and autoregressive prosody prior. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6699–6703. [Google Scholar]
- Zovato, E.; Pacchiotti, A.; Quazza, S.; Sandri, S. Towards emotional speech synthesis: A rule based approach. In Proceedings of the Fifth ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 14–16 June 2004. [Google Scholar]
- Lei, Y.; Yang, S.; Wang, X.; Xie, L. Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 853–864. [Google Scholar] [CrossRef]
- Yilmazyildiz, S.; Henderickx, D.; Vanderborght, B.; Verhelst, W.; Soetens, E.; Lefeber, D. EMOGIB: Emotional gibberish speech database for affective human-robot interaction. In Proceedings of the Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2011; pp. 163–172. [Google Scholar]
- Gonzalez, A.G.C.; Lo, W.; Mizuuchi, I. Talk to Kotaro: A web crowdsourcing study on the impact of phone and prosody choice for synthesized speech on human impression. In Proceedings of the 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 29 August–2 September 2022; pp. 244–251. [Google Scholar]
- Rheu, M.; Shin, J.Y.; Peng, W.; Huh-Yoo, J. Systematic Review: Trust-Building Factors and Implications for Conversational Agent Design. Int. J. Hum. -Comput. Interact. 2021, 37, 81–96. [Google Scholar] [CrossRef]
- Mizuuchi, I.; Yoshikai, T.; Sodeyama, Y.; Nakanishi, Y.; Miyadera, A.; Yamamoto, T.; Niemela, T.; Hayashi, M.; Urata, J.; Namiki, Y.; et al. Development of musculoskeletal humanoid kotaro. In Proceedings of the 2006 IEEE International Conference on Robotics and Automation, ICRA, Orlando, FL, USA, 15–19 May 2006; pp. 82–87. [Google Scholar]
- Fujisaki, H. Prosody, models, and spontaneous speech. In Computing Prosody: Computational Models for Processing Spontaneous Speech; Springer: New York, NY, USA, 1997; pp. 27–42. [Google Scholar]
- Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
- Ekman, P. Are there basic emotions? Psychol. Rev. 1992, 99, 550–553. [Google Scholar] [CrossRef]
- Mondal, A.; Gokhale, S.S. Mining Emotions on Plutchik’s Wheel. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), Virtual Event, Paris, France, 14–16 December 2020; pp. 1–6. [Google Scholar]
- Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 2019, 127, 907–929. [Google Scholar] [CrossRef]
- Haukoos, J.S.; Lewis, R.J. Advanced statistics: Bootstrapping confidence intervals for statistics with “difficult” distributions. Acad. Emerg. Med. 2005, 12, 360–365. [Google Scholar] [CrossRef] [PubMed]
- Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans; SIAM: Philadelphia, PA, USA, 1982. [Google Scholar]
- Steck, H.; Jaakkola, T. Bias-corrected bootstrap and model uncertainty. Adv. Neural Inf. Process. Syst. 2003, 16, 521–528. [Google Scholar]
- Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 1987, 82, 171–185. [Google Scholar] [CrossRef]
- Diciccio, T.; Efron, B. More accurate confidence intervals in exponential families. Biometrika 1992, 79, 231–245. [Google Scholar] [CrossRef]
- Kumagai, K.; Hayashi, K.; Mizuuchi, I. Hanamogera speech robot which makes a person feel a talking is fun. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, China, 5–8 December 2017; pp. 463–468. [Google Scholar]
- Yilmazyildiz, S.; Henderickx, D.; Vanderborght, B.; Verhelst, W.; Soetens, E.; Lefeber, D. Multi-modal emotion expression for affective human-robot interaction. In Proceedings of the Workshop on Affective Social Speech Signals (WASSS 2013), Grenoble, France, 21–22 August 2013. [Google Scholar]
- Yilmazyildiz, S.; Latacz, L.; Mattheyses, W.; Verhelst, W. Expressive gibberish speech synthesis for affective human-computer interaction. In Proceedings of the Text, Speech and Dialogue: 13th International Conference, TSD 2010, Brno, Czech Republic, 6–10 September 2010; Proceedings 13. Springer: Berlin/Heidelberg, Germany, 2010; pp. 584–590. [Google Scholar]
- Yilmazyildiz, S.; Athanasopoulos, G.; Patsis, G.; Wang, W.; Oveneke, M.C.; Latacz, L.; Verhelst, W.; Sahli, H.; Henderickx, D.; Vanderborght, B.; et al. Voice modification for wizard-of-OZ experiments in robot–child interaction. In Proceedings of the Workshop on Affective Social Speech Signals, Grenoble, France, 22–23 August 2013. [Google Scholar]
- Tambovtsev, Y.; Martindale, C. Phoneme frequencies follow a Yule distribution. SKASE J. Theor. Linguist. 2007, 4, 1–11. [Google Scholar]
- Wang, W.; Athanasopoulos, G.; Yilmazyildiz, S.; Patsis, G.; Enescu, V.; Sahli, H.; Verhelst, W.; Hiolle, A.; Lewis, M.; Canamero, L. Natural emotion elicitation for emotion modeling in child-robot interactions. In Proceedings of the WOCCI, Singapore, 19 September 2014; pp. 51–56. [Google Scholar]
- Renunathan Naidu, G.; Lebai Lutfi, S.; Azazi, A.A.; Lorenzo-Trueba, J.; Martinez, J.M.M. Cross-Cultural Perception of Spanish Synthetic Expressive Voices Among Asians. Appl. Sci. 2018, 8, 426. [Google Scholar] [CrossRef]
- Malfrere, F.; Dutoit, T.; Mertens, P. Automatic prosody generation using suprasegmental unit selection. In Proceedings of the Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, Blue Mountains, Australia, 26–29 November 1998. [Google Scholar]
- Meron, J. Prosodic unit selection using an imitation speech database. In Proceedings of the 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, Scotland, UK, 29 August–1 September 2001. [Google Scholar]
- Raitio, T.; Rasipuram, R.; Castellani, D. Controllable neural text-to-speech synthesis using intuitive prosodic features. arXiv 2020, arXiv:2009.06775. [Google Scholar]
- Fares, M. Towards multimodal human-like characteristics and expressive visual prosody in virtual agents. In Proceedings of the 2020 International Conference on Multimodal Interaction, Utrecht, The Netherlands, 25–29 October 2020; pp. 743–747. [Google Scholar]
- Morrison, M.; Jin, Z.; Salamon, J.; Bryan, N.J.; Mysore, G.J. Controllable neural prosody synthesis. arXiv 2020, arXiv:2008.03388. [Google Scholar]
- Yi, Y.; He, L.; Pan, S.; Wang, X.; Xiao, Y. Prosodyspeech: Towards advanced prosody model for neural text-to-speech. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7582–7586. [Google Scholar]
- Shen, F.; Du, C.; Yu, K. Acoustic Word Embeddings for End-to-End Speech Synthesis. Appl. Sci. 2021, 11, 9010. [Google Scholar] [CrossRef]
- Lee, Y.; Rabiee, A.; Lee, S.Y. Emotional End-to-End Neural Speech Synthesizer. arXiv 2017, arXiv:cs.SD/1711.05447. [Google Scholar]
- Tao, J.; Li, A. Emotional Speech Generation by Using Statistic Prosody Conversion Methods. In Affective Information Processing; Springer: London, UK, 2009; pp. 127–141. [Google Scholar]
- Um, S.Y.; Oh, S.; Byun, K.; Jang, I.; Ahn, C.; Kang, H.G. Emotional speech synthesis with rich and granularized control. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May; pp. 7254–7258.
- Sarma, P.; Barma, S. Review on stimuli presentation for affect analysis based on EEG. IEEE Access 2020, 8, 51991–52009. [Google Scholar] [CrossRef]
- Duddington, J.; Dunn, R. eSpeak Text to Speech. 2012. Available online: http://espeak.sourceforge.net (accessed on 23 July 2023).
- Association, I.P. Handbook of the International Phonetic Association: A guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
- McMahon, A. An Introduction to English Phonology; Edinburgh University Press: Edinburgh, UK, 2002. [Google Scholar]
- Kollias, D.; Zafeiriou, S. A multi-component CNN-RNN approach for dimensional emotion recognition in-the-wild. arXiv 2018, arXiv:1805.01452. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
- Fussell, S.R.; Kiesler, S.; Setlock, L.D.; Yew, V. How People Anthropomorphize Robots. In Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction, New York, NY, USA, 12–15 March 2008; pp. 145–152. [Google Scholar] [CrossRef]
- Takayama, L. Making Sense of Agentic Objects and Teleoperation: In-the-Moment and Reflective Perspectives. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, New York, NY, USA, 9–13 March 2009; pp. 239–240. [Google Scholar] [CrossRef]
- Pichora-Fuller, M.K.; Dupuis, K. Toronto Emotional Speech Set (TESS). Borealis. 2020. DRAFT VERSION. Available online: https://borealisdata.ca/dataverse/toronto (accessed on 21 July 2023).
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
- Haq, S.; Jackson, P.J. Multimodal Emotion Recognition. In Machine Audition: Principles, Algorithms and Systems; Wang, W., Ed.; IGI Global: Hershey, PA, USA, 2010. [Google Scholar] [CrossRef]
- Chandra, M.P. On the generalised distance in statistics. Indian J. Stat. Ser. A 1936, 2, 49–55. [Google Scholar]
- Ramachandran, V.S.; Hubbard, E.M. Synaesthesia—A window into perception, thought and language. J. Conscious. Stud. 2001, 8, 3–34. [Google Scholar]
- Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748. [Google Scholar]
- Pollack, I.; Pickett, J.M.; Sumby, W.H. On the identification of speakers by voice. J. Acoust. Soc. Am. 1954, 26, 403–406. [Google Scholar] [CrossRef]
- Stivers, T.; Enfield, N.J.; Brown, P.; Englert, C.; Hayashi, M.; Heinemann, T.; Hoymann, G.; Rossano, F.; De Ruiter, J.P.; Yoon, K.E.; et al. Universals and cultural variation in turn-taking in conversation. Proc. Natl. Acad. Sci. USA 2009, 106, 10587–10592. [Google Scholar] [CrossRef]
- Briggs, G. Overselling: Is Appearance or Behavior More Problematic? 2015. Available online: https://www.openroboethics.org/hri15/wp-content/uploads/2015/02/Mf-Briggs.pdf (accessed on 3 September 2023).
- Canning, C.; Donahue, T.J.; Scheutz, M. Investigating human perceptions of robot capabilities in remote human-robot team tasks based on first-person robot video feeds. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 4354–4361. [Google Scholar]
- Chen, Y.C. A tutorial on kernel density estimation and recent advances. Biostat. Epidemiol. 2017, 1, 161–187. [Google Scholar] [CrossRef]
- Nishimura, S.; Nakamura, T.; Sato, W.; Kanbara, M.; Fujimoto, Y.; Kato, H.; Hagita, N. Vocal Synchrony of Robots Boosts Positive Affective Empathy. Appl. Sci. 2021, 11, 2502. [Google Scholar] [CrossRef]
Country/Region of Origin | Male | Female | All | Mother Language | Male | Female | All | Total Speakers | Male | Female | All |
---|---|---|---|---|---|---|---|---|---|---|---|
Japan | 9 | 10 | 19 | Japanese | 9 | 11 | 20 | English | 19 | 15 | 34 |
Brazil | 6 | 1 | 7 | Portuguese (Brazil) | 6 | 1 | 7 | Japanese | 14 | 11 | 25 |
Malaysia | 2 | 0 | 2 | Mandarin | 3 | 0 | 3 | Portuguese (Brazil) | 6 | 1 | 7 |
China | 1 | 0 | 1 | Cantonese | 0 | 1 | 1 | Mandarin | 3 | 0 | 3 |
Hong Kong (China) | 1 | 0 | 1 | English | 0 | 1 | 1 | Malaysian | 2 | 0 | 2 |
India | 0 | 1 | 1 | Marathi | 0 | 1 | 1 | Arabic | 1 | 0 | 1 |
Peru | 1 | 0 | 1 | Spanish | 1 | 0 | 1 | Cantonese | 1 | 0 | 1 |
USA | 0 | 1 | 1 | Arabic | 1 | 0 | 1 | Spanish | 1 | 0 | 1 |
Bangladesh | 1 | 0 | 1 | Sinhala | 1 | 0 | 1 | Sanskrit | 0 | 1 | 1 |
Egypt | 1 | 0 | 1 | Bengali | 1 | 0 | 1 | Korean | 0 | 1 | 1 |
Sri Lanka | 1 | 0 | 1 | Sinhala | 1 | 0 | 1 | ||||
Undisclosed | 0 | 1 | 1 | Bengali | 1 | 0 | 1 | ||||
Hindi | 0 | 1 | 1 | ||||||||
Marathi | 0 | 1 | 1 |
Group | Value | Num DF | Den DF | F Value | Pr > F |
---|---|---|---|---|---|
Wilks’ lambda | 0.9950 | 2.0000 | 628.0000 | 1.5839 | 0.2060 |
Pillai’s trace | 0.0050 | 2.0000 | 628.0000 | 1.5839 | 0.2060 |
Hotelling–Lawley trace | 0.0050 | 2.0000 | 628.0000 | 1.5839 | 0.2060 |
Roy’s greatest root | 0.0050 | 2.0000 | 628.0000 | 1.5839 | 0.2060 |
Group | Value | Num DF | Den DF | F Value | Pr > F |
---|---|---|---|---|---|
Wilks’ lambda | 0.9716 | 4.0000 | 1120.0000 | 4.0687 | 0.0028 |
Pillai’s trace | 0.0285 | 4.0000 | 1122.0000 | 4.0534 | 0.0029 |
Hotelling–Lawley trace | 0.0292 | 4.0000 | 670.9614 | 4.0889 | 0.0028 |
Roy’s greatest root | 0.0274 | 2.0000 | 561.0000 | 7.6855 | 0.0005 |
Group | Prosodic Parameter | Valence | Arousal |
---|---|---|---|
Speed | [−0.065, 0.038] | [−0.036, 0.070 ] | |
General | Volume | [−0.095, 0.014] | [−0.040, 0.068 ] |
Pitch | [0.0045, 0.110] | [−0.054, 0.052] | |
Speed | [−0.076, 0.063] | [−0.061, 0.077] | |
Male | Volume | [−0.091, 0.050 ] | [−0.074, 0.064] |
Pitch | [−0.020, 0.114] | [−0.048, 0.076] | |
Speed | [−0.062, 0.098] | [−0.119, 0.061] | |
Female | Volume | [−0.156, 0.025] | [−0.060, 0.114] |
Pitch | [−0.063, 0.101] | [−0.076, 0.106] | |
Speed | [−0.058, 0.127] | [−0.168, 0.029] | |
Brazilian | Volume | [−0.087, 0.100] | [−0.100, 0.095] |
Pitch | [−0.117, 0.060] | [−0.018, 0.142] | |
Speed | [−0.053, 0.103] | [−0.090, 0.086] | |
Japanese | Volume | [−0.136, 0.041 ] | [−0.067, 0.100] |
Pitch | [−0.051, 0.113 ] | [−0.083, 0.085] |
Emotion Label | Number of Samples |
---|---|
Disgust | 118 |
Angry | 113 |
Happy | 78 |
Surprised | 54 |
Fearful | 46 |
Sad | 43 |
Calm | 38 |
Neutral | 27 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gonzalez, A.G.C.; Lo, W.-S.; Mizuuchi, I. The Impression of Phones and Prosody Choice in the Gibberish Speech of the Virtual Embodied Conversational Agent Kotaro. Appl. Sci. 2023, 13, 10143. https://doi.org/10.3390/app131810143
Gonzalez AGC, Lo W-S, Mizuuchi I. The Impression of Phones and Prosody Choice in the Gibberish Speech of the Virtual Embodied Conversational Agent Kotaro. Applied Sciences. 2023; 13(18):10143. https://doi.org/10.3390/app131810143
Chicago/Turabian StyleGonzalez, Antonio Galiza Cerdeira, Wing-Sum Lo, and Ikuo Mizuuchi. 2023. "The Impression of Phones and Prosody Choice in the Gibberish Speech of the Virtual Embodied Conversational Agent Kotaro" Applied Sciences 13, no. 18: 10143. https://doi.org/10.3390/app131810143