Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Speech Synthesis With Mixed Emotions

Published: 30 December 2022 Publication History

Abstract

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.

References

[1]
A. Braniecka, E. Trzebińska, A. Dowgiert, and A. Wytykowska, “Mixed emotions and coping: The benefits of secondary emotions,” PLoS One, vol. 9, no. 8, 2014, Art. no.
[2]
J. T. Larsen and A. P. McGraw, “The case for mixed emotions,” Social Pers. Psychol. Compass, vol. 8, no. 6, pp. 263–274, 2014.
[3]
J. T. Larsen and A. P. McGraw, “Further evidence for mixed emotions,” J. Pers. Social Psychol., vol. 100, no. 6, 2011, Art. no.
[4]
M. Schröder, “Emotional speech synthesis: A review,” in Proc. 7th Eur. Conf. Speech Commun. Technol., 2001, pp. 561–564.
[5]
J. Pittermann, A. Pittermann, and W. Minker, Handling Emotions in Human-Computer Dialogues. Berlin, Germany: Springer, 2010.
[6]
J. Crumpton and C. L. Bethel, “A survey of using vocal prosody to convey emotion in robot speech,” Int. J. Social Robot., vol. 8, no. 2, pp. 271–285, 2016.
[7]
A. Rosenberg and J. Hirschberg, “Prosodic aspects of the attractive voice,” in Voice Attractiveness. Berlin, Germany: Springer, 2021, pp. 17–40.
[8]
X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” 2021,.
[9]
J. P. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Progress in Speech Synthesis. Berlin, Germany: Springer, 2013.
[10]
B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 132–157, 2021.
[11]
J. Sotelo et al., “Char2wav: End-to-end speech synthesis,” in Proc. Int. Conf. Learn. Representations, 2017.
[12]
Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” 2017,.
[13]
Y. Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 3171–3180.
[14]
H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2013, pp. 7962–7966.
[15]
J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 1, pp. 66–83, Jan. 2009.
[16]
O. Watts, G. E. Henter, T. Merritt, Z. Wu, and S. King, “From HMMs to DNNs: Where do the improvements come from?,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 5505–5509.
[17]
D. Schuller and B. W. Schuller, “The age of artificial emotional intelligence,” Computer, vol. 51, no. 9, pp. 38–46, 2018.
[18]
K. Tokuda, H. Zen, and A. W. Black, “An HMM-based speech synthesis system applied to english,” in Proc. IEEE Speech Synth. Workshop, 2002, pp. 227–230.
[19]
Y. Ohtani, Y. Nasu, M. Morita, and M. Akamine, “Emotional transplant in statistical speech synthesis based on emotion additive model,” in Proc. 16th Annu. Conf. Int. Speech Commun. Assoc., 2015, pp. 274–278.
[20]
K. Inoue, S. Hara, M. Abe, N. Hojo, and Y. Ijima, “Model architectures to extrapolate emotional expressions in DNN-based text-to-speech,” Speech Commun., vol. 126, pp. 35–43, 2021.
[21]
R. Plutchik, The Emotions. Lanham, MD, USA: Univ. Press America, 1991.
[22]
Y. Xu, “Speech prosody: A methodological review,” J. Speech Sci., vol. 1, no. 1, pp. 85–115, 2011.
[23]
J. Latorre and M. Akamine, “Multilevel parametric-base F0 model for speech synthesis,” in Proc. 9th Annu. Conf. Int. Speech Commun. Assoc., 2008, pp. 2274–2277.
[24]
J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of various speaking styles and emotions for HMM-based speech synthesis,” in Proc. 8th Eur. Conf. Speech Commun. Technol., 2003, pp. 2461–2464.
[25]
F. Eyben et al., “Unsupervised clustering of emotion and voice styles for expressive TTS,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2012, pp. 4009–4012.
[26]
R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, “GMM-based emotional voice conversion using spectrum and prosody features,” Amer. J. Signal Process., vol. 2, pp. 134–138, 2012.
[27]
H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano, “GMM-based voice conversion applied to emotional speech synthesis,” in Proc. IEEE Conf. Trans. Speech Audio, 2003, pp. 2401–2404.
[28]
J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yamagishi, Y. Morino, and Y. Ochiai, “Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis,” Speech Commun., vol. 99, pp. 135–143, 2018.
[29]
Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Emotional voice conversion with adaptive scales F0 based on wavelet transform using limited amount of emotional data,” in Proc. INTERSPEECH Conf., 2017, pp. 3399–3403.
[30]
H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, and H. Li, “Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion,” in Proc. Interspeech Conf., 2016, pp. 2453–2457.
[31]
S. An, Z. Ling, and L. Dai, “Emotional statistical parametric speech synthesis using LSTM-RNNs,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., 2017, pp. 1613–1616.
[32]
R. Skerry-Ryan et al., “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4693–4702.
[33]
P. Wu, Z. Ling, L. Liu, Y. Jiang, H. Wu, and L. Dai, “End-to-end emotional speech synthesis using style tokens and semi-supervised training,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., 2019, pp. 623–627.
[34]
Y. Lee, S.-Y. Lee, and A. Rabiee, “Emotional end-to-end neural speech synthesizer,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017.
[35]
N. Tits, K. El Haddad, and T. Dutoit, “Exploring transfer learning for low resource emotional TTS,” in Proc. SAI Intell. Syst. Conf., 2019, pp. 52–60.
[36]
Y. Wang et al., “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 5180–5189.
[37]
D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressive speaking style from text in end-to-end speech synthesis,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2018, pp. 595–602.
[38]
S. D. Kreibig and J. J. Gross, “Understanding mixed emotions: Paradigms and measures,” Curr. Opin. Behav. Sci., vol. 15, pp. 62–71, 2017.
[39]
R. Berrios, P. Totterdell, and S. Kellett, “Eliciting mixed emotions: A meta-analysis comparing models, types, and measures,” Front. Psychol., vol. 6, 2015, Art. no.
[40]
P. M. Niedenthal and M. Brauer, “Social functionality of human emotion,” Annu. Rev. Psychol., vol. 63, pp. 259–285, 2012.
[41]
P. C. Hogan, The Mind and its Stories: Narrative Universals and Human Emotion. Cambridge, U.K.: Cambridge Univ. Press, 2003.
[42]
P. E. Ekman and R. J. Davidson, The Nature of Emotion: Fundamental Questions. London, U.K.: Oxford Univ. Press, 1994.
[43]
R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,” Amer. Scientist, vol. 89, no. 4, pp. 344–350, 2001.
[44]
R. Plutchik and H. Kellerman, Theories of Emotion, vol. 1. New York, NY, USA: Academic Press, 2013.
[45]
M. Cross and C. Hanrahan, Changing Minds: The Go-to Guide to Mental Health for Family and Friends. New York, NY, USA: HarperCollins, 2016.
[46]
G. N. Yannakakis, R. Cowie, and C. Busso, “The ordinal nature of emotions,” in Proc. IEEE 7th Int. Conf. Affect. Comput. Intell. Interact., 2017, pp. 248–255.
[47]
G. N. Yannakakis, R. Cowie, and C. Busso, “The ordinal nature of emotions: An emerging approach,” IEEE Trans. Affective Comput., vol. 12, no. 1, pp. 16–35, First Quarter 2021.
[48]
J. B. Harvill, S.-G. Leem, M. Abdelwahab, R. Lotfian, and C. Busso, “Quantifying emotional similarity in speech,” IEEE Trans. Affective Comput., to be published.
[49]
H. Cao, R. Verma, and A. Nenkova, “Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech,” Comput. Speech Lang., vol. 29, no. 1, pp. 186–202, 2015.
[50]
J. Harvill, M. A. Wahab, R. Lotfian, and C. Busso, “Retrieving speech samples with similar emotional content using a triplet loss function,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 7400–7404.
[51]
J. Fürnkranz and E. Hüllermeier, “Pairwise preference learning and ranking,” in Proc. Eur. Conf. Mach. Learn., 2003, pp. 145–156.
[52]
R. Lotfian and C. Busso, “Retrieving categorical emotions using a probabilistic framework to define preference learning samples,” in Proc. Interspeech Conf., 2016, pp. 490–494.
[53]
S. Parthasarathy, R. Cowie, and C. Busso, “Using agreement on direction of change to build rank-based emotion classifiers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2108–2121, Nov. 2016.
[54]
H. P. Martinez, G. N. Yannakakis, and J. Hallam, “Don't classify ratings of affect; rank them!,” IEEE Trans. Affect. Comput., vol. 5, no. 3, pp. 314–326, Third Quarter 2014.
[55]
G. N. Yannakakis and H. P. Martinez, “Grounding truth via ordinal annotation,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, 2015, pp. 574–580.
[56]
J. Huang et al., “Speech emotion recognition from variable-length inputs with triplet loss function,” in Proc. Interspeech Conf., 2018, pp. 3673–3677.
[57]
K. Feng and T. Chaspari, “A siamese neural network with modified distance loss for transfer learning in speech emotion recognition,” 2020,.
[58]
X. Zhu, S. Yang, G. Yang, and L. Xie, “Controlling emotion strength with relative attribute for end-to-end speech synthesis,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 192–199.
[59]
Y. Lei, S. Yang, and L. Xie, “Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2021, pp. 423–430.
[60]
K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion intensity and its control for emotional voice conversion,” IEEE Trans. Affective Comput., to be published.
[61]
Y. Lei, S. Yang, X. Wang, and L. Xie, “MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 853–864, 2022.
[62]
D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. Representations, 2015.
[63]
K. K. J. F. S. Kyle, K. A. C. Y. B. Jose, and S. M. Sotelo, “Char2wav: End-to-end speech synthesis,” in Proc. Int. Conf. Learn. Representations, Workshop, 2017.
[64]
D. M. Schuller and B. W. Schuller, “A review on five recent and near-future developments in computational processing of emotion in the human voice,” Emotion Rev., vol. 13, pp. 44–50, 2020.
[65]
K. Zhou, B. Sisman, and H. Li, “Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training,” 2021,.
[66]
D. Wu, T. D. Parsons, and S. S. Narayanan, “Acoustic feature analysis in speech emotion primitives estimation,” in Proc. 11th Annu. Conf. Int. Speech Commun. Assoc., 2010, pp. 785–788.
[67]
A. Rabiee, T.-H. Kim, and S.-Y. Lee, “Adjusting pleasure-arousal-dominance for continuous emotional text-to-speech synthesizer,” in Proc. INTERSPEECH Conf., 2019, pp. 3693–3694.
[68]
X. Cai, D. Dai, Z. Wu, X. Li, J. Li, and H. Meng, “Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 5734–5738.
[69]
T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” in Proc. 12th Int. Symp. Chin. Spoken Lang. Process., 2021, pp. 1–5.
[70]
S. Ma, D. Mcduff, and Y. Song, “Neural TTS stylization with adversarial and collaborative games,” in Proc. Int. Conf. Learn. Representations, 2018.
[71]
T. Cornille, F. Wang, and J. Bekker, “Interactive multi-level prosody control for expressive speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 8312–8316.
[72]
V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” in Proc. Interspeech Conf., 2019, pp. 4440–4444.
[73]
X. Li, C. Song, J. Li, Z. Wu, J. Jia, and H. Meng, “Towards multi-scale style control for expressive speech synthesis,” in Proc. Interspeech Conf., 2021, pp. 4673–4677.
[74]
G. Zhang, Y. Qin, and T. Lee, “Learning syllable-level discrete prosodic representation for expressive speech generation,” in Proc. INTERSPEECH Conf., 2020, pp. 3426–3430.
[75]
K. Zhou, B. Sisman, and H. Li, “Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training,” in Proc. Interspeech Conf., 2021, pp. 811–815.
[76]
H. Choi and M. Hahn, “Sequence-to-sequence emotional voice conversion with strength control,” IEEE Access, vol. 9, pp. 42 674–42 687, 2021.
[77]
S. Mozziconacci, “Prosody and emotions,” in Proc. Int. Conf. Speech Prosody, 2002.
[78]
Y. Lee and T. Kim, “Robust and fine-grained prosody control of end-to-end speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 5911–5915.
[79]
D. Tan and T. Lee, “Fine-grained style modeling, transfer and prediction in text-to-speech synthesis via phone-level content-style disentanglement,” in Proc. Interspeech Conf., 2021, pp. 4683–4687.
[80]
G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 6264–6268.
[81]
G. Sun et al., “Generating diverse and natural text-to-speech samples using a quantized fine-grained vae and autoregressive prosody prior,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 6699–6703.
[82]
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,.
[83]
Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6945–6949.
[84]
T. Kenter, V. Wan, C.-A. Chan, R. Clark, and J. Vit, “Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 3331–3340.
[85]
N. H. Frijda, A. Ortony, J. Sonnemans, and G. L. Clore, “The complexity of intensity: Issues concerning the structure of emotion intensity,” 1992.
[86]
K. Matsumoto, S. Hara, and M. Abe, “Controlling the strength of emotions in speech-like emotional sound generated by WaveNet,” in Proc. Interspeech Conf., 2020, pp. 3421–3425.
[87]
B. Schnell and P. N. Garner, “Improving emotional TTS with an emotion intensity input from unsupervised extraction,” in Proc. 11th ISCA Speech Synth. Workshop, 2021, pp. 60–65.
[88]
S.-Y. Um, S. Oh, K. Byun, I. Jang, C. Ahn, and H.-G. Kang, “Emotional speech synthesis with rich and granularized control,” in proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 7254–7258.
[89]
C.-B. Im, S.-H. Lee, S.-B. Kim, and S.-W. Lee, “EMOQ-TTS: Emotion intensity quantization for fine-grained controllable emotional text-to-speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 6317–6321.
[90]
C. M. Whissell, “The dictionary of affect in language,” in The Measurement of Emotions. Amsterdam, The Netherlands: Elsevier, 1989, pp. 113–131.
[91]
P. Ekman, “An argument for basic emotions,” Cogn. Emotion, vol. 6, pp. 169–200, 1992.
[92]
J. A. Russell, “A circumplex model of affect,” J. Pers. Social Psychol., vol. 39, no. 6, 1980, Art. no.
[93]
M. Schroder, “Expressing degree of activation in synthetic speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1128–1136, Jul. 2006.
[94]
C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 4, 2008, Art. no.
[95]
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception,” IEEE Trans. Affective Comput., vol. 8, no. 1, pp. 67–80, First Quarter 2017.
[96]
D. Parikh and K. Grauman, “Relative attributes,” in Proc. Int. Conf. Comput. Vis., 2011, pp. 503–510.
[97]
A. Kovashka, D. Parikh, and K. Grauman, “WhittleSearch: Image search with relative attribute feedback,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2973–2980.
[98]
Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu, “Robust relative attributes for human action recognition,” Pattern Anal. Appl., vol. 18, no. 1, pp. 157–171, 2015.
[99]
Q. Fan, P. Gabbur, and S. Pankanti, “Relative attributes for large-scale abandoned object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2736–2743.
[100]
O. Chapelle, “Training a support vector machine in the primal,” Neural Comput., vol. 19, no. 5, pp. 1155–1178, 2007.
[101]
M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” in Proc. Interspeech Conf., 2019, pp. 1298–1302.
[102]
J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 540–552, 2019.
[103]
M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1290–1302, 2021.
[104]
A. Polyak and L. Wolf, “Attention-based wavenet autoencoder for universal voice conversion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6800–6804.
[105]
W. Ping et al., “Deep voice 3: 2000-speaker neural text-to-speech,” in Proc. Int. Conf. Learn. Representations, 2018.
[106]
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality tts with transformer,” 2018,.
[107]
H.-T. Luong and J. Yamagishi, “Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 200–207.
[108]
W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” in Proc. Interspeech Conf., 2020, pp. 4676–4680.
[109]
H.-T. Luong and J. Yamagishi, “NAUTILUS: A versatile voice cloning system,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2967–2981, 2020.
[110]
A. Black et al., “The festival speech synthesis system, version 1.4.2,” Unpublished document available via, 2001. [Online]. Available: http://www.cstr.ed.ac.uk/projects/festival.html
[111]
J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Recognition-synthesis based non-parallel voice conversion with adversarial learning,” in Proc. Interspeech Conf., 2020, pp. 771–775.
[112]
F. Eyben, M. Wöllmer, and B. Schuller, “OpenSmile: The munich versatile and fast open-source audio feature extractor,” in Proc. 18th ACM Int. Conf. Multimedia, 2010, pp. 1459–1462.
[113]
B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in Proc. 10th Annu. Conf. Int. Speech Commun. Assoc., 2009, pp. 312–315.
[114]
C. Veaux et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2016.
[115]
K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 920–924.
[116]
K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and esd,” Speech Commun., vol. 137, pp. 1–18, 2022.
[117]
D. P. Kingma and J. L. Ba, “Adam: Amethod for stochastic optimization”.
[118]
P. Heracleous, K. Yasuda, F. Sugaya, A. Yoneyama, and M. Hashimoto, “Speech emotion recognition in noisy and reverberant environments,” in Proc. 7th Int. Conf. Affect. Comput. Intell. Interaction, 2017, pp. 262–266.
[119]
U. Tiwari, M. Soni, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 7194–7198.
[120]
B. J. Abbaschian, D. Sierra-Sosa, and A. Elmaghraby, “Deep learning techniques for speech emotion recognition, from databases to models,” Sensors, vol. 21, no. 4, 2021, Art. no.
[121]
H. Muthusamy, K. Polat, and S. Yaacob, “Improved emotion recognition using gaussian mixture model and extreme learning machine in speech and glottal signals,” Math. Problems Eng., vol. 2015, 2015, Art. no.
[122]
M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Process. Lett., vol. 25, no. 10, pp. 1440–1444, Oct. 2018.
[123]
D. Bitouk, R. Verma, and A. Nenkova, “Class-level spectral features for emotion recognition,” Speech Commun., vol. 52, no. 7/8, pp. 613–625, 2010.
[124]
R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proc. IEEE Pacific Rim Conf. Commun. Comput. Signal Process., 1993, pp. 125–128.
[125]
W. F. Johnson, R. N. Emde, K. R. Scherer, and M. D. Klinnert, “Recognition of emotion from vocal cues,” Arch. Gen. Psychiatry, vol. 43, no. 3, pp. 280–283, 1986.
[126]
M. J. Owren and J.-A. Bachorowski, “Measuring emotion-related vocal acoustics,” Handbook of Emotion Elicitation and Assessment. London, U.K.: Oxford Univ. Press, 2007, pp. 239–266.
[127]
M. Morise et al., “Harvest: A high-performance fundamental frequency estimator from speech signals,” in Proc. INTERSPEECH Conf., 2017, pp. 2321–2325.
[128]
K. Zhou, B. Sisman, M. Zhang, and H. Li, “Converting anyone's emotion: Towards speaker-independent emotional voice conversion,” in Proc. Interspeech Conf., 2020, pp. 3416–3420.
[129]
K. Zhou, B. Sisman, and H. Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data,” in Proc. Speaker Lang. Recognit. Workshop, 2020, pp. 230–237.
[130]
K. Zhou, B. Sisman, and H. Li, “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2021, pp. 415–422.
[131]
P. Williams and J. L. Aaker, “Can mixed emotions peacefully coexist?,” J. Consum. Res., vol. 28, no. 4, pp. 636–649, 2002.
[132]
Y. Miyamoto, Y. Uchida, and P. C. Ellsworth, “Culture and mixed emotions: Co-occurrence of positive and negative emotions in japan and the united states,” Emotion, vol. 10, no. 3, 2010, Art. no.
[133]
S. Hareli, S. David, and U. Hess, “The role of emotion transition for the perception of social dominance and affiliation,” Cogn. Emotion, vol. 30, no. 7, pp. 1260–1270, 2016.
[134]
S. PS and G. Mahalakshmi, “Emotion models: A review,” Int. J. Control Theory Appl., vol. 10, no. 8, pp. 651–657, 2017.
[135]
G. A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information,” Psychol. Rev., vol. 63, no. 2, 1956, Art. no.
[136]
L. F. Barrett, “Discrete emotions or dimensions? The role of valence focus and arousal focus,” Cogn. Emotion, vol. 12, no. 4, pp. 579–599, 1998.
[137]
G. Koch et al., “Siamese neural networks for one-shot image recognition,” in Proc. Int. Conf. Mach. Deep Learn. Workshop, 2015.
[138]
B. McFee and G. R. Lanckriet, “Metric learning to rank,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 775–782.

Cited By

View all
  • (2024)Empathy by Design: The Influence of Trembling AI Voices on Prosocial BehaviorIEEE Transactions on Affective Computing10.1109/TAFFC.2023.333274215:3(1253-1263)Online publication date: 1-Jul-2024
  • (2024)RSET: Remapping-Based Sorting Method for Emotion Transfer Speech SynthesisWeb and Big Data10.1007/978-981-97-7232-2_7(90-104)Online publication date: 31-Aug-2024
  • (2024)PiCo-VITS: Leveraging Pitch Contours for Fine-Grained Emotional Speech SynthesisText, Speech, and Dialogue10.1007/978-3-031-70566-3_19(210-221)Online publication date: 9-Sep-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing  Volume 14, Issue 4
Oct.-Dec. 2023
832 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 30 December 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Empathy by Design: The Influence of Trembling AI Voices on Prosocial BehaviorIEEE Transactions on Affective Computing10.1109/TAFFC.2023.333274215:3(1253-1263)Online publication date: 1-Jul-2024
  • (2024)RSET: Remapping-Based Sorting Method for Emotion Transfer Speech SynthesisWeb and Big Data10.1007/978-981-97-7232-2_7(90-104)Online publication date: 31-Aug-2024
  • (2024)PiCo-VITS: Leveraging Pitch Contours for Fine-Grained Emotional Speech SynthesisText, Speech, and Dialogue10.1007/978-3-031-70566-3_19(210-221)Online publication date: 9-Sep-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media