research-article

SeeSpeech: See Emotions in The Speech

Authors:

Xiang-Yang LiAuthors Info & Claims

ASSE '21: 2021 2nd Asia Service Sciences and Software Engineering Conference

Pages 116 - 122

https://doi.org/10.1145/3456126.3456129

Published: 29 June 2021 Publication History

Abstract

At present, the understanding of speech by machines mostly focuses on the understanding of semantics, but speech should also include emotions in the speech. Emotion can not only strengthen semantics, but can even change semantic information. The paper discusses how to realize the emotion classification, which is called SeeSpeech. SeeSpeech chooses MCEP as the speech emotion feature, and inputs it into CNN and Transformer respectively. In order to obtain richer features, CNN uses batch normalization, while Transformer uses layer normalization, and then combines the output of CNN and Transformer. Finally, the type of emotion is obtained through SoftMax. SeeSpeech obtained the highest classification accuracy rate of 97% on the RAVDESS data set, and also obtained the classification accuracy rate of 85% on the actual edge gateway test. It can be seen from the results that SeeSpeech has encouraging performance in speech emotion classification and has a wide range of application prospects in human-computer interaction.

References

[1]

R. W. Picard, Affective computing. MIT press, 2000.

Digital Library

[2]

P. Gupta and N. Rajput, “Two-stream emotion recognition for call center monitoring,” in Eighth Annual Conference of the International Speech Communication Association, 2007.

[3]

B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 2004, pp. I–577.

[4]

V. Kostov and S. Fukuda, “Emotion in user interface, voice interaction system,” in Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. no. 0, vol. 2. IEEE, 2000, pp. 798–803.

[5]

H. Boril, S. Omid Sadjadi, T. Kleinschmidt, and J. H. Hansen, “Analysis and detection of cognitive load and frustration in drivers’ speech,” Proceedings of INTERSPEECH 2010, pp. 502–505, 2010.

[6]

E. Marchi, B. Schuller, A. Batliner, S. Fridenzon, S. Tal, and O. Golan, “Emotion in the speech of children with autism spectrum conditions: Prosody and everything else,” in Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012), Satellite Event of INTERSPEECH 2012, 2012.

[7]

R. F. Livingstone SR, “(2018) the ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. plos one 13(5): e0196391.” https://doi.org/10.1371/journal.pone.0196391.

[8]

M. G. de Pinto, M. Polignano, P. Lops, and G. Semeraro, “Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients,” in 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS). IEEE, 2020, pp. 1 – 5.

[9]

Iqbal and K. Barua, “A real-time emotion recognition from speech using gradient boosting,” in 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, 2019, pp. 1 – 5.

[10]

R. Jannat, I. Tynes, L. L. Lime, J. Adorno, and S. Canavan, “Ubiquitous emotion recognition using audio and video data,” in Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, 2018, pp. 956–959.

Digital Library

[11]

K. S. Rao, S. G. Koolagudi, and R. R. Vempada, “Emotion recognition from speech using global and local prosodic features,” International journal of speech technology, vol. 16, no. 2, pp. 143–160, 2013.

Digital Library

[12]

S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International journal of speech technology, vol. 15, no. 2, pp. 99 –117, 2012.

Digital Library

[13]

S. Kuchibhotla, H. D. Vankayalapati, R. Vaddi, and K. R. Anne, “A comparative analysis of classifiers in emotion recognition through acoustic features,” International Journal of Speech Technology, vol. 17, no. 4, pp. 401–408, 2014.

Digital Library

[14]

D. Bitouk, R. Verma, and A. Nenkova, “Class-level spectral features for emotion recognition,” Speech communication, vol. 52, no. 7-8, pp. 613–625, 2010.

Digital Library

[15]

H. Teager and S. Teager, “Evidence for nonlinear sound production mechanisms in the vocal tract,” in Speech production and speech modelling. Springer, 1990, pp. 241–261.

[16]

J. F. Kaiser, “On a simple algorithm to calculate the'energy'of a signal,” in International conference on acoustics, speech, and signal processing. IEEE, 1990, pp. 381–384.

[17]

F. Bulagang, N. G. Weng, J. Mountstephens, and J. Teo, “A review of recent approaches for emotion classification using electrocardiography and electrodermography signals,” Informatics in Medicine Unlocked, vol. 20, p. 100363, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2352914820301040

[18]

B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., vol. 2. IEEE, 2003, pp. II–1.

[19]

T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003.

[20]

Y. Pan, P. Shen, and L. Shen, “Speech emotion recognition using support vector machine,” International Journal of Smart Home, vol. 6, no. 2, pp. 101–108, 2012.

[21]

J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in speech using neural networks,” Neural computing & applications, vol. 9, no. 4, pp. 290–296, 2000.

[22]

F. Eyben, M. Wollmer,¨ A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol. 3, no. 1-2, pp. 7–19, 2010.

[23]

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 5200–5204.

Digital Library

[24]

W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA). IEEE, 2016, pp. 1 – 4.

Index Terms

SeeSpeech: See Emotions in The Speech

Index terms have been assigned to the content through auto-classification.

Recommendations

Emotions and speech disorders: do developmental stutters recognize emotional vocal expressions?
Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues

This paper intends to evaluate the developmental stutters' ability to recognize emotional vocal expressions. To this aim, a group of diagnosed developmental child stutters and a fluent one are tested on the perception of 5 basic vocal emotional states (...
Emotions, speech and the ASR framework
Special issue on speech and emotion

Automatic recognition and understanding of speech are crucial steps towards natural human-machine interaction. Apart from the recognition of the word sequence, the recognition of properties such as prosody, emotion tags or stress tags may be of ...
Detecting changing emotions in natural speech
IEA/AIE'12: Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence

The goal of this research was to develop a system that will automatically measure changes in the emotional state of a speaker, by analyzing his/her voice. Natural (non-acted) human speech of 77 (Dutch) speakers was collected and manually splitted into ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ASSE '21: 2021 2nd Asia Service Sciences and Software Engineering Conference

February 2021

143 pages

ISBN:9781450389082

DOI:10.1145/3456126

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

China National Funds for Distinguished Young Scientists
Key Research Program of Frontier Sciences, CAS
National Key R&D Program of China
NSFC

Conference

ASSE '21

ASSE '21: 2021 2nd Asia Service Sciences and Software Engineering Conference

February 24 - 26, 2021

Macau, Macao

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten