Abstract
This paper proposes a framewrok for realizing sign language to emotional speech conversion by deep learning. We firstly adopt a deep belief network (DBN) model to extract the features of sign language and a deep neural network (DNN) to extract the features of facial expression. Then we train two support vector machines (SVM) to classify the sign language and facial expression for recognizing the text of sign language and emotional tags of facial expression. We also train a set of DNN-based emotional speech acoustic models by speaker adaptive training with an multi-speaker emotional speech corpus. Finally, we select the DNN-based emotional speech acoustic models with emotion tags to synthesize emotional speech from the text recognized from the sign language. Objective tests show that the recognition rate for static sign language is 92.8%. The recognition rate of facial expression achieves 94.6% on the extended Cohn-Kanade database (CK+) and 80.3% on the JAFFE database respectively. Subjective evaluation demonstrates that synthesized emotional speech can get 4.2 of the emotional mean opinion score. The pleasure-arousal-dominance (PAD) evaluation shows that the PAD values of facial expression are close to the PAD values of synthesized emotional speech.
The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 11664036, 61263036), High School Science and Technology Innovation Team Project of Gansu (2017C-03), Natural Science Foundation of Gansu (Grant No. 1506RJYA126).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kalsh, E.A., Garewal, N.S.: Sign language recognition system. Int. J. Comput. Eng. Res. 3(6), 15–21 (2013)
Assaleh, K., Shanableh, T., Zourob, M.: Low complexity classification system for glove-based arabic sign language recognition. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7665, pp. 262–268. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34487-9_32
Godoy, V., et al.: An HMM-based gesture recognition method trained on few samples. In: Proceedings of the 26th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 640–646. IEEE (2014)
Yang, Z.Q., Sun, G.: Gesture recognition based on quantum-behaved particle swarm optimization of back propagation neural network. Comput. Appl. 34, 137–140 (2014)
Ghosh, D.K., Ari, S.: Static hand gesture recognition using mixture of features and SVM Classifier. In: Proceedings of the Fifth International Conference on Communication Systems and Network Technologies (CSNT), pp. 1094–1099. IEEE (2015)
Oyedotun, O.K., Khashman, A.: Deep learning in vision-based static hand gesture recognition. Neural Comput. Appl. 28(12), 3941–3951 (2017)
Hsieh, C.C., Hsih, M.H., Jiang, M.K.: Effective semantic features for facials recognition using SVM. Multimed. Tools Appl. 75(11), 6663–6682 (2016)
Prabhakar, S., Sharma, J., Gupta, S.: Facial expression recognition in video using Adaboost and SVM. Pol. J. Nat. Sci. 3613(1), 672–675 (2014)
Abdulrahman, M., et al.: Gabor wavelet transform based facial expression recognition using PCA and LBP. In: Signal Processing and Communications Applications Conference (SIU), pp. 2265–2268. IEEE (2014)
Zhao, X., Shi, X., Zhang, S.: Facial expression recognition via deep learning. IETE Tech. Rev. 32(5), 347–355 (2015)
Hojo, N., Ijima, Y., Mizuno, H.: DNN-Based speech synthesis using speaker codes. IEICE Trans. Inf. Syst. 101(2), 462–472 (2018)
Potard, B., Motlicek, P., Imseng, D.: Preliminary work on speaker adaptation for DNN-Based speech synthesis. Idiap (2015)
Caridakis, G., et al.: Multimodal emotion recognition from expressive faces, body gestures and speech. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) AIAI 2007. ITIFIP, vol. 247, pp. 375–388. Springer, Boston, MA (2007). https://doi.org/10.1007/978-0-387-74161-1_41
Burger, B., Ferran, I., Lerasle, F.: Two-handed gesture recognition and fusion with speech to command a robot. Auton. Robots 32(2), 129–147 (2012)
Sinyukov, D.A., et al.: Augmenting a voice and facial expression control of a robotic wheelchair with assistive navigation. In: Proceedings of the International Conference on Systems, Man and Cybernetics (SMC), pp. 1088–1094. IEEE (2014)
Yang, H., et al.: Towards realizing gesture-to-speech conversion with a HMM-based bilingual speech synthesis system. In: Proceedings of the International Conference on Orange Technologies (ICOT), pp. 97–100. IEEE (2014)
An, X., Yang, H., Gan, Z.: Towards realizing sign language-to-speech conversion by combining deep learning and statistical parametric speech synthesis. In: Che, W., et al. (eds.) ICYCSEE 2016. CCIS, vol. 623, pp. 678–690. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-2053-7_61
Feng, F., Li, R., Wang, X.: Deep correspondence restricted Boltzmann machine for cross-modal retrieval. Neurocomputing 154, 50–60 (2015)
Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Deng, L., Yu, D.: Deep learning: methods and applications. foundations and trends\(\textregistered \)in. Signal Process. 7(3), 197–387 (2014)
Larochelle, H., Bengio, Y., Louradour, J.: Exploring strategies for training deep neural networks. J. Mach. Learn. Res. 1(10), 1–40 (2009)
Wu, Z., et al.: A study of speaker adaptation for DNN-based speech synthesis. In: Sixteenth Annual Conference of the International Speech Communication Association, pp. 879–883. Interspeech (2015)
Fan, Y., et al.: Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4475–4479. IEEE (2015)
Hwang, H.T., et al.: A probabilistic interpretation for artificial neural network-based voice conversion. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 552–558. IEEE (2015)
China Association of the Deaf and Hard of Hearing: Chinese Sign Language. Huaxia Publishing House, Beijing (2003)
Yang, H., Oura, K., Wang, H.: Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimed. Tools Appl. 74(22), 9927–9942 (2015)
Yang, H., Zhu, L.: Predicting Chinese prosodic boundary based on syntactic features. J. Northwest Norm. Univ. (Nat. Sci. Ed.) 49(1), 41–45 (2013)
Lucey, P., et al.: The extended Cohn-Kanade Dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: Computer Vision and Pattern Recognition Workshops, pp. 94–101. IEEE (2010)
Lyons, M., et al.: Coding facial expressions with Gabor wavelets. In: Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 200–205. IEEE (1998)
Amos, B., Ludwiczuk, B., Satyanarayanan, M.: Openface: a general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)
Xiaoming, L., Xiaolan, F., Guofeng, D.: Preliminary application of the abbreviated PAD emotion scale to Chinese undergraduates. Chin. Ment. Health J. 22(5), 327–329 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Song, N., Yang, H., Zhi, P. (2018). Towards Realizing Sign Language to Emotional Speech Conversion by Deep Learning. In: Zhou, Q., Miao, Q., Wang, H., Xie, W., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 902. Springer, Singapore. https://doi.org/10.1007/978-981-13-2206-8_34
Download citation
DOI: https://doi.org/10.1007/978-981-13-2206-8_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2205-1
Online ISBN: 978-981-13-2206-8
eBook Packages: Computer ScienceComputer Science (R0)