Abstract
Speaker verification (SV) is an important branch in speaker recognition. Several approaches have been investigated within the last few decades. In this context, deep learning has received much more interest by speech processing researchers, and it was introduced recently in speaker recognition. In most cases, deep learning models are adapted from speech recognition applications and applied to speaker recognition, and they have been showing their capability of being competitors to the state-of-the-art approaches. Nevertheless, the use of deep learning in speaker recognition is still linked to speech recognition. In this study, we are proposing a new way to use deep neural networks (DNNs) in speaker recognition, in the purpose to facilitate to DNN to learn features distribution. We have been motivated by our previous work, where we have proposed a novel scoring method that works perfectly with clean speech, but it needs improvements under noisy conditions. For this reason, we are aiming to transform the extracted feature vectors (MFCCs) into enhanced feature vectors, that we denote Deep Speaker Features (DeepSFs). Experiments have been conducted on THUYG-20 SRE corpus, and significant results have been achieved. Moreover, this new method outperformed both i-vector/PLDA and our baseline system in both clean and noisy conditions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ai, O. C., Hariharan, M., Yaacob, S., & Chee, L. S. (2012). Classification of speech dysfluencies with mfcc and lpcc features. Expert Systems with Applications, 39(2), 2157–2165.
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE (pp. 4945–4949).
Beigi, H. (2011). Fundamentals of speaker recognition (1st ed.). New York: Springer. https://doi.org/10.1007/978-0-387-77592-0.
Bouziane, A., Kadi, H., Hourri, S., & Kharroubi, J. (2016). An open and free speech corpus for speaker recognition: The fscsr speech corpus. In Intelligent Systems: Theories and Applications (SITA), 2016 11th International Conference on, IEEE, (pp. 1–5).
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., et al. (1967). What is the fast fourier transform? Proceedings of the IEEE, 55(10), 1664–1674.
Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3.
Dong, C., Loy, C. C., He, K., & Tang, X. (2016). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295–307.
Forsyth, M. E., Sutherland, A. M., Elliott, J., & Jack, M. A. (1993). Hmm speaker verification with sparse training data on telephone quality speech. Speech Communication, 13(3–4), 411–416.
Hanilçi, C. (2018). Data selection for i-vector based automatic speaker verification anti-spoofing. Digital Signal Processing, 72, 171–180.
Hasan, M. R., Jamil, M., Rahman, M., & et al. (2004). Speaker identification using mel frequency cepstral coefficients. variations, 1(4).
Hermansky, H. (1990). Perceptual linear predictive (plp) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Hourri, S., & Kharroubi, J. (2019). A novel scoring method based on distance calculation for similarity measurement in text-independent speaker verification. Procedia Computer Science, 148, 256–265.
Kabal, P., & Ramachandran, R. P. (1986). The computation of line spectral frequencies using chebyshev polynomials. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(6), 1419–1426.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 3128–3137).
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In Proc. Odyssey, (pp. 293–298).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Lee, K.F., & Hon, H.W. (1988). Large-vocabulary speaker-independent continuous speech recognition using hmm. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, IEEE, (pp. 123–126).
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, IEEE, (pp. 1695–1699).
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-dependent speaker verification. Speech Communication, 73, 1–13.
Martinez, J., Perez, H., Escamilla, E., & Suzuki, M. M. (2012). Speaker recognition using mel frequency cepstral coefficients (mfcc) and vector quantization (vq) techniques. In: Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on, IEEE, pp. (248–251).
McLaren, M., Lei, Y., & Ferrer, L. (2015). Advances in deep neural network approaches to speaker recognition. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, IEEE, (pp. 4814–4818).
Mohamed, A., Dahl, G. E., Hinton, G., et al. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.
Molau, S., Pitz, M., Schluter, R., & Ney, H. (2001). Computing mel-frequency cepstral coefficients on the power spectrum. In: Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, IEEE, vol 1, (pp. 73–76).
Qawaqneh, Z., Mallouh, A. A., & Barkana, B. D. (2017). Deep neural network framework and transformed mfccs for speaker’s age and gender classification. Knowledge-Based Systems, 115, 5–14.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Rozi, A., Wang, D., Zhang, Z., & Zheng, T. F. (2015). An open/free database and benchmark for uyghur speaker recognition. In: Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference, IEEE, (pp. 81–85).
Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., & Dumouchel, P. (2012). First attempt of boltzmann machines for speaker verification. In Odyssey 2012-The Speaker and Language Recognition Workshop.
Shahin, I., & Botros, N. (1998). Speaker identification using dynamic time warping with stress compensation technique. In: Southeastcon’98. Proceedings. IEEE, IEEE, (pp. 65–68).
Singh, S., & Rajan, E. (2011). Vector quantization approach for speaker recognition using mfcc and inverted mfcc. International Journal of Computer Applications, 17(1), 1–7.
Soong, F. K., Rosenberg, A. E., Juang, B. H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T Technical Journal, 66(2), 14–26.
Tirumala, S. S., & Shahamiri, S. R. (2016). A review on deep learning approaches in speaker identification. In Proceedings of the 8th international conference on signal processing systems, ACM, (pp. 142–147).
Vasilakakis, V., Cumani, S., Laface, P., & Torino, P. (2013). Speaker recognition by means of deep belief networks. Proc Biometric Technologies in Forensic Science.
Yujin, Y., Peihua, Z., & Qun, Z. (2010). Research of speaker recognition based on combination of lpcc and mfcc. In: Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, IEEE, vol 3, (pp. 765–767).
Zhang, C., Yu, C., & Hansen, J. H. (2017). An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE Journal of Selected Topics in Signal Processing, 11(4), 684–694.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hourri, S., Kharroubi, J. A deep learning approach for speaker recognition. Int J Speech Technol 23, 123–131 (2020). https://doi.org/10.1007/s10772-019-09665-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-019-09665-y