Abstract
On the basis of gaussian mixture model–universal background model (GMM–UBM) in the speaker recognition system, the paper proposes a short utterance sample compensation method based on the generative adversarial network (GAN) to solve the problem of the inadequate corpus data caused by short utterance, which has led to a serious reduction of recognition rate. The presented method compensates the short utterance samples into the speech samples with sufficient speaker identity information by completing the antagonistic training of generator network and discriminator network. In order to avoid the model crash and gradient instability in the process of GAN training, this paper adopts the condition information in the conditional GAN to guide the compensation process of the generator network, and proposes the generator compensation performance measurement training task and the feature tag training task of the discriminator to stabilize training process. Finally, the proposed short utterance compensation method is evaluated on the speaker recognition system based on GMM–UBM. The experimental results indicate that the presented method can effectively reduce the equal error rate of the speaker recognition system in short utterance environment.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abadi, M, et al. (2016). Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}16), pp. 265–283.
Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of IEEE,85(9), 1437–1462.
Chakroun, R., & Frikha, M. (2018). New approach for short utterance speaker identification. IET Signal Processing,12(7), 873–880.
Chao, Y. H., Tsai, W. H., & Wang, H. M. (2009). Improving GMM–UBM speaker verification using discriminative feedback adaptation. Computer Speech & Language,23(3), 376–388.
Guo, J., Xu, N., Qian, K., et al. (2018). Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication,105, 92–102.
Hansen, J. H. L., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine,32(6), 74–99.
Heravi, A. R., & Hodtani, G. A. (2018). Where does minimum error entropy outperform minimum mean square error? A new and closer look. IEEE Access,6(99), 5856–5864.
Isola P., Zhu J. Y., Zhou T., & Efros A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.
Li, L., Wang, D., Zhang, C., & Suzuki, M. M. (2016). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio Speech & Language Processing,24(6), 1129–1139.
Liu, Z., Wu, Z., Li, T., et al. (2018). GMM and CNN hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics,14(7), 3244–3252.
Martinez J., Jorge H.,et al. (2012). Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and vector quantization (VQ) techniques. In Proceedings of the International Conference on Electrical Communications & Computers, pp. 248–251.
Shen P., Lu X., Li S., & Kawai H. (2018). Conditional generative adversarial nets classifier for spoken language identification. In Proceedings of the INTERSPEECH, pp. 2814–2818.
Sueur, J. (2018). Mel-frequency cepstral and linear predictive coefficients. In Proceedings of the Sound Analysis and Synthesis with R, pp. 381–398.
Villalba J., Brummer N., & Dehak N. (2017). Tied variational autoencoder backends for i-vector speaker recognition. In Proceedings of Interspeech, pp. 1004–1008.
Wu, Z., Yu, Z., Yuan, J., & Zhang, J. (2016). A twice face recognition algorithm. Soft Computing,20(3), 1007–1019.
Zhang, L., Zhao, J. Y., Xu-Lun, Y. E., et al. (2018a). Co-operative generative adversarial nets. Zidonghua Xuebao/acta Automatica Sinica,44(5), 804–810.
Zhang J., Inoue N., & Shinoda K. (2018). I-vector transformation using conditional generative adversarial networks for short utterance speaker verification. arXiv preprint arXiv:1804.00290.
Funding
This research was supported by the Natural Science Foundation of Chongqin City, China (cstc2017jcyjA0893) and project of theoretical and applied research on enhanced Raman biosensor chip based on plasma waveguide (csts2017jcyjAX0427).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, Z., Fu, Y., Luo, Y. et al. Speaker recognition based on short utterance compensation method of generative adversarial networks. Int J Speech Technol 23, 443–450 (2020). https://doi.org/10.1007/s10772-020-09711-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09711-0