Overview of Voice Conversion Methods Based on Deep Learning
Abstract
:1. Introduction
- Speaker identity extraction—this involves extracting information about the speaker’s identity from the speech.
- Linguistic content extraction—this involves extracting from statements or appropriately processing other data (e.g., text) to obtain the most time-dependent information in the output. These include information about the content of speech, rhythm, and intonation.
- Encoder—this is responsible for the integration and appropriate representation of the above extractions. As the information fed into the encoder and the information obtained from the extraction of linguistic content latent embeddings are time-dependent, these two tasks are often combined.
- Decoder/vocoder—these are responsible for processing the data obtained from the encoder to produce an appropriately manipulated soundtrack in the output. The input is frequently a spectrogram. However, sometimes, to reduce the number of models or unnecessary intermediate representations [7], the encoder is combined with the vocoder, and there is no intermediate representation between them.
2. Voice Conversion Process
2.1. Speaker Identity Extraction
2.2. Linguistic Content Extraction
2.3. Generation
2.4. Vocoders
2.5. Datasets
2.6. Model Inputs
2.7. Evaluation Methods
3. Challenges
4. Conclusions
5. Future Directions
- The complex individual nature of human speech voice conversion is a task of great complexity, as it requires an understanding of various aspects of sound, such as tone, timbre, intonation, and tempo.
- Real-time performance requirements—in some cases, voice conversion must be done in real time, meaning that the algorithm must run fast enough for the user to hear the result in real time.
- Satisfactory results—the resulting quality of voice conversion can be crucial, especially for commercial applications. Algorithm developers face the challenge of ensuring that the results are good enough to be helpful to users.
- The flexibility of operation—the algorithm’s performance should adapt to our data. The final product should also adapt in the case of higher-quality data. Furthermore, if users have different lengths of statements, the algorithm should work smoothly.
- Developing appropriate metrics for evaluating performance—in order to put voice conversion algorithms into practice, it is necessary to determine how well they work. Therefore, algorithm developers must create appropriate quality assessment metrics that consider various aspects of voice conversion, such as speech fluency, naturalness, and intelligibility.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Childers, D.G.; Wu, K.; Hicks, D.M.; Yegnanarayana, B. Voice conversion. Speech Commun. 1989, 8, 147–158. [Google Scholar] [CrossRef]
- Mohammadi, S.H.; Kain, A. An Overview of Voice Conversion Systems. Speech Commun. 2017, 88, 65–82. [Google Scholar] [CrossRef]
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
- Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep Neural Networks for Small Footprint Text-dependent Speaker Verification. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
- Wu, Z.; Li, H. Voice conversion versus speaker verification: An overview. APSIPA Trans. Signal Inf. Process. 2014, 3, e17. [Google Scholar] [CrossRef]
- Chorowski, J.; Weiss, R.J.; Bengio, S.; Oord, A. van den Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 2041–2053. [Google Scholar] [CrossRef] [Green Version]
- Kashkin, A.; Karpukhin, I.; Shishkin, S. HiFi-VC: High Quality ASR-Based Voice Conversion. arXiv 2022, arXiv:2203.16937. [Google Scholar]
- Chen, M.; Zhou, Y.; Huang, H.; Hain, T. Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution. arXiv 2022, arXiv:2203.17172. [Google Scholar]
- Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
- Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y.N.; Auli, M. Pay Less Attention with Lightweight and Dynamic Convolutions. arXiv 2019, arXiv:1901.10430. [Google Scholar]
- Siuzdak, H.; Dura, P.; van Rijn, P.; Jacoby, N. WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. arXiv 2022, arXiv:arXiv:2203.16930. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Long, Z.; Zheng, Y.; Yu, M.; Xin, J. Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE. arXiv 2022, arXiv:2203.16037. [Google Scholar]
- Lian, J.; Zhang, C.; Yu, D. Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion. arXiv 2022, arXiv:2203.16705. [Google Scholar]
- Li, Y.A.; Zare, A.; Mesgarani, N. StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion. arXiv 2021, arXiv:2107.10394. [Google Scholar]
- Li, J.; Tu, W.; Xiao, L. FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion. arXiv 2022, arXiv:2210.15418. [Google Scholar]
- Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proc. Mach. Learn. Res. 2021, 139, 5530–5540. [Google Scholar]
- Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. arXiv 2017, arXiv:1710.10467. [Google Scholar]
- Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen, Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv 2018, arXiv:1806.04558. [Google Scholar]
- Qian, K.; Zhang, Y.; Chang, S.; Yang, X.; Hasegawa-Johnson, M. AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. arXiv 2019, arXiv:1905.05879. [Google Scholar]
- Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv 2020, arXiv:2005.07143. [Google Scholar]
- Yang, B.; Lyu, J.; Zhang, S.; Qi, Y.; Xin, J. Channel pruning for deep neural networks via a relaxed groupwise splitting method. In Proceedings of the 2019 2nd International Conference on Artificial Intelligence for Industries, AI4I 2019, Lagana Hills, CA, USA, 25–27 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 97–98. [Google Scholar] [CrossRef]
- Chou, J.; Yeh, C.; Lee, H. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. arXiv 2019, arXiv:1904.05742. [Google Scholar]
- Liu, S.; Cao, Y.; Wang, D.; Wu, X.; Liu, X.; Meng, H. Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1717–1728. [Google Scholar] [CrossRef]
- Polyak, A.; Wolf, L.; Taigman, Y. TTS Skins: Speaker Conversion via ASR. arXiv 2019, arXiv:1904.08983. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv 2019, arXiv:2005.08100. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Kum, S.; Nam, J. Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Appl. Sci. 2019, 9, 1324. [Google Scholar] [CrossRef] [Green Version]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using Real NVP. arXiv 2016, arXiv:1605.08803. [Google Scholar]
- Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. arXiv 2017, arXiv:1703.06868. [Google Scholar]
- Yamamoto, R.; Song, E.; Kim, J.-M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. arXiv 2019, arXiv:1910.11480. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2019, arXiv:1912.04958. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
- van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Nguyen, B.; Cardinaux, F. NVC-Net: End-to-End Adversarial Voice Conversion. arXiv 2021, arXiv:2106.00992. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A.; Deepmind, G. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 2017 International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerry-Ryan, R.; et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Li, Y.; Mandt, S. Disentangled Sequential Autoencoder. arXiv 2018, arXiv:1803.02991. [Google Scholar]
- van den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. arXiv 2016, arXiv:1601.06759. [Google Scholar]
- Holschneider, M.; Kronland-Martinet, R.; Morlet, J.; Tchamitchian, P. A Real-Time Algorithm for Signal Analysis with the Help of the Wavelet Transform. In Wavelets: Time-Frequency Methods and Phase Space, Proceedings of the International Conference, Marseille, France, 14–18 December 1987; Springer: Berlin/Heidelberg, Germany, 1990; pp. 286–297. [Google Scholar] [CrossRef]
- Dutilleux, P. An implementation of the “algorithme a trous” to compute the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space, Proceedings of the International Conference, Marseille, France, 14–18 December 1987; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar]
- Yamamoto, R.; Song, E.; Kim, J.M. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 15–19 September 2019; International Speech Communication Association: Baixas, France, 2019; pp. 699–703. [Google Scholar] [CrossRef] [Green Version]
- Arik, S.O.; Jun, H.; Diamos, G. Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks. IEEE Signal Process. Lett. 2018, 26, 94–98. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Takaki, S.; Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv 2018, arXiv:1810.11946. [Google Scholar]
- Wang, C.; Zeng, C.; He, X. HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation. arXiv 2022, arXiv:2210.12740. [Google Scholar]
- Donahue, C.; McAuley, J.; Puckette, M. Adversarial Audio Synthesis. arXiv 2018, arXiv:1802.04208. [Google Scholar]
- Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brebisson, A.; Bengio, Y.; Courville, A. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arXiv 2019, arXiv:1910.06711. [Google Scholar]
- Liu, Z.; Mak, B. Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers. arXiv 2019, arXiv:1911.11601. [Google Scholar]
- Bakhturina, E.; Lavrukhin, V.; Ginsburg, B.; Zhang, Y. Hi-Fi Multi-Speaker English TTS Dataset. arXiv 2021, arXiv:2104.01497. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 19–24 April 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. arXiv 2018, arXiv:1804.03619. [Google Scholar] [CrossRef] [Green Version]
- Garofolo, J.S. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Linguist. Data Consort. 1993. [Google Scholar] [CrossRef]
- Zhao, Y.; Huang, W.-C.; Tian, X.; Yamagishi, J.; Das, R.K.; Kinnunen, T.; Ling, Z.; Toda, T. Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv 2020, arXiv:2008.12527. [Google Scholar]
- Takamichi, S.; Mitsui, K.; Saito, Y.; Koriyama, T.; Tanji, N.; Saruwatari, H. JVS corpus: Free Japanese multi-speaker voice corpus. arXiv 2019, arXiv:1908.06248. [Google Scholar]
- Zhou, K.; Sisman, B.; Liu, R.; Li, H. Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset. arXiv 2020, arXiv:2010.14794. [Google Scholar]
- Lo, C.-C.; Fu, S.-W.; Huang, W.-C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.-M. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. arXiv 2019, arXiv:1904.08352. [Google Scholar]
- Lian, J.; Zhang, C.; Anumanchipalli, G.K.; Yu, D. Towards Improved Zero-shot Voice Conversion with Conditional DSVAE. arXiv 2022, arXiv:2205.05227. [Google Scholar]
- Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. arXiv 2019, arXiv:1904.02882. [Google Scholar]
Paper | Speaker Identity Extraction | Linguistic Content Extraction | Generation | Vocoder |
---|---|---|---|---|
[8] | GE2E [18] | VQWav2vec features [9] | Dynamic convolutions [10] WadaIN [34] LN + RC as in [21] | Parallel WaveGAN [33] |
[7] | 5-layer residual FC similar to [37] | Linguistic encoder: TTS Skins [26] Conformer [27] F0 encoder: BNE-Seq2seqMoL [25] | Modified HiFi-GAN [35] | Modified HiFi-GAN [35] |
[11] | ECAPA-TDNN [22] | Wav2vec [28] | Vec2wav model based on HiFi-GAN [35] | Vec2wav model based on HiFi-GAN [35] |
[13] | β-VAE [38]—average distribution | β-VAE [38]—individual distribution for each audio | β-VAE [38] RGSM [23] Attention [21] Post-Net as in [39] | WaveNet [36] |
[14] | Modified DSVAE [40]—time-invariant disentanglement | Modified DSVAE [40]—time-variant disentanglement | Modified AutoVC Decoder [20] | WaveNet [36] HiFi-GAN [35] |
[15] | Mapping network/style encoder | Encoder + F0 Encoder: JDC network [29] | Encoder output + F0 output + style injected by AdaIN [32] | Parallel WaveGAN [33] |
[16] | LSTM based on [25] | Prior encoder: WavLM [30] bottleneck extractor posterior encoder based on flow used only during training | HiFi-GAN [35] | HiFi-GAN [35] |
Paper | Evaluations Methods | MOS M-M Quality | MOS M-M Similarity | MOS A-A Quality | MOS A-A Similarity | Dataset | Public Code/Demo |
---|---|---|---|---|---|---|---|
[8] | MCD/MOS/CER/WER/MOSNet [64] | 3.81 | 3.87 | - | - | VCC2020 [61] | ✓ 1/✓ 2 |
[7] | MOS/WER/CER/PCC | 4.08 | 4.08 | 4.03 | 3.02 | VCTK [55] | ✓ 3 |
[11] | MOS | 4.09 | - | - | - | VCTK [55], Hi-Fi TTS [56] LibriSpeech [57], CommonVoice [58], AVSpeech [59] | ✗/✓ 4 |
[13] | MOSNet [64], average speaker classification accuracy | 3.74 | - | 3.58 | - | VCTK [55] | ✗/✗ |
WaveNet [14] | EER/MOS | 3.40 | 3.56 | 3.22 | 3.54 | VCTK [55], TIMIT [60] | ✗/✓ 5 |
HiFi-GAN [65] | 3.76 | 3.83 | 3.65 | 3.89 | |||
[15] | MOS/MOSNet [64]/CLS/CER | 4.09 | 3.86 | - | - | VCTK [55], JVS [62], ESD [63], | ✓ 6/✓ 7 |
[16] | MOS/WER/CER/PCC | 4.01 | 3.80 | 4.06 | 2.83 | VCTK [55], LibriTTS [66] | ✓ 8/✓ 9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Walczyna, T.; Piotrowski, Z. Overview of Voice Conversion Methods Based on Deep Learning. Appl. Sci. 2023, 13, 3100. https://doi.org/10.3390/app13053100
Walczyna T, Piotrowski Z. Overview of Voice Conversion Methods Based on Deep Learning. Applied Sciences. 2023; 13(5):3100. https://doi.org/10.3390/app13053100
Chicago/Turabian StyleWalczyna, Tomasz, and Zbigniew Piotrowski. 2023. "Overview of Voice Conversion Methods Based on Deep Learning" Applied Sciences 13, no. 5: 3100. https://doi.org/10.3390/app13053100
APA StyleWalczyna, T., & Piotrowski, Z. (2023). Overview of Voice Conversion Methods Based on Deep Learning. Applied Sciences, 13(5), 3100. https://doi.org/10.3390/app13053100