[PDF][PDF] TTS synthesis with bidirectional LSTM based recurrent neural networks
Fifteenth annual conference of the international speech communication …, 2014•isca-archive.org
Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been
recently shown to outperform decision-tree clustered context-dependent HMM TTS systems
[1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to
accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to
synthesize a smooth speech trajectory, the dynamic features are commonly used to
constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper …
recently shown to outperform decision-tree clustered context-dependent HMM TTS systems
[1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to
accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to
synthesize a smooth speech trajectory, the dynamic features are commonly used to
constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper …
Abstract
Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are commonly used to constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, ie, lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.
isca-archive.org