Google Scholar

[PDF][PDF] TTS synthesis with bidirectional LSTM based recurrent neural networks

Y Fan, Y Qian, FL Xie, FK Soong - Fifteenth annual conference of …, 2014 - isca-archive.org

Fifteenth annual conference of the international speech communication …, 2014•isca-archive.org

Abstract

Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are commonly used to constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, ie, lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.

isca-archive.org

Show moreShow less

Save Cite Cited by 629 Related articles All 5 versions View as HTML

Cite

Advanced search

Saved to My library

[PDF][PDF] TTS synthesis with bidirectional LSTM based recurrent neural networks