Abstract
In this paper, we present a new technique to extract a noise robust representation of speech signals called spectro-temporal power spectrum. This technique is based on applying a simple 2-D filter to the speech spectrogram to highlight the movements of spectral peaks. As speech spectral peaks constitute the regions of high-SNR (signal-to-noise ratio) values in the speech spectrogram, we expect that applying our filter will improve the recognition performance. In addition, by applying the 2-D filter, the spectro-temporal information around each frequency component is encoded into the frequency representation of speech signal. This information will help the recognizer to better identify the true state to which each frame should be allocated. Experimental results on the Aurora 2 task show that error rate improvements of about 40 and 35 % are obtained for test sets A and B, respectively, in comparison with the baseline system when combined with cepstral mean and variance normalization. Also, further improvement was achieved when the proposed features were extracted from enhanced spectra obtained by applying advanced front-end routine. Moreover, phone recognition task evaluated on TIMIT database showed the preference of the proposed method over the baseline methods. The obtained improvement by the proposed method is made with a very simple and easy-to-implement routine which makes it suitable for practical systems.
Similar content being viewed by others
References
J. Bouvrie, T. Ezzat, T. Poggio, Localized spectro-temporal cepstral analysis of speech. in Proceedings on ICASSP (Las Vegas, NV, USA, 2008)
J. Chen, K.K. Paliwal, S. Nakamura, Cepstrum derived from differential power spectrum for robust speech recognition. Speech Commun. 41, 469–484 (2003)
S.-Y. Chang, B.T. Meyer, N. Morgan, Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction. in Proceedings on ICASSP (Vancouver, Canada, 2013)
J. Demsar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
D.A. Depireux, J.Z. Simon, D.J. Klein, S.A. Shamma, Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. J. Neurophysiol. 85, 1220–1234 (2001)
ETSI standard document, Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm, ETSI ES 202 050 v.1.1.5. Nov 2003
G. Farahani, S.M. Ahadi, M.M. Homayounpour, Features based on filtering and spectral peaks in autocorrelation domain for robust speech recognition. Comput. Speech Lang. 21, 187–205 (2007)
S. Ganapathy, S. Thomas, H. Hermansky, Temporal envelope compensation for robust phoneme recognition using modulation spectrum. J. Acoust. Soc. Am. 128, 3769–3780 (2010)
H.A. Gupta, A. Raju, A. Alwan, Non-linear dimension reduction of Gabor features for noise-robust ASR. in Proceedings on ICASSP (Florence, Italy, 2014)
M. Happel, S. Muller, J. Anemueller, F. Ohl, Predictability of STRFs in auditory cortex neurons depends on stimulus class. in Proceedings on Interspeech (Brisbane, Australia, 2008)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87, 1738–1752 (1990)
H. Hermansky, N. Morgan, Rasta processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
H.-G. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. in Proceedings on ISCA ITRW ASR (Paris, France, 2000)
HTK, The hidden Markov model toolkit (2002). [Online]. Version: HTK 3.2.1 (windows). Available: http://htk.eng.cam.ac.uk
S. Ikbal, H. Bourlard, M. Magimai, HMM/ANN based spectral peak location estimation for noise robust speech recognition. in Proceedings on ICASSP (Philadelphia, PA, USA, 2005)
S. Ikbal, M.M. Doss, H. Misra, H. Bourlard, Spectro-temporal activity pattern (STAP) features for robust ASR. in Proceedings on ICSLP (Jeju Island, South Korea, 2004)
C. Kim, R.M. Stern, Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. in Proceedings on ICASSP (Dallas, Texas, USA, 2010)
M. Kleinschmidt, D. Gelbart, Improving word accuracy with Gabor feature extraction. in Proceedings on Interspeech (Denver, CO, USA, 2002)
M. Marki, Y. Stylianou, Discrimination of speech from nonspeech in broadcast news based on modulation frequency features. Speech Commun. 53(5), 726–735 (2011)
N. Mesgarani, S. Thomas, H. Hermansky, A multistream multiresolution framework for phoneme recognition. in Proceedings on Interspeech (Makuhari, Japan, 2010)
B.T. Meyer, B. Kollmeier, Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 53(5), 753–767 (2011)
B.T. Meyer, S.R. Ravuri, M.R. Scheadler, N. Morgan, Comparing different flavors of spectro-temporal features for ASR. in Proceedings on Interspeech (Florence, Italy, 2011)
B. Meyer, C. Spille, B. Kollmeier, N. Morgan, Hooking up spectro-temporal filters with auditory-inspiring representations for robust automatic speech recognition. in Proceedings on Interspeech (Portland, Oregon, USA, 2012)
S.K. Nemala, K. Patil, M. Elhilali, Multistream bandpass modulation features for robust speech recognition. in Proceedings on Interspeech (Florence, Italy, 2011)
J. Ramirez, J.M. Gorriz, Recent advances in robust speech recognition technology (Bentham Science Publishers, Sharjah, 2011)
S.V. Ravuri, N. Morgan, Easy does it: robust spectro-temporal many-stream ASR without fine tuning streams. in Proceedings on ICASSP (Kyoto, Japan, 2012)
M.R. Schaedler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131, 4134–4151 (2012)
S. Seyedin, S.M. Ahadi, A new subband-weighted MVDR-based front-end for robust speech recognition. IEICE Trans. Inf. Syst. E93–D, 2252–2261 (2010)
S. Seyedin, S.M. Ahadi, S. Gazor, New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition. Scientific World J. 2013, 634160 (2013). doi:10.1155/2013/634160
S. Tiberwala, H. Hermansky, Multi-band and adaptation approaches to robust speech recognition. in Proceedings on Eurospeech (Rhodes, Greece, 1997)
A. Varga, H. Steeneken, M. Tomlinson, J.D., The NOISEX-92 study on the effect of additive noise on automatic speech recognition (Speech Research Unit, Defense Research Agency, Malvern, 1992)
M. Westphal, The use of cepstral means in conversational speech recognition. in Proceedings on Eurospeech (Rhodes, Greece, 1997)
X. Xiao, E.S. Chng, H. Li, Normalization of the speech modulation spectra for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 16, 1662–1674 (2008)
S. Zhao, N. Morgan, Multi-stream spectro-temporal features for robust speech recognition. in Proceedings on Interspeech (Brisbane, Australia, 2008)
S.Y. Zhao, S. Ravuri, N. Morgan, Multi-stream to many-stream: using spectro-temporal features for ASR. in Proceedings ICASSP (Dallas, Texas, USA, 2010)
Acknowledgements
This work was in part supported by a grant from the Iran Telecommunication Research Center (ITRC).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Riazati Seresht, H., Ahadi, S.M. & Seyedin, S. Spectro-temporal Power Spectrum Features for Noise Robust ASR. Circuits Syst Signal Process 36, 3222–3242 (2017). https://doi.org/10.1007/s00034-016-0434-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-016-0434-0