Comparison of different implementations of MFCC

Zheng Fang¹,
Zhang Guoliang¹ &
Song Zhanjiang¹

4523 Accesses
368 Citations
12 Altmetric
Explore all metrics

Abstract

The performance of the Mel-Frequency Cepstrum Coefficients (MFCC) may be affected by (1) the number of filters, (2) the shape of filters, (3) the way in which filters are spaced, and (4) the way in which the power spectrum is warped. In this paper, several comparison experiments are done to find a best implementation. The traditional MFCC calculation excludes the 0th coefficient for the reason that it is regarded as somewhat unreliable. According to the analysis and experiments, the authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC. The authors also propose a better analysis, namely the auto-regressive analysis, on the frame energy, which outperform its 1st and/or 2nd order differential derivatives. Experiments with the “863” Speech Database show that, compared with the traditional MFCC with its corresponding auto-regressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reduces CSER by 2.5%. Comparison experiments are also done with a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS) corpus. The FBE-MFCC can reduce the error rate by about 2.9% on an average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Influence of Filter Bank Structure on the Statistical Significance of Coefficients in Cepstral Analysis for Acoustic Signals

Pitch Estimation Based on the Cepstrum Analysis by the Multi Scale Product of Clean and Noisy Speech

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Pols L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words [dissertation]. Free University, Amsterdam, The Netherlands, 1966.
Google Scholar
Davis S B, Mermelstein P. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences.IEEE Trans. ASSP, Aug., 1980.
Picone J W. Signal modeling techniques in speech recognition. InProceedings of the IEEE, 1993, 81(9): 1215–1247.
Schroeder M R. Recognition of complex acoustic signals.Life Science Research Report, Bullock T H (ed.), Abakon Verlag, Berlin, 1997, 55: 323–328.
Google Scholar
Huang X D, Acero A, Alleva Fet al. From SPHINX-II to WHISPER — Making Speech Recognition Usable.Automatic Speech and Speaker Recognition: Advanced Topics. Lee C H, Soong F K, Paliwal K K (eds.), USA: Kluwer Academic Publishers, 1996, pp.481–508.
Google Scholar
Furui S. Speaker-independent isolated word recognition using dynamic features of speech spectrum.IEEE Trans. Acoust., Speech, and Signal Processing, Feb., 1986, 34(1): 52–59.
Article Google Scholar
Zheng F. Studies on speaker-independent continous digit recognition methods and Chinese speech corpus [thesis]. Department of Computer Science & Technology, Tsinghua University, June 1992.
Zheng F, Mou X-L, Wu W-Het al. On the embedded multiple-model scoring scheme for speech recognition.International Symposium on Chinese Spoken Langauge Processing (ISCSLP’98), Singapore, Dec. 7–9, 1998, ASRA3: 49–53.
Hermansky Hynek. Perceptual linear predictive (PLP) analysis of speech.J. Acoust. Soc. Am., April, 1990, 87 (4): 1738–1752.
Article Google Scholar
Zwicker E. Masking and psychological excitation as consequences of ear’s frequency analysis. InFrequency Analysis and Periodicity Detection in Hearing, Plomp R, Smoorenburg G F (eds.), Sijthoff Leyden, The Netherlands, 1970.
Google Scholar
Zwicker E. Subdivision of the audible frequency range into critical bands.J. Acoust. Soc. Am., Feb., 1961, 33.
Chen X-X, Li A-Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.
Li A-J, Chen X-X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.
Young S, Kershaw D, Odell Jet al. The HTK Book, Version 2.2, Entropic Ltd., 1999.
Li A-J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.

Download references

Author information

Authors and Affiliations

Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, P.R. China
Zheng Fang, Zhang Guoliang & Song Zhanjiang

Authors

Zheng Fang
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Guoliang
View author publications
You can also search for this author in PubMed Google Scholar
Song Zhanjiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Fang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, F., Zhang, G. & Song, Z. Comparison of different implementations of MFCC. J. Comput. Sci. & Technol. 16, 582–589 (2001). https://doi.org/10.1007/BF02943243

Download citation

Received: 15 October 1999
Revised: 23 February 2001
Issue Date: November 2001
DOI: https://doi.org/10.1007/BF02943243

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Influence of Filter Bank Structure on the Statistical Significance of Coefficients in Cepstral Analysis for Acoustic Signals

Pitch Estimation Based on the Cepstrum Analysis by the Multi Scale Product of Clean and Noisy Speech

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Comparison of different implementations of MFCC

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Influence of Filter Bank Structure on the Statistical Significance of Coefficients in Cepstral Analysis for Acoustic Signals

Pitch Estimation Based on the Cepstrum Analysis by the Multi Scale Product of Clean and Noisy Speech

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation