Nothing Special   »   [go: up one dir, main page]

skip to main content
article

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

Published: 01 September 2017 Publication History

Abstract

Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background.

References

[1]
Abdo, M. S., & Kandil, A. H. (2016). Semi-automatic segmentation system for syllables extraction from continuous Arabic audio signal. International Journal of Advanced Computer Science and Applications, 7(1), 535-540.
[2]
Amanpreet, K., & Tarandeep, S. (2010). Segmentation of Continuous Punjabi Speech Signal into Syllables: WCECS'2010 Proceedings. San Francisco.
[3]
Anwar, M. J., Awais, M. M., Masud, S., & Shamail, S. (2006). Automatic Arabic speech segmentation system. International Journal of Information Technology, 12(6), 102-111.
[4]
Awais, M. M., Ahmad, W., Masud, S., & Shamail, S. (2006). Continuous Arabic speech segmentation using FFT spectrogram: Innovations in Information Technology Proceedings, Dubai, UAE.
[5]
Bilmes, J. A. (2003). Buried Markov models: A graphical-modelling approach to automatic speech recognition. Computer Speech and Language, 17(2-3), 213-231.
[6]
Brognaux, S., & Drugman, T. (2016). HMM-based speech segmentation: Improvements of fully automatic approaches. IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 24(1), 5-15.
[7]
Brognaux, S., Roekhaut, S., Drugman, T. & R. Beaufort, R. (2012). Train&Align: A new online tool for automatic phonetic alignments: IEEE Workshop Spoken Lang. Technol. (SLT) Proceedings, Miami, Florida, USA.
[8]
Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357-370.
[9]
Clarkson, P., & Moreno, P. J. (1999). On the use of support vector machines for phonetic classification: ICASSP'1999 Proceedings, Phoenix, Arizona, USA, (pp. 585-588).
[10]
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
[11]
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multi-class SVMs. Journal of Machine Learning Research, 2, 265-292.
[12]
Dines, J., Sridharan, S., & Moody, M. (2002). Automatic speech segmentation with HMM: 9th Australian International Conference on Speech Science and Technology Proceedings, Melbourne, Australia (pp. 544-549).
[13]
Frihia, H., & Bahi, H. (2016). Embedded Learning Segmentation Approach for Arabic Speech Recognition: TSD'2016, LNAI 9924, Brno, Czech Republic (pp. 383-390).
[14]
Galka, J., & Ziolko, B. (2007). Study of performance evaluation methods for non-uniform speech segmentation. International Journal of Circuits, Systems and Signal Processing, 1(2), 167-172.
[15]
Garofolo, J., et al. (1993). TIMIT acoustic-phonetic continuous speech corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.
[16]
Hsu, C. -W., & Lin, C. -J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415-425.
[17]
Kalamani, M., Valarmathy, D. S., & Anith, S. (2015). Hybrid speech segmentation algorithm for continuous speech recognition. International Journal on Applications of Information and Communication Engineering, 1(1), 39-46.
[18]
Kaur, G., & Singh, P. (2013) A technique to detect syllable boundary in a wave file. International Journal of Computer Science and Communication Engineering, Special issue on "Recent Advances in Engineering and Technology".
[19]
Khanagha, V., Daoudi, K., Pont, O., Yahia, H. (2014). Phonetic segmentation of speech signal using local singularity analysis, Digital Signal Processing, 35, 86-94.
[20]
King, S., & Hasegawa-Johnson, M. (2013). Accurate speech segmentation by mimicking human auditory processing: ICASSP' 2013 Proceedings, Vancouver, BC, Canada.
[21]
Kuo, J. -W., Lo, H. -Y., & Wang, H. -M. (2007). Improved HMM/ SVM methods for automatic phoneme segmentation: INTERSPEECH' 2007 Proceedings, Antwerp, Belgium (pp. 2057-2060).
[22]
Lakshmi, A., & Murthy, H. A. (2006). A syllable based continuous speech recognizer for Tamil. Pittsburgh: INTERSPEECH'2006.
[23]
Malcangi, M. (2009). Softcomputing approach to segmentation of speech in phonetic units. International Journal of Computers and Communications, 3(3), 41-48.
[24]
Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic Alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503-515.
[25]
Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms: ICASSP'2008 Proceedings, Las Vegas, NV, USA.
[26]
Nagarajan, T., Murthy, H. A., & Rajesh, M. H. (2003). Segmentation of speech into syllable-like units EuroSpeech'2003, Geneva, Switzerland (pp. 2893-2896).
[27]
Nofal, M., Abdel-Raheem, E., El Henawy, H., & Abdel Kader, N. S. (2003). Arabic automatic segmentation system and its application for Arabic speech recognition system, 46th Midwest Symposium on Circuits and Systems Proceedings, Cairo, Egypt (pp. 697-700).
[28]
Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(9), 9-18.
[29]
Prasad, V. K., Nagarajan, T., & Murthy, H. A. (2004). Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Communication, 42(3), 429-446.
[30]
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition: Proc. IEEE (pp. 257-286).
[31]
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4-16.
[32]
Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal, 54(2), 297-315.
[33]
Rahman, M., & Bhuiya, A. (2012). Continuous bangla speech segmentation using short-term speech features extraction approaches. International Journal of Advanced Computer Science and Applications, 3(11), 131-138.
[34]
Sangeetha, J., & Jothilakshmi, S. (2012). Robust automatic continuous speech segmentation for indian languages to improve speech to speech translation. International Journal of Computer Applications, 53(15), 13-16.
[35]
Sarkar, A., & Sreenivas, T. V. (2005). Automatic Speech Segmentation using average level crossing rate information: ICASSP'2005 Proceedings, Philadelphia, PA, USA.
[36]
Shah, N. J., Vachhani, B. B., Sailor, H. B., & Patil, H. A. (2014). Effectiveness of PLP-based phonetic segmentation for speech synthesis: ICASSP'2014 Proceedings, Florence, Italy.
[37]
Shanmugam, S. A., & Murthy, H. (2014). A hybrid approach to segmentation of speech using group delay processing and HMM based embedded reestimation: INTERSPEECH'2014 Proceedings, Singapore (pp. 1648-1652).
[38]
Shastri, L., Chang, S., & Greenberg, S. (1999). Syllable detection and segmentation using temporal flow neural networks: ICPhS'99 Proceedings, San Francisco, USA (pp. 1721-1724).
[39]
Solera-Ureña, R., Padrell-Sendra, J., Martín-Iglesias, D., Gallardo-Antolín, A., Peláez-Moreno, C., & Díaz-de-María, F. (2007). Chapter LNCS 9391. In SVMs for automatic speech recognition: A Survey (pp. 190-216). Berlin: Springer.
[40]
Sorin, D., & Rabiner, L. (2006). On the relation between maximum spectral transition positions and phone boundaries: INTERSPEECH' 2006 Proceedings, Pittsburgh, Pennsylvania, USA (pp. 645-648).
[41]
Tolba, M. F., Nazmy, T., Abdelhamid, A. A., & Gadallaha, M. E. (2005). A novel method for Arabic consonant/vowel segmentation using wavelet transform. International Journal on Intelligent Cooperative Information Systems, 5(1), 353-364.
[42]
Toledano, D. T., & Gómez, L. A. H. (2002). HMMs for automatic phonetic segmentation: LREC Proceedings (pp. 1558-1563).
[43]
Vachhani, B. B., & Patil, H. (2013). Use of PLP cepstral features for phonetic segmentation: International Conference on Asian Language Processing (IALP) Proceedings, Urumqi, China (pp. 143-146).
[44]
van Vuuren, V. Z., ten Bosch, L., & Niesler, T. (2015). Unconstrained speech segmentation using deep neural networks: ICPRAM'2015 Proceedings, Lisbon, Portugal (pp. 248-254).
[45]
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
[46]
Wang, H., Lee, T., Leung, C. C., Ma, B., & Li, H. (2015). Acoustic segment modeling with spectral clustering methods. IEEE/ ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(2), 264-277.
[47]
Young, S., et al. (2002). The HTK Book (for HTK Version 3.4). Cambridge: Cambridge University Engineering Department.
[48]
Zarrouk, E., Ben Ayed, Y., & Gargouri, F. (2014). Hybrid continuous speech recognition systems by HMM, MLP and SVM: a comparative study. International Jouranl of Speech Technology, 17, 223-233.

Cited By

View all
  • (2019)Unsupervised help-trained LS-SVR-based segmentation in speaker diarization systemMultimedia Tools and Applications10.1007/s11042-018-6621-178:9(11743-11777)Online publication date: 25-May-2019
  • (2019)Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training ApproachCircuits, Systems, and Signal Processing10.1007/s00034-018-0974-638:6(2489-2522)Online publication date: 1-Jun-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Speech Technology
International Journal of Speech Technology  Volume 20, Issue 3
September 2017
311 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2017

Author Tags

  1. Arabic language
  2. HMM
  3. SVM
  4. Speech recognition
  5. Speech segmentation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Unsupervised help-trained LS-SVR-based segmentation in speaker diarization systemMultimedia Tools and Applications10.1007/s11042-018-6621-178:9(11743-11777)Online publication date: 25-May-2019
  • (2019)Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training ApproachCircuits, Systems, and Signal Processing10.1007/s00034-018-0974-638:6(2489-2522)Online publication date: 1-Jun-2019

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media