Image-based features for speech signal classification

Himadri Mukherjee¹,
Ankita Dhar ORCID: orcid.org/0000-0003-1122-2250¹,
Sk Md Obaidullah²,
Santanu Phadikar³ &
…
Kaushik Roy¹

360 Accesses
7 Citations
Explore all metrics

Abstract

Like other applications, under the purview of pattern classification, analyzing speech signals is crucial. People often mix different languages while talking which makes this task complicated. This happens mostly in India, since different languages are used from one state to another. Among many, Southern part of India suffers a lot from this situation, where distinguishing their languages is important. In this paper, we propose image-based features for speech signal classification because it is possible to identify different patterns by visualizing their speech patterns. Modified Mel frequency cepstral coefficient (MFCC) features namely MFCC- Statistics Grade (MFCC-SG) were extracted which were visualized by plotting techniques and thereafter fed to a convolutional neural network. In this study, we used the top 4 languages namely Telugu, Tamil, Malayalam, and Kannada. Experiments were performed on more than 900 hours of data collected from YouTube leading to over 150000 images and the highest accuracy of 94.51% was obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Impact of Visual Representation of Audio Signals for Indian Language Identification

Deep learning for spoken language identification: Can we visualize speech signal patterns?

Article 05 September 2019

Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method

Article 01 March 2024

References

Alexa, https://www.alexa.com/ [Online; Accessed 5 Oct 2019]
Ambikairajah E, Li H, Wang L, Yin B, Sethu V (2011) Language identification: a tutorial. IEEE Circuits and Systems Magazine 11(2):82–108
Article Google Scholar
Anjana JS, Poorna SS (2018) Language identification from speech features using SVM and LDA. In: 2018 international conference on wireless communications, signal processing and networking (WiSPNET). IEEE, pp 1–4
Bansal S, Agrawal SS (2017) Modeling of linguistic and acoustic information from speech signal for multilingual spoken language identification system (SLID). In: 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, pp 1–6
Bartz C, Herold T, Yang H, Meinel C (2017) Language identification using deep convolutional recurrent neural networks. In: International conference on neural information processing. Springer, Cham, pp 880–889
Bouguelia MR, Nowaczyk S, Santosh KC, Verikas A (2017) Agreeing to disagree: active learning with noisy labels without crowdsourcing. In: International journal of machine learning and cybernetics, pp 1–13
Cortana, https://www.microsoft.com/en-in/windows/cortana [Online; Accessed 5 Oct 2019]
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learning Res 7:1–30
MathSciNet MATH Google Scholar
Ethnologue, http://www.ethnologue.com, [Online; Accessed 19 Jan 2019]
Giwa O, Davel MH (2017) The effect of language identification accuracy on speech recognition accuracy of proper names. In: Pattern recognition association of South Africa and robotics and mechatronics (PRASA-RobMech), 2017. IEEE, pp 187–192
Gunawan TS, Husain R, Kartiwi M (2017) Development of language identification system using MFCC and vector quantization. In: 2017 IEEE 4th international conference on smart instrumentation, measurement and application (ICSIMA). IEEE, pp 1–4
Gupta M, Bharti SS, Agarwal S (2017) Implicit language identification system based on random forest and support vector machine for speech. In: 2017 4th international conference on power, control & embedded systems (ICPCES). IEEE, pp 1–6
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Article Google Scholar
https://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/, [Online; Accessed 19 Aug 2018]
https://www.youtube.com, [Online; Accessed 19 Aug 2018]
https://en.wikipedia.org/wiki/Dravidian_languages [Online; Accessed 5 Oct 2019]
Jin M, Song Y, McLoughlin I, Dai LR, Jin M, Song Y, McLoughlin I, Dai LR (2018) LID-senones and their statistics for language identification. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26(1):171–183
Article Google Scholar
Jothilakshmi S, Ramalingam V, Palanivel S (2012) A hierarchical language identification system for Indian languages. Digital Signal Processing 22(3):544–553
Article MathSciNet Google Scholar
Kadambe S, Hieronymus JL (1995) Language identification with phonological and lexical models. In: 1995 International conference on acoustics, speech, and signal processing, 1995. ICASSP-95, vol 5. IEEE, pp 3507–3510
Mukherjee H, Dhar A, Phadikar S, Roy K (2017) RECAL-A language identification system. In: 2017 international conference on signal processing and communication (ICSPC). IEEE, pp 300–304
Mukherjee H, Dhar A, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2018) Identification of top-3 spoken Indian languages: an ensemble learning-based approach. In: 2018 fourth international conference on research in computational intelligence and communication networks (ICRCICN). IEEE, pp 135–140
Mukherjee H, Dutta M, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2018) Lazy learning based segregation of top-3 South Indian languages with LSF-A feature. In: International conference on recent trends in image processing and pattern recognition . Springer, Singapore, pp 449–459
Mukherjee H, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2018) Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. Int J Speech Techno 21(4):753–760
Article Google Scholar
Mukherjee H, Dhar A, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2019) Linear predictive coefficients-based feature to identify top-7 spoken language. In: International journal of pattern recognition and artificial intelligence, DOI https://doi.org/10.1142/S0218001420580069, (to appear in print)
Niesler T, Willett D (2006) Language identification and multilingual speech recognition using discriminatively trained acoustic models. In: Multilingual speech and language processing
Nyodu K, Sambyo K (2018) Automatic identification of Arunachal language using K-nearest neighbor algorithm. In: 2018 international conference on advances in computing, communication control and networking (ICACCCN). IEEE, pp 213–216
Obaidullah SM, Bose A, Mukherjee H, Santosh KC, Das N, Roy K (2018) Extreme learning machine for handwritten Indic script identification in multiscript documents. J Electron Imaging 27(5):051214
Article Google Scholar
Rao KS, Maity S, Reddy VR (2013) Pitch synchronous and glottal closure based speech analysis for language recognition. Int J Speech Technol 16(4):413–430
Article Google Scholar
Rebai I, BenAyed Y, Mahdi W (2017) Improving of open-set language identification by using deep SVM and thresholding functions. In: 2017 IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE, pp 796–802
Reddy VR, Maity S, Rao KS (2013) Identification of Indian languages using multi-level spectral and prosodic features. Int J Speech Technol 16(4):489–511
Article Google Scholar
Revathi A, Jeyalakshmi C (2017) Comparative analysis on the use of features and models for validating language identification system. In: International conference on inventive computing and informatics (ICICI). IEEE, pp 693–698
Siri, https://www.apple.com/in/siri/ [Online; Accessed 5 Oct 2019]
Tang Z, Wang D, Chen Y, Li L, Abel A (2018) Phonetic temporal neural model for language identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(1):134–144
Article Google Scholar
Ukil S, Ghosh S, Obaidullah SM, Santosh KC, Roy K, Das N (2018) Deep learning for word-level handwritten Indic script identification. arXiv:1801.01627
Ukil S, Ghosh S, Obaidullah SM, Santosh KC, Roy K, Das N (2019) Improved word-level handwritten Indic script identification by integrating small convolutional neural networks. Neural Comput & Appl: 1–16 https://doi.org/10.1007/s00521-019-04111-1
Vajda S, Santosh KC (2016) A fast k-nearest neighbor classifier using unsupervised clustering. In: RTIP2R-2016, pp 185–193
Wang JC, Wang CY, Chin YH, Liu YT, Chen ET, Chang PC (2017) Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimed Tools Appl 76(3):4055–4068
Article Google Scholar
Zhan Q, Zhang L, Deng H, Xie X (2018) An improved LSTM for language identification. In: 2018 14th IEEE international conference on signal processing (ICSP). IEEE, pp 609–612

Download references

Author information

Authors and Affiliations

Department of Computer Science, West Bengal State University, Kolkata, India
Himadri Mukherjee, Ankita Dhar & Kaushik Roy
Department of Computer Science and Engineering, Aliah University, Kolkata, India
Sk Md Obaidullah
Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, Kolkata, India
Santanu Phadikar

Authors

Himadri Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Ankita Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Sk Md Obaidullah
View author publications
You can also search for this author in PubMed Google Scholar
Santanu Phadikar
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Dhar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukherjee, H., Dhar, A., Obaidullah, S.M. et al. Image-based features for speech signal classification. Multimed Tools Appl 79, 34913–34929 (2020). https://doi.org/10.1007/s11042-019-08553-6

Download citation

Received: 10 May 2019
Revised: 21 October 2019
Accepted: 27 November 2019
Published: 28 February 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-019-08553-6

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Impact of Visual Representation of Audio Signals for Indian Language Identification

Deep learning for spoken language identification: Can we visualize speech signal patterns?

Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Image-based features for speech signal classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Impact of Visual Representation of Audio Signals for Indian Language Identification

Deep learning for spoken language identification: Can we visualize speech signal patterns?

Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation