Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

V. Subba Ramaiah¹ &
R. Rajeswara Rao²

Abstract

Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization

Article 01 November 2017

Language and Text-Independent Speaker Recognition System Using Energy Spectrum and MFCCs

Hybridized estimations of support vector machine free parameters C and γ using a fuzzy learning strategy for microphone array-based speaker recognition in a Kinect sensor-deployed environment

Article 01 March 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.
Article Google Scholar
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14).
Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72).
Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.
Article Google Scholar
Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation.
Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.
Article Google Scholar
Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132.
Google Scholar
Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.
Article MATH Google Scholar
CSTR VCTK Corpus from http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.
Article Google Scholar
Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.
Article Google Scholar
Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.
Article Google Scholar
ELSDSR database from http://cogsys.compute.dtu.dk/soundshare/elsdsr.zip
Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.
Article Google Scholar
Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.
Article Google Scholar
Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.
Article Google Scholar
Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.
Article Google Scholar
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling.
Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93).
Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872.
Google Scholar
Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109).
Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.
Article Google Scholar
Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180).
Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.
Article Google Scholar
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153.
Google Scholar
Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer.
Google Scholar
NIST. (2009). The NIST Rich Transcription 2009 (RT’09) evaluation. http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-val-plan-v2.pdf.
Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192).
Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.
Article Google Scholar
Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology.
Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer.
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.
Article Google Scholar
Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99).
Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99).
Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87.
Google Scholar
Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.
Article Google Scholar
Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.
Article Google Scholar
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78).
Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.
Article Google Scholar
Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Mahatma Gandhi Institute of Technology, Kokapet, Hyderabad, Telangana, 500075, India
V. Subba Ramaiah
JNTUK, Kakinada, Andhra Pradesh, 535002, India
R. Rajeswara Rao

Authors

V. Subba Ramaiah
View author publications
You can also search for this author in PubMed Google Scholar
R. Rajeswara Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Subba Ramaiah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramaiah, V.S., Rao, R.R. Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering. Int J Speech Technol 19, 945–963 (2016). https://doi.org/10.1007/s10772-016-9384-y

Download citation

Received: 29 July 2016
Accepted: 08 October 2016
Published: 20 October 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10772-016-9384-y

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization

Language and Text-Independent Speaker Recognition System Using Energy Spectrum and MFCCs

Hybridized estimations of support vector machine free parameters C and γ using a fuzzy learning strategy for microphone array-based speaker recognition in a Kinect sensor-deployed environment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization

Language and Text-Independent Speaker Recognition System Using Energy Spectrum and MFCCs

Hybridized estimations of support vector machine free parameters C and γ using a fuzzy learning strategy for microphone array-based speaker recognition in a Kinect sensor-deployed environment

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation