Abstract
Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust Speaker Change Detection. IEEE Signal Processing Letters, 11(8), 649–651.
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., & Friedland, G. (2010). Speaker diarization: A review of recent research. In Proceedings of IEEE TASLP (pp. 1–14).
Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., & Franz, M. (1997). Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In Proceedings of the speech recognition workshop (pp. 67–72).
Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1505–1512.
Beigi, H., & Maes, S. (1998). Speaker, channel and environment change detection. In Proceedings of the world congress on automation.
Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203.
Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 8, 127–132.
Chen, J., Shue, L., & Ser, W. (2002). A new approach for speaker tracking in reverberant environment. Signal Processing, 82(7), 1023–1028.
CSTR VCTK Corpus from http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustic Speech Signal Processing, 28(4), 357–366.
Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single- and cross-show speaker diarization using binary key speaker modeling. IEEE Transactions on Audio, Speech and Language Processing, 23(12), 2286–2297.
Dunn, R. B., Reynolds, D. A., & Quatieri, T. F. (2000). Approaches to speaker detection and tracking in conversational speech. Digital Signal Processing, 10(1–3), 93–112.
ELSDSR database from http://cogsys.compute.dtu.dk/soundshare/elsdsr.zip
Evans, N., Bozonnet, S., & Wang, D. (2012). A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 382–392.
Gish, H., & Schmidt, N. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32.
Huijbregts, M., & van Leeuwen, D. A. (2012). Large-scale speaker diarization for long recordings and small collections. IEEE Transaction on Audio, Speech, and Language Processing, 20(2), 404–413.
Jiang, J.-Y., Liou, R.-J., & Lee, S.-J. (2011). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349.
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P. & Alam, J. (2014). Deep neural networks for Baum-Wech statistics for speaker Recognition. In Proceedings of neural networks for speaker and language modelling.
Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R., & Makhou, J. (1997). The 1996 BBN Byblos Hub-4 transcription system. In Proceedings of the speech recognition workshop (pp. 90–93).
Le, V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. Interspeech, 7, 1869–1872.
Madikeri, S., Himawan, I., Motlicek, P., & Ferras, M. (2015). Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Interspeech-2015 16 th Annual Conference of the international speech communication association, Dresden, Germany, September 6–10 (pp. 3105–3109).
Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of IEEE, 63(4), 561–580.
Meignier, S., Bonastre, J.-F., & Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proceedings of Odyssey workshop (pp. 175–180).
Moattar, M. H., & Homayounpour, M. M. (2012). Variational conditional random fields for online speaker detection and tracking. Speech Communication, 54(6), 763–780.
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–153.
Nguyen, T. H., Chng, E. S., & Li, H. (2015). Speaker diarization: An emerging research. In T. Ogunfunmi (Ed.), Speech and audio processing for coding, enhancement and recognition, Part II (pp. 229–277). New York: Springer.
NIST. (2009). The NIST Rich Transcription 2009 (RT’09) evaluation. http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-val-plan-v2.pdf.
Oku, T., Sato, S., Kobayashi, A., Homma, S., & Imai, T. (2012). Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 4189–4192).
Pertila, P. (2013). Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Computer Speech & Language, 27(3), 683–702.
Ramaiah, V. S., & Rao, R. R. (in press) A novel approach for speaker diarization system using tmfcc parameterization and lion optimization. Journal of Central South University of Technology.
Reynolds, D. (2009). Universal background models. In Encyclopedia of biometrics (pp. 1349–1352). New York: Springer.
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.
Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop (pp. 97–99).
Siegler, M., Jain, U., Ray, B., & Stern, R. (1997). Automatic segmentation, classifcation and clustering of broadcast news audio. In Proceedings of the speech recognition workshop (pp. 97–99).
Sohal, J. S., & Sukhvinder, K. (2015). Optimization of speaker diarization by reducing diarization error rate: A review. International Journal of Electronics and Communication Engineering, 4, 84–87.
Stevens, S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. The American Journal of Psychology, 53(3), 329–353.
Vijayasenan, D., Valente, F., & Bourlard, H. (2012). Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Communication, 54, 55–67.
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). The development of the 1996 HTK broadcast news transcription system. In Proceedings of the speech recognition workshop (pp. 73–78).
Wu, C.-H., Ouyang, C.-S., Chen, L.-W., & Lu, L.-W. (2013). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transaction on Fuzzy System, 23(3), 1–16.
Xu, Y., McLoughlin, I., Song, Y., & Wu, K. (2015). Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing, 35(9), 3393–3404.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ramaiah, V.S., Rao, R.R. Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering. Int J Speech Technol 19, 945–963 (2016). https://doi.org/10.1007/s10772-016-9384-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-016-9384-y