Abstract
In audio stream containing multiple speakers, speaker diarization aids in ascertaining “who speak when”. This is an unsupervised task as there is no prior information about the speakers. It labels the speech signal conforming to the identity of the speaker, namely, input audio stream is partitioned into homogeneous segments. In this work, we present a novel speaker diarization system using the Tangent weighted Mel frequency cepstral coefficient (TMFCC) as the feature parameter and Lion algorithm for the clustering of the voice activity detected audio streams into particular speaker groups. Thus the two main tasks of the speaker indexing, i.e., speaker segmentation and speaker clustering, are improved. The TMFCC makes use of the low energy frame as well as the high energy frame with more effect, improving the performance of the proposed system. The experiments using the audio signal from the ELSDSR corpus datasets having three speakers, four speakers and five speakers are analyzed for the proposed system. The evaluation of the proposed speaker diarization system based on the tracking distance, tracking time as the evaluation metrics is done and the experimental results show that the speaker diarization system with the TMFCC parameterization and Lion based clustering is found to be superior over existing diarization systems with 95% tracking accuracy.
Similar content being viewed by others
References
MOATTAR M H, HOMAYOUNPOUR M M. A reveiw on speaker diarization systems and approaches [J]. Speech Communications. 2012, 54(10): 1065-1103.
TRANTER S E, DOUGLAS A. Reynolds, an overview of automatic speaker diarization systems [J]. IEEE Transactions on Audio, Speech and Language Processing, 2006, 14(5): 1557–1565.
KENNY P, GUPTA V, STAFYLAKIS T, OUELLET P, ALAM J. Deep neural networks for extracting baum welch statistics for speaker recognition [C]//Proceedings of the Speaker and Language Recognition. 2014: 293-298.
SAYOUD H, OUAMOUR S, KHENNOUF S. Virtual system of speaker tracking by camera using an audio-based source localization [C]//Proceedings of World Congress on Engineering. 2012, 2.
HUANG Y, BENESTY J, ELKO G W. Micro phone arrays for video camera steering [M]//Acoustic Signal Processing for Telecommunication. Hingham, MA, USA: Kluwer Academic Publishers, 2000: 239-260.
CHEN Jian-feng, LOUIS S, WEE S. A new approach for speaker tracking in reverberant environment [J]. Signal Processing, 2002, 82: 1023–1028.
HU M, SHARMA D, DOCLO S, BROOKES M, NAYLOR P A. Speaker Change detection and speaker diarization using spatial information [C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, QLD, Australia: IEEE, 2015: 5743-5747.
MOATTAR M H, HOMAYOUNPOUR M M. Variational conditional random fields for online speaker detection and tracking [J]. Speech Communication, 2012, 54: 763–780.
SUN X, FOOTE J, KIMBER D, MANJUNATH B S. Region of interest extraction and virtual camera control based on panoramic video [J]. IEEE Transactions on Multimedia, 2005, 7(5): 981–990.
CHEN Yun-qiang, RUI Yong, Real-time speaker tracking using particle filter sensor fusion [J]. Proceedings of the IEEE, 2004, 92(3): 485–494.
SWAMY R K, RAMA M K, YEGNANARAYANA B. Determining number of speakers from multi-speaker speech signals using excitation source information [J]. IEEE Signal Processing Letters, 2007, 14(7): 481–484.
PERTILA P. Online blind speech separation using multiple acoustic speaker tracking and time-frequency masking [J]. Computer Speech and Language, 2013, 27: 683–702.
MA Zhong-hong, YANG Yong, GE Qi, DENG Li-jun, XU Zhen-xin, SUN Xu-na. Nonlinear filtering method of zero-order term suppression for improving the image quality in off-axis holography [J]. Optics Communications, 2014, 315: 232–237.
YE Tian, CHEN Zhe, YIN Fu-liang. Distributed Kalman filter-based speaker tracking in microphone array networks [J]. Applied Acoustics, 2015, 89: 71–77.
RAJAKUMAR B. The Lion’s algorithm: A new nature-inspired search algorithm [J]. Procedia Technology, 2012, 6: 126–135.
DUNN R B, REYNOLDS D A, QUATIERI T F. Approaches to speaker detection and tracking in conversational speech [J]. Digital Signal Processing, 2000, 10: 93–112.
DAI Xiao-feng, LAHDESMAKI H, YLI-HARJA O. A stratified beta-gaussian mixture model for clustering genes with multiple data sources [C]//Proceedings of Biocomputation, Bioinformatics, and Biomedical Technologies. Bucharest, Romania: IEEE, 2008: 94-99.
MARKOVIC I, PETROVIC I. Speaker localization and tracking with a microphone array on a mobile robot using von Mises distribution and particle filtering [J]. Robotics and Autonomous Systems, 2010, 58: 1185–1196.
YEGNANARAYANA B, MAHADEVA PRASANNA S R. Analysis of instantaneous F0 contours from two speakers mixed signal using zero frequency filtering [C]//Proceedings of Acoustics Speech and Signal Processing. Dallas, TX, USA: IEEE, 2010: 5074-5077.
ALAM M J, OUELLET P, KENNY P, O’SHAUGHNESSY D. Comparative evaluation of feature normalization techniques for speaker verification [J]. Advances in Nonlinear Speech Processing, 2011, 7015: 246–253.
KUMAR K, KIM C, STERN R M. Delta-spectral cepstral Coefficients for robust speech recognition [C]//Proceedings of ICASSP. Prague, Czech: IEEE, 2011: 4784-4787.
GUPTA V, BOULIANNE G, KENNY P, OUELLET P, DUMOUCHEL P. Speaker diarization of the French broadcast news [C]//Proceedings of ICASSP. Las Vegas, NV, USA: IEEE, 2008: 4365-4368.
BARRAS C, ZHU Xuan. MEIGNIER S, GAUVAIN J L. Multistage speaker diarization of broadcast news [J]. IEEE Transactions on Audio, Speech and Language Processing. 2006, 14(5): 1505-1512.
MIRO X A, BOZONNET S, EVANS N, FREDOUILLE C, FRIEDLAND G, VINYALS O, DIARIZATION S. Speaker diarization: A review of recent research [J]. IEEE Transactions on Audio, Speech and Language Processing, 2012, 20(2): 356–370.
CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speaker verification [J]. IEEE Signal Processing Letters, 2006, 13(5): 308–311.
PEELING P, CEMGIL A T, GODSILL S. Bayesian hierarchical models and inference for musical audio processing [C]//Proceedings of IEEE Wireless Pervasive Computing. Las Vegas, NV, USA: IEEE, 2008: 278-282.
ZHENG Rong, ZHANG Ce, ZHANG Shan-shan, XU Bo. Variational bayes based i-vector for speaker diarization of telephone Conversations [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP). Florence, Italy, IEEE, 2014: 91-95.
KENNY P, GUPTA V, STAFYLAKIS T, OUELLET P, ALAM J. Deep neural networks for Baum-Welch statistics for speaker Recognition [C]//Proceedings of Neural Networks for Speaker and Language Modelling, 2014.
FLSDSR corpus dataset. [2016–05–02]. http://cogsys.compute.dtu. dk/soundshare/elsdsr.zip.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Subba Ramaiah, V., Rajeswara Rao, R. A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization. J. Cent. South Univ. 24, 2649–2663 (2017). https://doi.org/10.1007/s11771-017-3678-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11771-017-3678-3