Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Interrelate Training and Clustering for Online Speaker Diarization

Published: 01 February 2024 Publication History

Abstract

In clustering-based speaker diarization systems, the embedding clusters for distinctive speakers exhibit wide variability in size and density, posing difficulty for clustering accuracy. In spite of this, with the assistance of the overall distance relationships among speaker embeddings, most of the embeddings can be grouped to the correct cluster by sophisticated offline clustering algorithms. However, in online scenarios, such a complete distance relationships of the embeddings can not be obtained due to the incremental arrival of embeddings. Consequently, determining the number of clusters and then correctly grouping the embeddings become challenging in an online fashion. Furthermore, errors would accumulate quickly over time if the online clustering algorithm assigns the embeddings into clusters erroneously in the beginning. To address these problems, we designed a novel framework for online clustering. To reduce the high variability of speaker embeddings, we proposed the clustering guided embedding extractor training (CGEET) algorithm to encourage similarity between the size of the embedding space for different speakers in attempt to simplify the distance relationships of embeddings. The CGEET algorithm can grasp the distance information of the entire speaker embedding space and provide it to the online clustering algorithm. With this preliminary information, the distance thresholds guided online clustering (DTGOC) algorithm then processes incoming embeddings using a divide-and-conquer approach. It first handles the embeddings with explicit distance relationships and then searches for possible path combination they have with remaining embeddings in an online fashion. Moreover, in order to utilize the distance relationships of embeddings that are far apart in time, an online re-clustering strategy is incorporated in our DTGOC algorithm, which can alleviate error accumulation during online clustering. By implementing the above innovations, our proposed online clustering system achieves 14.00% DER with collar 0.25 at 2.5 s latency on the AISHELL-4, while the DER of the offline agglomerative hierarchical clustering system is 14.54%.

References

[1]
S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1557–1565, Sep. 2006.
[2]
G. Cheng et al., “The conversational short-phrase speaker diarization (CSSD) task: Dataset, evaluation metric and baselines,” in Proc. 13th Int. Symp. Chin. Spoken Lang. Process., 2022.
[3]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5329–5333.
[4]
Y. Fu et al., “Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” 2021, arXiv:2104.03603.
[5]
L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008.
[6]
S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967.
[7]
A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2001, vol. 14, pp. 849–856.
[8]
T. J. Park, K. J. Han, M. Kumar, and S. Narayanan, “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Process. Lett., vol. 27, pp. 381–385, 2019.
[9]
V. Cohen-Addad, B. Guedj, V. Kanade, and G. Rom, “Online k-means clustering,” in Proc. Int. Conf. Artif. Intell. Statist., 2021, pp. 1126–1134.
[10]
D. Lilt and F. Kubala, “Online speaker clustering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2004, pp. I–333.
[11]
J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. d. Carvalho, and J. Gama, “Data stream clustering: A survey,” ACM Comput. Surv., vol. 46, no. 1, pp. 1–31, 2013.
[12]
Y. Chen, Y. Guo, Q. Li, G. Cheng, P. Zhang, and Y. Yan, “Interrelate training and searching: A unified online clustering framework for speaker diarization,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2022, pp. 1456–1460.
[13]
V. Cohen-Addad, D. Saulpic, and C. Schwiegelshohn, “A new coreset framework for clustering,” in Proc. 53rd Annu. ACM SIGACT Symp. Theory Comput., 2021, pp. 169–182.
[14]
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.
[15]
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal, Process., 2014, pp. 4052–4056.
[16]
E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “The sticky HDP-HMM: Bayesian nonparametric hidden Markov models with persistent states,” 2007.
[17]
M. Diez, L. Burget, F. Landini, and J. Černockỳ, “Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 355–368, 2020.
[18]
M. Diez, L. Burget, S. Wang, J. Rohdin, and J. Cernockỳ, “Bayesian HMM based X-vector clustering for speaker diarization,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 346–350.
[19]
F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian HMM clustering of x-vector sequences (VBX) in speaker diarization: Theory, implementation and analysis on standard tasks,” Comput. Speech Lang., vol. 71, 2022, Art. no.
[20]
J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, and M. Brudno, “Speaker attribution with voice profiles by graph-based semi-supervised learning,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 289–293.
[21]
A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal, Process., 2019, pp. 6301–6305.
[22]
Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 296–303.
[23]
S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 269–273.
[24]
S. Maiti, H. Erdogan, K. Wilson, S. Wisdom, S. Watanabe, and J. R. Hershey, “End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 7183–7187.
[25]
Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, P. García, and K. Nagamatsu, “Online end-to-end neural diarization with speaker-tracing buffer,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2021, pp. 841–848.
[26]
E. Han, C. Lee, and A. Stolcke, “Bw-eda-eend: Streaming end-to-end neural speaker diarization for a variable number of speakers,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 7193–7197.
[27]
S. Horiguchi, S. Watanabe, P. García, Y. Takashima, and Y. Kawaguchi, “Online neural diarization of unlimited numbers of speakers using global and local attractors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 706–720, 2022.
[28]
W. Xia et al., “Turn-to-diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 8077–8081.
[29]
Z. Huang et al., “Speaker diarization with region proposal network,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 6514–6518.
[30]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2015, vol. 28, pp. 91–99.
[31]
I. Medennikov et al., “Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 274–278.
[32]
M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental clustering and dynamic information retrieval,” SIAM J. Comput., vol. 33, no. 6, pp. 1417–1440, 2004.
[33]
W. Barbakh and C. Fyfe, “Online clustering algorithms,” Int. J. Neural Syst., vol. 18, no. 03, pp. 185–194, 2008.
[34]
R. O. Duda et al. Pattern Classification. Hoboken, NJ, USA: Wiley, 2006.
[35]
V. K. Mansinghka, D. M. Roy, R. Rifkin, and J. Tenenbaum, “Aclass: An online algorithm for generative classification,” in Proc. 11th Int. Conf. Artif. Intell. Statist., 2007, vol. 2, pp. 315–322.
[36]
S. Har-Peled and S. Mazumdar, “On coresets for k-means and k-median clustering,” in Proc. 36th Annu. ACM Symp. Theory Comput., 2004, pp. 291–300.
[37]
I. Jubran, A. Maalouf, and D. Feldman, “Introduction to coresets: Accurate coresets,” 2019, arXiv:1910.08707.
[38]
D. Feldman and M. Langberg, “A unified framework for approximating and clustering data,” in Proc. 43rd Annu. ACM Symp. Theory Comput., 2011, pp. 569–578.
[39]
L. Huang, S. Jiang, and N. Vishnoi, “Coresets for clustering with fairness constraints,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019, pp. 7589–7600.
[40]
R. Chhaya, J. Choudhari, A. Dasgupta, and S. Shit, “Online coresets for clustering with Bregman divergences,” 2020, arXiv:2012.06522.
[41]
K. Q. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. 18th Int. Conf. Neural Inf. Process. Syst., 2005, pp. 1473–1480.
[42]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 815–823.
[43]
A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” 2017, arXiv:1703.07737.
[44]
F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Process. Lett., vol. 25, no. 7, pp. 926–930, Jul. 2018.
[45]
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4690–4699.
[46]
J. Thienpondt, B. Desplanques, and K. Demuynck, “The IDLAB VoxCeleb speaker recognition challenge 2020 system description,” 2020, arXiv:2010.12468.
[47]
R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised speaker recognition with loss-gated learning,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6142–6146.
[48]
H. W. Kuhn, “The hungarian method for the assignment problem,” Nav. Res. Logistics Quart., vol. 2, no. 1/2, pp. 83–97, 1955.
[49]
J. Munkres, “Algorithms for the assignment and transportation problems,” J. Soc. Ind. Appl. Math., vol. 5, no. 1, pp. 32–38, 1957.
[50]
A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 2616–2620.
[51]
J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2018, pp. 1086–1090.
[52]
S. Renals, T. Hain, and H. Bourlard, “Recognition and understanding of meetings the Ami and Amida projects,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2007, pp. 238–247.
[53]
“Nist speaker recognition evaluation,” 2000. [Online]. Available: https://catalog.ldc.upenn.edu/LDC2001S97
[54]
N. Ryant et al., “First Dihard challenge evaluation plan,” Tech. Rep., 2018.
[55]
Y. Fan et al., “Cn-celeb: A challenging chinese speaker recognition dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7604–7608.
[56]
L. Li et al., “Cn-celeb: Multi-genre speaker recognition,” Speech Commun., vol. 137, pp. 77–91, 2022.
[57]
T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 5220–5224.
[58]
D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” 2015, arXiv:1510.08484v1.
[59]
T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Comput. Speech Lang., vol. 72, 2022, Art. no.
[60]
D. Povey et al., “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011.

Index Terms

  1. Interrelate Training and Clustering for Online Speaker Diarization
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
    IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 32, Issue
    2024
    4633 pages
    ISSN:2329-9290
    EISSN:2329-9304
    Issue’s Table of Contents

    Publisher

    IEEE Press

    Publication History

    Published: 01 February 2024
    Published in TASLP Volume 32

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 6
      Total Downloads
    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media