research-article

Interrelate Training and Clustering for Online Speaker Diarization

Authors:

Yonghong YanAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 32

Pages 1352 - 1364

https://doi.org/10.1109/TASLP.2024.3357033

Published: 01 February 2024 Publication History

Get Access

Abstract

In clustering-based speaker diarization systems, the embedding clusters for distinctive speakers exhibit wide variability in size and density, posing difficulty for clustering accuracy. In spite of this, with the assistance of the overall distance relationships among speaker embeddings, most of the embeddings can be grouped to the correct cluster by sophisticated offline clustering algorithms. However, in online scenarios, such a complete distance relationships of the embeddings can not be obtained due to the incremental arrival of embeddings. Consequently, determining the number of clusters and then correctly grouping the embeddings become challenging in an online fashion. Furthermore, errors would accumulate quickly over time if the online clustering algorithm assigns the embeddings into clusters erroneously in the beginning. To address these problems, we designed a novel framework for online clustering. To reduce the high variability of speaker embeddings, we proposed the clustering guided embedding extractor training (CGEET) algorithm to encourage similarity between the size of the embedding space for different speakers in attempt to simplify the distance relationships of embeddings. The CGEET algorithm can grasp the distance information of the entire speaker embedding space and provide it to the online clustering algorithm. With this preliminary information, the distance thresholds guided online clustering (DTGOC) algorithm then processes incoming embeddings using a divide-and-conquer approach. It first handles the embeddings with explicit distance relationships and then searches for possible path combination they have with remaining embeddings in an online fashion. Moreover, in order to utilize the distance relationships of embeddings that are far apart in time, an online re-clustering strategy is incorporated in our DTGOC algorithm, which can alleviate error accumulation during online clustering. By implementing the above innovations, our proposed online clustering system achieves 14.00% DER with collar 0.25 at 2.5 s latency on the AISHELL-4, while the DER of the offline agglomerative hierarchical clustering system is 14.54%.

References

[1]

S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1557–1565, Sep. 2006.

Abstract

References

Index Terms

Recommendations

Online Neural Speaker Diarization With Target Speaker Tracking

Graph attention-based deep embedded clustering for speaker diarization

Speaker diarization system using MKMFCC parameterization and WLI-fuzzy clustering

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations