Computer Science > Sound

arXiv:2406.06992 (cs)

[Submitted on 11 Jun 2024 (v1), last revised 13 Jun 2024 (this version, v2)]

Title:Scaling up masked audio encoder learning for general audio classification

Authors:Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Abstract:Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available this https URL.

Comments:	Interspeech 2024
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2406.06992 [cs.SD]
	(or arXiv:2406.06992v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2406.06992

Submission history

From: Heinrich Dinkel [view email]
[v1] Tue, 11 Jun 2024 06:44:54 UTC (7,466 KB)
[v2] Thu, 13 Jun 2024 04:38:02 UTC (7,466 KB)

Computer Science > Sound

Title:Scaling up masked audio encoder learning for general audio classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Scaling up masked audio encoder learning for general audio classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators