research-article

Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks

Authors:

Xiaofei LiAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 32

Pages 1336 - 1351

https://doi.org/10.1109/TASLP.2024.3352248

Published: 11 January 2024 Publication History

Abstract

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed a view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation.

References

[1]

A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, arXiv:1807.03748.

[2]

M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Pre-training audio representations with self-supervision,” IEEE Signal Process. Lett., vol. 27, pp. 600–604, 2020.

[3]

A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 3875–3879.

[4]

E. Fonseca, D. Ortego, K. McGuinness, N. E. O'Connor, and X. Serra, “Unsupervised contrastive learning of sound event representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 371–375.

[5]

D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Self-supervised learning for general-purpose audio representation,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2021, pp. 1–8.

[6]

Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-supervised audio spectrogram transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 10699–10709.

[7]

A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked autoencoding audio spectrogram transformer,” 2022, arXiv:2203.16691.

[8]

S. Srivastava et al., “Conformer-based self-supervised learning for non-speech audio tasks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 8862–8866.

[9]

P.-Y. Huang et al., “Masked autoencoders that listen,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 28708–28720, 2022.

[10]

S. Chen et al., “BEATs: Audio pre-training with acoustic tokenizers,” 2022.

[11]

S. Liu et al., “Audio self-supervised learning: A survey,” Patterns, vol. 3, no. 12, 2022, Art. no.

[12]

D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Exploring pre-trained general-purpose audio representations,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 137–151, 2023.

[13]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics - Hum. Lang. Technol., 2019, pp. 4171–4186.

[14]

A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.

Digital Library

[15]

D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” in Proc. Conf. Mach. Learn. Res., 2022, pp. 1–24.

[16]

D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Learning representations by encouraging both networks to model the input,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5.

[17]

J.-B. Grill et al., “Bootstrap your own latent-a new approach to self-supervised learning,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, pp. 21271–21284.

Digital Library

[18]

X. Li and X. Li, “ATST: Audio representation learning with teacher-student transformer,” in Proc. Conf. Interspeech, 2022, pp. 4172–4176.

[19]

J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 776–780.

[20]

X. Chen and K. He, “Exploring simple siamese representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15750–15758.

[21]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1597–1607.

[22]

A. T. Liu, S.-W. Yang, P.-H. Chi, P.-C. Hsu, and H.-Y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020.

[23]

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2Vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, pp. 12449–12460.

[24]

W.-N. Hsu et al., “HUBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.

[25]

A. T. Liu, S.-W. Li, and H.-Y. Lee, “TERA: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2351–2366, 2021.

[26]

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 1298–1312.

[27]

Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio spectrogram transformer,” in Proc. Interspeech, 2021, pp. 571–575.

[28]

D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5.

[29]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16000–16009.

[30]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.

[31]

Q. Wang et al., “Learning deep transformer models for machine translation,” in Proc. Assoc. Comput. Linguistics, 2019, pp. 1810–1822.

[32]

Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in Proc. 6th Int. Conf. Learn. Representations, 2018.

[33]

H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” in Proc. 6th Int. Conf. Learn. Representations, 2018.

[34]

C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.

[35]

M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9650–9660.

[36]

S. Atito, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “ASiT: Audio spectrogram vision transformer for general audio representation,” 2022, arXiv:2211.13189.

[37]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NIPS Deep Learn. Representation Learn. Workshop, 2015.

[38]

Y. Gong, S. Khurana, A. Rouditchenko, and J. Glass, “CMKD: CNN/transformer-based cross-model knowledge distillation for audio classification,” 2022, arXiv:2203.06760.

[39]

Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3292–3306, 2021.

[40]

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. 7th Int. Conf. Learn. Representations, 2019.

[41]

I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proc. 5th Int. Conf. Learn. Representations, 2017.

[42]

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044.

Digital Library

[43]

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018, arXiv:1804.03209.

[44]

A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Conf. Interspeech, 2017, pp. 2616–2620.

[45]

J. Engel et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1068–1077.

[46]

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2021.

[47]

H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” in Proc. 10th Int. Conf. Learn. Representations, 2022.

[48]

N. Turpault, R. Serizel, J. Salamon, and A. P. Shah, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Proc. Workshop Detection Classification Acoust. Scenes Events, 2019, pp. 253–257.

[49]

S. Hershey et al., “The benefit of temporally-strong labels in audio event classification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 366–370.

[50]

L. JiaKai, “Mean teacher convolution system for DCASE 2018 task 4” DCASE2018 Challenge, Tech. Rep., Jun. 2018.

[51]

Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 61–65.

[52]

S. Hershey et al., “Cnn architectures for large-scale audio classification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 131–135.

[53]

E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, and X. Serra, “Audio tagging with noisy labels and minimal supervision,” in Proc. Workshop Detection Classification Acoust. Scenes Events, 2019, pp. 69–73.

[54]

J. Shor et al., “Towards learning a universal non-semantic representation of speech,” in Proc. Conf. Interspeech, 2020, pp. 140–144.

[55]

L. Wang et al., “Towards learning universal audio representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 4593–4597.

[56]

Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020.

[57]

S. Pascual, M. Ravanelli, J. Serrá, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Proc. Conf. Interspeech, 2019, pp. 161–165.

[58]

K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Singapore, Singapore, 2022, pp. 646–650.

[59]

K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proc. Conf. Interspeech, 2022, pp. 2753–2757.

[60]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.

[61]

J. Turian et al., “HEAR: Holistic evaluation of audio representations,” in Proc. NIPS Competitions Demonstrations Track, 2022, pp. 125–145.

Cited By

Li YGuo ZWang XLiu HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-trainingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681145(7356-7365)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681145
Guan YHan JSong HDeng SZheng GZheng THe Y(2024)Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345198332(3947-3959)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3451983

Index Terms

Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks

Index terms have been assigned to the content through auto-classification.

Recommendations

Skipping the frame-level: event-based piano transcription with neural semi-CRFs
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Piano transcription systems are typically optimized to estimate pitch activity at each frame of audio. They are often followed by carefully designed heuristics and post-processing algorithms to estimate note events from the frame-level predictions. ...
AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks
Abstract
We present a new dataset of 3000 artificial music tracks with rich annotations based on real instrument samples and generated by algorithmic composition with respect to music theory. Our collection provides ground truth onset information and has ...
Score Transformer: Generating Musical Score from Note-level Representation
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 32, Issue

2024

5088 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

2329-9290 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 11 January 2024

Published in TASLP Volume 32

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
16
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li YGuo ZWang XLiu HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-trainingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681145(7356-7365)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681145
Guan YHan JSong HDeng SZheng GZheng THe Y(2024)Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345198332(3947-3959)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3451983

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents