Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks

Published: 11 January 2024 Publication History

Abstract

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed a view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation.

References

[1]
A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, arXiv:1807.03748.
[2]
M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Pre-training audio representations with self-supervision,” IEEE Signal Process. Lett., vol. 27, pp. 600–604, 2020.
[3]
A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 3875–3879.
[4]
E. Fonseca, D. Ortego, K. McGuinness, N. E. O'Connor, and X. Serra, “Unsupervised contrastive learning of sound event representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 371–375.
[5]
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Self-supervised learning for general-purpose audio representation,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2021, pp. 1–8.
[6]
Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-supervised audio spectrogram transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 10699–10709.
[7]
A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked autoencoding audio spectrogram transformer,” 2022, arXiv:2203.16691.
[8]
S. Srivastava et al., “Conformer-based self-supervised learning for non-speech audio tasks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 8862–8866.
[9]
P.-Y. Huang et al., “Masked autoencoders that listen,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 28708–28720, 2022.
[10]
S. Chen et al., “BEATs: Audio pre-training with acoustic tokenizers,” 2022.
[11]
S. Liu et al., “Audio self-supervised learning: A survey,” Patterns, vol. 3, no. 12, 2022, Art. no.
[12]
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Exploring pre-trained general-purpose audio representations,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 137–151, 2023.
[13]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics - Hum. Lang. Technol., 2019, pp. 4171–4186.
[14]
A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[15]
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” in Proc. Conf. Mach. Learn. Res., 2022, pp. 1–24.
[16]
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Learning representations by encouraging both networks to model the input,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5.
[17]
J.-B. Grill et al., “Bootstrap your own latent-a new approach to self-supervised learning,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, pp. 21271–21284.
[18]
X. Li and X. Li, “ATST: Audio representation learning with teacher-student transformer,” in Proc. Conf. Interspeech, 2022, pp. 4172–4176.
[19]
J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 776–780.
[20]
X. Chen and K. He, “Exploring simple siamese representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15750–15758.
[21]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1597–1607.
[22]
A. T. Liu, S.-W. Yang, P.-H. Chi, P.-C. Hsu, and H.-Y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020.
[23]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2Vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, pp. 12449–12460.
[24]
W.-N. Hsu et al., “HUBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.
[25]
A. T. Liu, S.-W. Li, and H.-Y. Lee, “TERA: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2351–2366, 2021.
[26]
A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 1298–1312.
[27]
Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio spectrogram transformer,” in Proc. Interspeech, 2021, pp. 571–575.
[28]
D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5.
[29]
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16000–16009.
[30]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.
[31]
Q. Wang et al., “Learning deep transformer models for machine translation,” in Proc. Assoc. Comput. Linguistics, 2019, pp. 1810–1822.
[32]
Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in Proc. 6th Int. Conf. Learn. Representations, 2018.
[33]
H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” in Proc. 6th Int. Conf. Learn. Representations, 2018.
[34]
C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[35]
M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9650–9660.
[36]
S. Atito, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “ASiT: Audio spectrogram vision transformer for general audio representation,” 2022, arXiv:2211.13189.
[37]
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NIPS Deep Learn. Representation Learn. Workshop, 2015.
[38]
Y. Gong, S. Khurana, A. Rouditchenko, and J. Glass, “CMKD: CNN/transformer-based cross-model knowledge distillation for audio classification,” 2022, arXiv:2203.06760.
[39]
Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3292–3306, 2021.
[40]
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. 7th Int. Conf. Learn. Representations, 2019.
[41]
I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proc. 5th Int. Conf. Learn. Representations, 2017.
[42]
J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044.
[43]
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018, arXiv:1804.03209.
[44]
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Conf. Interspeech, 2017, pp. 2616–2620.
[45]
J. Engel et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 1068–1077.
[46]
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2021.
[47]
H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” in Proc. 10th Int. Conf. Learn. Representations, 2022.
[48]
N. Turpault, R. Serizel, J. Salamon, and A. P. Shah, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Proc. Workshop Detection Classification Acoust. Scenes Events, 2019, pp. 253–257.
[49]
S. Hershey et al., “The benefit of temporally-strong labels in audio event classification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 366–370.
[50]
L. JiaKai, “Mean teacher convolution system for DCASE 2018 task 4” DCASE2018 Challenge, Tech. Rep., Jun. 2018.
[51]
Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 61–65.
[52]
S. Hershey et al., “Cnn architectures for large-scale audio classification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 131–135.
[53]
E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, and X. Serra, “Audio tagging with noisy labels and minimal supervision,” in Proc. Workshop Detection Classification Acoust. Scenes Events, 2019, pp. 69–73.
[54]
J. Shor et al., “Towards learning a universal non-semantic representation of speech,” in Proc. Conf. Interspeech, 2020, pp. 140–144.
[55]
L. Wang et al., “Towards learning universal audio representations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 4593–4597.
[56]
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020.
[57]
S. Pascual, M. Ravanelli, J. Serrá, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Proc. Conf. Interspeech, 2019, pp. 161–165.
[58]
K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Singapore, Singapore, 2022, pp. 646–650.
[59]
K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proc. Conf. Interspeech, 2022, pp. 2753–2757.
[60]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.
[61]
J. Turian et al., “HEAR: Holistic evaluation of audio representations,” in Proc. NIPS Competitions Demonstrations Track, 2022, pp. 125–145.

Cited By

View all
  • (2024)Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-trainingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681145(7356-7365)Online publication date: 28-Oct-2024
  • (2024)Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345198332(3947-3959)Online publication date: 1-Jan-2024

Index Terms

  1. Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
      IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 32, Issue
      2024
      4633 pages
      ISSN:2329-9290
      EISSN:2329-9304
      Issue’s Table of Contents

      Publisher

      IEEE Press

      Publication History

      Published: 11 January 2024
      Published in TASLP Volume 32

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-trainingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681145(7356-7365)Online publication date: 28-Oct-2024
      • (2024)Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345198332(3947-3959)Online publication date: 1-Jan-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media