research-article

Deep semantic learning for acoustic scene classification

Authors:

Wei-Qiang ZhangAuthors Info & Claims

EURASIP Journal on Audio, Speech, and Music Processing, Volume 2024, Issue 1

https://doi.org/10.1186/s13636-023-00323-5

Published: 03 January 2024 Publication History

Abstract

Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.

References

[1]

Barchiesi D, Giannoulis D, Stowell D, and Plumbley M Acoustic scene classification: classifying environments from the sounds they produce IEEE Signal Process. Mag. 2015 32 3 16-34

[2]

Y. Han, J. Park, Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification. Tech. Rep., DCASE 2017 Challenge (2017)

[3]

H. Zeinali, L. Burget, J. Cernocky, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge (Zenodo, Geneve, 2018)

[4]

Y. Sakashita, M. Aono, Acoustic scene classifification by ensemble of spectrograms based on adaptive temporal divisions. Tech. Rep., DCASE 2018 Challenge (2018)

[5]

DCASE. Detection and classification of acoustic scenes and events 2020 task 1a (2020), https://dcase.community/challenge2020/task-acoustic-scene-classification-results-a

[6]

Badrinarayanan V, Kendall A, and Cipolla R SegNet: A deep convolutional encoder-decoder architecture for image segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 12 2481-2495

[7]

X. Ma, Y. Shao, Y. Ma, W.Q. Zhang, in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Deep semantic encoder-decoder network for acoustic scene classification with multiple devices (IEEE, Piscataway, NJ, 2020), pp. 365–370

[8]

X. Ma, Y. Shao, Y. Ma, W.-Q. Zhang, THUEE submission for DCASE 2020 challenge task1a. Tech. Rep., DCASE 2020 Challenge (2020)

[9]

S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, (2015), https://arxiv.org/abs/1502.03167

[10]

D. Ulyanov, A. Vedaldi, V. Lempitsky, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. (IEEE, Piscataway, 2017), pp.4105–4113

[11]

DCASE. Detection and classification of acoustic scenes and events (2020), http://dcase.community/

[12]

DCASE. Detection and classification of acoustic scenes and events challenge 2018 (2018), http://dcase.community/challenge2018/task-acoustic-scene-classification-results-a

[13]

DCASE. Detection and classification of acoustic scenes and events challenge 2019 (2019), http://dcase.community/challenge2019/task-acoustic-scene-classification#subtask-a

[14]

DCASE. Detection and classification of acoustic scenes and events challenge 2020 (2020), http://dcase.community/challenge2020/task-acoustic-scene-classification#subtask-a

[15]

J.T. Geiger, B. Schuller, G. Rigoll, in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Large-scale audio feature extraction and svm for acoustic scene classification (IEEE, Piscataway, 2013), pp. 1–4

[16]

S. Mun, S. Park, Y. Lee, H. Ko, Deep neural network bottleneck feature for acoustic scene classification. Tech. Rep., DCASE 2016 Challenge (2016)

[17]

G. Vikaskumar, S. Waldekar, D. Paul, G. Saha, Acoustic scene classification using block based mfcc features. Tech. Rep., DCASE 2016 Challenge (2016)

[18]

W. Zheng, J. Yi, X. Xing, X. Liu, S. Peng, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion (Zenodo, Geneve, 2017)

[19]

A. Schindler, T. Lidy, A. Rauber, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Multi-temporal resolution convolutional neural networks for acoustic scene classification (Zenodo, Geneve, 2017)

[20]

H. Wang, Y. Zou, D. Chong, Acoustic scene classification with spectrogram processing strategies, (2020), https://arxiv.org/abs/2007.03781

[21]

J. Sun, X. Liu, X. Mei, J. Zhao, M.D. Plumbley, V. Kılıç, W. Wang, in 30th European Signal Processing Conference (EUSIPCO). Deep neural decision forest for acoustic scene classification (IEEE, Piscataway, 2022), pp. 772–776

[22]

DCASE. Low-complexity acoustic scene classification with multiple devices (2021), https://dcase.community/challenge2021/task-acoustic-scene-classification-results-a

[23]

Salamon J and Bello JP Deep convolutional neural networks and data augmentation for environmental sound classification IEEE Signal Process. Lett. 2017 24 3 279-283

[24]

S.H. Bae, I. Choi, N.S. Kim., in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Acoustic scene classification using parallel combination of LSTM and CNN (Zenodo, Geneve, 2017)

[25]

S. Phaye, E. Benetos, Y. Wang, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Subspectralnet-using sub-spectrogram based convolutional neural networks for acoustic scene classification (IEEE, Piscataway, 2019), pp. 825–829

[26]

M.D. McDonnell, W. Gao, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths (IEEE, Piscataway, 2020), pp. 141–145

[27]

H. Wang, Y. Zou, W. Wang, in Interspeech 2021. SpecAugment++: A hidden space data augmentation method for acoustic scene classification (ISCA, Baixas, 2021), pp. 551–555

[28]

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y Generative adversarial networks Adv. Neural Inf. Process Syst. 2014 3 2672-2680

[29]

S. Mun, S. Park, D. Han, H. Ko, Generative adversial network based acoustic scene training set augmentation and selection using svm hyper-planne. Tech. Rep., DCASE 2017 Challenge (2017)

[30]

H. Hu, C.H.H. Yang, X. Xia, X. Bai, X. Tang, Y. Wang, S. Niu, L. Chai, J. Li, H. Zhu, F. Bao, Y. Zhao, S.M. Siniscalchi, Y. Wang, J. Du, C.H. Lee, Device-robust acoustic scene classification based on two-stage categorization and data augmentation. Tech. Rep., DCASE 2020 Challenge (2020)

[31]

S. Suh, S. Park, Y. Jeong, T. Lee, Designing acoustic scene classification models with cnn variants. Tech. Rep., DCASE 2020 Challenge (2020)

[32]

W. Gao, M.D. McDonnell, Coustic scene classification using deep residual networks with focal loss and mild domain adaptation. Tech. Rep., DCASE 2020 Challenge (2020)

[33]

A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, (2017), https://arxiv.org/abs/1704.04861

[34]

B. Kim, S. Yang, J. Kim, S. Chang, QTI submission to DCASE 2021: Residual normalization for device-imbalanced acoustic scene classification with efficient design. Tech. Rep., DCASE 2021 Challenge (2021)

[35]

K. Koutini, S. Jan, G. Widmer, Cpjku submission to DCASE21: Cross-device audio scene classification with wide sparse frequency-damped CNNs. Tech. Rep., DCASE 2021 Challenge (2021)

[36]

H.S. Heo, J.w. Jung, H.j. Shim, B.J. Lee, Clova submission for the DCASE 2021 challenge: Acoustic scene classification using light architectures and device augmentation. Tech. Rep., DCASE 2021 Challenge (2021)

[37]

H. Yen, C.H.H. Yang, H. Hu, S.M. Siniscalchi, Q. Wang, Y. Wang, X. Xia, Y. Zhao, Y. Wu, Y. Wang, J. Du, C.H. Lee, A lottery ticket hypothesis framework for low-complexity device-robust neural acoustic scene classification, (2021), https://arxiv.org/abs/2107.01461

[38]

J. Frankle, M. Carbin, The lottery ticket hypothesis: Finding sparse, trainable neural networks, (2018), https://arxiv.org/abs/1803.03635

[39]

T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Feature pyramid networks for object detection (IEEE, Piscataway, 2017), pp. 936–944

[40]

Shelhamer E, Long J, and Darrell T Fully convolutional networks for semantic segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 4 640-651

[41]

O. Ronneberger, P. Fischer, T. Brox, in 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). U-Net: Convolutional networks for biomedical image segmentation (Springer, Berlin, 2015), pp. 234–241

[42]

Chen LC, Papandreou G, Kokkinos I, Murphy K, and Yuille AL Semantic image segmentation with deep convolutional nets and fully connected CRFs Comput. Sci. 2014 4 357-361

[43]

Chen LC, Papandreou G, Kokkinos I, Murphy K, and Yuille AL DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 4 834-848

[44]

L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, in European conference on computer vision (ECCV). Encoder-decoder with atrous separable convolution for semantic image segmentation (Springer, Berlin, 2018), pp. 801–818

[45]

Q. Kong, Y. Cao, H. Liu, K. Choi, Y. Wang, Decoupling magnitude and phase estimation with deep resunet for music source separation, (2021), https://arxiv.org/abs/2109.05418

[46]

A. Cohen-Hadria, A. Roebel, G. Peeters, in 27th European Signal Processing Conference (EUSIPCO). Improving singing voice separation using deep u-net and wave-u-net with data augmentation (Springer, Berlin, 2019), pp. 1–5

[47]

Y. Liu, B. Thoshkahna, A. Milani, T. Kristjansson, Voice and accompaniment separation in music using self-attention convolutional neural network, (2020), https://arxiv.org/abs/2003.08954

[48]

H. Huang, K. Wang, Y. Hu, S. Li, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Encoder-decoder based pitch tracking and joint model training for mandarin tone classification (IEEE, Piscataway, 2021), pp. 6943–6947

[49]

Meng H, Yan T, Yuan F, and Wei H Speech emotion recognition from 3d log-mel spectrograms with deep learning network IEEE Access 2019 7 125868-125881

[50]

Krizhevsky A, Sutskever I, and Hinton GE Imagenet classification with deep convolutional neural networks Commun. ACM 2017 60 6 84-90

[51]

H. Zhang, M. Cisse, Y.N. Dauphin, D. Lopez-Paz, Mixup: Beyond empirical risk minimization, (2017), https://arxiv.org/abs/1710.09412

[52]

K. Wilkinghoff, F. Kurth, Open-set acoustic scene classification with deep convolutional autoencoders. Tech. Rep., DCASE 2019 Challenge (2019)

[53]

D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, in Interspeech. SpecAugment: A simple data augmentation method for automatic speech recognition (ISCA, Baixas, 2019), pp. 2613–2617

[54]

D. Ulyanov, V. Lebedev, A. Vedaldi, V. Lempitsky, Texture networks: Feed-forward synthesis of textures and stylized images. Proc. 33rd Int. Conf. Int. Conf. Mach. Learn. 48, 1349-1357 (2016)

[55]

X. Huang, S. Belongie, in 2017 IEEE International Conference on Computer Vision (ICCV). Arbitrary style transfer in real-time with adaptive instance normalization (IEEE, Piscataway, NJ, 2017), pp. 1510–1519

[56]

D. Jung, S. Yang, J. Choi, C. Kim, in 2020 IEEE International Conference on Image Processing (ICIP). Arbitrary style transfer using graph instance normalization. (IEEE, Piscataway, 2020), pp. 1596–1600

[57]

I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, (2016), https://arxiv.org/abs/1608.03983

[58]

A. Dang, T.H. Vu, J.C. Wang, in IEEE International Conference on Consumer Electronics (ICCE). Acoustic scene classification using convolutional neural networks and multi-scale multi-feature extraction (IEEE, Piscataway, 2018), pp. 1–4

[59]

L. Jie, Acoustic scene classification with residual networks and attention mechanism. Tech. Rep., DCASE 2020 Challenge (2020)

[60]

K. Koutini, F. Henkel, H. Eghbal-zadeh, G. Widmer, Cpjku submissions to DCASE20: Low-complexity cross-device acoustic scene classification with rf-regularized cnns. Tech. Rep., DCASE 2020 Challenge (2020)

[61]

Y. Liu, J. Liang, L. Zhao, J. Liu, K. Zhao, W. Liu, L. Zhang, T. Xu, C. Shi, DCASE 2021 task 1 subtask a: Low-complexity acoustic scene classification. Tech. Rep., DCASE 2021 Challenge (2021)

Index Terms

Deep semantic learning for acoustic scene classification

Index terms have been assigned to the content through auto-classification.

Recommendations

Acoustic Scene Classification based on Sound Textures and Events
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Semantic labelling of acoustic scenes has recently emerged as active topic covering a wide range of applications, e.g. surveillance and audio-based information retrieval. In this paper, we present an effective approach for acoustic scene classification ...
A Comparative Study on Approaches to Acoustic Scene Classification Using CNNs
Advances in Computational Intelligence
Abstract
Acoustic scene classification is a process of characterizing and classifying the environments from sound recordings. The first step is to generate features (representations) from the recorded sound and then classify the background environments. ...
Sparse Representation Frameworks for Acoustic Scene Classification
Speech and Computer
Abstract
This work addresses the task of acoustic scene classification (ASC) by using sparse representation frameworks, motivated by the inherent sparseness of audio data. We explore three different sparse representation classification (SRC) frameworks, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image EURASIP Journal on Audio, Speech, and Music Processing

EURASIP Journal on Audio, Speech, and Music Processing Volume 2024, Issue 1

Nov 2024

826 pages

EISSN:1687-4722

Issue’s Table of Contents

© The Author(s) 2023.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 03 January 2024

Accepted: 01 December 2023

Received: 16 April 2022

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R &D Program of China
National Natural Science Foundation of China
Guoqiang Institute, Tsinghua University

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents