Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Deep semantic learning for acoustic scene classification

Published: 03 January 2024 Publication History

Abstract

Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.

References

[1]
Barchiesi D, Giannoulis D, Stowell D, and Plumbley M Acoustic scene classification: classifying environments from the sounds they produce IEEE Signal Process. Mag. 2015 32 3 16-34
[2]
Y. Han, J. Park, Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification. Tech. Rep., DCASE 2017 Challenge (2017)
[3]
H. Zeinali, L. Burget, J. Cernocky, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge (Zenodo, Geneve, 2018)
[4]
Y. Sakashita, M. Aono, Acoustic scene classifification by ensemble of spectrograms based on adaptive temporal divisions. Tech. Rep., DCASE 2018 Challenge (2018)
[5]
DCASE. Detection and classification of acoustic scenes and events 2020 task 1a (2020), https://dcase.community/challenge2020/task-acoustic-scene-classification-results-a
[6]
Badrinarayanan V, Kendall A, and Cipolla R SegNet: A deep convolutional encoder-decoder architecture for image segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 12 2481-2495
[7]
X. Ma, Y. Shao, Y. Ma, W.Q. Zhang, in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Deep semantic encoder-decoder network for acoustic scene classification with multiple devices (IEEE, Piscataway, NJ, 2020), pp. 365–370
[8]
X. Ma, Y. Shao, Y. Ma, W.-Q. Zhang, THUEE submission for DCASE 2020 challenge task1a. Tech. Rep., DCASE 2020 Challenge (2020)
[9]
S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, (2015), https://arxiv.org/abs/1502.03167
[10]
D. Ulyanov, A. Vedaldi, V. Lempitsky, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. (IEEE, Piscataway, 2017), pp.4105–4113
[11]
DCASE. Detection and classification of acoustic scenes and events (2020), http://dcase.community/
[12]
DCASE. Detection and classification of acoustic scenes and events challenge 2018 (2018), http://dcase.community/challenge2018/task-acoustic-scene-classification-results-a
[13]
DCASE. Detection and classification of acoustic scenes and events challenge 2019 (2019), http://dcase.community/challenge2019/task-acoustic-scene-classification#subtask-a
[14]
DCASE. Detection and classification of acoustic scenes and events challenge 2020 (2020), http://dcase.community/challenge2020/task-acoustic-scene-classification#subtask-a
[15]
J.T. Geiger, B. Schuller, G. Rigoll, in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Large-scale audio feature extraction and svm for acoustic scene classification (IEEE, Piscataway, 2013), pp. 1–4
[16]
S. Mun, S. Park, Y. Lee, H. Ko, Deep neural network bottleneck feature for acoustic scene classification. Tech. Rep., DCASE 2016 Challenge (2016)
[17]
G. Vikaskumar, S. Waldekar, D. Paul, G. Saha, Acoustic scene classification using block based mfcc features. Tech. Rep., DCASE 2016 Challenge (2016)
[18]
W. Zheng, J. Yi, X. Xing, X. Liu, S. Peng, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion (Zenodo, Geneve, 2017)
[19]
A. Schindler, T. Lidy, A. Rauber, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Multi-temporal resolution convolutional neural networks for acoustic scene classification (Zenodo, Geneve, 2017)
[20]
H. Wang, Y. Zou, D. Chong, Acoustic scene classification with spectrogram processing strategies, (2020), https://arxiv.org/abs/2007.03781
[21]
J. Sun, X. Liu, X. Mei, J. Zhao, M.D. Plumbley, V. Kılıç, W. Wang, in 30th European Signal Processing Conference (EUSIPCO). Deep neural decision forest for acoustic scene classification (IEEE, Piscataway, 2022), pp. 772–776
[22]
DCASE. Low-complexity acoustic scene classification with multiple devices (2021), https://dcase.community/challenge2021/task-acoustic-scene-classification-results-a
[23]
Salamon J and Bello JP Deep convolutional neural networks and data augmentation for environmental sound classification IEEE Signal Process. Lett. 2017 24 3 279-283
[24]
S.H. Bae, I. Choi, N.S. Kim., in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Acoustic scene classification using parallel combination of LSTM and CNN (Zenodo, Geneve, 2017)
[25]
S. Phaye, E. Benetos, Y. Wang, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Subspectralnet-using sub-spectrogram based convolutional neural networks for acoustic scene classification (IEEE, Piscataway, 2019), pp. 825–829
[26]
M.D. McDonnell, W. Gao, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths (IEEE, Piscataway, 2020), pp. 141–145
[27]
H. Wang, Y. Zou, W. Wang, in Interspeech 2021. SpecAugment++: A hidden space data augmentation method for acoustic scene classification (ISCA, Baixas, 2021), pp. 551–555
[28]
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y Generative adversarial networks Adv. Neural Inf. Process Syst. 2014 3 2672-2680
[29]
S. Mun, S. Park, D. Han, H. Ko, Generative adversial network based acoustic scene training set augmentation and selection using svm hyper-planne. Tech. Rep., DCASE 2017 Challenge (2017)
[30]
H. Hu, C.H.H. Yang, X. Xia, X. Bai, X. Tang, Y. Wang, S. Niu, L. Chai, J. Li, H. Zhu, F. Bao, Y. Zhao, S.M. Siniscalchi, Y. Wang, J. Du, C.H. Lee, Device-robust acoustic scene classification based on two-stage categorization and data augmentation. Tech. Rep., DCASE 2020 Challenge (2020)
[31]
S. Suh, S. Park, Y. Jeong, T. Lee, Designing acoustic scene classification models with cnn variants. Tech. Rep., DCASE 2020 Challenge (2020)
[32]
W. Gao, M.D. McDonnell, Coustic scene classification using deep residual networks with focal loss and mild domain adaptation. Tech. Rep., DCASE 2020 Challenge (2020)
[33]
A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications, (2017), https://arxiv.org/abs/1704.04861
[34]
B. Kim, S. Yang, J. Kim, S. Chang, QTI submission to DCASE 2021: Residual normalization for device-imbalanced acoustic scene classification with efficient design. Tech. Rep., DCASE 2021 Challenge (2021)
[35]
K. Koutini, S. Jan, G. Widmer, Cpjku submission to DCASE21: Cross-device audio scene classification with wide sparse frequency-damped CNNs. Tech. Rep., DCASE 2021 Challenge (2021)
[36]
H.S. Heo, J.w. Jung, H.j. Shim, B.J. Lee, Clova submission for the DCASE 2021 challenge: Acoustic scene classification using light architectures and device augmentation. Tech. Rep., DCASE 2021 Challenge (2021)
[37]
H. Yen, C.H.H. Yang, H. Hu, S.M. Siniscalchi, Q. Wang, Y. Wang, X. Xia, Y. Zhao, Y. Wu, Y. Wang, J. Du, C.H. Lee, A lottery ticket hypothesis framework for low-complexity device-robust neural acoustic scene classification, (2021), https://arxiv.org/abs/2107.01461
[38]
J. Frankle, M. Carbin, The lottery ticket hypothesis: Finding sparse, trainable neural networks, (2018), https://arxiv.org/abs/1803.03635
[39]
T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Feature pyramid networks for object detection (IEEE, Piscataway, 2017), pp. 936–944
[40]
Shelhamer E, Long J, and Darrell T Fully convolutional networks for semantic segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 4 640-651
[41]
O. Ronneberger, P. Fischer, T. Brox, in 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). U-Net: Convolutional networks for biomedical image segmentation (Springer, Berlin, 2015), pp. 234–241
[42]
Chen LC, Papandreou G, Kokkinos I, Murphy K, and Yuille AL Semantic image segmentation with deep convolutional nets and fully connected CRFs Comput. Sci. 2014 4 357-361
[43]
Chen LC, Papandreou G, Kokkinos I, Murphy K, and Yuille AL DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 4 834-848
[44]
L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, in European conference on computer vision (ECCV). Encoder-decoder with atrous separable convolution for semantic image segmentation (Springer, Berlin, 2018), pp. 801–818
[45]
Q. Kong, Y. Cao, H. Liu, K. Choi, Y. Wang, Decoupling magnitude and phase estimation with deep resunet for music source separation, (2021), https://arxiv.org/abs/2109.05418
[46]
A. Cohen-Hadria, A. Roebel, G. Peeters, in 27th European Signal Processing Conference (EUSIPCO). Improving singing voice separation using deep u-net and wave-u-net with data augmentation (Springer, Berlin, 2019), pp. 1–5
[47]
Y. Liu, B. Thoshkahna, A. Milani, T. Kristjansson, Voice and accompaniment separation in music using self-attention convolutional neural network, (2020), https://arxiv.org/abs/2003.08954
[48]
H. Huang, K. Wang, Y. Hu, S. Li, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Encoder-decoder based pitch tracking and joint model training for mandarin tone classification (IEEE, Piscataway, 2021), pp. 6943–6947
[49]
Meng H, Yan T, Yuan F, and Wei H Speech emotion recognition from 3d log-mel spectrograms with deep learning network IEEE Access 2019 7 125868-125881
[50]
Krizhevsky A, Sutskever I, and Hinton GE Imagenet classification with deep convolutional neural networks Commun. ACM 2017 60 6 84-90
[51]
H. Zhang, M. Cisse, Y.N. Dauphin, D. Lopez-Paz, Mixup: Beyond empirical risk minimization, (2017), https://arxiv.org/abs/1710.09412
[52]
K. Wilkinghoff, F. Kurth, Open-set acoustic scene classification with deep convolutional autoencoders. Tech. Rep., DCASE 2019 Challenge (2019)
[53]
D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, in Interspeech. SpecAugment: A simple data augmentation method for automatic speech recognition (ISCA, Baixas, 2019), pp. 2613–2617
[54]
D. Ulyanov, V. Lebedev, A. Vedaldi, V. Lempitsky, Texture networks: Feed-forward synthesis of textures and stylized images. Proc. 33rd Int. Conf. Int. Conf. Mach. Learn. 48, 1349-1357 (2016)
[55]
X. Huang, S. Belongie, in 2017 IEEE International Conference on Computer Vision (ICCV). Arbitrary style transfer in real-time with adaptive instance normalization (IEEE, Piscataway, NJ, 2017), pp. 1510–1519
[56]
D. Jung, S. Yang, J. Choi, C. Kim, in 2020 IEEE International Conference on Image Processing (ICIP). Arbitrary style transfer using graph instance normalization. (IEEE, Piscataway, 2020), pp. 1596–1600
[57]
I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, (2016), https://arxiv.org/abs/1608.03983
[58]
A. Dang, T.H. Vu, J.C. Wang, in IEEE International Conference on Consumer Electronics (ICCE). Acoustic scene classification using convolutional neural networks and multi-scale multi-feature extraction (IEEE, Piscataway, 2018), pp. 1–4
[59]
L. Jie, Acoustic scene classification with residual networks and attention mechanism. Tech. Rep., DCASE 2020 Challenge (2020)
[60]
K. Koutini, F. Henkel, H. Eghbal-zadeh, G. Widmer, Cpjku submissions to DCASE20: Low-complexity cross-device acoustic scene classification with rf-regularized cnns. Tech. Rep., DCASE 2020 Challenge (2020)
[61]
Y. Liu, J. Liang, L. Zhao, J. Liu, K. Zhao, W. Liu, L. Zhang, T. Xu, C. Shi, DCASE 2021 task 1 subtask a: Low-complexity acoustic scene classification. Tech. Rep., DCASE 2021 Challenge (2021)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image EURASIP Journal on Audio, Speech, and Music Processing
EURASIP Journal on Audio, Speech, and Music Processing  Volume 2024, Issue 1
Nov 2024
826 pages

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 03 January 2024
Accepted: 01 December 2023
Received: 16 April 2022

Author Tags

  1. Acoustic scene classification
  2. Audio semantic
  3. Mini-SegNet
  4. Mixup Normalization
  5. DCASE 2020

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media