research-article

Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism

Authors:

Chengli SunAuthors Info & Claims

Volume 268, Issue C

https://doi.org/10.1016/j.knosys.2023.110460

Published: 23 May 2023 Publication History

Abstract

Acoustic scene classification (ASC) is a fundamental task of computational sound scene analysis that aims to identify the acoustic environment via audio. Many multitask learning (MTL) models have been proposed in computational sound scene analysis, but most of them are for acoustic event detection (AED). Existing MTL models for ASC usually leverage the knowledge of the primary and auxiliary tasks only via the shared layers and train the network using hard labels. They do not take advantage of the information contained in the primary and auxiliary tasks to improve the generalization performance, and ignore modeling the relationship between events, scenes or groups. Moreover, some models have the problem of subjectivity since they generate labels via observations, and subjectivity can create unreasonable information, which may restrict the improvement of system performance. To address these issues, we propose a novel MTL scheme for ASC that employs a mutual attention mechanism to explore the information contained in the primary and auxiliary tasks and employs a neural topic model to generate soft group labels automatically. The proposed method can model the relationship between groups and allows the primary and auxiliary tasks to make full use of each other’s information to improve generalization performance. Experimental results on two real-world datasets show that our MTL scheme can make full use of the auxiliary task to improve the performance of the ASC primary task and achieves significant improvements compared to baselines.

References

[1]

Virtanen T., Plumbley M.D., Ellis D., Computational Analysis of Sound Scenes and Events, Springer, Heidelberg, 2018, https://linkspringer.53yu.com/book/10.1007/978-3-319-63450-0.

[2]

Leng Y., Sun C.L., Xu X.Y., Yuan Q., Xing S.N., Wan H.L., Wang J.J., Li D.W., Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification, Knowl.-Based Syst. 98 (2016) 117–129,.

Digital Library

[3]

Ntalampiras S., Universal background modeling for acoustic surveillance of urban traffic, Digit. Signal Process. 31 (2014) 69–78,.

[4]

Banjar A., Dawood H., Javed A., Hassan F., Fall event detection using the mean absolute deviated local ternary patterns and BiLSTM, Appl. Acoust. 192 (2022),.

[5]

Chabot P., Bouserhal R.E., Cardinal P., Voix J., Detection and classification of human-produced nonverbal audio events, Appl. Acoust. 171 (2021),.

[6]

Imoto K., Tonami N., Koizumi Y., Yasuda M., Yamanishi R., Yamashita Y., Sound event detection by multitask learning of sound events and scenes with soft scene labels, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 621–625. https://ieeexplore.ieee.org/abstract/document/9053912.

[7]

Caruana R., Multitask learning, Mach. Learn. 28 (1) (1997) 41–75,.

Digital Library

[8]

Zhao M., Wang L.F., Jiang Z.J., Li R.H., Lu X.Y., Hu Z.T., Multi-task learning with graph attention networks for multi-domain task-oriented dialogue systems, Knowl.-Based Syst. 259 (2023),.

Digital Library

[9]

Hong C., Yu J., Zhang L., Lee K.-H., Jin X.N., Multimodal face-pose estimation with multitask manifold deep learning, in: IEEE Transactions on Industrial Informatics, TII, 15, 2019, 3952–3961, https://ieeexplore.ieee.org/document/8554134.

[10]

Liu Q.G., Wang D.M., Jia Y.H., Luo S.Y., Wang C.R., A multi-task based deep learning approach for intrusion detection, Knowl.-Based Syst. 238 (2022),.

Digital Library

[11]

Tonami N., Imoto K., Niitsuma M., Yamanishi R., Yamashita Y., Joint analysis of acoustic events and scenes based on multitask learning, in: IEEE Workshop on Applications of Signal Processing To Audio and Acoustics, WASPAA, 2019, pp. 338–342. https://ieeexplore.ieee.org/20abstract/document/8937196.

[12]

Liang Y., Long Y., Li Y., Liang J., Wang Y., Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection, Digit. Signal Process. 123 (2022),.

Digital Library

[13]

Jung J.W., Shim H.J., Kim J.H., Yu H.J., DCASENet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 621–625. https://ieeexplore.ieee.org/20abstract/document/9414406.

[14]

Nwe T.L., Dat T.H., Ma B., Convolutional neural network with multi-task learning scheme for acoustic scene classification, in: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 1347–1350. https://ieeexplore.ieee.org/abstract/document/8282241.

[15]

Heo H.S., Jung J.W., Shim H.J., Yu H.J., Acoustic scene classification using teacher-student learning with soft-labels, 2019,. arXiv preprint arXiv:1904.10135.

[16]

Yu J., Tan M., Zhang H., Rui Y., Tao D.C., Hierarchical deep click feature prediction for fine-grained image recognition, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, TPAMI, 2022, pp. 563–578. https://ieeexplore.ieee.org/document/8781933.

[17]

Hong C., Yu J., Wan J., Tao D., Wang M., Multimodal deep autoencoder for human pose recovery, in: IEEE Transactions on Image Processing, TIP, 2015, pp. 5659–5670. https://ieeexplore.ieee.org/document/7293666.

[18]

Zheng D., Xiao J., Chen K., Huang X., Chen L., Zhao Y., Soft pseudo-Label shrinkage for unsupervised domain adaptive person re-identification, Pattern Recognit. 127 (2022),.

Digital Library

[19]

Roy A., Cambria E., Soft labeling constraint for generalizing from sentiments in single domain, Knowl.-Based Syst. 245 (2022),.

Digital Library

[20]

Wang R., Hu X., Zhou D., He Y., Xiong Y., Ye C., Xu H., Neural topic modeling with bidirectional adversarial training, 2020,. arXiv preprint arXiv:2004.12331.

[21]

Leng Y., Zhou N., Sun C.L., Xu X.Y., Yuan Q., Cheng C.F., Liu Y.X., Li D.W., Audio scene recognition based on audio events and topic model, Knowl.-Based Syst. 125 (2017) 1–12,.

Digital Library

[22]

Leng Y., Zhao W.W., Lin C., Sun C.L., Wang R.Y., Yuan Q., Li D.W., LDA-based data augmentation algorithm for acoustic scene classification, Knowl.-Based Syst. 195 (2020),.

[23]

Niu Z., Zhong G., Yu H., A review on the attention mechanism of deep learning, Neurocomputing 452 (2021) 48–62,.

[24]

Tang Y., Liu C., Leng Y., Zhao W.W., Sun J.D., Sun C.L., Wang R.Y., Yuan Q., Li D.W., Xu H.Q., Attention based gender and nationality information exploration for speaker identification, Digit. Signal Process. 123 (2022),.

Digital Library

[25]

Meng J., Wang X., Wang J., Teng X., Xu Y., A capsule network with pixel-based attention and BGRU for sound event detection, Digit. Signal Process. 123 (2022),.

Digital Library

[26]

Xie W., He Q., Yu Z., Li Y., Deep mutual attention network for acoustic scene classification, Digit. Signal Process. 123 (2022),.

Digital Library

[27]

He W., Sun Y., Yang M., Ji F., Li C., Xu R., Multi-goal multi-agent learning for task-oriented dialogue with bidirectional teacher-student learning, Knowl.-Based Syst. 213 (2021),.

Digital Library

[28]

Hinton G., Vinyals O., Dean J., Distilling the knowledge in a neural network, 2015,. arXiv preprint arXiv:1503.02531.

[29]

Mesaros A., Heittola T., Virtanen T., TUT database for acoustic scene classification and sound event detection, in: 2016 24th European Signal Processing Conference, EUSIPCO, 2016, pp. 1128–1132. https://ieeexplore.ieee.org/abstract/document/7760424.

[30]

A. Mesaros, T. Heittola, T. Virtanen, A multi-device dataset for urban acoustic scene classification, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 9–13,.

[31]

Santoso A., Wang C.Y., Wang J.C., Acoustic Scene Classification using Network-in-Network Based Convolutional Neural Network, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Budapest, Hungary, 2016, https://dcase.community/documents/challenge2016/technical_reports/DCASE2016_Santoso_1031.pdf.

[32]

T. Lidy, A. Schindler, CQT-based convolutional neural networks for audio scene classification, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 1032–1048,.

[33]

Phan H., Hertel L., Maass M., Koch P., Mertins A., CNN-LTE: a class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene recognition, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2017, pp. 136–140,.

Digital Library

[34]

S.H. Bae, I. Choi, N.S. Kim, Acoustic scene classification using parallel combination of LSTM and CNN, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 11–15,.

[35]

Han Y., Lee K., Convolutional Neural Network with Multiple-Width Frequency-Delta Data Augmentation for Acoustic Scene Classification, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Budapest, Hungary, 2016, https://dcase.community/documents/challenge2016/technical_reports/DCASE2016_Lee_1034.pdf.

[36]

Zeinali H., Burget L., Cernocky J., Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge, 2018,. arXiv preprint arXiv:1810.04273.

[37]

Li A., Zhang L., Du S., Liu W., Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Li_51.pdf.

[38]

Sakashita Y., Aono M., Acoustic Scene Classification By Ensemble of Spectrograms Based on Adaptive Temporal Divisions, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Sakashita_15.pdf.

[39]

Dorfer M., Lehner B., Eghbal-zadeh H., Christop H., Fabian P., Gerhard W., Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Dorfer_97.pdf.

[40]

Mariotti O., Cord M., Schwander O., Exploring Deep Vision Models for Acoustic Scene Classification, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, http://www-poleia.lip6.fr/~cord/pdfs/publis/2018dcasecord.pdf.

Cited By

Tang YHuang HShi XMao XChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Beyond Labels and Topics: Discovering Causal Relationships in Neural Topic ModelingProceedings of the ACM Web Conference 202410.1145/3589334.3645715(4460-4469)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645715

Recommendations

Beyond Labels and Topics: Discovering Causal Relationships in Neural Topic Modeling
WWW '24: Proceedings of the ACM Web Conference 2024

Topic models that can take advantage of labels are broadly used in identifying interpretable topics from textual data. However, existing topic models tend to merely view labels as names of topic clusters or as categories of texts, thereby neglecting the ...
A Multi-task Learning Approach Based on Convolutional Neural Network for Acoustic Scene Classification
ACAI '19: Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence

Acoustic Scene Classification (ASC) aim to recognize an acoustic scene in audio signal records. The acoustic scene is a mixture of background sounds and various sound events, and sound events often determine the type of acoustic scene. However, in many ...
Acoustic Scene Classification based on Sound Textures and Events
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Semantic labelling of acoustic scenes has recently emerged as active topic covering a wide range of applications, e.g. surveillance and audio-based information retrieval. In this paper, we present an effective approach for acoustic scene classification ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems

Knowledge-Based Systems Volume 268, Issue C

May 2023

554 pages

ISSN:0950-7051

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 23 May 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang YHuang HShi XMao XChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Beyond Labels and Topics: Discovering Causal Relationships in Neural Topic ModelingProceedings of the ACM Web Conference 202410.1145/3589334.3645715(4460-4469)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645715

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents