Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanism

Published: 23 May 2023 Publication History

Abstract

Acoustic scene classification (ASC) is a fundamental task of computational sound scene analysis that aims to identify the acoustic environment via audio. Many multitask learning (MTL) models have been proposed in computational sound scene analysis, but most of them are for acoustic event detection (AED). Existing MTL models for ASC usually leverage the knowledge of the primary and auxiliary tasks only via the shared layers and train the network using hard labels. They do not take advantage of the information contained in the primary and auxiliary tasks to improve the generalization performance, and ignore modeling the relationship between events, scenes or groups. Moreover, some models have the problem of subjectivity since they generate labels via observations, and subjectivity can create unreasonable information, which may restrict the improvement of system performance. To address these issues, we propose a novel MTL scheme for ASC that employs a mutual attention mechanism to explore the information contained in the primary and auxiliary tasks and employs a neural topic model to generate soft group labels automatically. The proposed method can model the relationship between groups and allows the primary and auxiliary tasks to make full use of each other’s information to improve generalization performance. Experimental results on two real-world datasets show that our MTL scheme can make full use of the auxiliary task to improve the performance of the ASC primary task and achieves significant improvements compared to baselines.

References

[1]
Virtanen T., Plumbley M.D., Ellis D., Computational Analysis of Sound Scenes and Events, Springer, Heidelberg, 2018, https://linkspringer.53yu.com/book/10.1007/978-3-319-63450-0.
[2]
Leng Y., Sun C.L., Xu X.Y., Yuan Q., Xing S.N., Wan H.L., Wang J.J., Li D.W., Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification, Knowl.-Based Syst. 98 (2016) 117–129,.
[3]
Ntalampiras S., Universal background modeling for acoustic surveillance of urban traffic, Digit. Signal Process. 31 (2014) 69–78,.
[4]
Banjar A., Dawood H., Javed A., Hassan F., Fall event detection using the mean absolute deviated local ternary patterns and BiLSTM, Appl. Acoust. 192 (2022),.
[5]
Chabot P., Bouserhal R.E., Cardinal P., Voix J., Detection and classification of human-produced nonverbal audio events, Appl. Acoust. 171 (2021),.
[6]
Imoto K., Tonami N., Koizumi Y., Yasuda M., Yamanishi R., Yamashita Y., Sound event detection by multitask learning of sound events and scenes with soft scene labels, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 621–625. https://ieeexplore.ieee.org/abstract/document/9053912.
[7]
Caruana R., Multitask learning, Mach. Learn. 28 (1) (1997) 41–75,.
[8]
Zhao M., Wang L.F., Jiang Z.J., Li R.H., Lu X.Y., Hu Z.T., Multi-task learning with graph attention networks for multi-domain task-oriented dialogue systems, Knowl.-Based Syst. 259 (2023),.
[9]
Hong C., Yu J., Zhang L., Lee K.-H., Jin X.N., Multimodal face-pose estimation with multitask manifold deep learning, in: IEEE Transactions on Industrial Informatics, TII, 15, 2019, 3952–3961, https://ieeexplore.ieee.org/document/8554134.
[10]
Liu Q.G., Wang D.M., Jia Y.H., Luo S.Y., Wang C.R., A multi-task based deep learning approach for intrusion detection, Knowl.-Based Syst. 238 (2022),.
[11]
Tonami N., Imoto K., Niitsuma M., Yamanishi R., Yamashita Y., Joint analysis of acoustic events and scenes based on multitask learning, in: IEEE Workshop on Applications of Signal Processing To Audio and Acoustics, WASPAA, 2019, pp. 338–342. https://ieeexplore.ieee.org/20abstract/document/8937196.
[12]
Liang Y., Long Y., Li Y., Liang J., Wang Y., Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection, Digit. Signal Process. 123 (2022),.
[13]
Jung J.W., Shim H.J., Kim J.H., Yu H.J., DCASENet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 621–625. https://ieeexplore.ieee.org/20abstract/document/9414406.
[14]
Nwe T.L., Dat T.H., Ma B., Convolutional neural network with multi-task learning scheme for acoustic scene classification, in: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 1347–1350. https://ieeexplore.ieee.org/abstract/document/8282241.
[15]
Heo H.S., Jung J.W., Shim H.J., Yu H.J., Acoustic scene classification using teacher-student learning with soft-labels, 2019,. arXiv preprint arXiv:1904.10135.
[16]
Yu J., Tan M., Zhang H., Rui Y., Tao D.C., Hierarchical deep click feature prediction for fine-grained image recognition, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, TPAMI, 2022, pp. 563–578. https://ieeexplore.ieee.org/document/8781933.
[17]
Hong C., Yu J., Wan J., Tao D., Wang M., Multimodal deep autoencoder for human pose recovery, in: IEEE Transactions on Image Processing, TIP, 2015, pp. 5659–5670. https://ieeexplore.ieee.org/document/7293666.
[18]
Zheng D., Xiao J., Chen K., Huang X., Chen L., Zhao Y., Soft pseudo-Label shrinkage for unsupervised domain adaptive person re-identification, Pattern Recognit. 127 (2022),.
[19]
Roy A., Cambria E., Soft labeling constraint for generalizing from sentiments in single domain, Knowl.-Based Syst. 245 (2022),.
[20]
Wang R., Hu X., Zhou D., He Y., Xiong Y., Ye C., Xu H., Neural topic modeling with bidirectional adversarial training, 2020,. arXiv preprint arXiv:2004.12331.
[21]
Leng Y., Zhou N., Sun C.L., Xu X.Y., Yuan Q., Cheng C.F., Liu Y.X., Li D.W., Audio scene recognition based on audio events and topic model, Knowl.-Based Syst. 125 (2017) 1–12,.
[22]
Leng Y., Zhao W.W., Lin C., Sun C.L., Wang R.Y., Yuan Q., Li D.W., LDA-based data augmentation algorithm for acoustic scene classification, Knowl.-Based Syst. 195 (2020),.
[23]
Niu Z., Zhong G., Yu H., A review on the attention mechanism of deep learning, Neurocomputing 452 (2021) 48–62,.
[24]
Tang Y., Liu C., Leng Y., Zhao W.W., Sun J.D., Sun C.L., Wang R.Y., Yuan Q., Li D.W., Xu H.Q., Attention based gender and nationality information exploration for speaker identification, Digit. Signal Process. 123 (2022),.
[25]
Meng J., Wang X., Wang J., Teng X., Xu Y., A capsule network with pixel-based attention and BGRU for sound event detection, Digit. Signal Process. 123 (2022),.
[26]
Xie W., He Q., Yu Z., Li Y., Deep mutual attention network for acoustic scene classification, Digit. Signal Process. 123 (2022),.
[27]
He W., Sun Y., Yang M., Ji F., Li C., Xu R., Multi-goal multi-agent learning for task-oriented dialogue with bidirectional teacher-student learning, Knowl.-Based Syst. 213 (2021),.
[28]
Hinton G., Vinyals O., Dean J., Distilling the knowledge in a neural network, 2015,. arXiv preprint arXiv:1503.02531.
[29]
Mesaros A., Heittola T., Virtanen T., TUT database for acoustic scene classification and sound event detection, in: 2016 24th European Signal Processing Conference, EUSIPCO, 2016, pp. 1128–1132. https://ieeexplore.ieee.org/abstract/document/7760424.
[30]
A. Mesaros, T. Heittola, T. Virtanen, A multi-device dataset for urban acoustic scene classification, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 9–13,.
[31]
Santoso A., Wang C.Y., Wang J.C., Acoustic Scene Classification using Network-in-Network Based Convolutional Neural Network, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Budapest, Hungary, 2016, https://dcase.community/documents/challenge2016/technical_reports/DCASE2016_Santoso_1031.pdf.
[32]
T. Lidy, A. Schindler, CQT-based convolutional neural networks for audio scene classification, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 1032–1048,.
[33]
Phan H., Hertel L., Maass M., Koch P., Mertins A., CNN-LTE: a class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene recognition, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2017, pp. 136–140,.
[34]
S.H. Bae, I. Choi, N.S. Kim, Acoustic scene classification using parallel combination of LSTM and CNN, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 11–15,.
[35]
Han Y., Lee K., Convolutional Neural Network with Multiple-Width Frequency-Delta Data Augmentation for Acoustic Scene Classification, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Budapest, Hungary, 2016, https://dcase.community/documents/challenge2016/technical_reports/DCASE2016_Lee_1034.pdf.
[36]
Zeinali H., Burget L., Cernocky J., Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge, 2018,. arXiv preprint arXiv:1810.04273.
[37]
Li A., Zhang L., Du S., Liu W., Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Li_51.pdf.
[38]
Sakashita Y., Aono M., Acoustic Scene Classification By Ensemble of Spectrograms Based on Adaptive Temporal Divisions, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Sakashita_15.pdf.
[39]
Dorfer M., Lehner B., Eghbal-zadeh H., Christop H., Fabian P., Gerhard W., Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Dorfer_97.pdf.
[40]
Mariotti O., Cord M., Schwander O., Exploring Deep Vision Models for Acoustic Scene Classification, IEEE AASP Challenge on Detection and Classification of Acoustic Scene and Events (DCASE), Surrey, UK, 2018, http://www-poleia.lip6.fr/~cord/pdfs/publis/2018dcasecord.pdf.

Cited By

View all
  • (2024)Beyond Labels and Topics: Discovering Causal Relationships in Neural Topic ModelingProceedings of the ACM Web Conference 202410.1145/3589334.3645715(4460-4469)Online publication date: 13-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems
Knowledge-Based Systems  Volume 268, Issue C
May 2023
554 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 23 May 2023

Author Tags

  1. Acoustic scene classification
  2. Multitask learning
  3. Neural topic model
  4. Soft label
  5. Mutual attention

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Beyond Labels and Topics: Discovering Causal Relationships in Neural Topic ModelingProceedings of the ACM Web Conference 202410.1145/3589334.3645715(4460-4469)Online publication date: 13-May-2024

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media