research-article

CNN-LTE: A class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification

Authors:

Radoslaw Mazur,

Alfred MertinsAuthors Info & Claims

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages 136 - 140

https://doi.org/10.1109/ICASSP.2017.7952133

Published: 05 March 2017 Publication History

Abstract

We present in this work an approach for audio scene classification. Firstly, given the label set of the scenes, a label tree is automatically constructed where the labels are grouped into meta-classes. This category taxonomy is then used in the feature extraction step in which an audio scene instance is transformed into a label tree embedding image. Elements of the image indicate the likelihoods that the scene instances belong to different meta-classes. A class of simple 1-X (i.e. 1-max, 1-mean, and 1-mix) pooling convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Experimental results on the DCASE 2013 and DCASE 2016 datasets demonstrate the efficiency of the proposed method.

6. References

[1]

D. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006.

[2]

Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Trans. Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.

Digital Library

[3]

D. Barchiesi, D. Giannoulis, D. Stowell, and M. Plumbley, “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.

[4]

T. Heittola, A. Mesaros, A. J. Eronen, and T. Virtanen, “Audio context recognition using audio event histogram,” in Proc. EUSIPCO, 2010, pp. 1272–1276.

[5]

R. Cai, L. Lu, and A. Hanjalic, “Co-clustering for auditory scene categorization,” IEEE Trans. Multimedia, vol. 10, no. 4, pp. 596–606, 2008.

Digital Library

[6]

S. Deng, J. Han, C. Zhang, T. Zheng, and G. Zheng, “Robust minimum statistics project coefficients feature for acoustic environment recognition,” in Proc. ICASSP, 2014, pp. 8232–8236.

[7]

J. Ye, T. Kobayashi, M. Murakawa, and T. Higuchi, “Acoustic scene classification based on sound textures and events,” in Proc. ACM Multimedia, 2015, pp. 1291–1294.

[8]

G. Roma, W. Nogueira, and P. Herrera, “Recurrence quantification analysis features for environmental sound recognition,” in Proc. WASPAA, 2013, pp. 1–4.

[9]

X. Valero and F. Alías, “Gammatone cepstral coefficients: biologically inspired features fro non-speech audio classification,” IEEE Trans. Multimedia, vol. 17, no. 6, pp. 1684–1689, 2012.

[10]

A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 142–153, 2015.

Digital Library

[11]

R. Mogi and H. Kasaii, “Noise-robust environmental sound classification method based on combination of ICA and MP features,” Artificial Intelligence Research, vol. 2, no. 1, pp. 107–121, 2013.

[12]

J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,” The Journal of the Acoustical Society of America, vol. 122, pp. 881–891, 2007.

[13]

K. Lee, Z. Hyung, and J. Nam, “Acoustic scene classification using sparse feature learning and event based pooling,” in Proc. WASPAA, 2013, pp. 1–4.

[14]

V. Bisot, R. Serizel, S. Essid, and G. Richard, “Acoustic scene classification with matrix factorization for unsupervised feature learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6445–6449.

[15]

H. Phan, L. Hertel, M. Maass, P. Koch, and A. Mertins, “Label tree embeddings for acoustic scene classification,” in Proc. ACM Multimedia 2016, Amsterdam, The Netherlands, October 2016.

[16]

A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2006.

Digital Library

[17]

A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in Proc. EUSIPCO, 2016.

[18]

S. Chu, S. Narayanan, and C.-C.J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009.

Digital Library

[19]

Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition,” in Proc. EUSIPCO, 2015, pp. 125–129.

[20]

L. Breiman, “Random forest,” Machine Learning, vol. 45, pp. 5–32, 2001.

Digital Library

[21]

D. P. W. Ellis, “Gammatone-like spectrograms,” 2009.

[22]

H. Phan, L. Hertel, M. Maass, R. Mazur, and A. Mertins, “Representing nonspeech audio signals through speech classification models,” in Proc. Interspeech, 2015, pp. 3441–3445.

[23]

H. Phan, M. Maaß, R. Mazur, and A. Mertins, “Random regression forests for acoustic event detection and classification,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 20–31, 2015.

Digital Library

[24]

R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512, 2001.

[25]

X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 315–323.

[26]

H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max pooling convolutional neural networks,” in Proc. Interspeech, 2016.

[27]

Y. Kim, “Convolutional neural networks for sentence classification,” in Proc. EMNLP, 2014, pp. 1746–1751.

[28]

A. Severyn and A. Moschitti, “Twitter sentiment analysis with deep convolutional neural networks,” in Proc. SIGIR, 2015, pp. 959–962.

[29]

Y. L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. ICML, 2010, pp. 111–118.

[30]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research (JMLR), vol. 15, pp. 1929–1958, 2014.

Digital Library

[31]

D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proc. International Conference on Learning Representations (ICLR), 2015, pp. 1–13.

[32]

” http://www.cs.tut.fi/sgn/arg/dcase2016/.

[33]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc CVPR, 2008, pp. 1–8.

[34]

S. Agcaer, A. Schlesinger, F.-M. Hoffmann, and R. Martin, “Optimization of amplitude modulation features for low-resource acoustic scene classification,” in Proc. EUSIPCO, 2015, pp. 2556–2560.

Cited By

Leng YZhuang JPan JSun C(2023)Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanismKnowledge-Based Systems10.1016/j.knosys.2023.110460268:COnline publication date: 23-May-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110460

Index Terms

CNN-LTE: A class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Label Tree Embeddings for Acoustic Scene Classification
MM '16: Proceedings of the 24th ACM international conference on Multimedia

We present in this paper an efficient approach for acoustic scene classification by exploring the structure of class labels. Given a set of class labels, a category taxonomy is automatically learned by collectively optimizing a clustering of the labels ...
Learning Label-Specific Features and Class-Dependent Labels for Multi-Label Classification

Binary Relevance is a well-known framework for multi-label classification, which considers each class label as a binary classification problem. Many existing multi-label algorithms are constructed within this framework, and utilize identical data ...
Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Mar 2017

6527 pages

Copyright © 2017.

Publisher

IEEE Press

Publication History

Published: 05 March 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Leng YZhuang JPan JSun C(2023)Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanismKnowledge-Based Systems10.1016/j.knosys.2023.110460268:COnline publication date: 23-May-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110460

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents