Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICASSP.2017.7952133guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

CNN-LTE: A class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification

Published: 05 March 2017 Publication History

Abstract

We present in this work an approach for audio scene classification. Firstly, given the label set of the scenes, a label tree is automatically constructed where the labels are grouped into meta-classes. This category taxonomy is then used in the feature extraction step in which an audio scene instance is transformed into a label tree embedding image. Elements of the image indicate the likelihoods that the scene instances belong to different meta-classes. A class of simple 1-X (i.e. 1-max, 1-mean, and 1-mix) pooling convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Experimental results on the DCASE 2013 and DCASE 2016 datasets demonstrate the efficiency of the proposed method.

6. References

[1]
D. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006.
[2]
Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Trans. Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
[3]
D. Barchiesi, D. Giannoulis, D. Stowell, and M. Plumbley, “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.
[4]
T. Heittola, A. Mesaros, A. J. Eronen, and T. Virtanen, “Audio context recognition using audio event histogram,” in Proc. EUSIPCO, 2010, pp. 1272–1276.
[5]
R. Cai, L. Lu, and A. Hanjalic, “Co-clustering for auditory scene categorization,” IEEE Trans. Multimedia, vol. 10, no. 4, pp. 596–606, 2008.
[6]
S. Deng, J. Han, C. Zhang, T. Zheng, and G. Zheng, “Robust minimum statistics project coefficients feature for acoustic environment recognition,” in Proc. ICASSP, 2014, pp. 8232–8236.
[7]
J. Ye, T. Kobayashi, M. Murakawa, and T. Higuchi, “Acoustic scene classification based on sound textures and events,” in Proc. ACM Multimedia, 2015, pp. 1291–1294.
[8]
G. Roma, W. Nogueira, and P. Herrera, “Recurrence quantification analysis features for environmental sound recognition,” in Proc. WASPAA, 2013, pp. 1–4.
[9]
X. Valero and F. Alías, “Gammatone cepstral coefficients: biologically inspired features fro non-speech audio classification,” IEEE Trans. Multimedia, vol. 17, no. 6, pp. 1684–1689, 2012.
[10]
A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 142–153, 2015.
[11]
R. Mogi and H. Kasaii, “Noise-robust environmental sound classification method based on combination of ICA and MP features,” Artificial Intelligence Research, vol. 2, no. 1, pp. 107–121, 2013.
[12]
J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,” The Journal of the Acoustical Society of America, vol. 122, pp. 881–891, 2007.
[13]
K. Lee, Z. Hyung, and J. Nam, “Acoustic scene classification using sparse feature learning and event based pooling,” in Proc. WASPAA, 2013, pp. 1–4.
[14]
V. Bisot, R. Serizel, S. Essid, and G. Richard, “Acoustic scene classification with matrix factorization for unsupervised feature learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6445–6449.
[15]
H. Phan, L. Hertel, M. Maass, P. Koch, and A. Mertins, “Label tree embeddings for acoustic scene classification,” in Proc. ACM Multimedia 2016, Amsterdam, The Netherlands, October 2016.
[16]
A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2006.
[17]
A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in Proc. EUSIPCO, 2016.
[18]
S. Chu, S. Narayanan, and C.-C.J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009.
[19]
Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition,” in Proc. EUSIPCO, 2015, pp. 125–129.
[20]
L. Breiman, “Random forest,” Machine Learning, vol. 45, pp. 5–32, 2001.
[21]
D. P. W. Ellis, “Gammatone-like spectrograms,” 2009.
[22]
H. Phan, L. Hertel, M. Maass, R. Mazur, and A. Mertins, “Representing nonspeech audio signals through speech classification models,” in Proc. Interspeech, 2015, pp. 3441–3445.
[23]
H. Phan, M. Maaß, R. Mazur, and A. Mertins, “Random regression forests for acoustic event detection and classification,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 20–31, 2015.
[24]
R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512, 2001.
[25]
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 315–323.
[26]
H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max pooling convolutional neural networks,” in Proc. Interspeech, 2016.
[27]
Y. Kim, “Convolutional neural networks for sentence classification,” in Proc. EMNLP, 2014, pp. 1746–1751.
[28]
A. Severyn and A. Moschitti, “Twitter sentiment analysis with deep convolutional neural networks,” in Proc. SIGIR, 2015, pp. 959–962.
[29]
Y. L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. ICML, 2010, pp. 111–118.
[30]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research (JMLR), vol. 15, pp. 1929–1958, 2014.
[31]
D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proc. International Conference on Learning Representations (ICLR), 2015, pp. 1–13.
[33]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc CVPR, 2008, pp. 1–8.
[34]
S. Agcaer, A. Schlesinger, F.-M. Hoffmann, and R. Martin, “Optimization of amplitude modulation features for low-resource acoustic scene classification,” in Proc. EUSIPCO, 2015, pp. 2556–2560.

Cited By

View all
  • (2023)Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanismKnowledge-Based Systems10.1016/j.knosys.2023.110460268:COnline publication date: 23-May-2023

Index Terms

  1. CNN-LTE: A class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
          Mar 2017
          6527 pages

          Publisher

          IEEE Press

          Publication History

          Published: 05 March 2017

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 10 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Multitask learning for acoustic scene classification with topic-based soft labels and a mutual attention mechanismKnowledge-Based Systems10.1016/j.knosys.2023.110460268:COnline publication date: 23-May-2023

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media