Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Common sounds in bedrooms (CSIBE) corpora for sound event recognition of domestic robots

  • Original Research Paper
  • Published:
Intelligent Service Robotics Aims and scope Submit manuscript

Abstract

Although sound event recognition attracted much attention in the scientific community, applications in the robotics domain have not been in the focus. A new database was published in this paper and classifiers were evaluated with this dataset to guide the future practical developments of domestic robots. A corpus (CSIBE-RAW) was collected from the internet to build acoustic models to recognize 13 sound events and omit ambient sounds. As a case study, CSIBE-RAW was rerecorded in four room settings (CSIBE-AIBO) to create reverberation-tolerant classifiers for a Sony ERS-7. After eight classifiers were reviewed, the convolutional neural network achieved the best accuracy (95.07%) after multi-conditional learning and it was suitable for real-time classification on the robot. The effects of lossy audio codecs were studied, lossy encoder-tolerant audio statistics were specified for the feature vector and the Ogg Vorbis encoder with 128 kbit VBR was found superior to store big data and avoid any significant accuracy loss with the compression ratio 1:8.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

This figure was generated by adapting the code from https://github.com/gwding/draw_convnet

Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. https://xiph.org/vorbis/.

  2. http://lame.sourceforge.net.

  3. https://github.com/jamiebullock/LibXtract.

  4. https://github.com/tiny-dnn/tiny-dnn.

References

  1. Andrew G, Gao J (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of the 24th international conference on Machine learning, pp 33–40

  2. Beltrán J, Chávez E, Favela J (2015) Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. J Pattern Recognit Lett 68:153–160

    Article  Google Scholar 

  3. Bergstra J, Casagrande N, Erhan D et al (2006) Aggregate features and AdaBoost for music classification. J Mach Learn 65(2):473–484

    Article  Google Scholar 

  4. Besacier L, Bergamini C, Vaufreydaz D, Castelli E (2001) The effect of speech and audio compression on speech recognition performance. In: Proceedings of the 4th IEEE international symposium on signal processing, pp 301–306

  5. Borsky M, Pollak P, Mizera P (2015) Advanced acoustic modelling techniques in MP3 speech recognition. EURASIP J Audio Speech Music Process 1:1–7

    Google Scholar 

  6. Bradski GR, Kaehler A (2008) Learning OpenCV, 1st edn. O’Reilly Media, Newton

    Google Scholar 

  7. Bullock J (2007) LibXtract: a lightweight library for audio feature extraction. In: Proceedings of international computer music conference

  8. Cakir E, Heittola T, Huttunen H, et al (2016) Polyphonic sound event detection using multi label deep neural networks. In: Proceedings of IEEE international joint conference on neural networks (IJCNN 2016)

  9. Chmulik M, Jarina R (2012) Bio-inspired optimization of acoustic features for generic sound recognition. In: Proceedings of 19th international conference on systems, signals and image processing (IWSSIP), pp 629–632

  10. Choi I, Kwon K, Hyun Bae S, et al (2016) DNN-based sound event detection with exemplar-based approach for noise reduction. In: Proceedings of detection and classification of acoustic scenes and events workshop (DCASE2016)

  11. Chu S, Narayanan S, Kuo CCJ (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158

    Article  Google Scholar 

  12. Delgado-Contreras JR, Garcia-Vazquez JP, Brena RF (2014) Classification of environmental audio signals using statistical time and frequency features. In: Proceedings of international conference on electronics, communications and computers (CONIELECOMP), pp 212–216

  13. Dennis J (2014) Sound event recognition in unstructured environments using spectrogram image processing. Ph.D. thesis, Nanyang Technological University

  14. Foster P, Sigtia S, Krstulovic S, Barkerh J (2015) CHiME-Home: a dataset for sound source recognition in a domestic environment. In: Proceedings of 11th IEEE workshop on applications of signal processing to audio and acoustics (WASPAA)

  15. Goldstein EB (2010) Sensation and perception. Wadsworth, p 490

  16. Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning. In: Proceedings of IEEE international joint conference on neural networks (IJCNN 2016). arXiv:1603.05824

  17. Hsieh C-J, Chang K-W, Lin C-J (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of 25th international conference on machine learning, pp 408–415

  18. Jensen K (1999) Timbre models of musical sounds. Ph.D. dissertation, DIKU report

  19. King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758

    Google Scholar 

  20. Maxime J, Alameda-Pineda X, Girin L, Horaud R (2014) Sound representation and classification benchmark for domestic robots. In: Proceedings of IEEE international conference on robotics and automation (ICRA)

  21. McLoughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(3):540–552

    Article  Google Scholar 

  22. Mesaros A, Heittola T, Eronen A, Virtanen T (2010) Acoustic event detection in real life recordings. In: Proceedings of EUSIPCO

  23. Ng PS, Sanches I (2004) The influence of audio compression on speech recognition systems. In: Proceedings of 9th conference on speech and computer

  24. Ness S, Trail S, Driessen P, Schloss A, Tzanetakis G (2011) Music information robotics: coping strategies for musically challenged robots. In: Proceedings of 12th international society for music information retrieval conference (ISMIR), pp 567–572

  25. Nouza J, Cerva P, Silovsky J (2013) Adding controlled amount of noise to improve recognition of compressed and spectrally distorted speech. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 8046–8050

  26. Phan H, Maas M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE/ACM Trans Audio Speech Lang Process 23(1):20–31

    Article  Google Scholar 

  27. Phan H, Hertel L, Maass M, et al (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. In: Proceedings of 17th annual conference of the interenational speech communication association (INTERSPEECH 2016). arXiv:1604.06338

  28. Plinge A, Grzeszick R, Fink G A (2014) A bag-of-features approach to acoustic event detection. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing

  29. Pollak P, Behunek M (2011) Accuracy of MP3 speech recognition under real-word conditions: experimental study. In: Proceedings of IEEE signal processing and multimedia applications (SIGMAP), pp 1–6

  30. Pollard HF, Jansson EV (1982) A tristimulus method for the specification of musical timbre. J Acust 51:162–171

    Google Scholar 

  31. Ruiz-Martinez CA, Akhtar MT, Washizawa Y, Escamilla-Hernandez E (2013) On investigating efficient methodology for environmental sound recognition. In: Proceedings of international symposium on intelligent signal processing and communications systems (ISPACS), pp 210–214

  32. Sáenz-Lechón N, Osma-Ruiz V, Godino-Llorente JI (2008) Effects of audio compression in automatic detection of voice pathologies. IEEE Trans Biomed Eng 55(12):2831–2835

    Article  Google Scholar 

  33. Salamon J, Jakoby C, Bello J P (2014) A dataset and taxonomy for urban sound research. In: Proceedings 22nd ACM international conference on multimedia, pp 1041–1044

  34. Sebbanü M, Nock R, Chauchat J, Rakotomalala R (2000) Impact of learning set quality and size on decision tree performances. Int J Comput Syst Signals 1(1):85–105

    Google Scholar 

  35. Stowell D, Stowell D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17(10):1733–1746

    Article  Google Scholar 

  36. Sug H (2009) An effective sampling method for decision trees considering comprehensibility and accuracy. WSEAS Trans Comput 8(4):631–640

    Google Scholar 

  37. Terence NWZ, Dat TH, Dennis J, Siong CE (2013) A robust sound event recognition framework under TV playing conditions. In: Proceedings of signal and information processing association annual summit and conference (APSIPA), pp 1–5

  38. Theodorou T, Mporas I, Fakotakis N (2014) Audio feature selection for recognition of non-linguistic vocalization sounds. In: Proceedings of Hellenic conference on artificial intelligence, pp 395–405

    Chapter  Google Scholar 

  39. Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. In: Proceedings of ACL-IJCNLP, pp 477–485

  40. Uemura A, Kazumasa I, Katto J (2014) Effects of audio compression on chord recognition. In: Proceedings of international conference on multimedia modeling, pp 345–352

    Chapter  Google Scholar 

  41. Urbano J, Bogdanov D, Herrera P, Gómez E, Serra X (2014) What is the effect of audio quality on the robustness of MFCCs and chroma features? In: Proceedings of 15th ISMIR conference, pp 573–578

  42. Wang Y, Neves L, Metze F (2016) Audio-based multimedia event detection using deep recurrent neural networks. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2742–2746

  43. Yamamoto S, Nakadai K, Nakano M, et al (2006) Real-time robot audition system that recognizes simultaneous speech in the real world. In: Proceedings of international conference on intelligent robots and systems (IROS), pp 5333–5338

Download references

Acknowledgements

Nokia Foundation provided a Grant (ID: 201510141) for two months in 2015 when the CSIBE-RAW dataset was collected, CSIBE-AIBO was recorded and the initial baseline system was implemented. We want to say special thanks to Toni Heittola and Annamaria Mesaros from Tampere University of Technology for their invaluable comments to overcome some problems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Csaba Kertész.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kertész, C., Turunen, M. Common sounds in bedrooms (CSIBE) corpora for sound event recognition of domestic robots. Intel Serv Robotics 11, 335–346 (2018). https://doi.org/10.1007/s11370-018-0258-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11370-018-0258-9

Keywords

Navigation