Abstract
Although sound event recognition attracted much attention in the scientific community, applications in the robotics domain have not been in the focus. A new database was published in this paper and classifiers were evaluated with this dataset to guide the future practical developments of domestic robots. A corpus (CSIBE-RAW) was collected from the internet to build acoustic models to recognize 13 sound events and omit ambient sounds. As a case study, CSIBE-RAW was rerecorded in four room settings (CSIBE-AIBO) to create reverberation-tolerant classifiers for a Sony ERS-7. After eight classifiers were reviewed, the convolutional neural network achieved the best accuracy (95.07%) after multi-conditional learning and it was suitable for real-time classification on the robot. The effects of lossy audio codecs were studied, lossy encoder-tolerant audio statistics were specified for the feature vector and the Ogg Vorbis encoder with 128 kbit VBR was found superior to store big data and avoid any significant accuracy loss with the compression ratio 1:8.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Andrew G, Gao J (2007) Scalable training of L1-regularized log-linear models. In: Proceedings of the 24th international conference on Machine learning, pp 33–40
Beltrán J, Chávez E, Favela J (2015) Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. J Pattern Recognit Lett 68:153–160
Bergstra J, Casagrande N, Erhan D et al (2006) Aggregate features and AdaBoost for music classification. J Mach Learn 65(2):473–484
Besacier L, Bergamini C, Vaufreydaz D, Castelli E (2001) The effect of speech and audio compression on speech recognition performance. In: Proceedings of the 4th IEEE international symposium on signal processing, pp 301–306
Borsky M, Pollak P, Mizera P (2015) Advanced acoustic modelling techniques in MP3 speech recognition. EURASIP J Audio Speech Music Process 1:1–7
Bradski GR, Kaehler A (2008) Learning OpenCV, 1st edn. O’Reilly Media, Newton
Bullock J (2007) LibXtract: a lightweight library for audio feature extraction. In: Proceedings of international computer music conference
Cakir E, Heittola T, Huttunen H, et al (2016) Polyphonic sound event detection using multi label deep neural networks. In: Proceedings of IEEE international joint conference on neural networks (IJCNN 2016)
Chmulik M, Jarina R (2012) Bio-inspired optimization of acoustic features for generic sound recognition. In: Proceedings of 19th international conference on systems, signals and image processing (IWSSIP), pp 629–632
Choi I, Kwon K, Hyun Bae S, et al (2016) DNN-based sound event detection with exemplar-based approach for noise reduction. In: Proceedings of detection and classification of acoustic scenes and events workshop (DCASE2016)
Chu S, Narayanan S, Kuo CCJ (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158
Delgado-Contreras JR, Garcia-Vazquez JP, Brena RF (2014) Classification of environmental audio signals using statistical time and frequency features. In: Proceedings of international conference on electronics, communications and computers (CONIELECOMP), pp 212–216
Dennis J (2014) Sound event recognition in unstructured environments using spectrogram image processing. Ph.D. thesis, Nanyang Technological University
Foster P, Sigtia S, Krstulovic S, Barkerh J (2015) CHiME-Home: a dataset for sound source recognition in a domestic environment. In: Proceedings of 11th IEEE workshop on applications of signal processing to audio and acoustics (WASPAA)
Goldstein EB (2010) Sensation and perception. Wadsworth, p 490
Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning. In: Proceedings of IEEE international joint conference on neural networks (IJCNN 2016). arXiv:1603.05824
Hsieh C-J, Chang K-W, Lin C-J (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of 25th international conference on machine learning, pp 408–415
Jensen K (1999) Timbre models of musical sounds. Ph.D. dissertation, DIKU report
King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758
Maxime J, Alameda-Pineda X, Girin L, Horaud R (2014) Sound representation and classification benchmark for domestic robots. In: Proceedings of IEEE international conference on robotics and automation (ICRA)
McLoughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 23(3):540–552
Mesaros A, Heittola T, Eronen A, Virtanen T (2010) Acoustic event detection in real life recordings. In: Proceedings of EUSIPCO
Ng PS, Sanches I (2004) The influence of audio compression on speech recognition systems. In: Proceedings of 9th conference on speech and computer
Ness S, Trail S, Driessen P, Schloss A, Tzanetakis G (2011) Music information robotics: coping strategies for musically challenged robots. In: Proceedings of 12th international society for music information retrieval conference (ISMIR), pp 567–572
Nouza J, Cerva P, Silovsky J (2013) Adding controlled amount of noise to improve recognition of compressed and spectrally distorted speech. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 8046–8050
Phan H, Maas M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE/ACM Trans Audio Speech Lang Process 23(1):20–31
Phan H, Hertel L, Maass M, et al (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. In: Proceedings of 17th annual conference of the interenational speech communication association (INTERSPEECH 2016). arXiv:1604.06338
Plinge A, Grzeszick R, Fink G A (2014) A bag-of-features approach to acoustic event detection. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing
Pollak P, Behunek M (2011) Accuracy of MP3 speech recognition under real-word conditions: experimental study. In: Proceedings of IEEE signal processing and multimedia applications (SIGMAP), pp 1–6
Pollard HF, Jansson EV (1982) A tristimulus method for the specification of musical timbre. J Acust 51:162–171
Ruiz-Martinez CA, Akhtar MT, Washizawa Y, Escamilla-Hernandez E (2013) On investigating efficient methodology for environmental sound recognition. In: Proceedings of international symposium on intelligent signal processing and communications systems (ISPACS), pp 210–214
Sáenz-Lechón N, Osma-Ruiz V, Godino-Llorente JI (2008) Effects of audio compression in automatic detection of voice pathologies. IEEE Trans Biomed Eng 55(12):2831–2835
Salamon J, Jakoby C, Bello J P (2014) A dataset and taxonomy for urban sound research. In: Proceedings 22nd ACM international conference on multimedia, pp 1041–1044
Sebbanü M, Nock R, Chauchat J, Rakotomalala R (2000) Impact of learning set quality and size on decision tree performances. Int J Comput Syst Signals 1(1):85–105
Stowell D, Stowell D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17(10):1733–1746
Sug H (2009) An effective sampling method for decision trees considering comprehensibility and accuracy. WSEAS Trans Comput 8(4):631–640
Terence NWZ, Dat TH, Dennis J, Siong CE (2013) A robust sound event recognition framework under TV playing conditions. In: Proceedings of signal and information processing association annual summit and conference (APSIPA), pp 1–5
Theodorou T, Mporas I, Fakotakis N (2014) Audio feature selection for recognition of non-linguistic vocalization sounds. In: Proceedings of Hellenic conference on artificial intelligence, pp 395–405
Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. In: Proceedings of ACL-IJCNLP, pp 477–485
Uemura A, Kazumasa I, Katto J (2014) Effects of audio compression on chord recognition. In: Proceedings of international conference on multimedia modeling, pp 345–352
Urbano J, Bogdanov D, Herrera P, Gómez E, Serra X (2014) What is the effect of audio quality on the robustness of MFCCs and chroma features? In: Proceedings of 15th ISMIR conference, pp 573–578
Wang Y, Neves L, Metze F (2016) Audio-based multimedia event detection using deep recurrent neural networks. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2742–2746
Yamamoto S, Nakadai K, Nakano M, et al (2006) Real-time robot audition system that recognizes simultaneous speech in the real world. In: Proceedings of international conference on intelligent robots and systems (IROS), pp 5333–5338
Acknowledgements
Nokia Foundation provided a Grant (ID: 201510141) for two months in 2015 when the CSIBE-RAW dataset was collected, CSIBE-AIBO was recorded and the initial baseline system was implemented. We want to say special thanks to Toni Heittola and Annamaria Mesaros from Tampere University of Technology for their invaluable comments to overcome some problems.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kertész, C., Turunen, M. Common sounds in bedrooms (CSIBE) corpora for sound event recognition of domestic robots. Intel Serv Robotics 11, 335–346 (2018). https://doi.org/10.1007/s11370-018-0258-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11370-018-0258-9